Skip to content

How to gradually enrich OMOP mappings with SSSOM

This document is a guide for OMOP ETL developers to think about gradually improving the (documentation of the) strength of evidence for their vocabulary mappings.

Example table from OMOP

Generated manually with Athena on the 20th July 2023. The start and end dates are invented.

concept_id_1 concept_id_2 relationship_id valid_start_date valid_end_date invalid_reason
44499396 4028717 Maps to 19700101 20991231
45586281 4028717 Maps to 73754 20991231

Level 1, basic mapping table, basic provenance

The SSSOM metadata provided is conceptually correct, but fictious.

The reader should imagine this being provided as a separate CONCEPT_MAPPINGS.CSV table that can be joined on subject_id->concept_id_1, object_id->concept_id_2 for all rows with a Maps to relationship_id (this is assuming that the concept_id_1,concept_id_2 tuple is unique for Maps to).

subject_id object_id predicate_id mapping_provider mapping_tool mapping_tool_version mapping_justification reviewer_id author_id
OMOP:44499396 OMOP:4028717 omoprel:mapsTo OHDSI:Odysseus semapv:ManualMappingCuration ORCID:0000-0003-4147-1485
OMOP:45586281 OMOP:4028717 omoprel:mapsTo OHDSI:Odysseus OHDSI_TOOLS:Usagi 1.4.3 semapv:LexicalMatching ORCID:0000-0003-4147-1485
OMOP:45610575 OMOP:441554 omoprel:mapsTo OHDSI:UMLS semapv:UnspecifiedMatching

What we see here:

  1. all identifiers are prefixed to make sure they are interpreted correctly when they are reused. This includes OMOP ids (e.g. OMOP:44499396) as well as ORCIDs (OPTIONAL)
  2. "Maps to" is encoded using a proper identifier rather than a string (OPTIONAL)
  3. All three mappings have a mapping_justification to distinguish for example if the mapping was determined by human manual curation (semapv:ManualMappingCuration) or lexical matching (semapv:LexicalMatching). Many other justifications exist and/or can be created. If the justification for the mapping is unknown, we can make our lack of knowledge transparent by using semapv:UnspecifiedMatching.
  4. author_id, in the case of semapv:ManualMappingCuration, tells us who the person is that determined the mapping. This is basic provenance. If the identity of the author can be connected with an public record such as ORCID, this can help mapping users to increase trust in a mapping. reviewer_id tells us that some human looked at the mapping after it was proposed by a tool, and "signed off" on it. This can be valueable, again, to increase trust.
  5. If the match was generated by the tool, some basic provenance is added (mapping_tool, mapping_tool_version).

Level 2: Curate semantic mapping predicate

subject_id object_id predicate_id mapping_provider mapping_tool mapping_tool_version mapping_justification reviewer_id author_id
OMOP:44499396 OMOP:4028717 skos:broadMatch OHDSI:Odysseus semapv:ManualMappingCuration ORCID:0000-0003-4147-1485
OMOP:45586281 OMOP:4028717 skos:exactMatch OHDSI:Odysseus OHDSI_TOOLS:Usagi 1.4.3 semapv:LexicalMatching ORCID:0000-0003-4147-1485
OMOP:45610575 OMOP:441554 skos:exactMatch OHDSI:UMLS semapv:UnspecifiedMatching

What do we see here?

  1. Rather than Maps to, the mapping predicate (e.g. skos:exactMatch) is a semantic mapping predicate from a standardised vocabulary (SKOS). Here, we distinguish between skos:exactMatch and skos:broadMatch, but there are other predicates, see for example in the Semantic Mapping Vocabulary.

Level 3: Document confidence widely

confidence is an incredibly useful metric for downstream users, including ETL engineers and data analysts. In an ideal world, all mappings have some kind of confidence associated with them. confidence scores should be read as "the strength of evidence provided in this record/table row (i.e mapping justification) leads us to believe the mapping (e.g. OMOP:44499396 --[skos:broadMatch]--> OMOP:4028717) is correct with 90% confidence.

subject_id object_id predicate_id mapping_provider mapping_tool mapping_tool_version mapping_justification reviewer_id author_id confidence
OMOP:44499396 OMOP:4028717 skos:broadMatch OHDSI:Odysseus semapv:ManualMappingCuration ORCID:0000-0003-4147-1485 0.9
OMOP:45586281 OMOP:4028717 skos:exactMatch OHDSI:Odysseus OHDSI_TOOLS:Usagi 1.4.3 semapv:LexicalMatching ORCID:0000-0003-4147-1485 0.8
OMOP:45610575 OMOP:441554 skos:exactMatch OHDSI:UMLS semapv:UnspecifiedMatching 0.6

What do we see here?

  • For matching tools, confidence can be calculated by proxies such as "lexical similarity", "edit distance", "cosine similarity of node embedding" and other metrics. IN the example above, Usagi has determined that the subject and objects match, but it was only 80% sure (we dont know why - this is more advance SSSOM)
  • For case where an external mapping is re-used using ETL, confidence describes the level of trust you as an ETL expert have in the fidelty of the mapping provided by the source.

Level 4: Document curation rules

subject_id object_id predicate_id mapping_provider mapping_tool mapping_tool_version mapping_justification reviewer_id author_id confidence curation_rule
OMOP:44499396 OMOP:4028717 skos:broadMatch OHDSI:Odysseus semapv:ManualMappingCuration ORCID:0000-0003-4147-1485 0.9 OHDSI_CURATION_RULE:19

What do we see here?

  • For manual matches, it is often unclear by what criteria a match was established. Documenting the curation rules can help increasing consistency for manual curation, and transparency for downstream users.
  • OHDSI_CURATION_RULE:19 is a rule defined by your own curation rulebook. This can be anything. For example OHDSI_CURATION_RULE:19 could correspond to the following rule:
OHDSI_CURATION_RULE:19 = If the subject concept does not have an exact match in the object source vocabulary, we select the nearest broad ("up-hill") concept applicable. Conceptually, if both terms would exist in the same terminology, the subject concept can be defined as a subconcept of the object concept. The determination for both criteria (nearest broad, conceptally subconcept) is performed through medical expert judgement.