How to gradually enrich OMOP mappings with SSSOM
This document is a guide for OMOP ETL developers to think about gradually improving the (documentation of the) strength of evidence for their vocabulary mappings.
Example table from OMOP
Generated manually with Athena on the 20th July 2023. The start and end dates are invented.
concept_id_1 | concept_id_2 | relationship_id | valid_start_date | valid_end_date | invalid_reason |
---|---|---|---|---|---|
44499396 | 4028717 | Maps to | 19700101 | 20991231 | |
45586281 | 4028717 | Maps to | 73754 | 20991231 |
Level 1, basic mapping table, basic provenance
The SSSOM metadata provided is conceptually correct, but fictious.
The reader should imagine this being provided as a separate CONCEPT_MAPPINGS.CSV table that can be joined on subject_id
->concept_id_1
, object_id
->concept_id_2
for all rows with a Maps to
relationship_id
(this is assuming that the concept_id_1
,concept_id_2
tuple is unique for Maps to
).
subject_id | object_id | predicate_id | mapping_provider | mapping_tool | mapping_tool_version | mapping_justification | reviewer_id | author_id |
---|---|---|---|---|---|---|---|---|
OMOP:44499396 | OMOP:4028717 | omoprel:mapsTo | OHDSI:Odysseus | semapv:ManualMappingCuration | ORCID:0000-0003-4147-1485 | |||
OMOP:45586281 | OMOP:4028717 | omoprel:mapsTo | OHDSI:Odysseus | OHDSI_TOOLS:Usagi | 1.4.3 | semapv:LexicalMatching | ORCID:0000-0003-4147-1485 | |
OMOP:45610575 | OMOP:441554 | omoprel:mapsTo | OHDSI:UMLS | semapv:UnspecifiedMatching |
What we see here:
- all identifiers are prefixed to make sure they are interpreted correctly when they are reused. This includes OMOP ids (e.g.
OMOP:44499396
) as well as ORCIDs (OPTIONAL) - "Maps to" is encoded using a proper identifier rather than a string (OPTIONAL)
- All three mappings have a
mapping_justification
to distinguish for example if the mapping was determined by human manual curation (semapv:ManualMappingCuration
) or lexical matching (semapv:LexicalMatching
). Many other justifications exist and/or can be created. If the justification for the mapping is unknown, we can make our lack of knowledge transparent by usingsemapv:UnspecifiedMatching
. author_id
, in the case ofsemapv:ManualMappingCuration
, tells us who the person is that determined the mapping. This is basic provenance. If the identity of the author can be connected with an public record such as ORCID, this can help mapping users to increase trust in a mapping.reviewer_id
tells us that some human looked at the mapping after it was proposed by a tool, and "signed off" on it. This can be valueable, again, to increase trust.- If the match was generated by the tool, some basic provenance is added (
mapping_tool
,mapping_tool_version
).
Level 2: Curate semantic mapping predicate
subject_id | object_id | predicate_id | mapping_provider | mapping_tool | mapping_tool_version | mapping_justification | reviewer_id | author_id |
---|---|---|---|---|---|---|---|---|
OMOP:44499396 | OMOP:4028717 | skos:broadMatch | OHDSI:Odysseus | semapv:ManualMappingCuration | ORCID:0000-0003-4147-1485 | |||
OMOP:45586281 | OMOP:4028717 | skos:exactMatch | OHDSI:Odysseus | OHDSI_TOOLS:Usagi | 1.4.3 | semapv:LexicalMatching | ORCID:0000-0003-4147-1485 | |
OMOP:45610575 | OMOP:441554 | skos:exactMatch | OHDSI:UMLS | semapv:UnspecifiedMatching |
What do we see here?
- Rather than
Maps to
, the mapping predicate (e.g.skos:exactMatch
) is a semantic mapping predicate from a standardised vocabulary (SKOS). Here, we distinguish betweenskos:exactMatch
andskos:broadMatch
, but there are other predicates, see for example in the Semantic Mapping Vocabulary.
Level 3: Document confidence widely
confidence
is an incredibly useful metric for downstream users, including ETL engineers and data analysts. In an ideal world, all mappings have some kind of confidence
associated with them. confidence
scores should be read as "the strength of evidence provided in this record/table row (i.e mapping justification) leads us to believe the mapping (e.g. OMOP:44499396 --[skos:broadMatch]--> OMOP:4028717
) is correct with 90% confidence.
subject_id | object_id | predicate_id | mapping_provider | mapping_tool | mapping_tool_version | mapping_justification | reviewer_id | author_id | confidence |
---|---|---|---|---|---|---|---|---|---|
OMOP:44499396 | OMOP:4028717 | skos:broadMatch | OHDSI:Odysseus | semapv:ManualMappingCuration | ORCID:0000-0003-4147-1485 | 0.9 | |||
OMOP:45586281 | OMOP:4028717 | skos:exactMatch | OHDSI:Odysseus | OHDSI_TOOLS:Usagi | 1.4.3 | semapv:LexicalMatching | ORCID:0000-0003-4147-1485 | 0.8 | |
OMOP:45610575 | OMOP:441554 | skos:exactMatch | OHDSI:UMLS | semapv:UnspecifiedMatching | 0.6 |
What do we see here?
- For matching tools, confidence can be calculated by proxies such as "lexical similarity", "edit distance", "cosine similarity of node embedding" and other metrics. IN the example above, Usagi has determined that the subject and objects match, but it was only 80% sure (we dont know why - this is more advance SSSOM)
- For case where an external mapping is re-used using ETL,
confidence
describes the level of trust you as an ETL expert have in the fidelty of the mapping provided by the source.
Level 4: Document curation rules
subject_id | object_id | predicate_id | mapping_provider | mapping_tool | mapping_tool_version | mapping_justification | reviewer_id | author_id | confidence | curation_rule |
---|---|---|---|---|---|---|---|---|---|---|
OMOP:44499396 | OMOP:4028717 | skos:broadMatch | OHDSI:Odysseus | semapv:ManualMappingCuration | ORCID:0000-0003-4147-1485 | 0.9 | OHDSI_CURATION_RULE:19 |
What do we see here?
- For manual matches, it is often unclear by what criteria a match was established. Documenting the curation rules can help increasing consistency for manual curation, and transparency for downstream users.
OHDSI_CURATION_RULE:19
is a rule defined by your own curation rulebook. This can be anything. For exampleOHDSI_CURATION_RULE:19
could correspond to the following rule:
OHDSI_CURATION_RULE:19 = If the subject concept does not have an exact match in the object source vocabulary, we select the nearest broad ("up-hill") concept applicable. Conceptually, if both terms would exist in the same terminology, the subject concept can be defined as a subconcept of the object concept. The determination for both criteria (nearest broad, conceptally subconcept) is performed through medical expert judgement.