How to gradually enrich OMOP mappings with SSSOM

This document is a guide for OMOP ETL developers to think about gradually improving the (documentation of the) strength of evidence for their vocabulary mappings.

Example table from OMOP

Generated manually with Athena on the 20th July 2023. The start and end dates are invented.

concept_id_1	concept_id_2	relationship_id	valid_start_date	valid_end_date	invalid_reason
44499396	4028717	Maps to	19700101	20991231
45586281	4028717	Maps to	73754	20991231

Level 1, basic mapping table, basic provenance

The SSSOM metadata provided is conceptually correct, but fictitious.

The reader should imagine this being provided as a separate CONCEPT_MAPPINGS.CSV table that can be joined on subject_id->concept_id_1, object_id->concept_id_2 for all rows with a Maps to relationship_id (this is assuming that the concept_id_1,concept_id_2 tuple is unique for Maps to).

subject_id	object_id	predicate_id	mapping_provider	mapping_tool	mapping_tool_version	mapping_justification	reviewer_id	author_id
OMOP:44499396	OMOP:4028717	omoprel:mapsTo	OHDSI:Odysseus			semapv:ManualMappingCuration		ORCID:0000-0003-4147-1485
OMOP:45586281	OMOP:4028717	omoprel:mapsTo	OHDSI:Odysseus	OHDSI_TOOLS:Usagi	1.4.3	semapv:LexicalMatching	ORCID:0000-0003-4147-1485
OMOP:45610575	OMOP:441554	omoprel:mapsTo	OHDSI:UMLS			semapv:UnspecifiedMatching

What we see here:

all identifiers are prefixed to make sure they are interpreted correctly when they are reused. This includes OMOP ids (e.g. OMOP:44499396) as well as ORCIDs (OPTIONAL)
"Maps to" is encoded using a proper identifier rather than a string (OPTIONAL)
All three mappings have a mapping_justification to distinguish for example if the mapping was determined by human manual curation (semapv:ManualMappingCuration) or lexical matching (semapv:LexicalMatching). Many other justifications exist and/or can be created. If the justification for the mapping is unknown, we can make our lack of knowledge transparent by using semapv:UnspecifiedMatching.
author_id, in the case of semapv:ManualMappingCuration, tells us who the person is that determined the mapping. This is basic provenance. If the identity of the author can be connected with an public record such as ORCID, this can help mapping users to increase trust in a mapping. reviewer_id tells us that some human looked at the mapping after it was proposed by a tool, and "signed off" on it. This can be valuable, again, to increase trust.
If the match was generated by the tool, some basic provenance is added (mapping_tool, mapping_tool_version).

Level 2: Curate semantic mapping predicate

subject_id	object_id	predicate_id	mapping_provider	mapping_tool	mapping_tool_version	mapping_justification	reviewer_id	author_id
OMOP:44499396	OMOP:4028717	skos:broadMatch	OHDSI:Odysseus			semapv:ManualMappingCuration		ORCID:0000-0003-4147-1485
OMOP:45586281	OMOP:4028717	skos:exactMatch	OHDSI:Odysseus	OHDSI_TOOLS:Usagi	1.4.3	semapv:LexicalMatching	ORCID:0000-0003-4147-1485
OMOP:45610575	OMOP:441554	skos:exactMatch	OHDSI:UMLS			semapv:UnspecifiedMatching

What do we see here?

Rather than Maps to, the mapping predicate (e.g. skos:exactMatch) is a semantic mapping predicate from a standardised vocabulary (SKOS). Here, we distinguish between skos:exactMatch and skos:broadMatch, but there are other predicates, see for example in the Semantic Mapping Vocabulary.

Level 3: Document confidence widely

confidence is an incredibly useful metric for downstream users, including ETL engineers and data analysts. In an ideal world, all mappings have some kind of confidence associated with them. confidence scores should be read as "the strength of evidence provided in this record/table row (i.e mapping justification) leads us to believe the mapping (e.g. OMOP:44499396 --[skos:broadMatch]--> OMOP:4028717) is correct with 90% confidence.

subject_id	object_id	predicate_id	mapping_provider	mapping_tool	mapping_tool_version	mapping_justification	reviewer_id	author_id	confidence
OMOP:44499396	OMOP:4028717	skos:broadMatch	OHDSI:Odysseus			semapv:ManualMappingCuration		ORCID:0000-0003-4147-1485	0.9
OMOP:45586281	OMOP:4028717	skos:exactMatch	OHDSI:Odysseus	OHDSI_TOOLS:Usagi	1.4.3	semapv:LexicalMatching	ORCID:0000-0003-4147-1485	0.8
OMOP:45610575	OMOP:441554	skos:exactMatch	OHDSI:UMLS			semapv:UnspecifiedMatching			0.6

What do we see here?

For matching tools, confidence can be calculated by proxies such as "lexical similarity", "edit distance", "cosine similarity of node embedding" and other metrics. In the example above, Usagi has determined that the subject and objects match, but it was only 80% sure (we dont know why - this is more advanced SSSOM)
For case where an external mapping is reused using ETL, confidence describes the level of trust you as an ETL expert have in the fidelty of the mapping provided by the source.

Level 4: Document curation rules

subject_id	object_id	predicate_id	mapping_provider	mapping_tool	mapping_tool_version	mapping_justification	reviewer_id	author_id	confidence	curation_rule
OMOP:44499396	OMOP:4028717	skos:broadMatch	OHDSI:Odysseus			semapv:ManualMappingCuration		ORCID:0000-0003-4147-1485	0.9	OHDSI_CURATION_RULE:19

What do we see here?

For manual matches, it is often unclear by what criteria a match was established. Documenting the curation rules can help increasing consistency for manual curation, and transparency for downstream users.
OHDSI_CURATION_RULE:19 is a rule defined by your own curation rulebook. This can be anything. For example OHDSI_CURATION_RULE:19 could correspond to the following rule:

OHDSI_CURATION_RULE:19 = If the subject concept does not have an exact match in the object source vocabulary, we select the nearest broad ("up-hill") concept applicable. Conceptually, if both terms would exist in the same terminology, the subject concept can be defined as a subconcept of the object concept. The determination for both criteria (nearest broad, conceptally subconcept) is performed through medical expert judgement.