Confidence

SSSOM enables annotating confidence in several ways for individual mappings records and for mapping sets.

Confidence in Positive Semantic Mappings

The following example shows a high confidence (0.99) manually curated semantic mapping, between two disease resources.

#curie_map:
#  mesh: https://meshb.nlm.nih.gov/record/ui?ui=
#  MONDO: http://purl.obolibrary.org/obo/MONDO_
#  oboinowl: http://www.geneontology.org/formats/oboInOwl#
#  orcid: https://orcid.org/
#  semapv: https://w3id.org/semapv/vocab/
#  skos: http://www.w3.org/2004/02/skos/core#
#mapping_set_id: https://w3id.org/biopragmatics/biomappings/sssom/positive.sssom.tsv
subject_id  subject_label   predicate_id    object_id   object_label    mapping_justification   author_id   confidence
MONDO:0000455   cone dystrophy  skos:exactMatch mesh:D000077765 Cone Dystrophy  semapv:ManualMappingCuration    orcid:0000-0003-4423-4370 .99

The following example shows a medium-confidence semantic mapping produced through a lexical matching process. While this semantic mapping is actually incorrect, the lexical matching process assigned it a confidence of 0.65.

#curie_map:
#  DOID: http://purl.obolibrary.org/obo/DOID_
#  orcid: https://orcid.org/
#  semapv: https://w3id.org/semapv/vocab/
#  skos: http://www.w3.org/2004/02/skos/core#
#  umls: https://uts.nlm.nih.gov/uts/umls/concept/
#mapping_set_id: https://w3id.org/biopragmatics/biomappings/sssom/negative.sssom.tsv
subject_id  subject_label   predicate_id    object_id   object_label    mapping_justification   confidence
DOID:0050052    Rocky Mountain spotted fever    skos:exactMatch umls:C0035795   Rocky mountain spotted fever vaccine    semapv:LexicalMapping   0.65

When not explicitly specified, confidence estimation algorithms should consider the confidence of a semantic mapping to be 1.0 by default.

Confidence with Negated Semantic Mappings

SSSOM has explicit support for curating negative semantic mappings (i.e., subject-predicate-object triples known to be false) by using the predicate_modifier column.

The following example shows a highly confident negative semantic mapping, because Rocky Mountain spotted fever (a disease curated in DOID) is not the same as Rocky mountain spotted fever vaccine (a vaccine curated in UMLS).

#curie_map:
#  DOID: http://purl.obolibrary.org/obo/DOID_
#  orcid: https://orcid.org/
#  semapv: https://w3id.org/semapv/vocab/
#  skos: http://www.w3.org/2004/02/skos/core#
#  umls: https://uts.nlm.nih.gov/uts/umls/concept/
#mapping_set_id: https://w3id.org/biopragmatics/biomappings/sssom/negative.sssom.tsv
subject_id  subject_label   predicate_id    predicate_modifier  object_id   object_label    mapping_justification   author_id   confidence
DOID:0050052    Rocky Mountain spotted fever    skos:exactMatch Not umls:C0035795   Rocky mountain spotted fever vaccine    semapv:ManualMappingCuration    orcid:0000-0003-4423-4370   1.0

It's also possible to curate a negative semantic mapping with low confidence, but this is done less commonly in practice. Both human curators and semantic mapping prediction workflows typically focus on the production of positive knowledge.

Similarly, there are a large number of trivial negative semantic mappings that are typically ignored by curators and algorithms that consume semantic mappings.

When not explicitly specified, confidence estimation algorithms should consider the confidence of a negative semantic mapping to be 1.0 by default.

Estimating Overall Confidence in a Mapping Set

There are two places where the confidence in a mapping set can be reported:

The creator of the mapping set can report their confidence in the mapping set with the mapping_set_confidence slot in the mapping set's metadata.
The maintainer of a mapping set registry who indexes a mapping set can report their own confidence in the mapping set.

In some situations, it may be sufficient to choose a mapping set confidence based on knowledge about the scope/domain of the mapping set, who the curators were, etc.

Alternatively, an empirical confidence can be estimated by randomly sampling semantic mappings from the mapping set, manually reviewing them, then reporting the percentage that were correct as a decimal value between zero and one. This estimate becomes more accurate as the size of the sample increases, so it's suggested to sample a minimum 50-100 semantic mappings.

When not explicitly specified, confidence estimation algorithms should consider the registry confidence in a mapping set to be 1.0 by default.