How to pick the right mapping predicates
A mapping predicate such as skos:exactMatch specifies the semantics of the mapping relation - in other words, it defines how a computer (and human!) should interpret the mapping when it is being used. For example, a computer program may be allowed to merge nodes in a knowledge graph only when they are
skos:exactMatch, but not when they are, say,
Picking the right predicate to specify the meaning of your mapping is often a difficult process. The following guide should help you to understand the most widely used mapping predicates and when they are appropriate.
Table of content
- The three primary concerns for selecting a mapping predicate
- The 3 step process for selecting an appropriate mapping predicate
- Frequently asked questions about mapping predicates
subject: the entity that is being mapped
object: the entity that the
subjectis mapped to
predicate: the semantic mapping relationship used
The three primary concerns for selecting a mapping predicate
There are at least three things you need to decide before selecting an appropriate mapping predicate:
What is the precision of the mapping?
As a curator, you should try to investigate the intended meaning of both the subject and the object. This task usually involves trying to find out as much as possible about the mapped identifiers: What is their human readable definition? Are there any logical axioms that could help with understanding the intended meaning? Sometimes, this even involves asking the respective stewards of the database or ontology for clarification. Important: The key here is "intended meaning". For example, when you see
FOODON:Apple (FOODON is an ontology), you do not try to figure out what an apple is, but what thing in the world (in your conceptual model of the world) the FOODON developers intended the
FOODON:Apple identifier to refer to. This might be an apple that you can eat, or a cultivar!
The precision is simply: is the mapping
related? Here is a basic guide about how to think of each:
exact: The two terms are intended to refer to the same thing. For example, both the subject and the object identifiers refer to the concept of Gala cultivar.
close: The two terms are intended to refer to roughly the same thing, but not quite. This is a hazy category and should be avoided in practice, because when taken too literally, most mappings could be interpreted as close mappings. This is not the point of creating mappings, if their intention is to be useful (see "use case" considerations later in this document). An example of a
closemapping is one between the "heart" concept in a database of anatomical entities for biological research on chimpanzees and the "human heart" in an electronic health record for humans.
broad: The object is conceptually broader than the subject. For example, "human heart" in an electronic health record refers to "heart" in a general anatomy ontology that covers all species, such as Uberon. Another example is "Gala (cultivar)" in one ontology or database to "Apple (cultivar)" in another: the Apple (cultivar) has a broader meaning then "Gala (cultivar)". For a good mapping, it is advisable that "broad" and "narrow" are applied a bit more strictly than is technically permitted by the SKOS specification: both the subject and the object should belong to the same category. For example, you should use broad (or narrow) only if both the subject and the object are "cultivars" (in the above example).
narrow: The object is conceptually narrower than the subject. For example "Apple (cultivar)" is a narrow match to "Gala (cultivar)". Think of it as the opposite of "broad".
narroware so-called inverse categories: If "Gala (cultivar)" is a
broadmatch to "Apple (cultivar)", then "Apple (cultivar)" is a
narrowmatch to "Gala (cultivar)"! One note of caution:
narrowmatches generally have less useful applications then
broadones. For example, if we want to group subject entities in a database under an ontology to make them queryable in a knowledge graph, only
broadmatches to the ontology can be retrieved. For example, if we map "Gala (cultivar)" in a database to "Apple (cultivar)" in an ontology, and we wish to write a semantic query to obtain all records that are about "Apple (cultivar)" according to the ontology, we obtain "Gala (cultivar)". This is not true the other way around: if the ontology term is more specific then the database term, it can't be used to group the database data.
related: The subject refers to an analogous concept of a different category. For example "Apple" and "Apple tree" are considered
relatedmatches, but not
exactmatches, as "Apple" is of the "fruit" category, and "Apple tree" of the "tree" category. Other examples include: "disease" and "phenotype", "chemical" and "chemical exposure", "car" and "car manufacturing process". In general,
relatedmappings should be reserved for "direct analogues". For example, we should not try to map to
broadcategories at the same time, like, for example, "Gala (cultivar)" to "Apple tree". This causes a huge amount of proliferation of very "low value" mappings (see use case section later).
What is the acceptable degree of noise of the mapping?
"Noise" is the permissible margin of error for some target use case. Depending on what you want to do with your mappings, different quality levels are acceptable. This section is not exhaustive.
While reading through this section, you should keep one thing in mind: it is never a good idea to think about mappings as "correct" or "wrong". Even the the exact same identifier (for example in Wikidata, or even the biomedical data domain) can mean something very different depending on which database it is using it or in which part of which datamodel (or value set) they are used. Mapping should therefore be perceived as an inexact art where the goal is not "correctness" but "fitness for purpose": can the mappings deliver the use case I am interested in? In the following, we will take a closer look at the varying levels of noise you may need to weigh against each other.
- "zero-noise". Some mappings directly inform decision processes of downstream consumers, such as clinical decision support or manufacturing. For example, in an electronic health record (EHR) system we may want to know what the latest recommended drugs (or contra-indications) for a conditions are, and the disease-drugs relationships may be curated using one terminology such as OMOP, and the EHR may be represented using ICD10-CM (a clinical terminology used widely by hospitals). In these cases, noise should be zero or close to zero, as patient lives depend on the correctness of these mappings.
- "low-noise". Most mappings are used to augment/inform processes that are a bit upstream of the final consumer. For example, mappings are used to group data for analysis or make it easier to find related data during search (enhancing search indexing semantically). The final consumer does not immediately "see" the mappings, but just the consequences of applying the mappings. In these cases, a bit of noise may be acceptable, i.e. some mappings that are "not quite right". Practically, this is very often the case where data sources are aligned automatically to enable searches across, so a few bad mappings are better than having none.
- "high-noise": Some use cases employ data processing approaches that are themselves highly resilient to noise, like Machine Learning. Here, even a larger number of mappings (in a knowledge graph for example) which are "not quite right", or noisy, may be acceptable (if the signal to noise ratio is still ok, i.e. there are "more good than bad" mappings).
There is no easy formula by which you can decide what level of noise is acceptable. Your use case will determine this. What you, as the steward of your organisation's mapping data, should consider is that there is (roughly) an order of magnitude in cost involved between the three levels:
- "high-noise": Very cheap to generate. Automated matching tools can be used to generate the mappings, with no human review required. Your system may implement a way for your consumers to flag up bad results which can be traced back to a bad mapping, and simply exclude them moving forward.
- "low-noise": Moderately expensive. Most mappings are generated using automated matchers, but then confirmed by a human curator. The confirmation process can often be "hand-wavy" to weed out obviously bad mappings, but do not involve the same rigour as "zero-noise" mappings would require to maintain scalability to large volumes of mappings. Such a "hand-wavy" confirmative review can take 10 seconds to 100 seconds (if a quick lookup is required).
- "zero-noise": Very expensive. Every mapping must be carefully reviewed by a human curator, sometimes by a group of curators. In our experience, reviewing or establishing a mapping like this (manually) can take anything between 10 and 30 minutes - occasionally more.
You can use these estimated costs for mapping review to determine how much it would cost to apply the same level of rigour to your own mappings.
What is the intended use case?
This section is informative, not exhaustive, and will give you a sense of how use cases affect your choice of mapping predicate.
We have covered some implications of use cases in the sections above:
- Some use cases require lower levels of noise, others can live with higher levels of noise.
- Mappings are rarely 100% exact when mapping across semantic spaces (different database, ontologies, terminologies). What matters is not "correctness" - what matters is that the mappings are "fit for purpose" (i.e. useful for your use case).
- Some mappings may be of more value for your use case than others (for example,
exactmappings may be more valuable than
broadmappings). You can find the right level of cost benefit by selecting optimising value and cost of generating/maintaining such mappings.
closemappings may often have a very low value, but if your acceptable level of noise is high, just generate them, since they don't cost you anything!
Other key considerations in the sections are:
Semantic frameworks for analysis and querying
There are four semantic frameworks/formalisms that default SSSOM supports: (1) SPARQL/RDF(S) (querying an integrated knowledge with basic SPARQL); (2) Simple Knowledge organisation systems (SKOS); (3) Web Ontology Language (OWL); (4) no formalism (property graphs, non-semantic use cases). We will briefly discuss the implications of each for your use cases.
- SPARQL/RDF(S) is a very general semantic framework that allows query across property paths. Many SPARQL engines provide at least RDFS entailment regime, which allows for some (basic) semantic reasoning (subClassOf, property domains). This is the most likely semantic framework of choice if your use case involves semantic queries such as those involving sub-class groupings.
- SKOS is a semantic framework that layers on top of RDF and specifies semantics for a handful of properties that are useful for building taxonomies that do not seek to follow the rigorous semantics of the class-level modelling constructs such as subClassOf. We have no experience with SKOS reasoners, and do not know if there are any out there. This means, in effect, that this "case" (semantic framework) has the same exact considerations as the SPARQL/RDF(S) one above.
- OWL is a very powerful semantic framework that is based on formal logic. Ontologies represented in OWL offer support for complex expressions of knowledge, way beyond what RDFS and SKOS can do. OWL is the semantic framework of choice if the goal is to build and reason over an integrated (merged) ontology. An example use case where OWL is the appropriate framework is integration of species-specific anatomy ontologies under species-neutral ones, see for example Uberon. A basic rule of thumb is: unless you know positively that you have to reason over the merged graph, i.e. set of all ontologies you have mapped across, OWL is probably overkill and should be avoided.
- Using no semantic framework does not mean semantic mappings are useless! Many extremely useful applications exist for mappings which do not involve a semantic framework, such as those related to Labelled Property Graphs (for example neo4j). Even if you just want to translate your data into a graph, it is useful to know the semantics of your mappings as they can inform your graph queries.
Other semantic frameworks exist such as rule-based systems (e.g. Datalog, SWRL), but they are not used as widely as the above in our domain.
Instance vs Property vs Concept-level mapping
To pick the correct mapping predicate, it is important to understand whether you are mapping concepts or instances:
- Concept-level: the entity being mapped constitutes a class or a concept. A concept can be thought of a collection or set of individuals. For example, "Apple" could refer to the class of all apples.
- Instance-level: the entity being mapped constitutes an individual or an instance. An instance is a single real-world entity, such as Barack Obama. Instances are members of classes/concepts. For example, Barack Obama belongs to the class of "Person", or "Former Presidents". Another example is an individual apple on a shelf in a supermarket ("Gala Apple 199999"), which is an instance of the "Apple" class.
Note that notions like
narrow make no sense when mapping instances. We typically try to avoid the SKOS vocabulary for mapping instances, and make use of
owl:sameAs instead. Note that
owl:sameAs does have implications for reasoning, but it is also the preferred property when within the "RDF/SPARQL" semantic framework.
If the mapping involves an instance and a class, you have hit a corner case of the SSSOM use case. This case can still be represented, but instance-concept relationships are not widely thought of as "mappings".
In much the same way as concepts and instances, you can also map properties or "relationships":
- Property-level: the entities being mapped are both properties, like, for example, rdfs:label, skos:prefLabel, RO:0000050 (part of).
Note that it does not make sense to try to map instances of concepts, or concepts, directly to properties. There are no relationships that would support such a mapping.
Typical use cases
Typical use cases for mappings include:
- Semantic data integration. This often involves linking data to ontologies or semantic layers in knowledge graphs. Data from one source (such as an EHR) is translated to another (such as OMOP, see above). To analyse the data semantically, the most valuable links are
broadas these allow you to directly query the ontology to retrieve instance data.
narrowmatches are less useful for such a use case, but maybe be consulted as the "next best thing" to an exact mapping. Often, a low level of noise is acceptable.
- Data translation. Similar to data integration, but we want to map as precisely as possible. Only
exactmatches really matter if we want to make sure that data annotated with one ontology means the exact same thing as data annotated with another. Noise in the mappings is often not acceptable. An example for this is if one source has annotated all its genes using the HUGO Gene Nomenclature Committee (HGNC) while another is using NCBI Gene Database identifiers.
closematches are mostly meaningless - we need a 1:1 translation table with next to zero noise.
- Ontology and knowledge graph merging. Here, the key issue is that
exactmatches matches have as little noise as possible. Some merging approaches use probabilistic algorithms to weed out out potentially bad mappings (low levels of noise may be acceptable, see for example boomer), but any naive merging approach, which is still prevalent in the knowledge graph world, will usually do the following: (1) Merge all
exactmatches into one "node" in the knowledge graph and (2) redirect all data against all these
exactmatches to that newly created node.
The 3-step process for selecting an appropriate mapping predicate
The following 3-step process condenses the sections above into a simple to follow algorithm.
Given two terms A and B:
- Target: semantic framework: Does your use case require OWL reasoning over the merged subject and object sources?
- If yes, use OWL vocabulary for properties
- If no, use RDF/SPARQL/SKOS vocabulary for properties
- Are A and B instances, properties or concepts?
- If A and B are instances, use only vocabulary suitable for instances
- If A and B are concepts, use only vocabulary suitable for concepts
- If A and B are properties, use only vocabulary suitable for properties
- If either one of A or B is an instance and the other is a concept, use only vocabulary suitable for describing instance-class relationships
- Is A roughly the same as B?
- If yes, does the difference between "truly exact" and your understanding of
Bconstitute "acceptable noise level"?
- If yes: the mapping is
- If no: the mapping is
- If yes: the mapping is
- If no, determine if the precision as described above.
- If yes, does the difference between "truly exact" and your understanding of
You can now select the mapping predicate based on the table below:
|Mapping Predicate||Precision||Suitable for semantic framework||Suitable entity types?||Acceptable noise|
Note that "acceptable noise" refers to "what is acceptable for the target semantic framework". When using OWL, even a bit of noise can have huge consequences for reasoning, so it is not advisable to use the OWL vocabulary in cases where there is a lot of noise.
Frequently asked questions
- None of the mapping predicates listed here seem to fit for my use case. Can I define my own?
The SSSOM specification is currently open to specifying new mapping predicates. However, it is always advisable to open an issue to discuss such cases with the wider community - there may be some benefit in standardising predicates from the start!