1st Mapping Commons Workshop on Simple Standard for Sharing Ontology Mappings
When: 03.09.2021
For a list of participants see:
- Wikidata: https://www.wikidata.org/wiki/Q108394519
- Scholia: https://scholia.toolforge.org/event/Q108394519
In 2020, we introduced the Simple Standard for Sharing Ontology Mappings (SSSOM) as a way for the mapping community to exchange and consolidate mappings using a simple TSV format. SSSOM seeks to solve, in particular, the following problems:
- Standardising the mapping metadata that is necessary to drive data transformation and knowledge graph merging use cases
- Enable effective merging and filtering of mapping sets
- Standardising the representation of mapping sets across formats such as RDF/XML, JSON-LD, TSV, and others.
The purpose of this 3-hour workshop:
- Describing current use cases to the community, and ensuring all community use cases are documented and understood
- Establishing a user forum for getting support and providing feedback
- Define a simple governance strategy for organic evolution of standard
- Describing a number of key open issues:
- The representation of complex mappings
- The representation of curation rules
- The problem of predicate modification
- The alignment with external standards such as PROV-O and Alignment API
- Defining the path to a SSSOM beta release (stable) and the rallying for the paper
Resources:
Outcomes
- Members of the SSSOM core team are organised as as github team: https://github.com/orgs/mapping-commons/teams/sssom-core
- If you want to become a member, please make an issue here: https://github.com/mapping-commons/SSSOM/issues
- We added some of the questions asked to the new SSSOM FAQ: https://mapping-commons.github.io/sssom/faq/
- Governance proposal (comments welcome): https://github.com/mapping-commons/SSSOM/issues/82
- Governance will evolve over time. Standard and governance will evolve together.
- We will versioning (like SemVer) and should denote when backwards incompatible changes happen
- The 5-Star system for open FAIR mappings is now in its first official version.
Discussion summary
- Ben Gyori: would be interesting to discuss beyond the format whether there's a central repository, or whether primary developers will make it as a primary export hosted along with their other artifact. Would there be a process to pull those?
- Nico: takes a long time for uptake of new publication systems, so this could take a long time -> maybe better to promote on an ontology level. Could also have the side benefit of providing a point of introspection
- John G: I totally would want BioPortal to be capable of managing the RDF produced from SSSOM resources, and for Bioportal to be a mapping resource and not simply an ontology resource. I suspect the RDF patterns that SSSOM is defining are the gold we'll need for that gold standard for exchanging mappings. Uploading the RDF files can trivially be done in a naive way of course, but integrating that RDF knowledge into Bioportal to make them maximally useful as a separate kind of resource is obviously 'real work' (and so schedules are unknowable).
- John: how dependent is the library on the software itself? Is it an exchange principle?
- Nico: LinkML has the advantage that it gets JSON and TTL outputs for free if we use it. Would also be advantageous if more people used this standard for metamodeling to create similar outputs for different modeling
- Charlie: using "frontmatter" format for SSSOM TSV files, like how github is using frontmatter in Jekyll (ref: https://blog.datacite.org/using-yaml-frontmatter-with-csv/) (http://csvy.org/)
- John G: Analogous to frontmatter format, I keep being drawn to the SKOS Play format as an alternate (but I think TTL-compatible) format for the SSSOM content. How bad would that be? (I can create a ticket)
- Charlie: Requirements for a default JSON-LD context (e.g., prefix -> URL prefix mapping)
- How should it be maintained? Should it continue to be manually curated, or is an automated export from something like the Bioregistry a good idea? If it's automatically exported from the Bioregistry, what kinds of interactions might users want to have via the Bioregistry issue tracker to propose improvements? Similarly, we can make tutorials for directly creating PRs.
- Charlie: How prefixes should be stylized/what is the business logic/decision tree for using OBO Library PURLs, Identifiers.org URLs, Bioregistry URLs, first-party provider URLs, etc. based on what's available and mapped between various first-party providers, third-party providers (e.g., ChemSpider InChI resolver), and meta-providers (e.g., Identifiers.org, OntoBee)? This is both a concern for "best practices" in SSSOM defining a custom context and also when using or extending the default context.
- Charlie: How to represent mappings where the curator is unsure if the relation is correct or not? This happens often when curating equivalences, e.g., in Biomappings https://github.com/biomappings/biomappings/blob/master/src/biomappings/resources/unsure.tsv
- Tiffany: Is it important to know why someone feels more or less confident about a mapping? If so, is there also a way to include that in the measure of “confidence”?
- Sue: In practice I’ve tended to add comments when I am uncertain and have questions. Possibly this could be formalized?
- Davera clinical use case discussion: Overall issue: mapping sets of things to a term is a goal for clinical mappings
- mapping recommendations/rational exercise
- staging and diagnosis information (like stage 1 or stage 2 of a given cancer)
- Select a set of stages - this is challenging wrt mappings
- Different kinds of scales describing the same thing are hard/sometimes not "kosher" to mix
- Phenotypes rely on capturing human-readable data on the decision logic of how mappings are applied by standards implementation team
- Proposed to look at the HL7 Implementation profiles as a way to incorporate an approach to this complex mapping challenge
- Melissa: rename SSSOM to Slytherin Standard.
- Charlie 100% supports this (Tiffany: +1; Alex +1)
- Kristin also likes this.
- John used it.
- John G: Ontology repositories are mappings-motivated, to both provide to users good mappings, and to provide good ways for users or managers to ingest, manage, apply, and create mapping knowledge. Ontology repositories are presumably also capable of storing mappings in their semantic (RDF-equivalent) format. With this in mind, is the concept of a "mapping server" equivalent, complementary, or antagonistic to the existing ontology repositories?
- John G: Need to consider identification and versioning of the mapping artifacts. It's one thing to say "We have all the mapping artifacts and we are giving those out", but (just like ontologies) citing a mapping artifact requires that you have a unique identifier for that artifact, and that the identifier incorporates the fact the artifact may have multiple versions. Ideally the SSSOM artifacts (like ontologies) would (a) be accessible in a defined format at the identifier IRI, (b) include their identifiers within the SSMOC artifact. I am thinking that an SSMOC is inherently a semantic artifact, and therefore it should follow semantic namespace declaration principles in this regard.
- Julie: W3id supports regex based redirects (for purls)
Breakout sessions
Curation rules: documenting the decision rules on how a mapping was determined
-
Effective definition of inclusion criteria/exclusion criteria:
- Inclusion example: Two ontologies saying I created exact mappings that they have a string match or a string match to a synonym + an xref
- Exclusion example: Only matched on an acronym
-
Match types
- Cover partial string matches
-
Other
- documentation
- criteria to distinguish exact from narrow/broad - how exact is exact
- Line between close/narrow/broad
- Direction of narrow and broad
- DOS: I'd favor manual mapping be done on definitions + context in ontology, leaving lexical mappings to machines.
- What metadata could we add to the header to make clear criteria used? One thing it might be useful to record is whether ontology context (relationships & location in classification) of mapped terms was used (Some ontologies/taxonomies have poor quality graphs but high quality term definitions.)
- Source string match to target (lexical exact, stem, word [synonym and type]) - need for both source and target, how to synonyms fit in
- documentation
- Needs:
- Generalized patterns that relate file header information to row-level information
- Need more expressivity in the match type
- Inclusion and exclusion criteria
- Best practices guide
- Algorithm/tool/similarity measure for computationally derived mappings
- Specificity with respect to the parent concept or portion of the hierarchy that the concept is from
Mapping provenance and alignment with external provenance standards
- Problems:
- We need to distinguish original and derived mappings
-
We need to somehow “encode” how a derived mapping was created (for example through a walk
-
USeful to capture as part of the PROV activity
- agents (wasAssociatedWith some)
- mapping tool
- creator
- algorithms
- semantic similarity etc
- Why provenance: “i dont trust mappings from source x..”
- Who did it? What tools were used? - are the most important
- When completed, how often updated
- Which version of the ontology was the mapping generated from? (20-30 provenance related properties that could be relevant) list
- list of most-recommended terms as a template: https://github.com/sifrproject/MOD-Ontology/blob/master/mod-v1.4_properties_template.ttl
- Activity manual mapping -> Activity reconciliation
- Inputs and outputs of activities?
- Mapping set activities vs mapping activities
- Shahim: Generic tagging mechanism
- users add tags k:v
- Suggestion: we open the “other” field to arbitrary json, then if we see people use something a lot, we allow promoting stuff to the top level (look at fhir as an example)
- James counter suggestion: open the column space and allow Qnames in there? its like Shahims suggestion just on the top level;
- John says look at SKOS Play convert tool it implements arbitrary triples as
so you add whatever properties you want in the top row top row is actually the first row after the Column Header row, which begins with "Identifier" cell - Thomas: While it’s nice to have the ability to express complex prov (and we should think about it), the important prov files are not that complex. Minimum should be something like: Creator, creation date, algorithm,... see below “list of critical (minimum) prov information”
- John: World is changing we can assume a bit more complexity
- All: A short list of the critical provenance information is needed, but there should also be a mechanism to add other ecosystem-specific provenance.
- Versioning:
- We need to carefully think through versioning of mapping sets. Versioning should be similar to ontology artefacts, with version IRIs and PURLs
- W3id supports regex based redirects
- We need to introduce versioning for the SSSOM standard itself. Someway to indicate whether breaking changes were introduced
Representing predicate modification: negation, inverse, direct, indirect etc
https://github.com/mapping-commons/sssom/issues/40
- Negative mappings (e.g., not equivalent to, not related to) have a clear use case in supporting semi-automated curation of mappings to avoid zombie mappings.
- We agreed adding additional syntax to SSSOM would make it less simple and likely less accessible.
- Two candidate solutions for including negative mappings remain: curating a controlled vocabulary of negative relationships (e.g., sssom:notEquivalentTo) OR adding a predicated modifier column. We considered parallel discussions in the LinkML community and examined the use of predicate modifiers in the Gene Ontology Annotation database. Both solutions could work, but we were hesitant to commit to one during the meeting.
Mapping (clinical etc) data model elements and values
https://github.com/mapping-commons/sssom/issues/43
Use cases for complex mappings
- https://github.com/mapping-commons/sssom/issues/61
- The main outcome for this discussion was that the particpant urged to keep the
Simple
in SSSOM, and that any decision to capture more complex mapping cases should be driven by a veruy strong use case - For the first release of the SSSOM standard, we will not worry about complex mappings
Next steps
- Declare stable first version for SSSOM spec (September 2021)
- Write manuscript (September/October 2021)
- Dockerise all mapping related tooling, for example for generation, reconciliation, transformation etc. (December 2021)
- Work with OAEI to publish automated mappings more systematically in SSSOM, including better mapping justifications/curation rules (Early 2022)
- Work with @cmungall & @balhoff to integrate mapping reconciliation as a first-class citizen into mapping pipelines (February 2022)
- Extend OxO to fully support SSSOM data model (prototype SSSOM browser April 2022).