The SSSOM Toolkit
In the following we will give a brief introduction into the SSSOM toolkit. For more detailed documentation please refer to https://mapping-commons.github.io/sssom-py.
Pre-requisites
- Complete the basic SSSOM tutorial
- Install SSSOM toolkit. Alternatively, you can install the Ontology Development Kit (ODK) and follow the tutorial using its docker image.
- We are assuming a Unix shell for this tutorial, but most of the principles should apply to the Windows CMD as well. Windows users may prefer to install the ODK (see above).
Overview
SSSOM toolkit (STK), previously known as sssom-py
, is a set of utility methods for processing SSSOM files, packaged as a Command Line Client (CLI) and a python package. In the following, we will extract mappings from an ontology an process them with the CLI. The goal is to give a sense of the functionality of the toolkit. Additional and more up-to-date information on usage can be found here.
Table of Contents
parse
: Extracting mappings from an external sourcemerge
: Combining mappings from several sourcesconvert
: Converting an SSSOM mapping table into different formats
Extracting mappings from an external source
One key issue developers are faced with is to convert various different mapping formats into a common representation (e.g. SSSOM). The SSSOM toolkit (STK) already implements a number of commonly use mapping formats:
- OWL Ontologies
- Alignment API Format (format used by the Ontology Alignment Evaluation Initiative, OAEI)
- Parsers for SNOMED mapping format and FHIR Concept Map are in the making, June 2022.
Here we use Uberon, an anatomy ontology in the biomedical domain.
wget http://purl.obolibrary.org/obo/uberon/uberon-base.json -O uberon-base.json
Feel free to download the file manually if you do not have wget
installed.
Now use sssom parse
to extract all the mappings provided by the ontology. As there are multiple json based formats that can be parsed, you have to tell sssom
which format you are using: --input-format obographs-json
.
sssom parse uberon-base.json --input-format obographs-json --output uberon.sssom.tsv
From a CLI design perspective we already notice a few things:
uberon-base.json
is passed to the STK as an argument (without an option like-i
). This is the case for most primary inputs (mapping tables, source files) througout the SSSOM client.- The output generated by the above command is large. There seem to be a lot of messages where some URL
does not follow any known prefixes
:
WARNING:root:http://dbpedia.org/ontology/AnatomicalStructure does not follow any known prefixes
WARNING:root:http://uri.neuinfo.org/nif/nifstd/nlx_subcell_100205 does not follow any known prefixes
WARNING:root:http://neurolex.org/wiki/Category:Embryonic_organism does not follow any known prefixes
WARNING:root:http://www.informatics.jax.org/cookbook/figures/figure20.shtml does not follow any known prefixes
WARNING:root:http://mbe.oxfordjournals.org/content/26/3/613/F1.large.jpg does not follow any known prefixes
WARNING:root:http://palaeos.com/vertebrates/glossary/images/450x218xEctocuneiform.gif.pagespeed.ic.kaiuLYQELL.png does not follow any known prefixes
WARNING:root:http://palaeos.com/vertebrates/bones/dermal/images/289x311xPalatine1.gif.pagespeed.ic.tglmNBrF4D.png does not follow any known prefixes
WARNING:root:http://uri.neuinfo.org/nif/nifstd/nifext_14 does not follow any known prefixes
....
Understanding this is important to understand a lot about how SSSOM treats entities in general.
Why are there so many does not follow any known prefixes
warnings?
CURIEs are a key concept for the representation of SSSOM documents, in particular its table. All fields that constitute a reference to some entity, such as ids (subject_id
, object_id
, predicate_id
), and other fields such as mapping_justification
are represented in CURIE syntax.
The Semantic Web uses URIs (which look more like URLs rather than CURIEs) to refer to entities - there is, however, no standard protocol to translate a URI into a Compact URI (or CURIE).
Efforts such as https://bioregistry.io/, https://github.com/prefixcommons or https://identifiers.org/ try to bring a bit of an organisation to prefixes. In particular the former two curate maps between prefixes and URIs.
- URI:
http://purl.obolibrary.org/obo/MONDO_0000001
- CURIE:
MONDO:0000001
- PREFIX:
MONDO
- URI expansion:
http://purl.obolibrary.org/obo/MONDO_
Now the problem is that over the years, many very ideosyncratic URIs where used to denote entities in ontologies. While the STK tries to figure out the correct prefixes using https://bioregistry.io/, many times it fails - in these cases, the user must provide its own prefix map.
Lets create a simple one, and save it as metadata.yml
(we call it "metadata", because we will add more metadata to it in this tutorial):
curie_map:
dbpedia: http://dbpedia.org/ontology/
We can now use this in addition to the default prefix maps:
sssom parse uberon-base.json --input-format obographs-json --metadata metadata.yml --prefix-map-mode merged --output uberon.sssom.tsv
Combining mappings from several sources
Converting an SSSOM mapping table into different formats
Other methods:
- cliquesummary
- correlations
- crosstab
- dedupe
- diff
- dosql
- partition
- ptable
- reconcile-prefixes
- rewire
- sort
- sparql
- split
- validate
Under construction.