Skip to content

The SSSOM Toolkit

In the following we will give a brief introduction into the SSSOM toolkit. For more detailed documentation please refer to https://mapping-commons.github.io/sssom-py.

Pre-requisites

Overview

SSSOM toolkit (STK), previously known as sssom-py, is a set of utility methods for processing SSSOM files, packaged as a Command Line Client (CLI) and a python package. In the following, we will extract mappings from an ontology an process them with the CLI. The goal is to give a sense of the functionality of the toolkit. Additional and more up-to-date information on usage can be found here.

Table of Contents

  1. parse: Extracting mappings from an external source
  2. merge: Combining mappings from several sources
  3. convert: Converting an SSSOM mapping table into different formats

Extracting mappings from an external source

One key issue developers are faced with is to convert various different mapping formats into a common representation (e.g. SSSOM). The SSSOM toolkit (STK) already implements a number of commonly use mapping formats:

  1. OWL Ontologies
  2. Alignment API Format (format used by the Ontology Alignment Evaluation Initiative, OAEI)
  3. Parsers for SNOMED mapping format and FHIR Concept Map are in the making, June 2022.

Here we use Uberon, an anatomy ontology in the biomedical domain.

wget http://purl.obolibrary.org/obo/uberon/uberon-base.json -O uberon-base.json

Feel free to download the file manually if you do not have wget installed.

Now use sssom parse to extract all the mappings provided by the ontology. As there are multiple json based formats that can be parsed, you have to tell sssom which format you are using: --input-format obographs-json.

sssom parse uberon-base.json --input-format obographs-json --output uberon.sssom.tsv

From a CLI design perspective we already notice a few things:

  • uberon-base.json is passed to the STK as an argument (without an option like -i). This is the case for most primary inputs (mapping tables, source files) througout the SSSOM client.
  • The output generated by the above command is large. There seem to be a lot of messages where some URL does not follow any known prefixes:
WARNING:root:http://dbpedia.org/ontology/AnatomicalStructure does not follow any known prefixes
WARNING:root:http://uri.neuinfo.org/nif/nifstd/nlx_subcell_100205 does not follow any known prefixes
WARNING:root:http://neurolex.org/wiki/Category:Embryonic_organism does not follow any known prefixes
WARNING:root:http://www.informatics.jax.org/cookbook/figures/figure20.shtml does not follow any known prefixes
WARNING:root:http://mbe.oxfordjournals.org/content/26/3/613/F1.large.jpg does not follow any known prefixes
WARNING:root:http://palaeos.com/vertebrates/glossary/images/450x218xEctocuneiform.gif.pagespeed.ic.kaiuLYQELL.png does not follow any known prefixes
WARNING:root:http://palaeos.com/vertebrates/bones/dermal/images/289x311xPalatine1.gif.pagespeed.ic.tglmNBrF4D.png does not follow any known prefixes
WARNING:root:http://uri.neuinfo.org/nif/nifstd/nifext_14 does not follow any known prefixes
....

Understanding this is important to understand a lot about how SSSOM treats entities in general.

Why are there so many does not follow any known prefixes warnings?

CURIEs are a key concept for the representation of SSSOM documents, in particular its table. All fields that constitute a reference to some entity, such as ids (subject_id, object_id, predicate_id), and other fields such as mapping_justification are represented in CURIE syntax.

The Semantic Web uses URIs (which look more like URLs rather than CURIEs) to refer to entities - there is, however, no standard protocol to translate a URI into a Compact URI (or CURIE).

Efforts such as https://bioregistry.io/, https://github.com/prefixcommons or https://identifiers.org/ try to bring a bit of an organisation to prefixes. In particular the former two curate maps between prefixes and URIs.

  • URI: http://purl.obolibrary.org/obo/MONDO_0000001
  • CURIE: MONDO:0000001
  • PREFIX: MONDO
  • URI expansion: http://purl.obolibrary.org/obo/MONDO_

Now the problem is that over the years, many very ideosyncratic URIs where used to denote entities in ontologies. While the STK tries to figure out the correct prefixes using https://bioregistry.io/, many times it fails - in these cases, the user must provide its own prefix map.

Lets create a simple one, and save it as metadata.yml (we call it "metadata", because we will add more metadata to it in this tutorial):

curie_map:
  dbpedia: http://dbpedia.org/ontology/

We can now use this in addition to the default prefix maps:

sssom parse uberon-base.json --input-format obographs-json --metadata metadata.yml --prefix-map-mode merged --output uberon.sssom.tsv

Combining mappings from several sources

Converting an SSSOM mapping table into different formats

Other methods:

  • cliquesummary
  • correlations
  • crosstab
  • dedupe
  • diff
  • dosql
  • partition
  • ptable
  • reconcile-prefixes
  • rewire
  • sort
  • sparql
  • split
  • validate

Under construction.