Python API

Datamodel

Top-level Document object

Additional SSSOM object models.

class sssom.sssom_document.MappingSetDocument(mapping_set, converter)[source]

Bases: object

Represents a single SSSOM document.

A document is simply a holder for a MappingSet object plus a CURIE map

mapping_set: MappingSet

a set of mappings plus metadata

Type:: The main part of the document

property prefix_map: Dict[str, str]: Get a prefix map.

A Document holds a MappingSet which is a collection of mappings

I/O

Conversion between TSV/pandas, internal datamodel and RDF

I/O utilities for SSSOM.

sssom.io.annotate_file(input, output=None, replace_multivalued=False, **kwargs)[source]

Annotate a file i.e. add custom metadata to the mapping set.

Parameters:

input (str) – SSSOM tsv file to be queried over.
output (Optional[TextIO]) – Output location.
replace_multivalued (bool) – Multivalued slots should be replaced or not, defaults to False
kwargs – Options provided by user which are added to the metadata (e.g.: –mapping_set_id http://example.org/abcd)

Return type:

MappingSetDataFrame

Returns:

Annotated MappingSetDataFrame object.

sssom.io.convert_file(input_path, output, output_format=None)[source]

Convert a file from one format to another.

Parameters:

input_path (str) – The path to the input SSSOM tsv file
output (TextIO) – The path to the output file. If none is given, will default to using stdout.
output_format (Optional[str]) – The format to which the SSSOM TSV should be converted.

Return type:

None

sssom.io.extract_iris(input, converter)[source]

Recursively extracts a list of IRIs from a string or file.

Parameters:

input (Union[str, Path, Iterable[Union[str, Path]]]) – CURIE OR list of CURIEs OR file path containing the same.
converter (Converter) – Prefix map of mapping set (possibly) containing custom prefix:IRI combination.

Return type:

List[str]

Returns:

A list of IRIs.

sssom.io.filter_file(input, output=None, **kwargs)[source]

Filter a dataframe by dynamically generating queries based on user input.

e.g. sssom filter –subject_id x:% –subject_id y:% –object_id y:% –object_id z:% tests/data/basic.tsv

yields the query:

“SELECT * FROM df WHERE (subject_id LIKE ‘x:%’ OR subject_id LIKE ‘y:%’): AND (object_id LIKE ‘y:%’ OR object_id LIKE ‘z:%’) “ and displays the output.

Parameters:

input (str) – DataFrame to be queried over.
output (Optional[TextIO]) – Output location.
kwargs – Filter options provided by user which generate queries (e.g.: –subject_id x:%).

Raises:

ValueError – If parameter provided is invalid.

Return type:

MappingSetDataFrame

Returns:

Filtered MappingSetDataFrame object.

sssom.io.get_metadata_and_prefix_map(metadata_path=None, *, prefix_map_mode=None)[source]: Load metadata and a prefix map in a deprecated way. :rtype: Tuple[Converter, Dict[str, Any]]

Deprecated since version 0.4.3: This functionality for loading SSSOM metadata from a YAML file is deprecated from the public API since it has internal assumptions which are usually not valid for downstream users.

sssom.io.parse_file(input_path, output, *, input_format=None, metadata_path=None, prefix_map_mode=None, clean_prefixes=True, strict_clean_prefixes=True, embedded_mode=True, mapping_predicate_filter=None)[source]

Parse an SSSOM metadata file and write to a table.

Parameters:

input_path (str) – The path to the input file in one of the legal formats, eg obographs, aligmentapi-xml
output (TextIO) – The path to the output file.
input_format (Optional[str]) – The string denoting the input format.
metadata_path (Optional[str]) – The path to a file containing the sssom metadata (including prefix_map) to be used during parse.
prefix_map_mode (Optional[Literal['metadata_only', 'sssom_default_only', 'merged']]) – Defines whether the prefix map in the metadata should be extended or replaced with the SSSOM default prefix map derived from the bioregistry.
clean_prefixes (bool) – If True (default), records with unknown prefixes are removed from the SSSOM file.
strict_clean_prefixes (bool) – If True (default), clean_prefixes() will be in strict mode.

Return type:

None

:param embedded_mode:If True (default), the dataframe and metadata are exported in one file (tsv), else two separate files (tsv and yaml). :type mapping_predicate_filter: Optional[tuple] :param mapping_predicate_filter: Optional list of mapping predicates or filepath containing the same.

sssom.io.run_sql_query(query, inputs, output=None)[source]

Run a SQL query over one or more SSSOM files.

Each of the N inputs is assigned a table name df1, df2, …, dfN

Alternatively, the filenames can be used as table names - these are first stemmed E.g. ~/dir/my.sssom.tsv becomes a table called ‘my’

Example:: sssom dosql -Q “SELECT * FROM df1 WHERE confidence>0.5 ORDER BY confidence” my.sssom.tsv
Example:: sssom dosql -Q “SELECT file1.*,file2.object_id AS ext_object_id, file2.object_label AS ext_object_label FROM file1 INNER JOIN file2 WHERE file1.object_id = file2.subject_id” FROM file1.sssom.tsv file2.sssom.tsv

Parameters:

query (str) – Query to be executed over a pandas DataFrame (msdf.df).
inputs (List[str]) – Input files that form the source tables for query.
output (Optional[TextIO]) – Output.

Return type:

MappingSetDataFrame

Returns:

Filtered MappingSetDataFrame object.

sssom.io.split_file(input_path, output_directory)[source]

Split an SSSOM TSV by prefixes and relations.

Parameters:

input_path (str) – The path to the input file in one of the legal formats, eg obographs, aligmentapi-xml
output_directory (Union[str, Path]) – The directory to which the split file should be exported.

Return type:

None

sssom.io.validate_file(input_path, validation_types)[source]

Validate the incoming SSSOM TSV according to the SSSOM specification.

Parameters:

input_path (str) – The path to the input file in one of the legal formats, eg obographs, aligmentapi-xml
validation_types (List[SchemaValidationType]) – A list of validation types to run.

Return type:

None

Utils

Utils - currently boomer-specific

Utility functions.

class sssom.util.EntityPair(subject_entity, object_entity)[source]

Bases: object

A tuple of entities.

Note that (e1,e2) == (e2,e1)

sssom.util.KEY_FEATURES = ['subject_id', 'predicate_id', 'object_id', 'predicate_modifier']: The 4 columns whose combination would be used as primary keys while merging/grouping

class sssom.util.MappingSetDataFrame(df, converter=<factory>, metadata=<factory>)[source]

Bases: object

A collection of mappings represented as a DataFrame, together with additional metadata.

clean_context()[source]

Clean up the context.

Return type:: None

clean_prefix_map(strict=True)[source]

Remove unused prefixes from the internal prefix map based on the internal dataframe.

Parameters:: strict (bool) – Boolean if True, errors out if all prefixes in dataframe are not listed in the ‘curie_map’.
Raises:: ValueError – If prefixes absent in ‘curie_map’ and strict flag = True
Return type:: None

classmethod from_mapping_set(mapping_set, *, converter=None)[source]

Instantiate from a mapping set and an optional converter.

Parameters:

mapping_set (MappingSet) – A mapping set
converter (Union[None, Mapping[str, str], Converter]) – A prefix map or pre-instantiated converter. If none given, uses a default prefix map derived from the Bioregistry.

Return type:

MappingSetDataFrame

Returns:

A mapping set dataframe

classmethod from_mapping_set_document(doc)[source]

Instantiate from a mapping set document.

Return type:: MappingSetDataFrame

classmethod from_mappings(mappings, *, converter=None, metadata=None)[source]

Instantiate from a list of mappings, mapping set metadata, and an optional converter.

Return type:: MappingSetDataFrame

merge(*msdfs, inplace=True)[source]

Merge two MappingSetDataframes.

Parameters:

msdfs (MappingSetDataFrame) – Multiple/Single MappingSetDataFrame(s) to merge with self
inplace (bool) – If true, msdf2 is merged into the calling MappingSetDataFrame, if false, it simply return the merged data frame.

Return type:

MappingSetDataFrame

Returns:

Merged MappingSetDataFrame

property prefix_map: Get a simple, bijective prefix map.

remove_mappings(msdf)[source]

Remove mappings in right msdf from left msdf.

Parameters:: msdf (MappingSetDataFrame) – MappingSetDataframe object to be removed from primary msdf object.
Return type:: None

standardize_references()[source]

Standardize this MSDF’s dataframe and metadata with respect to its converter.

Return type:: None

to_mapping_set()[source]

Get a mapping set.

Return type:: MappingSet

to_mapping_set_document()[source]

Get a mapping set document.

Return type:: MappingSetDocument

to_mappings()[source]

Get a mapping set.

Return type:: List[Mapping]

classmethod with_converter(converter, df, metadata=None)[source]

Instantiate with a converter instead of a vanilla prefix map.

Return type:: MappingSetDataFrame

class sssom.util.MappingSetDiff(unique_tuples1=None, unique_tuples2=None, common_tuples=None, combined_dataframe=None)[source]

Bases: object

Represents a difference between two mapping sets.

Currently this is limited to diffs at the level of entity-pairs. For example, if file1 has A owl:equivalentClass B, and file2 has A skos:closeMatch B, this is considered a mapping in common.

combined_dataframe: Optional[DataFrame] = None: Dataframe that combines with left and right dataframes with information injected into the comment column

sssom.util.add_default_confidence(df, confidence=nan)[source]

Add confidence column to DataFrame if absent and initializes to 0.95.

If confidence column already exists, only fill in the None ones by 0.95.

Parameters:: df (DataFrame) – DataFrame whose confidence column needs to be filled.
Return type:: DataFrame
Returns:: DataFrame with a complete confidence column.

sssom.util.are_params_slots(params)[source]

Check if parameters conform to the slots in MAPPING_SET_SLOTS.

Parameters:: params (dict) – Dictionary of parameters.
Raises:: ValueError – If params are not slots.
Return type:: bool
Returns:: True/False

sssom.util.assign_default_confidence(df)[source]

Assign numpy.nan to confidence that are blank.

Parameters:: df (DataFrame) – SSSOM DataFrame
Return type:: Tuple[DataFrame, DataFrame]
Returns:: A Tuple consisting of the original DataFrame and dataframe consisting of empty confidence values.

sssom.util.augment_metadata(msdf, meta, replace_multivalued=False)[source]

Augment metadata with parameters passed.

Parameters:

msdf (MappingSetDataFrame) – MappingSetDataFrame (MSDF) object.
meta (dict) – Dictionary that needs to be added/updated to the metadata of the MSDF.
replace_multivalued (bool) – Multivalued slots should be replaced or not, defaults to False.

Raises:

ValueError – If type of slot is neither str nor list.

Return type:

MappingSetDataFrame

Returns:

MSDF with updated metadata.

sssom.util.collapse(df)[source]

Collapse rows with same S/P/O and combines confidence.

Return type:: DataFrame

sssom.util.compare_dataframes(df1, df2)[source]

Perform a diff between two SSSOM dataframes.

Parameters:

df1 (DataFrame) – A mapping dataframe
df2 (DataFrame) – A mapping dataframe

Return type:

MappingSetDiff

Returns:

A mapping set diff

Warning

currently does not discriminate between mappings with different predicates

sssom.util.create_entity(identifier, mappings)[source]

Create an Entity object.

Parameters:

identifier (str) – Entity Id
mappings (Dict[str, Any]) – Mapping dictionary

Return type:

Uriorcurie

Returns:

An Entity object

sssom.util.dataframe_to_ptable(df, *, inverse_factor=None, default_confidence=None)[source]

Export a KBOOM table.

Parameters:

df (DataFrame) – Pandas DataFrame
inverse_factor (Optional[float]) – Multiplier to (1 - confidence), defaults to 0.5
default_confidence (Optional[float]) – Default confidence to be assigned if absent.

Raises:

ValueError – Predicate value error
ValueError – Predicate type value error

Returns:

List of rows

sssom.util.deal_with_negation(df)[source]

Combine negative and positive rows with matching [SUBJECT_ID, OBJECT_ID, CONFIDENCE] combination.

Rule: negative trumps positive if modulus of confidence values are equal.

Parameters:: df (DataFrame) – Merged Pandas DataFrame
Return type:: DataFrame
Returns:: Pandas DataFrame with negations addressed
Raises:: ValueError – If the dataframe is none after assigning default confidence

sssom.util.filter_out_prefixes(df, filter_prefixes, features=None, require_all_prefixes=False)[source]

Filter out rows which contains a CURIE with a prefix in the filter_prefixes list.

Parameters:

df (DataFrame) – Pandas DataFrame of SSSOM Mapping
filter_prefixes (List[str]) – List of prefixes
features (Optional[list]) – List of dataframe column names dataframe to consider
require_all_prefixes (bool) – If True, all prefixes must be present in a row to be filtered out

Return type:

DataFrame

Returns:

Pandas Dataframe

sssom.util.filter_prefixes(df, filter_prefixes, features=None, require_all_prefixes=True)[source]

Filter out rows which do NOT contain a CURIE with a prefix in the filter_prefixes list.

Parameters:

df (DataFrame) – Pandas DataFrame of SSSOM Mapping
filter_prefixes (List[str]) – List of prefixes
features (Optional[list]) – List of dataframe column names dataframe to consider
require_all_prefixes (bool) – If True, all prefixes must be present in a row to be filtered out

Return type:

DataFrame

Returns:

Pandas Dataframe

sssom.util.filter_redundant_rows(df, ignore_predicate=False)[source]

Remove rows if there is another row with same S/O and higher confidence.

Parameters:

df (DataFrame) – Pandas DataFrame to filter
ignore_predicate (bool) – If true, the predicate_id column is ignored, defaults to False

Return type:

DataFrame

Returns:

Filtered pandas DataFrame

sssom.util.get_all_prefixes(msdf)[source]

Fetch all prefixes in the MappingSetDataFrame.

Parameters:

msdf (MappingSetDataFrame) – MappingSetDataFrame

Raises:

ValidationError – If slot is wrong.
ValidationError – If slot is wrong.

Return type:

Set[str]

Returns:

List of all prefixes.

sssom.util.get_dict_from_mapping(map_obj)[source]

Get information for linkml objects (MatchTypeEnum, PredicateModifierEnum) from the Mapping object and return the dictionary form of the object.

Parameters:: map_obj (Union[Any, Dict[Any, Any], Mapping]) – Mapping object
Return type:: dict
Returns:: Dictionary

sssom.util.get_file_extension(file)[source]

Get file extension.

Parameters:: file (Union[str, Path, TextIO]) – File path
Return type:: str
Returns:: format of the file passed, default tsv

sssom.util.get_prefix_from_curie(curie)[source]

Get the prefix from a CURIE.

Return type:: str

sssom.util.get_prefixes_used_in_metadata(meta)[source]

Get a set of prefixes used in CURIEs in the metadata.

Return type:: Set[str]

sssom.util.get_prefixes_used_in_table(df, converter)[source]

Get a list of prefixes used in CURIEs in key feature columns in a dataframe.

Return type:: Set[str]

sssom.util.get_row_based_on_hierarchy(df)[source]

Get row based on hierarchy of predicates.

The hierarchy is as follows: # owl:equivalentClass # owl:equivalentProperty # rdfs:subClassOf # rdfs:subPropertyOf # owl:sameAs # skos:exactMatch # skos:closeMatch # skos:broadMatch # skos:narrowMatch # oboInOwl:hasDbXref # skos:relatedMatch # rdfs:seeAlso

Parameters:: df (DataFrame) – Dataframe containing multiple predicates for same subject and object.
Return type:: DataFrame
Returns:: Dataframe with a single row which ranks higher in the hierarchy.
Raises:: KeyError – if no rows are available

sssom.util.group_mappings(df)[source]

Group mappings by EntityPairs.

Return type:: Dict[EntityPair, List[Series]]

sssom.util.inject_metadata_into_df(msdf)[source]

Inject metadata dictionary key-value pair into DataFrame columns in a MappingSetDataFrame.DataFrame.

Parameters:: msdf (MappingSetDataFrame) – MappingSetDataFrame with metadata separate.
Return type:: MappingSetDataFrame
Returns:: MappingSetDataFrame with metadata as columns

sssom.util.invert_mappings(df, subject_prefix=None, merge_inverted=True, update_justification=True, predicate_invert_dictionary=None)[source]

Switching subject and objects based on their prefixes and adjusting predicates accordingly.

Parameters:

df (DataFrame) – Pandas dataframe.
subject_prefix (Optional[str]) – Prefix of subjects desired.
merge_inverted (bool) – If True (default), add inverted dataframe to input else, just return inverted data.
update_justification (bool) – If True (default), the justification is updated to “sempav:MappingInversion”, else it is left as it is.
predicate_invert_dictionary (Optional[dict]) – YAML file providing the inverse mapping for predicates.

Return type:

DataFrame

Returns:

Pandas dataframe with all subject IDs having the same prefix.

sssom.util.is_multivalued_slot(slot)[source]

Check whether the slot is multivalued according to the SSSOM specification.

Parameters:: slot (str) – Slot name
Return type:: bool
Returns:: Slot is multivalued or no

sssom.util.merge_msdf(*msdfs, reconcile=False)[source]

Merge multiple MappingSetDataFrames into one.

Parameters:

msdfs (MappingSetDataFrame) – A Tuple of MappingSetDataFrames to be merged
reconcile (bool) – If reconcile=True, then dedupe(remove redundant lower confidence mappings) and reconcile (if msdf contains a higher confidence _negative_ mapping, then remove lower confidence positive one. If confidence is the same, prefer HumanCurated. If both HumanCurated, prefer negative mapping). Defaults to True.

Return type:

MappingSetDataFrame

Returns:

Merged MappingSetDataFrame.

sssom.util.parse(filename)[source]

Parse a TSV to a pandas frame.

Return type:: DataFrame

sssom.util.raise_for_bad_path(file_path)[source]

Raise exception if file path is invalid.

Parameters:: file_path (Union[str, Path]) – File path
Raises:: FileNotFoundError – Invalid file path
Return type:: None

sssom.util.reconcile_prefix_and_data(msdf, prefix_reconciliation)[source]

Reconciles prefix_map and translates CURIE switch in dataframe.

Parameters:

msdf (MappingSetDataFrame) – Mapping Set DataFrame.
prefix_reconciliation (dict) – Prefix reconcilation dictionary from a YAML file

Return type:

MappingSetDataFrame

Returns:

Mapping Set DataFrame with reconciled prefix_map and data.

This method is build on curies.remap_curie_prefixes() and curies.rewire(). Note that if you want to overwrite a CURIE prefix in the Bioregistry extended prefix map, you need to provide a place for the old one to go as in {"geo": "ncbi.geo", "geogeo": "geo"}. Just doing {"geogeo": "geo"} would not work since geo already exists.

sssom.util.remove_unmatched(df)[source]

Remove rows where no match is found.

TODO: https://github.com/OBOFoundry/SSSOM/issues/28 :type df: DataFrame :param df: Pandas DataFrame :rtype: DataFrame :return: Pandas DataFrame with ‘PREDICATE_ID’ not ‘noMatch’.

sssom.util.safe_compress(uri, converter)[source]

Parse a CURIE from an IRI.

Parameters:

uri (str) – The URI to parse. If this is already a CURIE, return directly.
converter (Converter) – Converter used for compression

Return type:

str

Returns:

A CURIE

sssom.util.sort_df_rows_columns(df, by_columns=True, by_rows=True)[source]

Canonical sorting of DataFrame columns.

Parameters:

df (DataFrame) – Pandas DataFrame with random column sequence.
by_columns (bool) – Boolean flag to sort columns canonically.
by_rows (bool) – Boolean flag to sort rows by column #1 (ascending order).

Return type:

DataFrame

Returns:

Pandas DataFrame columns sorted canonically.

sssom.util.sort_sssom(df)[source]

Sort SSSOM by columns.

Parameters:: df (DataFrame) – SSSOM DataFrame to be sorted.
Return type:: DataFrame
Returns:: Sorted SSSOM DataFrame

sssom.util.to_mapping_set_dataframe(doc)[source]

Convert MappingSetDocument into MappingSetDataFrame.

Parameters:: doc (MappingSetDocument) – MappingSetDocument object
Return type:: MappingSetDataFrame
Returns:: MappingSetDataFrame object