Python API
Datamodel
Top-level Document object
Additional SSSOM object models.
- class sssom.sssom_document.MappingSetDocument(mapping_set, converter)[source]
Bases:
object
Represents a single SSSOM document.
A document is simply a holder for a MappingSet object plus a CURIE map
-
mapping_set:
MappingSet
a set of mappings plus metadata
- Type:
The main part of the document
- property prefix_map: Dict[str, str]
Get a prefix map.
-
mapping_set:
A Document holds a MappingSet which is a collection of mappings
I/O
Conversion between TSV/pandas, internal datamodel and RDF
I/O utilities for SSSOM.
- sssom.io.annotate_file(input, output=None, replace_multivalued=False, **kwargs)[source]
Annotate a file i.e. add custom metadata to the mapping set.
- Parameters:
input (
str
) – SSSOM tsv file to be queried over.output (
Optional
[TextIO
]) – Output location.replace_multivalued (
bool
) – Multivalued slots should be replaced or not, defaults to Falsekwargs – Options provided by user which are added to the metadata (e.g.: –mapping_set_id http://example.org/abcd)
- Return type:
- Returns:
Annotated MappingSetDataFrame object.
- sssom.io.convert_file(input_path, output, output_format=None)[source]
Convert a file from one format to another.
- Parameters:
input_path (
str
) – The path to the input SSSOM tsv fileoutput (
TextIO
) – The path to the output file. If none is given, will default to using stdout.output_format (
Optional
[str
]) – The format to which the SSSOM TSV should be converted.
- Return type:
None
- sssom.io.extract_iris(input, converter)[source]
Recursively extracts a list of IRIs from a string or file.
- Parameters:
input (
Union
[str
,Path
,Iterable
[Union
[str
,Path
]]]) – CURIE OR list of CURIEs OR file path containing the same.converter (
Converter
) – Prefix map of mapping set (possibly) containing custom prefix:IRI combination.
- Return type:
List
[str
]- Returns:
A list of IRIs.
- sssom.io.filter_file(input, output=None, **kwargs)[source]
Filter a dataframe by dynamically generating queries based on user input.
e.g. sssom filter –subject_id x:% –subject_id y:% –object_id y:% –object_id z:% tests/data/basic.tsv
yields the query:
- “SELECT * FROM df WHERE (subject_id LIKE ‘x:%’ OR subject_id LIKE ‘y:%’)
AND (object_id LIKE ‘y:%’ OR object_id LIKE ‘z:%’) “ and displays the output.
- Parameters:
input (
str
) – DataFrame to be queried over.output (
Optional
[TextIO
]) – Output location.kwargs – Filter options provided by user which generate queries (e.g.: –subject_id x:%).
- Raises:
ValueError – If parameter provided is invalid.
- Return type:
- Returns:
Filtered MappingSetDataFrame object.
- sssom.io.get_metadata_and_prefix_map(metadata_path=None, *, prefix_map_mode=None)[source]
Load metadata and a prefix map in a deprecated way. :rtype:
Tuple
[Converter
,Dict
[str
,Any
]]Deprecated since version 0.4.3: This functionality for loading SSSOM metadata from a YAML file is deprecated from the public API since it has internal assumptions which are usually not valid for downstream users.
- sssom.io.parse_file(input_path, output, *, input_format=None, metadata_path=None, prefix_map_mode=None, clean_prefixes=True, strict_clean_prefixes=True, embedded_mode=True, mapping_predicate_filter=None)[source]
Parse an SSSOM metadata file and write to a table.
- Parameters:
input_path (
str
) – The path to the input file in one of the legal formats, eg obographs, aligmentapi-xmloutput (
TextIO
) – The path to the output file.input_format (
Optional
[str
]) – The string denoting the input format.metadata_path (
Optional
[str
]) – The path to a file containing the sssom metadata (including prefix_map) to be used during parse.prefix_map_mode (
Optional
[Literal
['metadata_only'
,'sssom_default_only'
,'merged'
]]) – Defines whether the prefix map in the metadata should be extended or replaced with the SSSOM default prefix map derived from thebioregistry
.clean_prefixes (
bool
) – If True (default), records with unknown prefixes are removed from the SSSOM file.strict_clean_prefixes (
bool
) – If True (default), clean_prefixes() will be in strict mode.
- Return type:
None
:param embedded_mode:If True (default), the dataframe and metadata are exported in one file (tsv), else two separate files (tsv and yaml). :type mapping_predicate_filter:
Optional
[tuple
] :param mapping_predicate_filter: Optional list of mapping predicates or filepath containing the same.
- sssom.io.run_sql_query(query, inputs, output=None)[source]
Run a SQL query over one or more SSSOM files.
Each of the N inputs is assigned a table name df1, df2, …, dfN
Alternatively, the filenames can be used as table names - these are first stemmed E.g. ~/dir/my.sssom.tsv becomes a table called ‘my’
- Example:
sssom dosql -Q “SELECT * FROM df1 WHERE confidence>0.5 ORDER BY confidence” my.sssom.tsv
- Example:
sssom dosql -Q “SELECT file1.*,file2.object_id AS ext_object_id, file2.object_label AS ext_object_label FROM file1 INNER JOIN file2 WHERE file1.object_id = file2.subject_id” FROM file1.sssom.tsv file2.sssom.tsv
- Parameters:
query (
str
) – Query to be executed over a pandas DataFrame (msdf.df).inputs (
List
[str
]) – Input files that form the source tables for query.output (
Optional
[TextIO
]) – Output.
- Return type:
- Returns:
Filtered MappingSetDataFrame object.
- sssom.io.split_file(input_path, output_directory)[source]
Split an SSSOM TSV by prefixes and relations.
- Parameters:
input_path (
str
) – The path to the input file in one of the legal formats, eg obographs, aligmentapi-xmloutput_directory (
Union
[str
,Path
]) – The directory to which the split file should be exported.
- Return type:
None
- sssom.io.validate_file(input_path, validation_types)[source]
Validate the incoming SSSOM TSV according to the SSSOM specification.
- Parameters:
input_path (
str
) – The path to the input file in one of the legal formats, eg obographs, aligmentapi-xmlvalidation_types (
List
[SchemaValidationType
]) – A list of validation types to run.
- Return type:
None
Utils
Utils - currently boomer-specific
Utility functions.
- class sssom.util.EntityPair(subject_entity, object_entity)[source]
Bases:
object
A tuple of entities.
Note that (e1,e2) == (e2,e1)
- sssom.util.KEY_FEATURES = ['subject_id', 'predicate_id', 'object_id', 'predicate_modifier']
The 4 columns whose combination would be used as primary keys while merging/grouping
- class sssom.util.MappingSetDataFrame(df, converter=<factory>, metadata=<factory>)[source]
Bases:
object
A collection of mappings represented as a DataFrame, together with additional metadata.
- clean_prefix_map(strict=True)[source]
Remove unused prefixes from the internal prefix map based on the internal dataframe.
- Parameters:
strict (
bool
) – Boolean if True, errors out if all prefixes in dataframe are not listed in the ‘curie_map’.- Raises:
ValueError – If prefixes absent in ‘curie_map’ and strict flag = True
- Return type:
None
- classmethod from_mapping_set(mapping_set, *, converter=None)[source]
Instantiate from a mapping set and an optional converter.
- Parameters:
mapping_set (
MappingSet
) – A mapping setconverter (
Union
[None
,Mapping
[str
,str
],Converter
]) – A prefix map or pre-instantiated converter. If none given, uses a default prefix map derived from the Bioregistry.
- Return type:
- Returns:
A mapping set dataframe
- classmethod from_mapping_set_document(doc)[source]
Instantiate from a mapping set document.
- Return type:
- classmethod from_mappings(mappings, *, converter=None, metadata=None)[source]
Instantiate from a list of mappings, mapping set metadata, and an optional converter.
- Return type:
- merge(*msdfs, inplace=True)[source]
Merge two MappingSetDataframes.
- Parameters:
msdfs (
MappingSetDataFrame
) – Multiple/Single MappingSetDataFrame(s) to merge with selfinplace (
bool
) – If true, msdf2 is merged into the calling MappingSetDataFrame, if false, it simply return the merged data frame.
- Return type:
- Returns:
Merged MappingSetDataFrame
- property prefix_map
Get a simple, bijective prefix map.
- remove_mappings(msdf)[source]
Remove mappings in right msdf from left msdf.
- Parameters:
msdf (
MappingSetDataFrame
) – MappingSetDataframe object to be removed from primary msdf object.- Return type:
None
- standardize_references()[source]
Standardize this MSDF’s dataframe and metadata with respect to its converter.
- Return type:
None
- class sssom.util.MappingSetDiff(unique_tuples1=None, unique_tuples2=None, common_tuples=None, combined_dataframe=None)[source]
Bases:
object
Represents a difference between two mapping sets.
Currently this is limited to diffs at the level of entity-pairs. For example, if file1 has A owl:equivalentClass B, and file2 has A skos:closeMatch B, this is considered a mapping in common.
-
combined_dataframe:
Optional
[DataFrame
] = None Dataframe that combines with left and right dataframes with information injected into the comment column
-
combined_dataframe:
- sssom.util.add_default_confidence(df, confidence=nan)[source]
Add confidence column to DataFrame if absent and initializes to 0.95.
If confidence column already exists, only fill in the None ones by 0.95.
- Parameters:
df (
DataFrame
) – DataFrame whose confidence column needs to be filled.- Return type:
DataFrame
- Returns:
DataFrame with a complete confidence column.
- sssom.util.are_params_slots(params)[source]
Check if parameters conform to the slots in MAPPING_SET_SLOTS.
- Parameters:
params (
dict
) – Dictionary of parameters.- Raises:
ValueError – If params are not slots.
- Return type:
bool
- Returns:
True/False
- sssom.util.assign_default_confidence(df)[source]
Assign
numpy.nan
to confidence that are blank.- Parameters:
df (
DataFrame
) – SSSOM DataFrame- Return type:
Tuple
[DataFrame
,DataFrame
]- Returns:
A Tuple consisting of the original DataFrame and dataframe consisting of empty confidence values.
- sssom.util.augment_metadata(msdf, meta, replace_multivalued=False)[source]
Augment metadata with parameters passed.
- Parameters:
msdf (
MappingSetDataFrame
) – MappingSetDataFrame (MSDF) object.meta (
dict
) – Dictionary that needs to be added/updated to the metadata of the MSDF.replace_multivalued (
bool
) – Multivalued slots should be replaced or not, defaults to False.
- Raises:
ValueError – If type of slot is neither str nor list.
- Return type:
- Returns:
MSDF with updated metadata.
- sssom.util.collapse(df)[source]
Collapse rows with same S/P/O and combines confidence.
- Return type:
DataFrame
- sssom.util.compare_dataframes(df1, df2)[source]
Perform a diff between two SSSOM dataframes.
- Parameters:
df1 (
DataFrame
) – A mapping dataframedf2 (
DataFrame
) – A mapping dataframe
- Return type:
- Returns:
A mapping set diff
Warning
currently does not discriminate between mappings with different predicates
- sssom.util.create_entity(identifier, mappings)[source]
Create an Entity object.
- Parameters:
identifier (
str
) – Entity Idmappings (
Dict
[str
,Any
]) – Mapping dictionary
- Return type:
Uriorcurie
- Returns:
An Entity object
- sssom.util.dataframe_to_ptable(df, *, inverse_factor=None, default_confidence=None)[source]
Export a KBOOM table.
- Parameters:
df (
DataFrame
) – Pandas DataFrameinverse_factor (
Optional
[float
]) – Multiplier to (1 - confidence), defaults to 0.5default_confidence (
Optional
[float
]) – Default confidence to be assigned if absent.
- Raises:
ValueError – Predicate value error
ValueError – Predicate type value error
- Returns:
List of rows
- sssom.util.deal_with_negation(df)[source]
Combine negative and positive rows with matching [SUBJECT_ID, OBJECT_ID, CONFIDENCE] combination.
Rule: negative trumps positive if modulus of confidence values are equal.
- Parameters:
df (
DataFrame
) – Merged Pandas DataFrame- Return type:
DataFrame
- Returns:
Pandas DataFrame with negations addressed
- Raises:
ValueError – If the dataframe is none after assigning default confidence
- sssom.util.filter_out_prefixes(df, filter_prefixes, features=None, require_all_prefixes=False)[source]
Filter out rows which contains a CURIE with a prefix in the filter_prefixes list.
- Parameters:
df (
DataFrame
) – Pandas DataFrame of SSSOM Mappingfilter_prefixes (
List
[str
]) – List of prefixesfeatures (
Optional
[list
]) – List of dataframe column names dataframe to considerrequire_all_prefixes (
bool
) – If True, all prefixes must be present in a row to be filtered out
- Return type:
DataFrame
- Returns:
Pandas Dataframe
- sssom.util.filter_prefixes(df, filter_prefixes, features=None, require_all_prefixes=True)[source]
Filter out rows which do NOT contain a CURIE with a prefix in the filter_prefixes list.
- Parameters:
df (
DataFrame
) – Pandas DataFrame of SSSOM Mappingfilter_prefixes (
List
[str
]) – List of prefixesfeatures (
Optional
[list
]) – List of dataframe column names dataframe to considerrequire_all_prefixes (
bool
) – If True, all prefixes must be present in a row to be filtered out
- Return type:
DataFrame
- Returns:
Pandas Dataframe
- sssom.util.filter_redundant_rows(df, ignore_predicate=False)[source]
Remove rows if there is another row with same S/O and higher confidence.
- Parameters:
df (
DataFrame
) – Pandas DataFrame to filterignore_predicate (
bool
) – If true, the predicate_id column is ignored, defaults to False
- Return type:
DataFrame
- Returns:
Filtered pandas DataFrame
- sssom.util.get_all_prefixes(msdf)[source]
Fetch all prefixes in the MappingSetDataFrame.
- Parameters:
msdf (
MappingSetDataFrame
) – MappingSetDataFrame- Raises:
ValidationError – If slot is wrong.
ValidationError – If slot is wrong.
- Return type:
Set
[str
]- Returns:
List of all prefixes.
- sssom.util.get_dict_from_mapping(map_obj)[source]
Get information for linkml objects (MatchTypeEnum, PredicateModifierEnum) from the Mapping object and return the dictionary form of the object.
- Parameters:
map_obj (
Union
[Any
,Dict
[Any
,Any
],Mapping
]) – Mapping object- Return type:
dict
- Returns:
Dictionary
- sssom.util.get_file_extension(file)[source]
Get file extension.
- Parameters:
file (
Union
[str
,Path
,TextIO
]) – File path- Return type:
str
- Returns:
format of the file passed, default tsv
- sssom.util.get_prefixes_used_in_metadata(meta)[source]
Get a set of prefixes used in CURIEs in the metadata.
- Return type:
Set
[str
]
- sssom.util.get_prefixes_used_in_table(df, converter)[source]
Get a list of prefixes used in CURIEs in key feature columns in a dataframe.
- Return type:
Set
[str
]
- sssom.util.get_row_based_on_hierarchy(df)[source]
Get row based on hierarchy of predicates.
The hierarchy is as follows: # owl:equivalentClass # owl:equivalentProperty # rdfs:subClassOf # rdfs:subPropertyOf # owl:sameAs # skos:exactMatch # skos:closeMatch # skos:broadMatch # skos:narrowMatch # oboInOwl:hasDbXref # skos:relatedMatch # rdfs:seeAlso
- Parameters:
df (
DataFrame
) – Dataframe containing multiple predicates for same subject and object.- Return type:
DataFrame
- Returns:
Dataframe with a single row which ranks higher in the hierarchy.
- Raises:
KeyError – if no rows are available
- sssom.util.group_mappings(df)[source]
Group mappings by EntityPairs.
- Return type:
Dict
[EntityPair
,List
[Series
]]
- sssom.util.inject_metadata_into_df(msdf)[source]
Inject metadata dictionary key-value pair into DataFrame columns in a MappingSetDataFrame.DataFrame.
- Parameters:
msdf (
MappingSetDataFrame
) – MappingSetDataFrame with metadata separate.- Return type:
- Returns:
MappingSetDataFrame with metadata as columns
- sssom.util.invert_mappings(df, subject_prefix=None, merge_inverted=True, update_justification=True, predicate_invert_dictionary=None)[source]
Switching subject and objects based on their prefixes and adjusting predicates accordingly.
- Parameters:
df (
DataFrame
) – Pandas dataframe.subject_prefix (
Optional
[str
]) – Prefix of subjects desired.merge_inverted (
bool
) – If True (default), add inverted dataframe to input else, just return inverted data.update_justification (
bool
) – If True (default), the justification is updated to “sempav:MappingInversion”, else it is left as it is.predicate_invert_dictionary (
Optional
[dict
]) – YAML file providing the inverse mapping for predicates.
- Return type:
DataFrame
- Returns:
Pandas dataframe with all subject IDs having the same prefix.
- sssom.util.is_multivalued_slot(slot)[source]
Check whether the slot is multivalued according to the SSSOM specification.
- Parameters:
slot (
str
) – Slot name- Return type:
bool
- Returns:
Slot is multivalued or no
- sssom.util.merge_msdf(*msdfs, reconcile=False)[source]
Merge multiple MappingSetDataFrames into one.
- Parameters:
msdfs (
MappingSetDataFrame
) – A Tuple of MappingSetDataFrames to be mergedreconcile (
bool
) – If reconcile=True, then dedupe(remove redundant lower confidence mappings) and reconcile (if msdf contains a higher confidence _negative_ mapping, then remove lower confidence positive one. If confidence is the same, prefer HumanCurated. If both HumanCurated, prefer negative mapping). Defaults to True.
- Return type:
- Returns:
Merged MappingSetDataFrame.
- sssom.util.pandas_set_no_silent_downcasting(no_silent_downcasting=True)[source]
Set pandas future.no_silent_downcasting option. Context https://github.com/pandas-dev/pandas/issues/57734.
- sssom.util.raise_for_bad_path(file_path)[source]
Raise exception if file path is invalid.
- Parameters:
file_path (
Union
[str
,Path
]) – File path- Raises:
FileNotFoundError – Invalid file path
- Return type:
None
- sssom.util.reconcile_prefix_and_data(msdf, prefix_reconciliation)[source]
Reconciles prefix_map and translates CURIE switch in dataframe.
- Parameters:
msdf (
MappingSetDataFrame
) – Mapping Set DataFrame.prefix_reconciliation (
dict
) – Prefix reconcilation dictionary from a YAML file
- Return type:
- Returns:
Mapping Set DataFrame with reconciled prefix_map and data.
This method is build on
curies.remap_curie_prefixes()
andcuries.rewire()
. Note that if you want to overwrite a CURIE prefix in the Bioregistry extended prefix map, you need to provide a place for the old one to go as in{"geo": "ncbi.geo", "geogeo": "geo"}
. Just doing{"geogeo": "geo"}
would not work since geo already exists.
- sssom.util.remove_unmatched(df)[source]
Remove rows where no match is found.
TODO: https://github.com/OBOFoundry/SSSOM/issues/28 :type df:
DataFrame
:param df: Pandas DataFrame :rtype:DataFrame
:return: Pandas DataFrame with ‘PREDICATE_ID’ not ‘noMatch’.
- sssom.util.safe_compress(uri, converter)[source]
Parse a CURIE from an IRI.
- Parameters:
uri (
str
) – The URI to parse. If this is already a CURIE, return directly.converter (
Converter
) – Converter used for compression
- Return type:
str
- Returns:
A CURIE
- sssom.util.sort_df_rows_columns(df, by_columns=True, by_rows=True)[source]
Canonical sorting of DataFrame columns.
- Parameters:
df (
DataFrame
) – Pandas DataFrame with random column sequence.by_columns (
bool
) – Boolean flag to sort columns canonically.by_rows (
bool
) – Boolean flag to sort rows by column #1 (ascending order).
- Return type:
DataFrame
- Returns:
Pandas DataFrame columns sorted canonically.
- sssom.util.sort_sssom(df)[source]
Sort SSSOM by columns.
- Parameters:
df (
DataFrame
) – SSSOM DataFrame to be sorted.- Return type:
DataFrame
- Returns:
Sorted SSSOM DataFrame
- sssom.util.to_mapping_set_dataframe(doc)[source]
Convert MappingSetDocument into MappingSetDataFrame.
- Parameters:
doc (
MappingSetDocument
) – MappingSetDocument object- Return type:
- Returns:
MappingSetDataFrame object