obnb.label
Labelset collection object
- class obnb.label.LabelsetCollection[source]
Collection of labelsets.
This class is used for managing collection of labelsets.
Example GMT (Gene Matrix Transpose):
Geneset1 Description1 Gene1 Gene2 Gene3 Geneset2 Description2 Gene2 Gene4 Gene5 Gene6
Example internal data for a label collection with above GMT data:
self.entity_ids = ['Gene1', 'Gene2', 'Gene3', 'Gene4', 'Gene5', 'Gene6'] self.entity.prop = {'Noccur': [1, 2, 1, 1, 1, 1]} self.label_ids = ['Geneset1', 'Geneset2'] self.prop = { 'Info':['Description1', 'Description2'] 'Labelset':[ {'Gene1', 'Gene2', 'Gene3'}, {'Gene2', 'Gene4', 'Gene5', 'Gene6'} ] }
Initialize LabelsetCollection object.
- add_labelset(lst, label_id, label_info=None)[source]
Add a new labelset.
- Parameters:
lst (
listofstr) – list of IDs of entiteis belong to the input labellabel_id (str) – name of label
label_info (str) – description of label
- apply(filter_func, inplace=False, progress_bar=False)[source]
Apply filter to labelsets.
See obnb.label.filters for more info.
- Parameters:
filter_func –
inplace (bool) – whether or not to modify original object, if True, then apply the filter directly on the original object; otherwise, apply the filter on a copy of the original object and return that object (default:
False).progress_bar (bool) – whether or not to display progress bar for filtering (default:
False).
- Returns:
Labelset coolection object after filtering.
- property entity_ids
List of all entity IDs that are part of at least one labelset.
- export(path)[source]
Export self as a ‘.lsc’ file.
Notes
‘.lsc’ is a csv file storing entity labels in matrix form, where first column is entity IDs, first and second rows correspond to label ID and label information respectively. If an entity ‘i’ is annotated with a label ‘j’, the corresponding ‘ij’ entry is marked as ‘1’, else if it is considered a negative for that label, it is marked as ‘-1’, otherwise it is ‘0’, standing for neutral.
entity_idmap is necessary since not all entities are guaranteed to be part of at least one label.
- Parameters:
path (str) – path to file to save, including file name, with/without extension.
- export_gmt(path)[source]
Export self as a ‘.gmt’ (Gene Matrix Transpose) file.
- Input:
- path(str): path to file to save, including file name, with/without
extension.
- classmethod from_dict(input_dict)[source]
Load data from entity label dictionary.
- Parameters:
input_dict (
Dict[str,str]) – A dictionary mapping from entities to their unique label IDs.
- classmethod from_gmt(path, sep='\\t')[source]
Construct LabelsetCollection object from GMT file.
- Parameters:
path (
str) – path to the .gmt file.sep (
str) – separator used in the GMT file.
- classmethod from_ontology_graph(graph, min_size=10, namespace=None)[source]
Construct LabelsetCollection object from an annotated ontology.
- Parameters:
graph (OntologyGraph) –
min_size (int) –
namespace (str | None) –
- get_negative(label_id)[source]
Return set of negative samples of a labelset.
Note
If negative samples not available, use complement of labelset
- get_y(target_ids, labelset_name=None, return_y_mask=False)[source]
Return the y matrix.
- Parameters:
target_ids (
Tuple[str,...]) – Tuple of entity ids used to order the rows.labelset_name (
Optional[str]) – A specific labelset to use, if not set, use all the labelests (default:None).return_y_mask (
bool) – If set toTrue, then additionally return a mask indicating the positive and negative entries. In other words, the neutrals, or exmaples whose labels are not confidently known as positives or negatives, are deselected in the mask.
- Return type:
Union[ndarray,Tuple[ndarray,ndarray]]
- iapply(filter_func, progress_bar=False)[source]
Apply filter to labelsets inplace.
This is a shortcut for calling self.apply(filter_func, inplace=True).
- Parameters:
progress_bar (bool) –
- items()[source]
Yield label name and the corresponding label set.
- Return type:
Iterator[Tuple[int,Set[str]]]
- property label_ids
listofstr: list of all labelset names.
- load_entity_properties(path, prop_name, default_val, default_type, interpreter=<class 'int'>, comment='#', skiprows=0)[source]
Load entity properties from file.
The file is tab separated with two columns, first column contains entities IDs, second column contains corresponding properties of entities.
- Parameters:
path (str) – path to the entity properties file.
default_val – default value of property of an entity if not specified.
default_type (type) – default type of the property.
interpreter – function to transform property value from string to some other value.
- pop_entity(entity_id)[source]
Pop an entity from entity list and remove it from all labelsets.
Note
Unlike pop_labelset, if after removal, a labelset beomes empty, the labelset itself is NOT removed. This is for more convenient comparison of labelset sizes before and after filtering.
- pop_labelset(label_id)[source]
Pop a labelset.
Note
This also removes any entity that longer belongs to any labelset.
- read_gmt(path, sep='\\t', reload=False)[source]
Load data from Gene Matrix Transpose .gmt file.
- Parameters:
path (
str) – path to the .gmt file.sep (
str) – separator used in the GMT file.reload (
bool) – Remove existing labelsets before loading if set to True.
- read_ontology_graph(graph, min_size=10, namespace=None)[source]
Load labelset collection from an annotated ontology graph.
- Parameters:
graph (
OntologyGraph) – The annotated ontology graph to be read.min_size (int) – Minimum number of positive examples in order to be loaded as a label set (default: 10).
namespace (str, optional) – If set, only load terms that are inherited from the term specified in as namespace, otherwise load all terms (default:
None).
- reset_labelset(label_id)[source]
Reset an existing labelset to an empty set.
Setting the labelset back to empty and deecrement Noccur of all entities belonging to the labelset by 1.
- property sizes: List[int]
Sizes of the labelsets.
- split(splitter, target_ids=None, labelset_name=None, mask_names=None, consider_negative=False, **kwargs)[source]
Split the entities based on the labelsets.
- Parameters:
splitter (
Callable[[ndarray,ndarray],Iterator[Tuple[ndarray,...]]]) – A splitter function that split the entities based on their labels and optionally the an entity.target_ids (
Optional[Tuple[str,...]]) – Tuple of entity ids for the output masks and label vector to align with. Useself.entity_idsif not specified.labelset_name (
Optional[str]) – Indicate which specific labelset to split. Split based on all available sets if not specified.mask_names (
Optional[Tuple[str,...]]) – Name of maskes for splits generated by the splitter. If not specified, use('train', 'test')when the splitter generates two splits and use('train', 'val', 'test')when the splitter generates three splits.consider_negative (
bool) – Only use annotated negatives and remove neutral data points where we do not know for sure they are negatives (default:False).
- Return type:
Tuple[ndarray,Dict[str,ndarray]]
Note
The
consider_negativeoption currently only works when one explicitly specify thelabelset_name. In the future, might also support this option with multiple labelsets.- Raises:
ValueError – If the length of the specified mask_names` does not match that of the number of splits generated by the splitter, or if the number of splits generated by the splitter is neither two or three but
mask_namesis not specified. Or the specifiedtarget_idsdoes not catain all ofentity_ids.IDNotExistError – If the specified
labelset_namedoes not exist or the specifiedproperty_namedoes not exist.
- Parameters:
splitter (Callable[[ndarray, ndarray], Iterator[Tuple[ndarray, ...]]]) –
target_ids (Tuple[str, ...] | None) –
labelset_name (str | None) –
mask_names (Tuple[str, ...] | None) –
consider_negative (bool) –
- Return type:
Tuple[ndarray, Dict[str, ndarray]]
- to_df()[source]
Construct label sets info dataframe.
The first three columns of the table correspond to the name, info, and the number of positive examples for each labelset. The rest of the columns contain the positive examples, padded with None.
- Return type:
DataFrame
- update_labelset(lst, label_id)[source]
Update an existing labelset.
Take list of entities IDs and update current labelset with a label name matching label_id. Any ID in the input list lst that does not exist in the entity list will be added to the entity list. Increment the Noccur property of any newly added entities to the labelset by 1.
Note: label_id must already existed, use .add_labelset() for adding new labelset
- Parameters:
lst (
listofstr) – list of entiteis IDs to be added to the labelset, can be redundant.- Raises:
TypeError – if lst is not list type or any element within lst is not str type
Labelset collection filters
Composition of filters. |
|
Filter entities by list of entiteis of interest. |
|
Filter entities based on number of occurrence. |
|
Filter labelset by list of labelsets of interest. |
|
Filter out redundant labelsets in a labelset collection. |
|
Filter labelsets based on Jaccard index. |
|
Filter labelsets based on the Overlap coefficient. |
|
Filter labelsets based on size. |
|
Filter labelsets based on number of positives in each dataset split. |
|
Filter based on enrichment (hypergeometric test). |
Filter objecst for preprocessing the labelset collection.
- class obnb.label.filters.Compose(*filters, log_level='WARNING')[source]
Composition of filters.
Initialize composition.
- Parameters:
log_level (Literal['CRITICAL', 'ERROR', 'WARNING', 'INFO', 'DEBUG', 'NOTSET']) –
- class obnb.label.filters.EntityExistenceFilter(target_lst, remove_specified=False, **kwargs)[source]
Filter entities by list of entiteis of interest.
Example
The following example removes any entities in the labelset_collection that are not present in the specified entity_id_list.
>>> existence_filter = EntityExistenceFilter(entity_id_list) >>> labelset_collection.apply(existence_filter, inplace=True)
Alternatively, can preserve (instead of remove) only eneities not present in the entity_id_list by setting
remove_specified=True.Initialize EntityExistenceFilter object.
- Parameters:
target_lst (List[str]) –
remove_specified (bool) –
- property mod_name
Name of modification to entity.
- class obnb.label.filters.EntityRangeFilterNoccur(min_val=None, max_val=None, **kwargs)[source]
Filter entities based on number of occurrence.
Example
The following example removes any entity that occurs to be positive in more than 10 labelsets.
>>> labelset_collection.apply(EntityRangeFilterNoccur(max_val=10), >>> inplace=True)
Initialize EntityRangeFilterNoccur object.
- Parameters:
min_val (float | None) –
max_val (float | None) –
- property mod_name
Name of modification to entity.
- class obnb.label.filters.LabelsetExistenceFilter(target_lst, remove_specified=False, **kwargs)[source]
Filter labelset by list of labelsets of interest.
Example
The following example removes any labelset in the labelset_collection that has a label name matching any of the element in label_name_list
>>> labelset_existence_filter = LabelsetExistenceFilter(label_name_list) >>> labelset_collection.apply(labelset_existence_filter, inplace=True)
Alternatively, can preserve (instead of remove) only labelsets not present in the label_name_list by setting
remove_specified=True.Initialize LabelsetExistenceFilter object.
- Parameters:
target_lst (List[str]) –
remove_specified (bool) –
- property mod_name
Name of modification to entity.
- class obnb.label.filters.LabelsetNonRedFilter(*thresholds, **kwargs)[source]
Filter out redundant labelsets in a labelset collection.
The detailed procedure can be found in the supplementary data of https://doi.org/10.1093/bioinformatics/btaa150 In brief, given a labelset collection, a graph of labelsets if first constructed based on the redundancy score function of interest. Here, we use the combination of Jaccard index and overlap coefficient. Then, for each connected component in this graph, retreieve representative labelsets according to the sum of the proportions of genes in a geneset that is contained in any other gene sets within that component.
Initialize BaseLabelsetNonRedFilter object.
- Parameters:
thresholds (
Tuple[float,float]) – Thresholds for Jaccard index and overlap coefficient, respectively. If a pair of genesets have Jaccard index and overlap coefficient above the specified threshold simultaneously, then an edge is added connecting the two gene sets. Accept values within [0, 1].inclusive – Whether or not to include value exactly at the threshold when constructing the labelset graph.
- get_nonred_label_ids(g, lsc)[source]
Extract non-redundant labelsets.
- Parameters:
g (
SparseGraph) – The labelset graph connecting different labelsets according to the extend they are redundant. Seeconstruct_labelset_graph().lsc (
LabelsetCollection) – The labelset collection object.
- Returns:
The set of non-redundant labelset IDs.
- Return type:
set[str]
- property mod_name
Name of modification to entity.
- property params: List[str]
Parameter list.
- class obnb.label.filters.LabelsetPairwiseFilterJaccard(max_val, size_constraint='smaller', inclusive=True, **kwargs)[source]
Filter labelsets based on Jaccard index.
The Jaccard index is computed as the size of the intersection divided by the size of the union of two sets.
Example
>>> labelset_collection.iapply(LabelsetPairwiseFilterJaccard(0.7))
Initialize the pairwise labelset filter.
- Parameters:
max_val (float) –
size_constraint (str) – If set to ‘larger’ (or ‘smaller’), then only make the pairwise comparison if the current labelset if larger (or smaller) than the target labelset. Finally, ‘none’ is the same as setting to both ‘larger’ and ‘smaller’ (default; ‘larger’).
inclusive (bool) – Whether or not to make the comparison if the two labelsets have the same size (default:
True)
- class obnb.label.filters.LabelsetPairwiseFilterOverlap(max_val, size_constraint='smaller', inclusive=True, **kwargs)[source]
Filter labelsets based on the Overlap coefficient.
The Overlap coefficient is computed as the size of the intersection divided by the minimum size of the two sets.
Example
>>> labelset_collection.iapply(LabelsetPairwiseFilterOverlap(0.8))
Initialize the pairwise labelset filter.
- Parameters:
max_val (float) –
size_constraint (str) – If set to ‘larger’ (or ‘smaller’), then only make the pairwise comparison if the current labelset if larger (or smaller) than the target labelset. Finally, ‘none’ is the same as setting to both ‘larger’ and ‘smaller’ (default; ‘larger’).
inclusive (bool) – Whether or not to make the comparison if the two labelsets have the same size (default:
True)
- class obnb.label.filters.LabelsetRangeFilterSize(min_val=None, max_val=None, **kwargs)[source]
Filter labelsets based on size.
Example
The following example removes any labelset that has more less than 10 or more than 100 number of positives.
>>> labelset_collection.apply( >>> LabelsetRangeFilterSize(min_val=10, max_val=100), inplace=True)
Initialize LabelsetRangeFilterSize object.
- Parameters:
min_val (float | None) –
max_val (float | None) –
- property mod_name
Name of modification to entity.
- class obnb.label.filters.LabelsetRangeFilterSplit(min_val, splitter, count_negatives=True, **kwargs)[source]
Filter labelsets based on number of positives in each dataset split.
Initialize LabelsetRangeFilterTrainTestPos object.
- Parameters:
count_negatives (
bool) – Whether or not to filter based on the number of negatives in each split (default:True).min_val (float) –
splitter (Callable[[ndarray, ndarray], Iterator[Tuple[ndarray, ...]]]) –
- get_val_getter(lsc)[source]
Return the value getter.
The value getter finds the minimum number of positives for a labelset across all the dataset splits.
- property mod_name
Name of modification to entity.
- property params: List[str]
Parameter list.
- class obnb.label.filters.NegativeGeneratorHypergeom(p_thresh, **kwargs)[source]
Filter based on enrichment (hypergeometric test).
Given a labelset, it compares all pairs of labelsets via hypergometric test. If the p-val is less than
p_thresh, then exclude the entities from that labelset that are not positive from training/testing sets, i.e., set to neutral.Example
The following example set up the negatives for each labelset using 0.05 p-value threshold.
>>> labelset_collection.apply(NegativeFilterHypergeom(0.05), >>> inplace=True)
Initialize NegativeFilterHypergeom object.
- Parameters:
p_thresh (
float) – p-val threshold of the hypergeometric test.
- property params: List[str]
Parameter list.
Labelset collection splits
Holdout all available data points. |
|
Randomly holdout some ratio of the dataset. |
|
Holdout a portion of the dataset. |
|
Split the dataset according to some threshold values. |
|
Randomly partition the dataset based on ratios. |
|
Split the dataset into parts of size proportional to some ratio. |
|
Split the dataset according to some threshold values. |
Genearting data splits from the labelset collection.
- class obnb.label.split.AllHoldout(*, shuffle=False, random_state=None)[source]
Holdout all available data points.
Initialize the AllHoldout object.
- class obnb.label.split.RandomRatioHoldout(ratio, *, shuffle=True, random_state=None)[source]
Randomly holdout some ratio of the dataset.
Initialize RandomRatioHoldout.
- class obnb.label.split.RandomRatioPartition(*ratios, shuffle=True, random_state=None)[source]
Randomly partition the dataset based on ratios.
Initialize RandomRatioPartition.
- class obnb.label.split.RatioHoldout(ratio, *, property_converter, ascending=True)[source]
Holdout a portion of the dataset.
First sort the dataset entities (data points) based on a 1-dimensional entity property parsed in as
x, either ascendingly or descendingly. Then take the top datapoints with portion defined by the ratio input.Initialize the RatioHoldout object.
- Ags:
ratio: Ratio of holdout.
- Parameters:
ratio (float) –
ascending (bool) –
- get_split_idx(x_sorted_val)[source]
Return the split index based on the split ratio.
- Return type:
int- Parameters:
x_sorted_val (ndarray) –
- property ratio: float
Ratio of each split.
- class obnb.label.split.RatioPartition(*ratios, property_converter, ascending=True)[source]
Split the dataset into parts of size proportional to some ratio.
First sort the dataset entities (data points) based on a 1-dimensional entity property parsed in as
x, either ascendingly or descendingly. Then split the dataset based on the defined ratios.Initialize the RatioPartition object.
- Ags:
ratios: Ratio of each split.
- Parameters:
ratios (float) –
ascending (bool) –
- get_split_idx(x_sorted_val)[source]
Return the split index based on the split ratios.
- Return type:
List[int]- Parameters:
x_sorted_val (ndarray) –
- property ratios: Tuple[float, ...]
Ratio of each split.
- class obnb.label.split.ThresholdHoldout(threshold, *, property_converter, ascending=True)[source]
Split the dataset according to some threshold values.
First sort the dataset entities (data points) based on a 1-dimensional entity property parsed in as
x, either ascendingly or descendingly. When sorted ascendingly, the holdout split would be entities that have properties with values up to but not including the first (smallest) threshold value.Example
Suppose we have some dataset with properties x, then given the specified threshold, we would split the dataset as follows
>>> x = [0, 1, 1, 1, 2, 3, 4] >>> threshold = 2 >>> >>> holdout = [0, 1, 1, 1]
Initialize the ThresholdHoldout object.
- Parameters:
threshold (
float) – Threshold used to determine the splits.ascending (bool) –
- get_split_idx(x_sorted_val)[source]
Return the split index based on the cut threshold.
- Return type:
int- Parameters:
x_sorted_val (ndarray) –
- property threshold: float
Threshold for splitting.
- class obnb.label.split.ThresholdPartition(*thresholds, property_converter, ascending=True)[source]
Split the dataset according to some threshold values.
First sort the dataset entities (data points) based on a 1-dimensional entity property parsed in as
x, either ascendingly or descendingly. When sorted ascendingly, the first partition would be entities that have properties with values up to but not including the first (smallest) threshold value, and the second partition would be the entities that have properties with values starting (inclusive) from the first threshold value up to the second threshold value (not inclusive).Example
Suppose we have some dataset with properties x, then given the specified thresholds, we would split the dataset as follows
>>> x = [0, 1, 1, 1, 2, 3, 4] >>> thresholds = (1, 3) >>> >>> split1 = [0] >>> split2 = [1, 2, 3, 4] >>> split3 = [5, 6]
Initialize the ThresholdPartition object.
- Parameters:
thresholds (
float) – Thresholds used to determine the splits.ascending (bool) –
- get_split_idx(x_sorted_val)[source]
Return the split index based on the cut thresholds.
- Return type:
List[int]- Parameters:
x_sorted_val (ndarray) –
- property thresholds: Tuple[float, ...]
Thresholds for splitting.