mowl.datasets
Base dataset
This module contains classes intended to deal with mOWL datasets.
- class mowl.datasets.base.Dataset(ontology, validation=None, testing=None)[source]
Bases:
object
This class represents an mOWL dataset.
- Parameters
ontology (
org.semanticweb.owlapi.model.OWLOntology
) – The ontology containing the training data of the dataset.validation (
org.semanticweb.owlapi.model.OWLOntology
, optional) – The ontology containing the validation data of the dataset, defaults toNone
.testing (
org.semanticweb.owlapi.model.OWLOntology
, optional) – The ontology containing the testing data of the dataset, defaults toNone
.
- property ontology
Training dataset
- Return type
org.semanticweb.owlapi.model.OWLOntology
- property validation
Validation dataset
- Return type
org.semanticweb.owlapi.model.OWLOntology
- property testing
Testing ontology
- Return type
org.semanticweb.owlapi.model.OWLOntology
- property classes
List of classes in the dataset. The classes are collected from training, validation and testing ontologies using the OWLAPI method
ontology.getClassesInSignature()
.- Return type
- property individuals
List of individuals in the dataset. The individuals are collected from training, validation and testing ontologies using the OWLAPI method
ontology.getIndividualsSignature()
.- Return type
- property object_properties
List of object properties (relations) in the dataset. The object properties are collected from training, validation and testing ontologies using the OWLAPI method
ontology.getObjectPropertiesInSignature()
.- Return type
- property evaluation_classes
List of classes used for evaluation. Depending on the dataset, this method could return a single
OWLClasses
object (as inPPIYeastDataset
) or a tuple ofOWLClasses
objects (as inGDAHumanDataset
). If not overriden, this method returns the classes in the testing ontology obtained from the OWLAPI methodgetClassesInSignature()
as aOWLClasses
object.
- class mowl.datasets.base.PathDataset(ontology_path: str, validation_path: Optional[str] = None, testing_path: Optional[str] = None)[source]
Bases:
Dataset
Loads the dataset from ontology documents.
- class mowl.datasets.base.TarFileDataset(tarfile_path: str, *args, **kwargs)[source]
Bases:
PathDataset
Loads the dataset from a tar file.
- Parameters
tarfile_path (str) – Location of the tar file
**kwargs – See below
- Keyword Arguments
dataset_name (str): Name of the dataset
- class mowl.datasets.base.RemoteDataset(url: str, data_root='./')[source]
Bases:
TarFileDataset
Loads the dataset from a remote URL.
- class mowl.datasets.base.Entities(collection)[source]
Bases:
object
Abstract class containing OWLEntities indexed by they IRIs
- check_owl_type(collection)[source]
This method checks whether the elements in the provided collection are of the correct type.
- to_dict()[source]
Generates a dictionaty indexed by OWL entities IRIs and the values are the corresponding OWL entities.
- to_index_dict()[source]
Generates a dictionary indexed by OWL objects and the values are the corresponding indicies.
- property as_str
Returns the list of entities as string names.
- property as_owl
Returns the list of entities as OWL objects.
- property as_dict
Returns the dictionary of entities indexed by their names.
- property as_index_dict
Returns the dictionary of entities indexed by their names.
- class mowl.datasets.base.OWLClasses(collection)[source]
Bases:
Entities
Class containing OWL classes indexed by they IRIs
Built-in datasets
- class mowl.datasets.builtin.PPIYeastDataset(url=None)[source]
Bases:
RemoteDataset
This dataset represent protein–protein interactions on the yeast species. The data used for this dataset consists of the Gene Ontology released on 20-10-2021 and protein interaction data found in String Database version 11.5. Protein interaction data was randomly split 90:5:5 across training, validation and testing ontologies and Gene Ontology functional annotations of proteins is part of the training ontology only. Protein interactions are represented as an axiom of the form \(protein_1 \sqsubseteq interacts\_with . protein_2.\)
- property evaluation_classes
Classes that are used in evaluation
- class mowl.datasets.builtin.PPIYeastSlimDataset(*args, **kwargs)[source]
Bases:
PPIYeastDataset
Reduced version of
PPIYeastDataset
. Tranining ontology is built from the Slim Yeast subset of Gene Ontology.
- class mowl.datasets.builtin.GDADataset(url=None)[source]
Bases:
RemoteDataset
Abstract class for Gene–Disease association datasets. This dataset represent the gene-disease association in a particular species. This dataset is built using phenotypic annotations of genes and diseases. For genes annotations we used the Mouse/Human Orthology with Phenotype Annotations document. Disease annotations were obtained from the HPO annotations for rare disease document. These annotations were added to the Unified Phenotype Ontology (uPheno) to build the training ontology. Futhermore, gene-disease associations were obtained from the Associations of Mouse Genes with DO Diseases file, from which associations for human and mouse were extracted (to build separate datasets) and each of them were randomly split 80:10:10, added to the training ontology and created the validation and testing ontologies, respectively.
- property evaluation_classes
List of classes used for evaluation. Depending on the dataset, this method could return a single
OWLClasses
object (as inPPIYeastDataset
) or a tuple ofOWLClasses
objects (as inGDAHumanDataset
). If not overriden, this method returns the classes in the testing ontology obtained from the OWLAPI methodgetClassesInSignature()
as aOWLClasses
object.
- class mowl.datasets.builtin.GDAHumanDataset[source]
Bases:
GDADataset
- class mowl.datasets.builtin.GDAHumanELDataset[source]
Bases:
GDADataset
This dataset is a reduced version of
GDAHumanDataset
. The training ontology contains axioms in the \(\mathcal{EL}\) language.
- class mowl.datasets.builtin.GDAMouseDataset[source]
Bases:
GDADataset
- class mowl.datasets.builtin.GDAMouseELDataset[source]
Bases:
GDADataset
This dataset is a reduced version of
GDAMouseDataset
. The training ontology contains axioms in the \(\mathcal{EL}\) language.
- class mowl.datasets.builtin.FamilyDataset(url=None)[source]
Bases:
RemoteDataset
This dataset represents a family domain. It is a short ontology with 12 axioms describing family relationships. The axioms are:
\[\begin{split}\begin{align} Male & \sqsubseteq Person \\ Female & \sqsubseteq Person \\ Father & \sqsubseteq Male \\ Mother & \sqsubseteq Female \\ Father & \sqsubseteq Parent \\ Mother & \sqsubseteq Parent \\ Female \sqcap Male & \sqsubseteq \perp \\ Female \sqcap Parent & \sqsubseteq Mother \\ Male \sqcap Parent & \sqsubseteq Father \\ \exists hasChild.Person & \sqsubseteq Parent\\ Parent & \sqsubseteq Person \\ Parent & \sqsubseteq \exists hasChild. \top \end{align}\end{split}\]- property evaluation_classes
List of classes used for evaluation. Depending on the dataset, this method could return a single
OWLClasses
object (as inPPIYeastDataset
) or a tuple ofOWLClasses
objects (as inGDAHumanDataset
). If not overriden, this method returns the classes in the testing ontology obtained from the OWLAPI methodgetClassesInSignature()
as aOWLClasses
object.
Dataset for \(\mathcal{EL}\) language
- class mowl.datasets.el.ELDataset(ontology, class_index_dict=None, object_property_index_dict=None, extended=True, device='cpu')[source]
Bases:
object
This class provides data-related methods to work with \(\mathcal{EL}\) description logic language. In general, it receives an ontology, normalizes it into 4 or 7 \(\mathcal{EL}\) normal forms and returns a
torch.utils.data.Dataset
per normal form. In the process, the classes and object properties names are mapped to an integer values to create the datasets and the corresponding dictionaries can be input or created from scratch.- Parameters
ontology (
org.semanticweb.owlapi.model.OWLOntology
) – Input ontology that will be normalized into \(\mathcal{EL}\) normal formsextended (bool, optional) – If true, the normalization process will return 7 normal forms. If false, only 4 normal forms. See Embedding the EL language for more information. Defaults to
True
.class_index_dict (dict, optional) – Dictionary containing information class name –> index. If not provided, a dictionary will be created from the ontology classes. Defaults to
None
.object_property_index_dict (dict, optional) – Dictionary containing information object property name –> index. If not provided, a dictionary will be created from the ontology object properties. Defaults to
None
.
- get_gci_datasets()[source]
Returns a dictionary containing the name of the normal forms as keys and the corresponding datasets as values. This method will return 7 datasets if the class parameter extended is True, otherwise it will return only 4 datasets.
- Return type
- property class_index_dict
Returns indexed dictionary with class names present in the dataset.
- Return type