Note
Click here to download the full example code
DL2Vec
This example corresponds to the paper Predicting candidate genes from phenotypes, functions and anatomical site of expression.
This work is a graph-based machine-learning method to learn from biomedical ontologies. This method works by transforming the ontology into a graph following a set of rules. Random walks are generated from the obtained graph and then processed by a Word2Vec model, which generates embeddings of the original ontology classes. This algorithm is applied to generate numerical representations of genes and diseases based on the background knowledge found in the Gene Ontology, which was extended to incorporate phenotypes, functions of the gene products and anatomical location of gene expression. The representations of genes and diseases are then used to predict candidate genes for a given disease.
To show an example of DL2Vec, we need 3 components:
The ontology projector
The random walks generator
The Word2Vec model
import sys
sys.path.append('../../')
import mowl
mowl.init_jvm("10g")
from mowl.datasets.builtin import GDAMouseDataset
from mowl.projection import DL2VecProjector
from mowl.walking import DeepWalk
from gensim.models.word2vec import LineSentence
from gensim.models import Word2Vec
Projecting the ontology
We project the ontology using the DL2VecProjector class. The rules used to project the ontology can be found at Projecting ontologies into graphs. The outcome of the projection algorithm is an edgelist.
dataset = GDAMouseDataset()
projector = DL2VecProjector(bidirectional_taxonomy=True)
edges = projector.project(dataset.ontology)
Generating random walks
The random walks are generated using the DeepWalk class. This class implements the DeepWalk algorithm with a modification consisting of including the edge labels as part of the walks.
walker = DeepWalk(20, # number of walks per node
20, # walk length
0.1, # restart probability
workers=4) # number of threads
walks = walker.walk(edges)
Training the Word2Vec model
To train the Word2Vec model, we rely on the Gensim library:
walks_file = walker.outfile
sentences = LineSentence(walks_file)
model = Word2Vec(sentences, vector_size=100, epochs = 20, window=5, min_count=1, workers=4)
Evaluating the embeddings
We can evaluate the embeddings using the
EmbeddingsRankBasedEvaluator
class. We need to do some data preparation.
from mowl.evaluation.rank_based import EmbeddingsRankBasedEvaluator
from mowl.evaluation.base import CosineSimilarity
from mowl.projection import TaxonomyWithRelationsProjector
We are going to evaluate the plausability of an association gene-disease with a gene against all possible diseases and check the rank of the true disease association.
genes, diseases = dataset.evaluation_classes
projector = TaxonomyWithRelationsProjector(taxonomy=False,
relations=["http://is_associated_with"])
evaluation_edges = projector.project(dataset.testing)
filtering_edges = projector.project(dataset.ontology)
assert len(evaluation_edges) > 0
The gene-disease associations will be scoredc using cosine similarity. For that reason we use
the CosineSimilarity
class.
vectors = model.wv
evaluator = EmbeddingsRankBasedEvaluator(
vectors,
evaluation_edges,
CosineSimilarity,
training_set=filtering_edges,
head_entities = genes.as_str,
tail_entities = diseases.as_str,
device = 'cpu'
)
evaluator.evaluate(show=True)
0%| | 0/371 [00:00<?, ?it/s]
2%|2 | 8/371 [00:00<00:04, 75.09it/s]
8%|7 | 28/371 [00:00<00:02, 142.19it/s]
12%|#1 | 43/371 [00:00<00:02, 137.91it/s]
16%|#6 | 60/371 [00:00<00:02, 146.28it/s]
22%|##1 | 81/371 [00:00<00:01, 165.17it/s]
26%|##6 | 98/371 [00:00<00:01, 164.43it/s]
32%|###2 | 119/371 [00:00<00:01, 175.76it/s]
37%|###6 | 137/371 [00:00<00:01, 157.53it/s]
42%|####1 | 154/371 [00:01<00:03, 72.26it/s]
48%|####7 | 177/371 [00:01<00:02, 95.32it/s]
52%|#####2 | 193/371 [00:01<00:01, 105.91it/s]
57%|#####6 | 210/371 [00:01<00:01, 117.52it/s]
62%|######1 | 229/371 [00:01<00:01, 131.65it/s]
66%|######6 | 246/371 [00:01<00:00, 136.49it/s]
71%|#######1 | 265/371 [00:02<00:00, 147.93it/s]
78%|#######8 | 291/371 [00:02<00:00, 175.36it/s]
84%|########3 | 311/371 [00:02<00:00, 109.60it/s]
88%|########8 | 328/371 [00:02<00:00, 120.05it/s]
95%|#########5| 354/371 [00:02<00:00, 148.90it/s]
100%|##########| 371/371 [00:02<00:00, 128.54it/s]
Hits@1: 0.00 Filtered: 0.00
Hits@10: 0.01 Filtered: 0.01
Hits@100: 0.37 Filtered: 0.37
MR: 574.23 Filtered: 574.23
AUC: 0.93 Filtered: 0.93
Evaluation finished. Access the results using the "metrics" attribute.
Total running time of the script: ( 29 minutes 50.812 seconds)
Estimated memory usage: 4769 MB