DL2Vec

This example corresponds to the paper Predicting candidate genes from phenotypes, functions and anatomical site of expression.

This work is a graph-based machine-learning method to learn from biomedical ontologies. This method works by transforming the ontology into a graph following a set of rules. Random walks are generated from the obtained graph and then processed by a Word2Vec model, which generates embeddings of the original ontology classes. This algorithm is applied to generate numerical representations of genes and diseases based on the background knowledge found in the Gene Ontology, which was extended to incorporate phenotypes, functions of the gene products and anatomical location of gene expression. The representations of genes and diseases are then used to predict candidate genes for a given disease.

To show an example of DL2Vec, we need 3 components:

The ontology projector
The random walks generator
The Word2Vec model

import sys
sys.path.append('../../')
import mowl
mowl.init_jvm("10g")

from mowl.datasets.builtin import GDAMouseDataset
from mowl.projection import DL2VecProjector
from mowl.walking import DeepWalk
from gensim.models.word2vec import LineSentence
from gensim.models import Word2Vec

Projecting the ontology

We project the ontology using the DL2VecProjector class. The rules used to project the ontology can be found at Projecting ontologies into graphs. The outcome of the projection algorithm is an edgelist.

dataset = GDAMouseDataset()

projector = DL2VecProjector(bidirectional_taxonomy=True)
edges = projector.project(dataset.ontology)

Generating random walks

The random walks are generated using the DeepWalk class. This class implements the DeepWalk algorithm with a modification consisting of including the edge labels as part of the walks.

walker = DeepWalk(20, # number of walks per node
                  20, # walk length
                  0.1, # restart probability
                  workers=4) # number of threads

walks = walker.walk(edges)

Training the Word2Vec model

To train the Word2Vec model, we rely on the Gensim library:

walks_file = walker.outfile
sentences = LineSentence(walks_file)
model = Word2Vec(sentences, vector_size=100, epochs = 20, window=5, min_count=1, workers=4)

Evaluating the embeddings

We can evaluate the embeddings using the EmbeddingsRankBasedEvaluator class. We need to do some data preparation.

from mowl.evaluation.rank_based import EmbeddingsRankBasedEvaluator
from mowl.evaluation.base import CosineSimilarity
from mowl.projection import TaxonomyWithRelationsProjector

We are going to evaluate the plausability of an association gene-disease with a gene against all possible diseases and check the rank of the true disease association.

genes, diseases = dataset.evaluation_classes

projector = TaxonomyWithRelationsProjector(taxonomy=False,
                                           relations=["http://is_associated_with"])

evaluation_edges = projector.project(dataset.testing)
filtering_edges = projector.project(dataset.ontology)
assert len(evaluation_edges) > 0

The gene-disease associations will be scoredc using cosine similarity. For that reason we use the CosineSimilarity class.

vectors = model.wv
evaluator = EmbeddingsRankBasedEvaluator(
    vectors,
    evaluation_edges,
    CosineSimilarity,
    training_set=filtering_edges,
    head_entities = genes.as_str,
    tail_entities = diseases.as_str,
    device = 'cpu'
)

evaluator.evaluate(show=True)

  0%|          | 0/371 [00:00<?, ?it/s]
  2%|2         | 8/371 [00:00<00:04, 75.09it/s]
  8%|7         | 28/371 [00:00<00:02, 142.19it/s]
 12%|#1        | 43/371 [00:00<00:02, 137.91it/s]
 16%|#6        | 60/371 [00:00<00:02, 146.28it/s]
 22%|##1       | 81/371 [00:00<00:01, 165.17it/s]
 26%|##6       | 98/371 [00:00<00:01, 164.43it/s]
 32%|###2      | 119/371 [00:00<00:01, 175.76it/s]
 37%|###6      | 137/371 [00:00<00:01, 157.53it/s]
 42%|####1     | 154/371 [00:01<00:03, 72.26it/s]
 48%|####7     | 177/371 [00:01<00:02, 95.32it/s]
 52%|#####2    | 193/371 [00:01<00:01, 105.91it/s]
 57%|#####6    | 210/371 [00:01<00:01, 117.52it/s]
 62%|######1   | 229/371 [00:01<00:01, 131.65it/s]
 66%|######6   | 246/371 [00:01<00:00, 136.49it/s]
 71%|#######1  | 265/371 [00:02<00:00, 147.93it/s]
 78%|#######8  | 291/371 [00:02<00:00, 175.36it/s]
 84%|########3 | 311/371 [00:02<00:00, 109.60it/s]
 88%|########8 | 328/371 [00:02<00:00, 120.05it/s]
 95%|#########5| 354/371 [00:02<00:00, 148.90it/s]
100%|##########| 371/371 [00:02<00:00, 128.54it/s]
Hits@1:   0.00 Filtered:   0.00
Hits@10:  0.01 Filtered:   0.01
Hits@100: 0.37 Filtered:   0.37
MR:       574.23 Filtered: 574.23
AUC:      0.93 Filtered:   0.93
Evaluation finished. Access the results using the "metrics" attribute.

Total running time of the script: ( 29 minutes 50.812 seconds)

Estimated memory usage: 4769 MB

Gallery generated by Sphinx-Gallery