Research

Alignment based techniques dominate sequence (DNA, protein) comparison in bioinformatics, but the inability of these techniques to produce desirable results when sequences are divergent but functionally similar, and the quadratic complexity of the algorithms, has motivated researchers to work on alternative ‘alignment-free’ approaches. One of the more common ideas in this direction is to use a bag-of- words based approach, a representation based on a vector of kmer (sub sequences) counts. Such vectors are easy to compute but are of very high dimension, sometimes even larger than the sequence itself. The other noted criticism of bag-of- word based approaches is their limited capability of incorporating contextual information, which can otherwise be very useful. A step forward in this direction is to develop approaches which map sequences in a low dimensional vector space, while keeping the biological relations intact (i.e. functionally similar sequences are mapped closer together in the vector space than others). Recent developments in representation learning research (specifically in NLP) have opened the possibility to explore similar techniques for biological sequences. Such techniques offer advantages in terms of their capability to include contextual information while computing low dimensional representations. An additional advantage of these techniques is the possibility to include prior knowledge (such as class information), which is sometimes not explicitly evident in sequences, for example, apparently divergent sequences may belong to the same class. Furthermore, employing such information rich representations, as an input to machine learning algorithms, improves their performance when applied to solve a given task. Summary of some of the works is given below:

Learning distributed representation for biological sequence analysis

Keywords: Representation Learning, Alignment free methods, Classification

Biological sequences are an array of alphabets and hence are not in the desirable form to be used directly with machine learning algorithms. Thus, it is essential to develop efficient approaches to generate the vector embeddings that may capture biological information directly from the sequences. In this project, we develop a word-embedding based representation learning framework, Seq2Vec, that can be trained to generate the embeddings for any given biological sequences. As an outcome of the project, we demonstrated the applications of such embeddings for protein family classification tasks.

Metric Learning on Biological Sequences Embeddings

Keywords: Metric Learning, Classification, Retrieval

Machine learning algorithms that rely on distance metrics are computationally efficient and can handle large datasets; however, default distances in the embedded space often yield inadequate accuracy. In this work, we propose a framework that doesn’t rely on default distance metrics (euclidean, cosine) instead uses Mahalanobis distance - that is learned using the available class label information. As an outcome of the work, we show performance improvements gained using Mahalanobis distance over the default Euclidean metric in both retrieval and classification tasks.

Supervised approach for learning embeddings for biological sequences

Keywords: Supervised learning, Alignment free approaches, Retrieval, Classification

In this project, we introduce novel supervised representation learning methods SuperVec and SuperVecX that incorporate class label information along with the contextual information for learning sequence embeddings. We also provide a hierarchical approach on top of these embedding methods, specifically for homologous sequence retrieval tasks. The other contribution of this work is to introduce hybrid-approaches that uses standard alignment-based (BLAST) and alignment-free (SuperVec(X)) retrieval methods together for improving the sequence retrieval performance. As an outcome of the project, we showed the success of the proposed methods in terms of retrieval performance and computational efficiency for the homologus sequence retrieval task. We also showed that such embeddings could prove to be useful for other downstream bioinformatics applications such as various protein classification tasks.

Protein-Protein Interaction Prediction

Keywords: Paired-input problem, Prediction, Sequence embeddings

PPI prediction is a paired input problem in which the prediction is made for two objects together. In this ongoing work, we explore the usability of representation learning approaches as a feature constructor of paired samples. Such features can give the computational and performance advantage compared to the prevalent high dimensional embeddings that rely on physicochemical properties on this problem. We establish that the low dimensional representations for protein pair generally gives better performance than the physicochemical properties based feature vectors. The success of such methods would also help in extending them for including the meta-information from the protein interaction networks or annotations from the Gene Ontology graph.

Semi-supervised framework for representation learning

Keywords: Representation Learning, Semi-Supervised approach, Alignment-free approach

Acquiring the quality labels (such as functional annotations) for biological sequences is generally costly and time taking process because of which, for many sequences, we do not have the desired annotations available. Considering the limited availability of labels (annotations), in this work, our focus is to develop a semi-supervised representation learning approach that can generate the quality embedding of sequences, requiring only few labels for training the model. We expect such approach to improve upon the unsupervised methods and give comparable results to the supervised methods for downstream bioinformatics tasks such as sequence retrieval and protein family classification.