صفحة

Research Interest

My main research interest lies in facilitating the integration of experimental and computational research, in particular computational genomics and metagenomics. Genome projects as well as metagenomic projects have resulted in larger sequence databases that remain unanalyzed. My research interest is to develop algorithms and tools to analyze the large amount of data from sequencing projects. As computation becomes the more costly part of the sequence analysis pipeline, the need for more efficient algorithms becomes necessary and crucial.

Biological data analysis aims to answer questions that are fundamental to our understanding of life, while at the same time leads to important findings in medicine and drug discovery. In order to address any scientific question, I believe that we should use the knowledge we already have and apply it to different problems. I have learned from my experience that different domains share common characteristics and thus common solutions. For example, the same machine learning algorithms used in speech recognition are now being tailored to analyze biological data. In addition to taking advantage of our accumulated knowledge, I strongly believe that individual skills are crucial in a successful research career. While, I have always tried to take advantage of other researchers ideas, I also strive to be an independent thinker and use my critical thinking, experience, creativity and communication skills to develop my own approach to the given problem. In addition to independent work, I strongly believe that we need to collaborate with colleagues that share mutual and complementary interests. I have worked with an interdisciplinary team that has members in the computer science and the statistic departments as well as in the medical school. I know first hand the benefits of working together to learn more about the subject and achieve the ultimate goal.

My research has been focused on the problem of gene-finding for genomic and metagenomic data. A number of competing methodologies have been developed to identify genes and classify DNA sequences into coding and non-coding sequences. This classification process is fundamental in gene finding tools and is one of the most challenging tasks in computational biology. In the past, I have developed an information theory measure called MIM capable of discriminating coding from non-coding sequences. Based on this amino acid mutual information measure, I have also developed a species independent iterative approach to learn MIM-based models and classify sequences from novel genomes. This algorithm is capable of distinguishing coding from non-coding sequences in bacterial and archael genomes with high accuracy. Using a set of representative genes, I created a universal coding models which are used by the algorithm to classify all of the sequences in the genome that is being analyzed. Then new coding profiles are calculated from the classification. The algorithm then alternates between sequence classification and building the new models. The algorithm iterates until converging or the specified maximum number of iterations is attained. The experimental results support my hypothesis that the universal amino acid profiles provide sufficient breadth to create profiles that serves as a starting point for the iterative MIM algorithm. The result is a powerful species-independent method for classifying coding and non-coding DNA sequences which can be incorporated into any of the existing gene finders or used to post-process their classifications.

Computational gene finding algorithms have proven their robustness in identifying genes in complete genomes. However, Next Generation Sequencing (NGS) has presented new challenges due to the incomplete and fragmented nature of the data. During the last few years, attempts have been made to extract complete and incomplete open reading frames (ORFs) directly from short reads and identify the coding ORFs, bypassing other challenging tasks such as the assembly of the metagenome. I have developed a robust metagenomics gene identifier, called MGC which attains superior performance to that of the state-of-the-art metagenomics prediction algorithms. MGC uses a two-stage machine learning approach and computes several models that classify extracted ORFs from fragmented sequences. The main difference with other approaches is the hypothesis that sequences need separate models based on their local GC-Content in order to avoid the noise introduced to a single model computed with sequences from the entire GC spectrum. MGC demonstrates evidence that learning separate models for several pre-defined GC content regions as opposed to a single approach improves the performance of the neural network. In addition to its direct use in finding genes, MGC's improvement sets the ground for further investigation into the use of GC-content to separate data for training models in machine learning based gene finders. I am currently working on improving MGC's ability to work for various read lengths and incorporate error models to the algorithm.

Currently, most metagenomics analysis tools rely on Blastx for annotation and phylogenetic profiling. The idea of the metagenomic pipeline is to take an environmental sample, extract the DNA and sequence some reads. These reads are then blasted against reference databases in order to infer Taxa or identify genes. Researchers use Blastx against the NCBI-nr database even though it is at least six time slower because blasting metagenomic reads against the nucleotide database results in less than 1% hit rate. Proteins allow us to go further back in time and give us homology, however the cost is high. If we can minimize the computational cost by finding the correct reading frame and identify only the sequences that code for a protein, we can cut this cost by at least six time. This immediate benefit is why I believe we need to advocate for the use of these new generation gene finder in the metagenomic pipeline. We also need to continue our search for better and more efficient tools. My immediate research goal is to ameliorate and produce a powerful gene finder such as the ones that exist for genomic analysis.

In the next few years, I plan to continue my work in building powerfull algorithms and tools that allow us to better analyze the large-scale datasets that are being sequenced on a daily basis. I am open to interdisciplinary colaborations, so if you or you lab has large scale genomic or metagenomic data, I would be glad to contribute in the analysis process of the data.

Achraf El Allali

Research Interest