Computational genomics

Knowledge integration and exploratory multi-omics data analysis

Our research

Computational life science is a diverse field with various layers and scales of information available in private and public databases. We are interested in applying computational techniques for knowledge extraction, representation and integration to facilitate exploratory analysis of omics datasets. Our interests include representation and clustering of anti-microbial peptides as k-mer graphs, inference of large networks from gene expression data, and application of machine-learning techniques for phenotype predictions.


Human health and food safety

Metagenomics, the study of the genomic diversity of microbes, is increasingly used in food safety, environmental studies, human and animal health. Recent advances in high-throughput sequencing technologies have enabled the characterization and comparison of microbial communities in very diverse environments. One of the major research challenges is gaining insight into the function, structure and organisation of microbial communities. For example, characterizing composition and activity of metagenomes across different individuals (healthy and disease subjects) is important to understand the role of microbiota in disease development.

Can we predict whether someone has a disease, a predisposition, or even how progressed the disease is from the gut microbiome? In high-dimensional metagenomics data, the phenotype or trait of the host organism may not be obvious to detect and the ability to predict it becomes a powerful analytic tool. By analysing microbiome data across human, mouse, and environmental samples and applying RoDEO (Robust Differential Gene Expression), combined with machine learning methods, we have been able to accurately predict phenotypes or traits of host organisms. We are investigating how bacteria may affect our health and how to predict the disease status of an individual from his or her gut microbiome. The approach has the potential to help disease diagnosis and improve the future personalisation of medicines.

Whole-metagenome analysis

Current taxonomic classification methods focus on sequencing of specific marker genes, such as 16S rRNA, and rely on existing microbial reference databases, often incomplete. On the other hand, a more informative method is whole-metagenome shotgun sequencing, which generates huge collections of short reads. The need to analyse, assemble or align metagenomics reads makes whole-metagenome analysis both data and computation-intensive. We are developing high-performance computing (HPC) tools for metagenomics analyses.

Soil metagenomics

Soil is the most biodiverse environment on earth, where up to 10 billion bacterial cells are expected to reside around a gram of soil. We have very little understanding of the microbial populations that are essential for maintaining soil health. Through a large-scale metagenomics analysis, we are trying to understand what microbial populations are present in different soil samples, and how their concentration affects the quality of soil. Furthermore, we are trying to understand soil as a living system, where the characteristics of soil are defined by the interplay between inhabitant species.

We are setting up comprehensive computational workflows, developed in-house and in close collaboration with our industrial and academic partners,to facilitate rapid and efficient processing of voluminous metagenomics datasets on high-performance computing (HPC) clusters, and applying computational and statistical techniques at scale to create a better understanding of soil.

Plant genomics

Wheat genomics

Wheat is the most widely cultivated grain for human consumption. To keep pace with the increase in human population and environmental changes, it is vital to increase and sustain the yield of wheat crops. The size and complexity of the wheat genome makes it a challenging system to study. We are interested in understanding important regulatory mechanisms in wheat that are responsible for key agricultural traits that are of interest to plant breeders. We are combining traditional computational genomics approaches with Big Data techniques to handle the complexities arising from studying such large systems.

Blog post: Thanksgiving Stuffing and Addressing the Concerns of Future Food Security

Graph-based representation of antimicrobial peptides

Antimicrobial peptides are a unique and diverse group of molecules, divided into subgroups based on their amino-acid composition and structure, that have been demonstrated to kill bacteria, viruses and fungi, and even transform cancerous cells. Today there is the need to discover new antimicrobial peptides as antimicrobial resistance is a threat to global health. High-throughput simulation, machine learning as well as data analysis and representation can help accelerate the discovery process. As a large amount of proteins and peptides sequences annotated with a range of information and properties are available in public databases (such as Uniprot, InterPro, CAMPR3, etc.) for analysis, we want to explore these datasets from a genomic perspective and cluster sequences that share some functionality.

In support of this activity, we are developing a k-mer based framework for clustering, graph representation and visualization of amino-acid sequences, more precisely antimicrobial peptides, based on their functionalities, properties and structural features. The tool can provide insights about the data by extracting antimicrobial signals from sequences and inspiration in the process of discovering novel antimicrobial peptides.


[1] A.P. Carrieri, N. Haiminen, L. Parida
Host Phenotype Prediction from Differentially Abundant Microbes Using RoDEO,”
In: A. Bracciali, G. Caravagna, D. Gilbert, R. Tagliaferri R. (eds) Computational Intelligence Methods for Bioinformatics and Biostatistics. CIBB 2016. Lecture Notes in Computer Science, vol 10477. Springer, 2017.

[2] N. Haiminen, M. Klaas, Z. Zhou, F. Utro, P. Cormican, T. Didion, C. Jensen, C.C. Mason, S. Barth, L. Parida
Comparative exomics of Phalaris cultivars under salt stress,”
BMC Genomics 15(6), 1–12, 2014.


Niina Haiminen

Niina Haiminen
IBM T.J. Watson Research Center

Laxmi Parida

Laxmi Parida
IBM T.J. Watson Research Center

Our collaborators at STFC

Philippe Gambron
Will Rowe
Martyn Winn