05 Aug 2021
Research
8 minute read

Tapping into the inner rhythm of living organisms with AI and ML

In recently published research in PNAS, IBM and Earlham Institute scientists demonstrate the power of AI- and ML-based approaches for deeper insight into the circadian clock and how it is regulated.

IBM's Dr. Laura-Jayne Gardiner, AI and informatics for life sciences, and Prof. Anthony Hall, Head of Plant Genomics, Earlham Institute

In recently published research in PNAS, IBM and Earlham Institute scientists demonstrate the power of AI- and ML-based approaches for deeper insight into the circadian clock and how it is regulated.

Anyone who’s flown a great distance by plane would probably agree that jetlag is one of the biggest drags of the whole traveling experience. While there are all kinds of methods to trick the body, it’s hard to go against our natural, inner rhythm which regulates our 24-hour sleep-wake cycles.

So, why do our bodies get out of whack when we fly to another time zone?

As part of a collaboration between IBM Research Europe and the Earlham Institute, we set out to explore how artificial intelligence (AI) and machine learning (ML) could help scientists understand more about the inner 24-hour cycles—or circadian rhythms—that are part of an organism’s internal body clock. Understanding more about circadian regulation and function in living organisms could potentially lead to new discoveries about how human bodies tick, for example, or how to influence crop yields.

In our new paper, published in PNAS, we demonstrate the power of AI- and ML-based approaches for more cost-effective analysis and deeper insight into the circadian clock and how it is regulated.1

Detecting circadian rhythmicity

Circadian rhythms are innate to most living organisms and are critical to life on Earth; the human sleep-wake cycle being the most well-known example. The word circadian originates from the Latin phrase “circa diem” which means “around a day.”

Essentially, our inner rhythm is driven by a circadian clock, which is a biochemical oscillator synchronized with solar time, or the position of the sun in the sky.

In most living things, including animals, plants, fungi and even cyanobacteria, internally synchronized circadian clocks make it possible for an organism to anticipate daily environmental changes corresponding with the day-night cycle and adjust its biology and behavior accordingly. So, when we experience jetlag, there is a chronobiological problem. That is, our body clocks are out of whack because the normal external cues such as light or temperature have changed.

On a more biological level, the circadian clock temporally orchestrates physiology, biochemistry, and metabolism across the 24-hour day-night cycle. This is why being out of sync can affect our fitness levels, our health, or our ability to survive. In other organisms, such as plants, it controls when they grow or when they flower.

If a gene is involved in the circadian clock, it typically shows an oscillation between an off-on state throughout a 24-hour period. This pattern is called circadian rhythmicity. With current methods, detecting circadian rhythmicity is challenging as it requires generating long, high-resolution, time-series datasets to measure gene expression throughout the day using sequencing technologies known as transcriptomic datasets. Not only is this expensive, it is also time consuming for laboratory scientists to generate. Consequently, our knowledge, to date, of how genes are controlled and regulated in a circadian clock is limited.

Arabidopsis thaliana

Our research specifically involves applying ML to predict complex temporal circadian gene expression patterns in Arabidopsis thaliana, a small flowering weed. Arabidopsis thaliana is a popular scientific model organism used to study plant biology and genetics. It was the first plant to have its genome sequenced, which biologists and geneticists now use to aid our understanding of the molecular biology and genetics of many plant traits, including circadian regulation.

IBM’s Dr. Laura-Jayne Gardiner, an AI and informatics for life sciences scientist, and Prof. Anthony Hall, head of Plant Genomics, Earlham Institute, in a field of Arabidopsis thaliana.IBM’s Dr. Laura-Jayne Gardiner, an AI and informatics for life sciences scientist, and Prof. Anthony Hall, head of Plant Genomics, Earlham Institute, in a field of Arabidopsis thaliana.

Taking newly generated datasets, published temporal datasets and Arabidopsis genomes, we trained ML models to make predictions about circadian gene regulation and expression patterns. Our ML models classify circadian expression patterns using iteratively lower numbers of transcriptomic timepoints, which is an improvement in accuracy compared to the existing state-of-the-art models.

We also introduced model interpretation to quantify the best transcriptomic timepoints for sampling. We believe that predictive insight on when to sample will be a valuable reference for experimental biologists when planning experiments. Next, we redefined the field by developing ML models to distinguish circadian transcripts that don’t use transcriptomic timepoint information, but instead use DNA sequence features generated from public genomic resources. This allows us to predict the circadian regulation of genes simply by analyzing the genome sequence.

Our decision to do this was based on the theory that a major mechanism of gene expression control, be it circadian or other mechanisms, is through transcription factors (and other factors) that bind to regulatory DNA sequences. Transcription factors are vital molecules that can control gene expression, directing when, where and to what degree genes are expressed. They bind to specific sequences of DNA and control the transcription of DNA into mRNA.

Explainable AI

What makes our models more informative is our usage of explainable AI algorithms. We used the interpretation of our ML models to illuminate what’s inside the “black box,” so that we can better understand the predictions they make. We used a local model explanation that is transcript-specific to rank DNA sequence features, which provide a detailed profile of the potential circadian regulatory mechanisms for each transcript.

Using the local explanation derived from ranked DNA sequence features allows us to distinguish the temporal phase of transcript expression and in doing so, reveal hidden sub-classes within the circadian class, e.g., whether a transcript is likely to show its peak expression in the day, or night.

Model interpretation and explanation provides the backbone of our methodological advances because it gives insight into biological processes and experimental design. This approach can optimize sampling strategies when we predict circadian transcripts using reduced numbers of transcriptomic timepoints.

Finally, our models predict the circadian time from a single transcriptomic timepoint. This identifies novel marker transcripts that are most impactful for accurate predictions, which could facilitate the identification of altered circadian clock function from existing datasets. These applications of explainable AI could redefine how we reuse public data and how we generate testable hypotheses to understand gene expression control.

This chart show the combinations of timepoints that gave the highest accuracy as we reduce the number of timepoints from 12 to 3.This chart show the combinations of timepoints that gave the highest accuracy as we reduce the number of timepoints from 12 to 3. Counts 0-10 represent the number of times each timepoint appeared in the 10 most-accurate combinations. Higher numbers (in red) show a more “popular” sampling time to be selected for high accuracy. Labels N3-N11 show the number of timepoints. Labels avexp_24-avexp_68 show sampling times in hours.For the example gene PHYA, a line plot of the gene’s expression across the best combination of timepoints in each reduced set, 12-3.For the example gene PHYA, a line plot of the gene’s expression across the best combination of timepoints in each reduced set, 12-3.This dendrogram is clustering known circadian transcripts according to their ML model explanations i.e., the most-important DNA sequences (model explanations calculated using the software called SHAP).This dendrogram is clustering known circadian transcripts according to their ML model explanations i.e., the most-important DNA sequences (model explanations calculated using the software called SHAP).

Beyond plants

Our research describes a series of AI- and ML-based approaches that have the potential to enable more cost-effective analysis and insight into circadian regulation and function. While we initially worked with the model plant Arabidopsis, where extensive genome resources allow experimental validation of findings, this approach has widespread implications for other complex or temporal gene expression patterns, as well as other Arabidopsis ecotypes—some of which we have tested already. Furthermore, in our published work we adapted our ML approach for wheat to show that our methods allow accurate analysis of key food crops.

Our ML models and their application in crops, where circadian rhythms are critical to maintaining healthy growth and development, could potentially lead to increased yields as agricultural scientists and farmers begin to use the model to understand the inner rhythms of the plants they grow and harvest.

But the technology we developed goes beyond the scope of plants. We are now looking at different species to investigate the circadian clock and its link to disease in humans; for example, where the dysregulation of the circadian clock has been associated with a range of diseases from depression to cancer.

References

  1. Gardiner, L., Rusholme-Pilcher, R., Colmer, J., et al. Interpreting machine learning models to investigate circadian regulation and facilitate exploration of clock function. PNAS. August 10, 2021 118 (32) e2103070118