Using AI To Make Sense Of The Human Genome

Whilst the first human genome cost billions of dollars and huge amounts of time, it’s now increasingly possible to sequence your DNA for around $1,000. Whilst the amount of data available has sky rocketed however, our ability to derive insights from it has lagged behind somewhat. It has meant that the genetic revolution we hoped for when the genome was first sequenced has failed to materialize.

It’s a problem that Google are tackling head on via a new tool, called DeepVariant, which utilizes AI to try and develop a better understanding of our genome. The system aims to autonomously mutations in the sequencing data, and especially to distinguish them from random errors. It’s a task that trips up scientists, but for which machine learning is ideal.

It’s part of a growing range of tools that exist to make sense of genetic data, with VarDict and GATK among the leading tools in the space. It’s perhaps fair to say that DeepVariant will be the most sophisticated of the bunch however.

The project has been spun out of the Google Brain and Verily initiatives, with both using AI to make sense of the rapid expanse of medical data. It has seen genome sequences harvested from the Genome in a Bottle (GIAB) project and used to train their AI algorithm until it was capable of interpreting data with a high degree of accuracy.

Big data

It’s the kind of project that Google have form in. Earlier this year their Verily arm launched a new venture to apply big data analytics to healthcare like never before.

The venture, known as the Baseline Project, is aiming to recruit 10,000 people to participate in a multi-year study into finding predictors for heart disease and cancer. Participants will subject themselves to extensive monitoring and testing via the study watch that will record their activity levels in real-time. In addition to the readings from the watch, participants will also undergo x-rays and heart scans, and will also have their genome mapped and blood tested at regular intervals over a four year period.

“No one has done this kind of deep dive on so many individuals. This depth has never been attempted,” the team say. “It’s to enable generations to come to mine it, to ask questions, without presupposing what the questions are.”

I’ve written a number of times about the growing role of data in healthcare research, and Google are developing the infrastructure to support and capitalize on this. The study will try and capture as much information as possible, with participants volunteering stool, saliva and even tear samples in a project that is likely to cost upwards of $100 million.

Genomic insight

They aren’t the only ones using such an approach to try and give us a greater understanding of genetic data however. Last year saw a new search engine released by the University of California San Diego that aims to make it easier for us to search our genomics data records.

The search engine, called GeNemo, has been documented in a recently published paper, and aims to make it easier to search for functional genomic data.

Functional genomics data is valuable as it helps to record the range of activities of each piece of the genome. The new search engine hopes to help researchers uncover the various functional aspects of certain parts of the genome that we believe are responsible for disease.

The search engine allows users to query a range of databases, including the entire ENCODE dataset. The search algorithm utilizes pattern matching to offer richer results than traditional text-based searches.

Swiss startup Sophia Genetics are arguably the market leaders in this space. They claim to have the largest clinical genomics community in the world, with an AI-powered platform to help make sense of the genetic data collected.

The company, which recently raised $30 million in a funding round led by Balderton Capital, have deployed their platform in 334 hospitals across 53 countries. To date they’ve managed to analyze over 125,000 patients from around the world.

Privacy concerns

One of the appealing aspects of the Sophia approach is that they only process the anonymized data collected by the hospitals themselves. It’s something that Verily don’t do with their Baseline Project, with ownership of the data sitting squarely with Google themselves.

A recent paper published in PLOS Biology by a pair of health law researchers from the University of Alberta argues that the whole industry lacks basic legal and ethical principles at the moment around consent, with this only likely to intensify as more genomic data is generated.

With projects such as the UK Biobank, researchers can embark upon projects with hundreds of thousands of participants. Issues around the ownership of those samples, and the consent given by participants around their use persist however. The authors contend that we need real policy movement in the area to cover these concerns, especially as industry is getting increasingly involved.

“The international research community has built a massive and diverse research infrastructure on a foundation that has the potential to collapse, in bits or altogether. This issue would benefit from more explicit recognition of the vast disconnect between the current practices and the realities of the law, research ethics and public perceptions,” they say.

It’s a topic that was touched upon heavily in a recent paper from Professor Dame Sally Davies into the current state of genomic service provision in NHS England.

The report examines the potential for genomics to significantly improve the health of the nation. It provides clear evidence of its potential in areas such as screening, disease diagnoses and personalized prevention services.

The paper goes on to highlight some serious shortfalls in areas such as infrastructure, public engagement, organization of research and the provision of services, before providing clear recommendations on how each of these gaps can be addressed and access to genomic services widened.

It’s clear that this is an area undergoing some pretty rapid changes, and as such will be one that demands attention in the coming years.