Using Machine Learning To Mine Complex Datasets

It’s widely recognised that the rising power of AI is largely driven by the increase in data available to train the algorithms.  In our ‘big data’ era, quantity of data is seldom the issue, but being able to successfully analyze it is often much harder.

New research from the Department of Energy’s Lawrence Berkeley National Laboratory (Berkeley Lab) and UC Berkeley uses machine learning to enable scientists to derive insights from incredibly complex datasets in record time.

“Take a human cell, for example. There are 10170 possible molecular interactions in a single cell. That creates considerable computing challenges in searching for relationships,” the authors explain. “Our method enables the identification of interactions of high order at the same computational cost as main effects – even when those interactions are local with weak marginal effects.”

Unique requirements

The team highlight the very unique requirements of machine learning projects in science than those in other sectors.  Whereas in some sectors, not being able to understand how the algorithm came to its conclusion is acceptable, in science this isn’t the case.

A detailed understanding of how and why something happens allows scientists to model the process and test whether it can be improved.  As such, explainability is crucial for machine learning when used in scientific projects.

This is especially difficult in complex systems where there are generally a huge number of variables to keep in mind, and indeed variables that behave in nonlinear ways.  This makes building a model that shows cause and effect very difficult.

“Unfortunately, in biology, you come across interactions of order 30, 40, 60 all the time,” the authors explain. “It’s completely intractable with traditional approaches to statistical learning.”

Random forests

The team used random forests to translate the internal state of the algorithm to a more human-readable interpretation.  They believe their approach will allow researchers to safely search for complex interactions without incurring huge computational costs of identification.

“There is no difference in the computational cost of detecting an interaction of order 30 versus an interaction of order two,” they say. “And that’s a sea change.”

The algorithm was put through its paces on a couple of genomics problems, one involving the role of gene enhancers in fruit flies, and the other alternative splicing in a human-derived cell line.  In both experiments, the algorithm was able to confirm previous findings, whilst also discovering some higher-order interactions for the team to follow up on in subsequent work.

The team are now testing the algorithm on a number of other problems, in various other domains, but are confident that their work represents a fundamental shift in how science can be performed.

“We do prediction, but we introduce stability on top of prediction in iRF to more reliably learn the underlying structure in the predictors,” they say.  “This enables us to learn how to engineer systems for goal-oriented optimization and more accurately targeted simulations and follow-up experiments.”