At Enveda, we love a hackathon. They allow us to engage intensely on a particular question with curiosity and creativity. In a recent hackathon, members of our data science team asked the tantalizing question: how do you estimate an unknown?
One thing that unites Envedians is our belief that much of nature is fundamentally unknown. “Science has made incredible progress in characterizing biology,” says Viswa Colluru, CEO and founder of Enveda. “But we have only just scratched the surface of what nature can teach us.”
One of the unknowns of biology that is common fodder for speculation at Enveda is the size of life’s chemical space. As we say, life is chemistry. Every biological process is driven by a series of chemical reactions. Science knows a lot about the machinery (proteins and RNA) that catalyze these reactions, and the instructions for that machinery (DNA), but we know shockingly little about the molecules themselves that participate in and are transformed by these reactions. We don’t even have a solid estimate of how many different kinds of these molecules, known as metabolites, exist across the tree of life. “It’s like knowing everything about a factory, except what it makes,” says Colluru.
There are two types of metabolites, primary and secondary. Primary metabolites are your basic building blocks of life involved in central metabolism, replicating DNA, synthesizing proteins and other highly conserved processes. Secondary metabolites are much more specialized. A secondary metabolite could make a plant resistant to a pathogen, attractive to a pollinator, or symbiotic with a bacteria. Many of humanity’s most powerful medicines are derived from plant secondary metabolites, such as aspirin and warfarin. But how many of these metabolites exist is unknown.
“When we looked into the literature for estimates of the number of different plant secondary metabolites, we were pretty surprised to find that there really wasn’t a solid, evidence-backed number,” said Chloe Engler Hart, an ML scientist at Enveda. “Many people cite a paper estimating about 200,000 metabolites, but this number was estimated based on known metabolites 20 years ago and was much smaller than we expected, especially given that there are estimated to be over 400,000 species of plants.”
“We got really interested in this question because it gets to the core of Enveda’s mission. We’re surveying plant secondary metabolites, looking for the molecules with the greatest potential to inspire new medicines. If there are only 200,000 secondary metabolites in plants, then that’s not a very big pool. However, if our intuition is right and there are many more metabolites than previously estimated, then that increases the likelihood that we’ll find lots of different therapeutic molecules for a huge range of conditions,” she said.
“One of the reasons we do hackathons is to blow off scientific steam,” said August Allen, CTO of Enveda. “We often have these questions, like the one posed by Chloe and her team, that are really intriguing but not necessarily mission critical. Hackathons give us the opportunity to flex our skills and chase down the questions that really excite us.”
“We were really excited about estimating the size of plant chemical space and thought that with the five of us on the team we could make significant progress in the week allotted. But what really made this work possible was a newly published dataset that contained data from one thousand different plants,” she said.
This dataset, called ENPKG, contains the same type of data that Enveda uses to discover and characterize metabolites: mass spectrometry. This is a data type that can be used to identify molecules and their structures, but the analysis is technically challenging and there is no one way to do it. “We used four different methods for analyzing the mass spec data to get a likely range of metabolites per plant,” she explained.
“We then fit that data to a model and used it to project how many metabolites there are across the 400,000 species of plants,” she said. “We also supplemented these studies with work on two of the biggest literature databases of plant metabolites.”
“What we found is that there are very likely millions, if not tens of millions of unique metabolites in plants, confirming our hypothesis that plant chemical space is far more vast than previously appreciated,” Chloe said.
“Of course, there are caveats to our study,” Chloe explains. “Each of the four methods we used to analyze the mass spec data have their own drawbacks, plus the samples from the one thousand plants were not all processed in exactly the same way.” The results of this hackathon work are posted in a pre-print on bioRxiv, and Figure 2 lays out some of the main limitations and how they may impact the final estimate.
“Our work is a solid start to understanding the scope of plant chemical space, but to really answer the question we would need far more data. In particular, we would need a broader diversity of plants, and we would need samples from the roots, stems, and leaves, as these produce different secondary metabolites. Additionally, we also know that metabolite production can vary with time of day, season, and other biotic and abiotic factors,” she said.
“What makes me excited about the output of this hackathon is that it uses the best evidence we currently have to show that we have only just dipped our toes into the ocean of plant chemistry,” says Colluru. “There is still so much to explore and the platform we have built is the key to scaling this exploration so that we can turn a biological unknown into a medicine that benefits patients. It is both science for science’s sake and science for the patient’s sake.”