With apps that empower drug discovery scientists to optimise models for their lead compounds right at their desk – Predictive Analytics is no longer the preserve of the select few and is on the cusp of becoming readily accessible. Here, Sean McGee welcomes us to the era of on-demand analytics…
Today it appears as though Predictive Analytics (PA) – employing data to build statistical models yielding predictions that help make decisions – has thoroughly infiltrated every aspect of modern business, from investment banking projections to environmental toxicology studies to ice cream recipes.
PA has enabled teams in almost every industry to leverage existing knowledge to facilitate data-driven decision making which can optimise their products, processes, and even their overall business. The utility and staying power of PA is difficult to deny. However, until recently there has been a rift in PA as its technical complexities have limited its adoption to only a select few who could afford a data scientist. This exclusive environment has now reached a tipping point, as new technologies have catalysed the democratisation of PA and created a new wave of “on demand” analytics. These applications can guide even novice users to quickly and easily develop compelling models to provide a data-based foundation for decisions and many industries are capitalising on the resulting benefits.
The declining productivity of biopharmaceutical R&D is pressuring many organisations to seek out new paths to efficient innovation. For years cheminformatics scientists and medicinal chemists have relied on quantitative structure-activity relationship (QSAR) models to identify promising candidates and drive lead optimisation, narrowing a large field of compounds down to one. This, coupled with other technological advances in biopharmaceutical research such as next-generation sequencing or “lab-on-a-chip” microfluidics devices, has drastically improved the early discovery process in biopharmaceuticals. However, industry pressure for even faster, leaner innovation has extended traditional, exclusive PA beyond its useful limit. As a result, many scientists working in early biopharmaceutical discovery and research have begun to turn to “on demand” analytics to spur novel discoveries.
The declining productivity of biopharmaceutical R&D is pressuring many organisations to seek out new paths to efficient innovation.
To implement this new approach throughout early biopharmaceutical research effectively, teams need to democratise the role of data scientists. It is important to note that this approach neither diminishes data scientists’ utility nor changes their role; instead, it gives non-expert users the ability to tailor their models to better support individual decisions. The key to this distinction lies in the conversion of global models into local ones.
The dichotomy between global and local models hinges on the fact that a given model cannot be effectively applied in every possible situation. This is due to the applicability domain inherent in every model. In the case of cheminformatics this refers to “the physico-chemical, structural or biological space, knowledge or information on which the training set of the model has been developed, and for which it is applicable to make predictions for new compounds.”¹
For example, a model that was designed for a collection of different fullerenes would not likely provide the same level of predictability for a set of beta lactam drugs since they are too chemically dissimilar. Since global models are designed to generally work for an organisation’s entire compound portfolio, a collection of hundreds of thousands of molecules, they can have wider strategic benefit for an organisation as they are applicable to a wider range of discovery projects. However, this also means many end users will not be able to get the best predictions for their individual work as they manage a much smaller set of compounds which may or may not lie within the model’s effective applicability domain. As a result, these end users require more specialised local models optimised for their own compounds to fully leverage the benefits of PA. This need creates the crux of the problem for today’s exclusive PA.
Currently, it can take an inexperienced scientist weeks to develop a new QSAR model from scratch. This is due to the inherent technical complexities in designing the model, “teaching” it to the computer and validating its predictive capabilities. Designing the model involves selecting its inputs and outputs – its parameters, descriptors, and independent and dependent variables. Next, the scientist needs to define and run a “training set” of pre-existing data through the model to uncover the trends that will provide the foundation for future predictions, effectively “teaching” the computer. Finally, the scientist validates the model, determining if it is accurate enough to provide compelling predictions and is not overfitting the data, describing random error or noise rather than the actual relationships between its variables.
On demand analytics gives the power of PA to the people, adding laser precision to predictions and avoiding the crystal ball of an overly generalised model.
The validation step is critical, as the model’s end users must be able to trust the predictive outcomes of these models in their research. As a result, data scientists, who have the ability to develop predictive models much faster, are constantly facing a mountain of requests for project-centric local models instead of designing and maintaining global models which have much wider applicability and strategic value to the organisation at large. End users, on the other hand, must either wait in line for their local model to be developed by the data science team or struggle through the process themselves, often resulting in wasted time and a poorer quality model.
On demand analytics fills this gap. It facilitates the conversion of broadly applicable global models into local models, allowing end users to optimise them for their own set of compounds. These end users can now more effectively develop predictions in silico, and data scientists can now focus on building and publishing more valuable global models for the organisation at large rather than spending time on one-off projects. These analytics applications allow an end user to access a global model, such as solubility vs. Cox2 inhibition, and quickly make changes with predefined functions to shape the model’s applicability domain around his or her compound library. These predefined functions can help the scientist to determine how the results will be presented, such as in a heat map or a box-and-whisker plot, supporting faster analysis and decision making. These applications can also aggregate data from a variety of sources, allowing disparate variables to be compared side by side, unlocking the potential for further discovery. Previously, any one of these steps would have needed a team of data scientists to implement them on a case-by-case basis, but the technologies that constitute on demand analytics allow individual end users to carry out this work at their desks.
While democratising model optimisation is catalysing a fundamental shift in early pharmaceutical research, teams must validate their models to ensure high quality predictions. An on demand approach can work here too, building automated validation functions into the model’s design. For example, the model can generate warnings for out-of-bounds descriptor or principal component values and can flag structural features in test compounds not seen in the training set. By automatically ensuring that input data does not stray far from the original model’s applicability domain, this capability helps to maintain prediction quality. Other functions enable scientists to quickly build error models for their predictive models, ensuring that their results are not being influenced by overfitting. All of this frees up data scientists, allowing them to focus on creating and publishing new global models for widespread use.
On demand analytics gives the power of PA to the people, adding laser precision to predictions and avoiding the crystal ball of an overly generalised model. In cheminformatics research, this rapidly gives scientists a deeper understanding of trends within their individual compound libraries. They can feel more confident in their predictions and optimise their lead compounds for their own parameters. Data scientists can spend their time developing more powerful models for use by the entire organisation rather than working on one-off projects. They can develop new predefined functions and build quality checkpoints into processes to ensure that end users get the most out of their local models. It is more than just a set of technologies; it is creating a new mind-set and charting a new course for innovation and discovery in cheminformatics research.
Author: Sean McGee is Product Marketing Manager with Dassault Systèmes BIOVIA specialising in software supporting chemistry-focused research and development.
1. Jaworska J, Nikolova-Jeliazkova N, Aldenberg T. “QSAR Applicability Domain Estimation by Projection of the Training Set in Descriptor Space: A Review.” Alternatives to Laboratory Animals. Vol. 33. (2005): 445-459. Print.