In the second of our three part special on the changing attitudes and approaches of publishers, reviewers and scientists to scientific publishing we learn how Medial charity LifeArc and science software specialists SciBite are planning to use a pioneering machine-learning approach to reduce the ‘data mining’ process…
At every stage of a researcher’s career, literary reviews are an integral part of scientific investigation. They help to ensure researchers are aware of the current landscape, latest findings and newest technologies in their field.
To keep ahead of the game many researchers now invest time in ‘horizon scanning’ which is the systematic examination of potential opportunities and developments which are at the forefront of current thinking and planning across respective disciplines. However, there is an increasingly vast quantity of publications and information available.
Data mining and horizon scanning by conventional methods is becoming progressively difficult and time-consuming for researchers, which can lead to delayed discoveries. In otherwise fast-paced industries, this lengthy process is being put under scrutiny. One area where such delays could have significant repercussions is within the life science industry, as innovation and progress is key to ensuring medicines and technologies continue to evolve to meet patient needs.
Beating the noise
At medical research charity, LifeArc, researchers experienced this challenge first hand. It was this which led to a collaboration with scientific data experts, SciBite, to develop a machine learning technology with the potential to accelerate their research.
The team at LifeArc is constantly on the lookout to gain early insights and uncover a wealth of information regarding novel technologies, new drug targets, biomarkers and rare disease connections. Analysts can spend multiple hours per day sifting through publications on PubMed, as well as publicly available grant information and a range of biotech-focussed news websites, in an attempt to identify articles of interest amongst the background ‘noise’. Manually searching different sources with multiple keywords or phrases is resource intensive and means initial stages of research could only be performed at a restricted frequency and with a limited depth of review.
“We knew this process was inefficient and we were keen to find an alternative way to triage through publications and data in order to free up our analyst’s time and allow them to focus on the most important and relevant findings,” explains Ben Cryar, Senior Analyst, Opportunity Assessment Group, LifeArc. “Ultimately we want to streamline the research process so that we can focus on the next steps in identifying the most exciting new discoveries.”
Together, LifeArc and SciBite set up the pilot study with two key aims in mind:
- To align multiple unstructured scientific information sources including publications, news feeds and clinical trial records and create a richly annotated index of connected data.
- To search and analyse data to identify research findings which could inform novel drug, diagnostic and medical technology discoveries.
SciBite’s solution was semi-automated software that combined artificial intelligence approaches. This includes semantic searching and machine learning to sift through tens of millions of documents to identify genes, diseases, devices and many more scientific concepts. Working together with LifeArc, SciBite manually curated a library comprising tens of millions of synonyms tailored specifically to LifeArc’s internal vocabularies, such as compound identities and study codes. This provided the foundation for automated pre-processing and what they call ‘semantic enrichment’ – and it was these two things that allowed the team to attain the high quality, contextualised data necessary for machine learning to be effective.
“The technology is designed to reduce the time spent mining data by up to 80%, providing researchers with a subset of scientifically-relevant information filtered from the vast amounts of raw data in a rapid, easy-to-interpret manner, allowing them to focus and accelerate their research,” says Neal Dunkinson, Head of Technical Sales at SciBite. “We want more and more of the hidden knowledge in scientific content to be unlocked by simple services provided by our platform, helping application developers and informatics professionals build even more intelligent systems”
The team say the technology will improve the ability of researchers to identify emerging ‘mega trends’ in medical research, with the ultimate aim of fast-tracking discoveries in the healthcare space.
Senior Analyst at LifeArc. Ben Cryar again: “Going forward using this technology at LifeArc we hope to identify breakthrough discoveries and invest in those technologies with the greatest potential to have a positive impact on patient’s lives”
So does it actually work?
The proof, as always, lay with the pudding. Scibite are keen to show what their clever algorithms can do and think one area that could yield insights hidden in plan sight, as it were, is in the study of rare diseases.
Rare diseases affect approximately one in twenty people worldwide and treatment options are often limited, with many people remaining undiagnosed for many years or their entire lives. In the UK around 3.5 million people are affected and of the 7000 recognised rare diseases, only 400 have licenced treatments. These diseases are often chronic, life threatening and isolating for the sufferers and their families.
In the past two decades, rare diseases have attracted increasing attention and national government support. With the availability of genomic profiling, the number of people receiving rare disease diagnoses is on the rise, however, characterisation of rare diseases at the molecular level remains a challenge as research in this area is limited.
The fact that some rare diseases share similar phenotypes with common, well-understood conditions forms the basis of an inference-led approach to understanding them. However, evidence of disease similarity is often hidden within unstructured biomedical literature. Any chance of identifying relevant links would require a time-consuming and costly review process.
The technology developed by SciBite includes a method which quantifies disease similarities identified within biomedical literature based on their phenotypes. As a first step, semantic analytics is used to extract co-occurring pairs of conditions and clinical signs from over 25 million MEDLINE abstracts. Machine learning algorithms are then used to rank these relationships and predict how scientifically significant they are, for example based on how often the diseases co-occur compared to how often they appear independently. The resulting information is subsequently used to create a knowledge graph representing the strength of connectivity between diseases based on shared phenotypes or ‘phenotype signatures’. Where there is strong overlap in phenotype signatures, it can be hypothesised that a disease pair could share an underlying mechanistic relationship and use this to classify poorly characterised diseases.
The approach can be extended even further by including additional data sources, such as gene association data and protein-protein interaction data to go beyond phenotypes and classify diseases based on richer signatures comprised of genomic, proteomic and phenotypic information.
The team at Scibite say this showcases just how useful this machine learning technology can be at the research stage, ultimately working towards life science discoveries to improve patients’ lives and care.
LifeArc is a medical research charity that pioneers new ways to turn great science into greater patient impact. They have a 25 year legacy of helping scientists and organisations advance their research into therapeutics and diagnostics for patients, including Keytruda(cancer), Actemra (rheumatoid arthritis), Tysabri (multiple sclerosis) and Entyvio® (Crohn’s disease) and a test for antimicrobial resistance. Find out more at Lifearc.org
To learn more about SciBite and its solutions please contact email@example.com or visit http://www.scibite.com/