Pharmaceutical research and development has historically been shrouded in mystery; a secretive activity conducted behind closed doors to protect commercial advantage. So, as big data continues to transform the industry must we remain so reluctant to share data? Katharine Briggs looks at the benefits, challenges and considerations surrounding the sharing of proprietary data.
We know that one of the challenges in medical research is the scarcity of real-world data available to academic researchers and other interested parties to develop new and improved drugs.
Data sharing does happen in the pharmaceutical industry, but it is not yet standard practice and remains the preserve of special projects. One such example is the ChEMBL database. Hosted by the European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI), ChEMBL is a vast online database containing bioactivity data on more than 1.6 million drugs and drug-like small molecules and their targets. Originally developed as a private resource by a biotechnology firm, it was acquired by EMBL in 2008 and has become a valued public resource for virtual screening, drug design and product development.
Share and share alike
ChEMBL is used by academics and industries of all sizes, strengthening innovation from new research, and the discovery of new treatments and drugs benefiting human health and agriculture. In the Strategic Vision for UK e-infrastructure report, Professor Dominic Tildesley of Unilever identified the ChEMBL database as a crucial part of the company’s development of antiperspirants. Unilever used the database to identify active components for antiperspirants and the ChEMBL data to build a model of their inhibition activity. Similarly, chemists from agrochemicals business Syngenta use ChEMBL in their product development. Mark Forster from Syngenta says of the database: “ChEMBL has links between both chemistry and biology data which makes it searchable in ways that the underlying literature would not be. I’m sure life science research would be greatly hindered.”
"Increased collaboration and dissemination of data is not only in the interest of public health but is also increasingly required by funding organisations. It is a vital part of achieving a reduction in animal testing."
Increased collaboration and dissemination of data is not only in the interest of public health but is also increasingly required by funding organisations. It is a vital part of achieving a reduction in animal testing. Aside from the ethical benefits, a reduction in animal testing also delivers other savings in terms of time and money, plus the data and knowledge gained in sharing data could enable more informed decisions about what substances to test and what tests to perform. An initiative led by the NC3Rs and the MHRA involving 32 organisations sharing data for 137 compounds and 259 studies, identified that the use of recovery animals could be reduced by up to 66%, saving thousands of animals globally each year.
A case for data sharing can also be made on the basis of the ethos of science described by Robert Merton which states that scientific findings should be made available to the entire scientific community to allow other researchers to conduct their own analyses and verify the results. Independent replication of research findings is seen as the fundamental mechanism by which scientific evidence accumulates to support a hypothesis. The field of genomics is regarded as a leader in the development of infrastructure, resources and policies that promote data sharing and this is cited as one of the main reasons for the rapid advance in genetic research compared to other areas of biomedicine.
Don’t be left out
"A key obstacle to data collaboration is the perceived need within industry to protect proprietary information. However, organisations need to be clear about how much of a competitive advantage they will lose by sharing data versus the knowledge they will gain."
A key obstacle to data collaboration is the perceived need within industry to protect proprietary information. However, organisations need to be clear about how much of a competitive advantage they will lose by sharing data versus the knowledge they will gain. How unique is the knowledge they hold versus the knowledge their competitors could bring to the table? Consideration should also be given to the risk of not taking part in data sharing, as those organisations that participate will have a competitive and economic advantage over those who do not.
Frustratingly, big data in pharma is often ‘locked’ inside pdfs sitting in individual company archives where it is unavailable even for internal analysis, so companies are often ‘protecting’ data they aren’t actually able to use themselves. Providing access to a larger pool of data can reveal patterns that are simply not visible in smaller component datasets where such relationships may be represented by only one or two chemicals.
It is often the case that only regulatory bodies have ready access to pooled datasets from multiple companies and therefore the opportunity to identify these broader patterns by performing cross-company analyses. This can present problems when pharmaceutical businesses submit a new drug application as broader regulatory knowledge can lead to challenges and assertions that need to be addressed, resulting in delays and the need for additional data generation for the pharmaceutical company. Research data can be valuable many years after it has been generated and fresh eyes can reveal new insights beyond those originally identified. In addition, new research topics and fields are emerging between the boundaries of traditional disciplines. By sharing data, companies can gain from external expertise in the same or different fields, opening up the data to be explored and used in ways which may not have originally been envisioned.
Academics, small biotechs, SMEs (small and medium-sized enterprises) and contractors can be included as collaborators, broadening the skills and experience still further and creating relationships which can be built on in the future. There is also an opportunity to improve data quality, as providing access to other experts will help identify errors and inconsistencies, similar to the crowdsourcing model used by Chemspider. As the costs of generating the data are also shared, it opens up the possibility for exploratory research that otherwise might not be commercially viable.
So how can pharmaceutical businesses overcome the challenges and concerns relating to data collaboration in order to reap the benefits? Regulations to protect the privacy of personal health information are often seen as potential barriers to data sharing due to the risk of accidental, malicious or compelled disclosure. However, data can still be shared as long as privacy safeguards are in place. Redacting data to strip out individual identifiers, statistically altering data in ways which do not compromise secondary analysis and placing restrictions on access to data are all simple steps that can be taken to secure it.
A survey of 1329 scientists suggested that another concern amongst the pharmaceutical community was the idea that data could be misused. However, creating an End User License where users are required to agree to certain conditions of use, including specific authorisation requirements from the data owner and limiting access to certain users are measures that can easily be put in place to mitigate risk. Data being stored in disparate repositories, in different formats and using potentially incompatible data types presents another significant technical challenge but not one that is unsurmountable. However, the additional resource needed to convert the data to an agreed format will add to the costs of data sharing. It also makes sense to opt for platform-independent file formats for exporting and importing data such as XML (extensible markup language), CSV (comma separated value) or SDF (structure data file), which can be opened using several software applications. However, using the same format for exporting and importing data does not avoid differences in what data are captured or how those data are captured e.g. as a number, text, etc. Here, data standards such as SEND can ensure that the data being captured are compatible.
A controlled vocabulary is preferred when capturing qualitative data in order to avoid problems due to differences in spelling and terminology. Quantitative data should ideally be captured using standardised units to simplify data mining and analysis. However, this is not always practical as recalculation of values can lead to an increase in the number of errors introduced during data entry. When designing the schema, an assessment also needs to be made as to whether precise figures will always be given, or if greater than/less than values and number ranges also need to be captured.
In the case of confidential data, an honest broker can be used in order to protect the security of sensitive data. This organisation needs to be trusted by all partners as they will have access to all the data and be responsible for controlling access for the other partners. A not-for-profit or academic organisation is likely to be preferred over a commercial one for this reason.
Evolution of sharing
Over the past decade, data sharing within the pharmaceutical industry has evolved from being virtually non-existent to a landscape where most companies will have gained experience through one or more initiatives. However, for the pharmaceutical sector to truly benefit, data collaboration needs to be incorporated into business as usual, rather than remaining the preserve of special projects. Data still exists within silos and the people who could do something useful with that data often don’t have access to it. There remains a fear in the sector that sharing data gives away commercial advantages when, in fact, sharing information could significantly reduce overheads and speed up the development of new drugs. With the rising cost of clinical trials and health data, the industry needs to look at collaboration as the way forward. Sharing data is not without its challenges, but with the right partners, the benefits far outweigh the risks.
Read our companion case study article: Big data for understanding small molecules; the eTOX consortium
Author: Katharine Briggs is Research Leader at Lhasa Limited, a not-for-profit organisation and educational charity that facilitates collaborative data sharing projects in the pharmaceutical, cosmetics and chemistry-related industries. lhasalimited.org