OpenTrials is a collaborative and open database for all available structured data and documents on all clinical trials, threaded together by individual trial. With a versatile and expandable data schema, it is initially designed to host and match the following documents and data for each trial: registry entries; links, abstracts, or texts of academic journal papers; portions of regulatory documents describing individual trials; structured data on methods and results extracted by systematic reviewers or other researchers; clinical study reports; and additional documents such as blank consent forms, blank case report forms, and protocols. The intention is to create an open, freely re-usable index of all such information and to increase discoverability, facilitate research, identify inconsistent data, enable audits on the availability and completeness of this information, support advocacy for better data and drive up standards around open data in evidence-based medicine. The project has phase I funding. This will allow us to create a practical data schema and populate the database initially through web-scraping, basic record linkage techniques, crowd-sourced curation around selected drug areas, and import of existing sources of structured and documents. It will also allow us to create user-friendly web interfaces onto the data and conduct user engagement workshops to optimise the database and interface designs. Where other projects have set out to manually and perfectly curate a narrow range of information on a smaller number of trials, we aim to use a broader range of techniques and attempt to match a very large quantity of information on all trials. We are currently seeking feedback and additional sources of structured data.
Trials are used to inform decision making, but there are several ongoing problems with information management on clinical trials, including publication bias, selective outcome reporting, lack of information on methodological flaws, and duplication of effort for search and extraction of data, which have a negative impact on patient care. Randomised trials are used to detect differences between treatments because they are less vulnerable to confounding, and because biases can be minimised within the trial itself. The broader structural problems external to each individual trial result in additional biases, which can exaggerate or attenuate the apparent benefits of treatments.
To take the example of publication bias, the results of trials are commonly and legally withheld from doctors, researchers and patients, more so when they have unwelcome results [1, 2], and there are no clear data on how much is missing for each treatment, sponsor, research site, or investigator , which undermines efforts at audit and accountability. Information that is publicly available in strict legal terms can still be difficult to identify and access if, for example, it is contained in a poorly indexed regulatory document or a results portal that is not commonly accessed [4, 5]. In addition to this, different reports on the same trial can often describe inconsistent results because of, for example, diverse analytic approaches to the same data in different reports or undisclosed primary outcome switching and other forms of misreporting [4, 6]. There is also considerable inefficiency and duplication of effort around extracting structured data from trial reports to conduct systematic reviews, for example, and around indexing these data to make it more discoverable and more used. Lastly, although large collections of structured “open data” on clinical trials would be valuable for research and clinical activity, including linkage to datasets other than those on trials, there is little available and it can be hard to search or access.
In 1999, Altman and Chalmers described a concept of “threaded publications” , whereby all publications related to a trial could be matched together: the published protocol, the results paper, secondary commentaries, and so forth. This suggestion has been taken up by the Linked Reports of Clinical Trials project, a collaboration of academic publishers which was launched in 2011 with the aim of using the existing CrossMark system for storing metadata on academic publications as a place where publishers can store a unique identifier (ID) on each trial to create a thread of published academic journal articles .
We have obtained funding for phase I of a project that expands this vision, going further than linking all academic papers on each trial: an open database of all structured data and documents on all clinical trials, cross-referenced and indexed by trial. The intention is to create a freely re-usable index of all such information to increase discoverability, facilitate audit on accessibility of information, increase demand for structured data, facilitate annotation, facilitate research, drive up standards around open data in evidence-based medicine, and help address inefficiencies and unnecessary duplication in search, research, and data extraction. Presenting such information coherently will also make different sources more readily comparable and auditable. The project will be built as structured “open data”, a well-recognised concept in information policy work described as “data that can be freely used, modified, and shared by anyone for any purpose” .
This article describes our specific plans, the types of documents and data we will be including, our methods for populating the database, and our proposed presentations of the data to various different types of users. We do not have funding to manually populate the entire database for all data and documents on all trials, and such a task would likely be unmanageably large in any case. In the first phase, we aim to create an empty database with a sensible data schema, or structure, and then populate this through a combination of donations of existing sets of data on clinical trials, scraping and then matching existing data on clinical trials, with the option for users of the site to upload missing documents or links, and manual curation for a subset of trials. We will also create user-friendly windows onto this data. Our project start date was April 2015; our first user engagement workshop was in April 2015; and, after consultation on features and design, our first major coding phase will start in September 2015. We are keen to hear from anyone with suggestions, feature requests, or criticisms, as well as from anybody able to donate structured data on clinical trials, as described below.
A description of the main classes of documents and data included is presented below and in Fig. 1. In overview, where possible, we will be collecting and matching registry entries; links, abstracts, or texts of academic journal papers; portions of regulatory documents describing trials; structured data extracted by systematic reviewers or other researchers; clinical study reports; additional documents such as blank consent forms; and protocols.
Registers are a valuable source of structured data on ongoing and completed trials. There are two main categories of register: industry registers, containing information on some or all trials conducted by one company, and national registers, containing information on some or all trials conducted in one territory or covered by one regulator. National registers generally consist of structured data on 20 standard data fields set out by the World Health Organisation (WHO) ; industry and specialty registers are more variable . The WHO International Clinical Trials Registry Platform is a “registry of registers” combining the contents of a large number of registers in one place . The simple act of aggregating, deduplicating, and then comparing registers can in itself be valuable. For example, in preliminary coding and matching work, we have found that trials listed in one register as “completed” may be listed as “ongoing” in another; thus, anyone looking only in the register where the trial was “ongoing” would not have known that results were, in fact, overdue. Similarly, where the text field for primary outcome has been changed during a trial, this can be identified in serial data on one registry and flagged up on the page for that trial. Registers presenting structured data have consistent and clearly denoted fields containing information on features such as the number of participants, the interventions (ideally using standard dictionaries and data schemas for consistency with other structured data), inclusion and exclusion criteria, primary and secondary outcomes, location of trial sites, and so forth. This information is ready to be extracted, processed, or presented. As a very simple example, after extracting this information, one can calculate the total number of trial participants on an intervention globally, restrict a search to include only large trials, or facilitate search of ongoing trials within 50 miles of a location, on a specific condition, where data quality permits .
Academic journals are one source of information on clinical trials, in the form of semi-structured free text, although they have increasingly been found to be flawed vehicles for such data. For example, they are less complete than clinical study reports , inconsistent with mandated structured data on registers , and permissive on undisclosed switching of primary outcomes  and other forms of misreporting . Journal articles on trials include other document types, such as commentaries and protocols. Academic journal articles reporting trial results can be matched against registry entries through various imperfect techniques, such as searching for trial ID numbers in metadata on PubMed (for very recent publications only) while applying standard search filters for trials, or using record linkage techniques on other features such as intervention or population.
Regulatory documents are an important and often neglected source of information on trials. Clinical study reports are extremely lengthy documents produced for industry-sponsored trials. They have a closely defined structure, which academic researchers have recently begun to access more frequently [14, 17]. At the other end of the spectrum for length, there will often be free text descriptions of the methods and results of clinical trials mixed in with other information in bundles of regulatory documents released by the U.S. Food and Drug Administration and indexed on the Drugs@FDA website  or as part of the European public assessment report published by the European Medicines Agency for approved uses of approved drugs . These documents are generally neglected by clinicians and researchers , poorly indexed, and hard to access and navigate. For example, the description of one trial may be buried in a few paragraphs in the middle of a long and poorly structured file, containing multiple documents, each covering multiple different issues around the approval of a product .
Structured data on the results of clinical trials is available from two main sources: registers that accept results reporting, such as ClinicalTrials.gov and ISRCTN (International Standard Randomised Controlled Trial Number), and structured data that has been manually extracted from free text reports on trials by researchers conducting systematic reviews or other research. This can include structured data on the characteristics of the trial (such as number of participants or a description of the interventions using standard dictionaries) or the results of a trial (to populate fields in meta-analysis software), as well as data on the conduct of a trial or its methodological shortcomings; for example, many trials have had their risk of bias graded on various aspects of trial design using standard tools such as the Cochrane Risk of Bias Assessment Tool. There is also a Systematic Review Data Repository (SRDR) archiving structured data that has been extracted manually in the course of producing systematic reviews. SRDR is managed by the Agency for Healthcare Research and Quality (AHRQ), which has already begun to pool such data .
Trial paperwork includes protocols, lay summaries, and statistical analysis plans, as well as documents often currently regarded as “internal”, such as blank case report forms, blank consent forms, ethical approval documents, and patient information sheets. These are generally poorly accessible and rarely indexed, but they can contain salient information. For example, it was only by examination of case report forms that the team conducting the Cochrane review on oseltamivir and complications of influenza were able to establish that the diagnostic criterion for pneumonia was “patient self-report” rather than more conventional methods such as chest x-ray, sputum, and/or medical examination . As another example, when presented with a trial in which the control group received a treatment which seems to be lower than the usual standard of care, a researcher or other interested party may wish to see the consent form to establish whether the benefits and risks of participation were clearly explained to patients. Lastly, ethics committee or institutional review board paperwork may contain information on how any potential risks were discussed or mitigated or may act as an additional source of information to identify undisclosed switching of primary and secondary endpoints. By placing all of this information side by side, identifying such inconsistencies becomes more straightforward and therefore may reasonably be expected to become more commonplace.
Manually populating the database for all documents and data on all trials would be desirable, but it would be a major information curation project requiring very significant financial support. We initially aim to populate the database in sections, with breadth and depth in different areas, through a range of approaches, including web-scraping, basic record linkage techniques, curated crowd-sourcing, and imports or donations of existing structured and linked data.