Clinical trials are pivotal for developing new medical treatments but typically carry risks such as patient mortality and enrollment failure that waste immense efforts spanning over a decade. Applying artificial intelligence (AI) to predict key events in clinical trials holds great potential for providing insights to guide trial designs. However, complex data collection and question definition requiring medical expertise have hindered the involvement of AI thus far. This paper tackles these challenges by presenting a comprehensive suite of 23 meticulously curated AI-ready datasets covering multi-modal input features and 8 crucial prediction challenges in clinical trial design, encompassing prediction of trial duration, patient dropout rate, serious adverse event, mortality rate, trial approval outcome, trial failure reason, drug dose finding, design of eligibility criteria. Furthermore, we provide basic validation methods for each task to ensure the datasets' usability and reliability. We anticipate that the availability of such open-access datasets will catalyze the development of advanced AI approaches for clinical trial design, ultimately advancing clinical trial research and accelerating medical solution development.
Phase I | Phase II | Phase III | |
Spent time | 1-2 years | 1-2 years | 2-3 years |
Spent Money ($) | 225 M | 225 M | 250 M |
Result | 5-10 candidates | 2-5 candidates | 1-2 candidates |
Major objective | safety | safety and dosing | safety and efficacy |
# of patients | 20-80 | 100-300 | 300-3000 |
Recruited patient | healthy | with diseases | with diseases |
Tasks | # trials (I/II/III/IV) | # drugs | # med device | # other inter | # diseases | Intervention study (%) | |
trial duration forecasting | 143.8K (13.5K/13.4K/9.2K/7.1K) | 40.8K | 21.1K | 83.6K | 44.6K | 77.3% | |
patient dropout event forecasting | 62.1K (4.2K/15.8K/11.5K/6.9K) | 29.7K | 10.9K | 20.7K | 21.9K | 94.5% | |
serious adverse event forecasting | 31.3K (2.0K/8.1K/4.8K/2.9K) | 15.9K | 6.6K | 12.4K | 15.9K | 96.0% | |
mortality event prediction | 31.3K (2.0K/8.1K/4.8K/2.9K) | 15.9K | 6.6K | 12.4K | 15.9K | 96.0% | |
trial approval forecasting | 43.2K (4.5K/12.5K/9.2K/4.5K) | 24.1K | 3.3K | 12.6K | 19.5K | 93.0% | |
trial failure reason identification | 41.4K (4.3K/8.8K/4.2K/3.5K) | 17.7K | 6.6K | 16.9K | 21.9K | 86.8% | |
eligibility criteria design | 136.4K (19.4K/14.2K/10.8K/10.6K) | 48.5K | 16.2K | 75.0K | 36.6K | 84.9% | |
drug dose finding | 12.8K (0/12.8K/0/0) | 11.0K | 0.1K | 1.2K | 7.3K | 100% |
(a) A histogram showing the distribution of start dates for the selected trials reveals a steady
increase in the number of initiated trials over time, reflecting the growing demand for new treatments.
(b) A statistical breakdown of the clinical trials by phase indicates that the majority of trials are in
Phase II.
(c) The frequency of events varies across different phases, as exemplified by the dropout rates among
participants.