Methodology Overview

TrialBench: Multi-Modal AI-Ready Datasets for Clinical Trial Prediction

Jintai Chen, Yaojun Hu, Mingchen Cai, Yingzhou Lu, Yue Wang, Xu Cao, Miao Lin, Hongxia Xu, Jian Wu, Cao Xiao, Jimeng Sun, Lucas Glass, Kexin Huang, Marinka Zitnik, Tianfan Fu*

Abstract

Clinical trials are pivotal for developing new medical treatments but typically carry risks such as patient mortality and enrollment failure that waste immense efforts spanning over a decade. Applying artificial intelligence (AI) to predict key events in clinical trials holds great potential for providing insights to guide trial designs. However, complex data collection and question definition requiring medical expertise have hindered the involvement of AI thus far. This paper tackles these challenges by presenting a comprehensive suite of 23 meticulously curated AI-ready datasets covering multi-modal input features and 8 crucial prediction challenges in clinical trial design, encompassing prediction of trial duration, patient dropout rate/event, serious adverse event, mortality event, trial approval outcome, trial failure reason, drug dose, and design of eligibility criteria. Furthermore, we provide basic validation methods for each task to ensure the datasets' usability and reliability. We anticipate that the availability of such open-access datasets will catalyze the development of advanced AI approaches for clinical trial design, ultimately advancing clinical trial research and accelerating medical solution development.

Paper Code Data
Until 2025-3-4, data has been downloaded 0 times.

Download SubTask Datasets

Download SubTask Datasets: Python & R Instructions

Python
  1. Install trialbench:
    pip install trialbench
  2. Manually download mesh_embeddings.txt.gz:
    Download mesh_embeddings.txt.gz
    Copy it to the trialbench data directory:
    cp mesh_embeddings.txt.gz your_path_to/miniconda3/envs/trialbench/lib/python3.10/site-packages/trialbench/data/mesh-embeddings/
  3. Download and load datasets:
    import trialbench # Download all datasets (optional) trialbench.function.download_all_data('data/') # Load data task = 'dose' phase = 'All' # Dataloader format train_loader, valid_loader, test_loader, num_classes, tabular_input_dim = trialbench.function.load_data(task, phase, data_format='dl') # Or as Pandas DataFrame train_df, valid_df, test_df, num_classes, tabular_input_dim = trialbench.function.load_data(task, phase, data_format='df')
R
  1. Download R package:
    Please download r.trialbench_0.0.0.9000.tar.gz from here.
    After downloading, decompress it:
    tar -xzvf r.trialbench_0.0.0.9000.tar.gz
  2. Install dependencies and Python environment:
    install.packages("reticulate") library(reticulate) conda_create("r_trialbench", python_version = "3.10") use_condaenv("r_trialbench") reticulate::py_install("trialbench", pip = TRUE)
  3. Manually download mesh_embeddings.txt.gz:
    Download mesh_embeddings.txt.gz
    Copy it to the trialbench data directory:
    cp mesh_embeddings.txt.gz your_path_to/miniconda3/envs/r_trialbench/lib/python3.10/site-packages/trialbench/data/mesh-embeddings/
  4. Source the R functions and load data:
    # Load R functions (replace with your actual path) source("your_path_to/r.trialbench/R/function.R", encoding = "UTF-8") # Download all datasets (optional) download_all_data("data/") # Load data task <- "dose" phase <- "All" data_list <- load_data(task, phase) train_df <- data_list$train_df valid_df <- data_list$valid_df test_df <- data_list$test_df

Comparison of different phases from several angles

Phase I Phase II Phase III
Spent time 1-2 years 1-2 years 2-3 years
Spent Money ($) 225 M 225 M 250 M
Result 5-10 candidates 2-5 candidates 1-2 candidates
Major objective safety safety and dosing safety and efficacy
# of patients 20-80 100-300 300-3000
Recruited patient healthy with diseases with diseases

Statistics of all the curated AI-solvable clinical trial datasets

Tasks # trials (I/II/III/IV) # drugs # med device # other inter # diseases Intervention study (%)
trial duration forecasting 143.8K (13.5K/13.4K/9.2K/7.1K) 40.8K 21.1K 83.6K 44.6K 77.3%
patient dropout event forecasting 62.1K (4.2K/15.8K/11.5K/6.9K) 29.7K 10.9K 20.7K 21.9K 94.5%
serious adverse event forecasting 31.3K (2.0K/8.1K/4.8K/2.9K) 15.9K 6.6K 12.4K 15.9K 96.0%
mortality event prediction 31.3K (2.0K/8.1K/4.8K/2.9K) 15.9K 6.6K 12.4K 15.9K 96.0%
trial approval forecasting 43.2K (4.5K/12.5K/9.2K/4.5K) 24.1K 3.3K 12.6K 19.5K 93.0%
trial failure reason identification 41.4K (4.3K/8.8K/4.2K/3.5K) 17.7K 6.6K 16.9K 21.9K 86.8%
eligibility criteria design 136.4K (19.4K/14.2K/10.8K/10.6K) 48.5K 16.2K 75.0K 36.6K 84.9%
drug dose finding 12.8K (0/12.8K/0/0) 11.0K 0.1K 1.2K 7.3K 100%
Result Analysis

(a) A histogram showing the distribution of start dates for the selected trials reveals a steady increase in the number of initiated trials over time, reflecting the growing demand for new treatments.
(b) A statistical breakdown of the clinical trials by phase indicates that the majority of trials are in Phase II.
(c) The frequency of events varies across different phases, as exemplified by the dropout rates among participants.

World Map