Trialbench - Research Project

Abstract

Clinical trials are pivotal for developing new medical treatments but typically carry risks such as patient mortality and enrollment failure that waste immense efforts spanning over a decade. Applying artificial intelligence (AI) to predict key events in clinical trials holds great potential for providing insights to guide trial designs. However, complex data collection and question definition requiring medical expertise have hindered the involvement of AI thus far. This paper tackles these challenges by presenting a comprehensive suite of 23 meticulously curated AI-ready datasets covering multi-modal input features and 8 crucial prediction challenges in clinical trial design, encompassing prediction of trial duration, patient dropout rate/event, serious adverse event, mortality event, trial approval outcome, trial failure reason, drug dose, and design of eligibility criteria. Furthermore, we provide basic validation methods for each task to ensure the datasets' usability and reliability. We anticipate that the availability of such open-access datasets will catalyze the development of advanced AI approaches for clinical trial design, ultimately advancing clinical trial research and accelerating medical solution development.

Download SubTask Datasets

Trial Duration Forecasting
Download
(0)
Patient Dropout Event Forecasting
Download
(0)
Serious Adverse Event Forecasting
Download
(0)
Mortality Event Prediction
Download
(0)
Trial Approval Forecasting
Download
(0)
Trial Failure Reason Identification
Download
(0)
Eligibility Criteria Design
Download
(0)
Drug Dose Finding
Download
(0)

Download SubTask Datasets: Python & R Instructions

Python

Install trialbench:
pip install trialbench
Manually download mesh_embeddings.txt.gz:
Download mesh_embeddings.txt.gz
Copy it to the trialbench data directory:

cp mesh_embeddings.txt.gz your_path_to/miniconda3/envs/trialbench/lib/python3.10/site-packages/trialbench/data/mesh-embeddings/
Download and load datasets:
import trialbench # Download all datasets (optional) trialbench.function.download_all_data('data/') # Load data task = 'dose' phase = 'All' # Dataloader format train_loader, valid_loader, test_loader, num_classes, tabular_input_dim = trialbench.function.load_data(task, phase, data_format='dl') # Or as Pandas DataFrame train_df, valid_df, test_df, num_classes, tabular_input_dim = trialbench.function.load_data(task, phase, data_format='df')

Download R package:
Please download r.trialbench_0.0.0.9000.tar.gz from here.
After downloading, decompress it:
tar -xzvf r.trialbench_0.0.0.9000.tar.gz
Install dependencies and Python environment:
install.packages("reticulate") library(reticulate) conda_create("r_trialbench", python_version = "3.10") use_condaenv("r_trialbench") reticulate::py_install("trialbench", pip = TRUE)
Manually download mesh_embeddings.txt.gz:
Download mesh_embeddings.txt.gz
Copy it to the trialbench data directory:

cp mesh_embeddings.txt.gz your_path_to/miniconda3/envs/r_trialbench/lib/python3.10/site-packages/trialbench/data/mesh-embeddings/
Source the R functions and load data:
# Load R functions (replace with your actual path) source("your_path_to/r.trialbench/R/function.R", encoding = "UTF-8") # Download all datasets (optional) download_all_data("data/") # Load data task <- "dose" phase <- "All" data_list <- load_data(task, phase) train_df <- data_list$train_df valid_df <- data_list$valid_df test_df <- data_list$test_df


	Phase I	Phase II	Phase III
Spent time	1-2 years	1-2 years	2-3 years
Spent Money ($)	225 M	225 M	250 M
Result	5-10 candidates	2-5 candidates	1-2 candidates
Major objective	safety	safety and dosing	safety and efficacy
# of patients	20-80	100-300	300-3000
Recruited patient	healthy	with diseases	with diseases


Tasks	# trials (I/II/III/IV)	# drugs	# med device	# other inter	# diseases	Intervention study (%)
trial duration forecasting	143.8K (13.5K/13.4K/9.2K/7.1K)	40.8K	21.1K	83.6K	44.6K	77.3%
patient dropout event forecasting	62.1K (4.2K/15.8K/11.5K/6.9K)	29.7K	10.9K	20.7K	21.9K	94.5%
serious adverse event forecasting	31.3K (2.0K/8.1K/4.8K/2.9K)	15.9K	6.6K	12.4K	15.9K	96.0%
mortality event prediction	31.3K (2.0K/8.1K/4.8K/2.9K)	15.9K	6.6K	12.4K	15.9K	96.0%
trial approval forecasting	43.2K (4.5K/12.5K/9.2K/4.5K)	24.1K	3.3K	12.6K	19.5K	93.0%
trial failure reason identification	41.4K (4.3K/8.8K/4.2K/3.5K)	17.7K	6.6K	16.9K	21.9K	86.8%
eligibility criteria design	136.4K (19.4K/14.2K/10.8K/10.6K)	48.5K	16.2K	75.0K	36.6K	84.9%
drug dose finding	12.8K (0/12.8K/0/0)	11.0K	0.1K	1.2K	7.3K	100%

TrialBench: Multi-Modal AI-Ready Datasets for Clinical Trial Prediction

Abstract

Download SubTask Datasets

Download SubTask Datasets: Python & R Instructions

Comparison of different phases from several angles

Statistics of all the curated AI-solvable clinical trial datasets