ChemPile advances chemical AI with over 75 billion curated multimodal tokens spanning education
materials, research literature, code, and molecular representations.
Integrating structured notations (SMILES, SELFIES), text, images, and code across disciplines, it features
standardized splits and permissive licensing for effortless model training.
The user-friendly dataset's intuitive design enables robust cross-domain benchmarking while maintaining scientific rigor.
By offering machine learning-ready resources through easy-to-use infrastructure, we hope that ChemPile will be a foundation for chemical AI.
Text-only Datasets
Multimodal Datasets
Introduction
ChemPile bridges the gap between chemistry's complexity and AI's potential by curating unified,
multimodal training data that mirrors human expertise. While foundation models could revolutionize drug discovery
and materials science, progress is hindered by fragmented, narrow datasets limited to single modalities
(e.g., SMILES strings). These resources fail to capture the interconnected reasoning and
contextual fluency of real-world chemistry, resulting in models that struggle to generalize or synthesize insights.
ChemPile addresses this by structuring 60 billion tokens into a holistic learning
ecosystem, designed to emulate how chemists build mastery—through foundational concepts,
research exposure, and multimodal problem-solving.
The dataset integrates six synergistic components:
ChemPile-Education: textbooks/lectures for core principles.
ChemPile-Paper: cutting-edge chemical research literature.
ChemPile-(M)LIFT: multimodal chemical properties.
ChemPile-Reasoning: step-by-step problem-solving.
ChemPile-Code: executable computational scripts.
ChemPile-Caption: image-text pairs for molecular analysis.
Hosted on HuggingFace with standardized splits and open licensing, ChemPile spans chemical subfields from biochemistry, over organic chemistry,
to materials science to enable cross-domain generalization. By unifying high-quality, machine learning-ready data,
it aims to democratize chemical AI development..
This sunburst visualization shows the different components of ChemPile distributed according to token count.
Smaller datasets are scaled to a minimum size to ensure visibility.
Click on any segment to explore the hierarchy: main topics → groups → individual datasets.
Each dataset segment links directly to its corresponding
HuggingFace collection.
These foundational datasets together form the comprehensive ChemPile collection.
Examples from ChemPile
Explore examples from different components of ChemPile.
Each example demonstrates the diverse content and formats available across the dataset.
Use the navigation arrows to browse through examples or let them rotate automatically.
A critical factor in developing robust foundation models is the scale of the training data.
As illustrated in the bar plot below, ChemPilesignificantly surpasses other publicly available chemical datasets in size.
For instance, ChemDFM, the largest reported chemical foundation model, was trained on 34 billion tokens, a dataset that is over 50% smaller than ChemPile,
even with its inclusion of general-purpose data like Wikipedia and the WuDao Corpora.
Figure 1: Comparison of ChemPile's token count with other chemical datasets, demonstrating ChemPile's significant size advantage.
Other prominent chemical datasets, including LlaSMol and ChemDual, are orders of magnitude smaller.
This makes ChemPile the most substantial open chemical dataset currently available,
providing the necessary scale for training powerful and insightful foundation models.
A Rich Tapestry of Chemical Information
Beyond its impressive size, ChemPile boasts exceptional diversity, a crucial element for training versatile
large language models (LLMs), as the interplay of data sources significantly influences model generalization.
Current understanding of optimal data mixing for LLMs is an evolving area of study, with different
combinations often leading to varied performance outcomes.
Figure 2: Comparison of dataset embeddings with other available chemistry datasets showing ChemPile's superior diversity.
ChemPile is intentionally designed to be maximally diverse to facilitate research in this domain.
This diversity demonstrates that data embeddings from ChemPile occupy a broader informational space
than many other chemical datasets combined as shown in Figure 2.
This breadth is achieved by incorporating data from a
wide array of sources—ranging from structured chemical databases and lecture transcripts to
entirely novel, purpose-built data—and by representing chemical entities through multiple
modalities and textual formats.
Unlike other large chemical datasets, ChemPile is a carefully assembled collection of
distinct subsets, each curated to encapsulate specific chemical knowledge or to foster
particular capabilities in models trained upon them. Furthermore, ChemPile acknowledges that chemical
substances can be described in numerous ways. This includes various string-based notations such as
IUPAC names, SMILES, SELFIES, and InChI, alongside visual representations like molecular
drawings and images extracted from chemical textbooks.
Crafted with Manual Curation
The superior quality of ChemPile is a direct result of rigorous expert oversight throughout
its creation. Each component dataset was subjected to manual review by domain specialists
to confirm its scientific accuracy and relevance. For instance, the ChemPile-(M)LIFT subset
underwent a systematic verification protocol where chemical experts meticulously checked template
designs, property assignments, and molecular representations.
Every dataset within ChemPile passed through multiple validation stages to eradicate
inconsistencies, incorrect terminology, and formatting errors. This painstaking curation, reflecting
hundreds of hours of expert work, has produced a dataset that faithfully captures both fundamental
chemical principles and specialized knowledge with high fidelity.
Seamless Integration and Use
ChemPile is engineered for immediate and widespread use, offering consistent interfaces across
all its constituent datasets to welcome researchers from diverse scientific backgrounds. The entire
collection is conveniently hosted on HuggingFace, featuring uniform formatting and extensive documentation.
Each subset is accompanied by detailed metadata, illustrative usage examples, and clearly defined
training, validation, and test splits, which are carefully designed to prevent leakage of chemical
structures between partitions. The dataset's modular design empowers researchers to utilize
specific subsets independently or to combine them as required for their particular research goals.
This focus on accessibility lowers the barrier to entry for both machine learning practitioners and
chemistry experts, facilitating its direct application in training foundation models, undertaking
specialized fine-tuning, or conducting targeted research within specific chemical domains.
BibTeX
@article{mirza2025chempile0,
title = {ChemPile: A 250GB Diverse and Curated Dataset for Chemical Foundation Models},
author = {Adrian Mirza and Nawaf Alampara and Martiño Ríos-García and Mohamed Abdelalim and Jack Butler and Bethany Connolly and Tunca Dogan and Marianna Nezhurina and Bünyamin Şen and Santosh Tirunagari and Mark Worrall and Adamo Young and Philippe Schwaller and Michael Pieler and Kevin Maik Jablonka},
year = {2025},
journal = {arXiv preprint arXiv: 2505.12534}
}