ChemPile

TL;DR

ChemPile advances chemical AI with over 75 billion curated multimodal tokens spanning education materials, research literature, code, and molecular representations. Integrating structured notations (SMILES, SELFIES), text, images, and code across disciplines, it features standardized splits and permissive licensing for effortless model training. The user-friendly dataset's intuitive design enables robust cross-domain benchmarking while maintaining scientific rigor. By offering machine learning-ready resources through easy-to-use infrastructure, we hope that ChemPile will be a foundation for chemical AI.

Introduction

ChemPile bridges the gap between chemistry's complexity and AI's potential by curating unified, multimodal training data that mirrors human expertise. While foundation models could revolutionize drug discovery and materials science, progress is hindered by fragmented, narrow datasets limited to single modalities (e.g., SMILES strings). These resources fail to capture the interconnected reasoning and contextual fluency of real-world chemistry, resulting in models that struggle to generalize or synthesize insights. ChemPile addresses this by structuring 60 billion tokens into a holistic learning ecosystem, designed to emulate how chemists build mastery—through foundational concepts, research exposure, and multimodal problem-solving.

The dataset integrates six synergistic components:

ChemPile-Education: textbooks/lectures for core principles.
ChemPile-Paper: cutting-edge chemical research literature.
ChemPile-(M)LIFT: multimodal chemical properties.
ChemPile-Reasoning: step-by-step problem-solving.
ChemPile-Code: executable computational scripts.
ChemPile-Caption: image-text pairs for molecular analysis.

Hosted on HuggingFace with standardized splits and open licensing, ChemPile spans chemical subfields from biochemistry, over organic chemistry, to materials science to enable cross-domain generalization. By unifying high-quality, machine learning-ready data, it aims to democratize chemical AI development..

Examples from ChemPile

Explore examples from different components of ChemPile. Each example demonstrates the diverse content and formats available across the dataset. Use the navigation arrows to browse through examples or let them rotate automatically.

Magnitude Matters: Powering Next-Generation Models

A critical factor in developing robust foundation models is the scale of the training data. As illustrated in the bar plot below, ChemPile significantly surpasses other publicly available chemical datasets in size. For instance, ChemDFM, the largest reported chemical foundation model, was trained on 34 billion tokens, a dataset that is over 50% smaller than ChemPile, even with its inclusion of general-purpose data like Wikipedia and the WuDao Corpora.

Comparison of dataset sizes showing ChemPile's superior scale — **Figure 1: Comparison of ChemPile's token count with other chemical datasets, demonstrating ChemPile's significant size advantage.**

Other prominent chemical datasets, including LlaSMol and ChemDual, are orders of magnitude smaller. This makes ChemPile the most substantial open chemical dataset currently available, providing the necessary scale for training powerful and insightful foundation models.

A Rich Tapestry of Chemical Information

Beyond its impressive size, ChemPile boasts exceptional diversity, a crucial element for training versatile large language models (LLMs), as the interplay of data sources significantly influences model generalization. Current understanding of optimal data mixing for LLMs is an evolving area of study, with different combinations often leading to varied performance outcomes.

Comparison of dataset embeddings showing ChemPile's superior diversity — **Figure 2: Comparison of dataset embeddings with other available chemistry datasets showing ChemPile's superior diversity.**

ChemPile is intentionally designed to be maximally diverse to facilitate research in this domain. This diversity demonstrates that data embeddings from ChemPile occupy a broader informational space than many other chemical datasets combined as shown in Figure 2. This breadth is achieved by incorporating data from a wide array of sources—ranging from structured chemical databases and lecture transcripts to entirely novel, purpose-built data—and by representing chemical entities through multiple modalities and textual formats.

Unlike other large chemical datasets, ChemPile is a carefully assembled collection of distinct subsets, each curated to encapsulate specific chemical knowledge or to foster particular capabilities in models trained upon them. Furthermore, ChemPile acknowledges that chemical substances can be described in numerous ways. This includes various string-based notations such as IUPAC names, SMILES, SELFIES, and InChI, alongside visual representations like molecular drawings and images extracted from chemical textbooks.

Crafted with Manual Curation

The superior quality of ChemPile is a direct result of rigorous expert oversight throughout its creation. Each component dataset was subjected to manual review by domain specialists to confirm its scientific accuracy and relevance. For instance, the ChemPile-(M)LIFT subset underwent a systematic verification protocol where chemical experts meticulously checked template designs, property assignments, and molecular representations.

Every dataset within ChemPile passed through multiple validation stages to eradicate inconsistencies, incorrect terminology, and formatting errors. This painstaking curation, reflecting hundreds of hours of expert work, has produced a dataset that faithfully captures both fundamental chemical principles and specialized knowledge with high fidelity.

Seamless Integration and Use

ChemPile is engineered for immediate and widespread use, offering consistent interfaces across all its constituent datasets to welcome researchers from diverse scientific backgrounds. The entire collection is conveniently hosted on HuggingFace, featuring uniform formatting and extensive documentation.

Each subset is accompanied by detailed metadata, illustrative usage examples, and clearly defined training, validation, and test splits, which are carefully designed to prevent leakage of chemical structures between partitions. The dataset's modular design empowers researchers to utilize specific subsets independently or to combine them as required for their particular research goals.

This focus on accessibility lowers the barrier to entry for both machine learning practitioners and chemistry experts, facilitating its direct application in training foundation models, undertaking specialized fine-tuning, or conducting targeted research within specific chemical domains.

BibTeX

@article{mirza2025chempile0,
  title   = {ChemPile: A 250GB Diverse and Curated Dataset for Chemical Foundation Models},
  author  = {Adrian Mirza and Nawaf Alampara and Martiño Ríos-García and Mohamed Abdelalim and Jack Butler and Bethany Connolly and Tunca Dogan and Marianna Nezhurina and Bünyamin Şen and Santosh Tirunagari and Mark Worrall and Adamo Young and Philippe Schwaller and Michael Pieler and Kevin Maik Jablonka},
  year    = {2025},
  journal = {arXiv preprint arXiv: 2505.12534}
}