1
The Fully-Open Model Builders
These organizations release weights + training data + code + checkpoints — the full stack, reproducible from scratch.
Allen Institute for AI (Ai2)
Gold Standard
Models: OLMo / Molmo / Tülu. Full training code, every intermediate checkpoint, training configs, data provenance, and evaluation pipelines are public, all under Apache 2.0. Sizes up to 32B. Latest is Olmo 3 (Nov 2025) and Olmo 3.1 32B, which released every training and fine-tuning dataset for download without license restrictions, up to 6 trillion tokens, plus OlmoTrace for tracing outputs back to training data.
Ai2 has confirmed a sparse Olmo-MoE is on the roadmap for 2026, with more model flows, toolkits, and reasoning-focused releases coming, especially around the 32B scale they call a "sweet spot."
Hugging Face
Data Stewards
Models: SmolLM / SmolVLM. SmolLM3 (3B) is a fully open model — open weights plus full training details including the public data mixture and training configs — pretrained on 11.2T tokens. They also maintain the FineWeb / FineWeb-Edu corpora that many other open models train on.
Continued iteration on the SmolLM/SmolVLM family and FineWeb data releases; they release intermediate checkpoints and post-training data progressively.
EleutherAI
Original Open-Data Lab
Models: Pythia. Datasets: The Pile, Common Pile. Their newest direction is the Common Pile v0.1 (mid-2025), built entirely from openly licensed and public-domain text, after consulting legal experts on what counts as a sufficiently open license.
Growing the pool of openly-licensed data and training models on it, explicitly to sidestep the copyright problem — they argue the common idea that unlicensed text drives performance is unjustified.
BigCode (Hugging Face + ServiceNow)
Code Models
Models: StarCoder 2 — code model with training code and the underlying dataset (The Stack v2) released publicly. A rare fully-open entry in the code-generation space.
LLM360
Community-Owned AGI
Models: Amber / Crystal / K2. A research lab built around the phrase "community-owned AGI." K2 is fully transparent — they open-source all artifacts including code, data, model checkpoints, and intermediate results — developed with MBZUAI and Petuum.
Mission-driven rather than tightly scheduled: standards and tooling for fully reproducible large-model research.
Smaller / Newer Fully-Open Entrants
- Stanford's Marin
- Apertus (70B) — from ETH Zürich / EPFL, Switzerland
- AMD's Instella
- Zyphra — Zamba models + the Zyda dataset
- BLOOM, T5 — older but fully open
2
The Partial-Transparency Tier
Open weights with detailed technical reports, but no training data release. These sit between the fully-open builders and the fully-closed labs.
DeepSeek
Open Weights + Reports
Models: DeepSeek-V3, DeepSeek-R1. Open weights with detailed technical reports on architecture and training methodology, but no dataset release. R1's reinforcement-learning pipeline is documented enough to be reproducible in principle.
Mistral AI
Open Weights + Papers
Models: Mixtral, Mistral 7B. Open weights (some Apache 2.0) with research papers describing recipes, but training data is not released. A member of the NVIDIA Nemotron Coalition.
Cohere For AI
Open Weights + Data Cards
Models: Aya family. Open weights with multilingual data cards documenting composition, but not the raw dataset itself.
3
The Frontier-Scale Open Outlier
NVIDIA — Nemotron
Frontier-Scale Open
Builds frontier-scale open models and releases the data — distinct from "frontier-class" capability. Nemotron 3 (Dec 2025) shipped with training datasets, recipes, and ~10T tokens of open data. The Nemotron Coalition (GTC, March 2026) pools data and compute with Mistral, Perplexity, Cursor, LangChain, and others.
A coalition-built base model co-developed with Mistral that will underpin the upcoming Nemotron 4 family, intended to be open-sourced on completion.
The honest trend note: The direction of the big closed labs is the opposite of this group. Even OpenAI, Anthropic, and Google DeepMind used to disclose substantial detail about their pretraining data mixtures pre-2022, but stopped — with researchers specifically citing lawsuits as the reason. The realistic forward picture is a widening split: the closed frontier gets less transparent on data, while this open-data cohort carries the full-transparency torch — increasingly using openly licensed data specifically to stay legally durable.
Caveat on roadmaps: Only Ai2 (Olmo-MoE 2026) and NVIDIA (Nemotron 4) have given fairly concrete public timelines. The others have stated direction (more data, more reproducibility) without firm dates — so their forward plans should not be read as scheduled commitments.
4
Nuances Worth Separating
- Open weights ≠ open data. Google's Gemma and Meta's Llama come from frontier-scale labs, but they release the weights and withhold the dataset. Genuine full-data release from a frontier-scale builder is basically NVIDIA plus the dedicated research orgs.
- Indirect contribution via distillation. Frontier models' outputs do flow into open models as synthetic training data — many open models are partly trained on GPT- or Claude-generated text. But that's usually done by the open-model builders scraping or generating from APIs, often against terms of service, rather than the closed lab deliberately contributing. It's influence, not donation.
5
Concrete Examples of Data Reuse
- OLMo's Dolma corpus was explicitly built to be reusable — AI2 said outright they expected it to be useful for training other language models. Zyphra's Zyda dataset was built partly on Dolma's filtering and deduplication approach. SmolLM itself reuses the books subset of Dolma for its long-context training mix.
- SmolLM's data lineage (FineWeb-Edu, DCLM, SmolTalk) flows in the other direction and gets reused widely. SmolLM2 was trained on Python-Edu from The Stack, FineWeb-Edu, DCLM, and curated datasets — all open components that show up across many small-model projects.
- The tooling travels too: OLMo's evaluation harness (Catwalk/OLMES) and its fully-published training code are used by researchers studying training dynamics, precisely because everything is reproducible.
6
The Bar: OSAID 1.0
The Open Source Initiative published its Open Source AI Definition (OSAID 1.0) in October 2024. It requires sufficiently detailed information about the data used for training, the code to run the system, and the weights — note it does not mandate releasing the training dataset itself, nor the full training code. By even that bar, almost none of the popular "open" models qualify.