Open-Source AI Model Builders — Full Transparency Report

1

The Fully-Open Model Builders

These organizations release weights + training data + code + checkpoints — the full stack, reproducible from scratch.

Allen Institute for AI (Ai2) Gold Standard

Models: OLMo / Molmo / Tülu. Full training code, every intermediate checkpoint, training configs, data provenance, and evaluation pipelines are public, all under Apache 2.0. Sizes up to 32B. Latest is Olmo 3 (Nov 2025) and Olmo 3.1 32B, which released every training and fine-tuning dataset for download without license restrictions, up to 6 trillion tokens, plus OlmoTrace for tracing outputs back to training data.

Ai2 has confirmed a sparse Olmo-MoE is on the roadmap for 2026, with more model flows, toolkits, and reasoning-focused releases coming, especially around the 32B scale they call a "sweet spot."

Hugging Face Data Stewards

Models: SmolLM / SmolVLM. SmolLM3 (3B) is a fully open model — open weights plus full training details including the public data mixture and training configs — pretrained on 11.2T tokens. They also maintain the FineWeb / FineWeb-Edu corpora that many other open models train on.

Continued iteration on the SmolLM/SmolVLM family and FineWeb data releases; they release intermediate checkpoints and post-training data progressively.

EleutherAI Original Open-Data Lab

Models: Pythia. Datasets: The Pile, Common Pile. Their newest direction is the Common Pile v0.1 (mid-2025), built entirely from openly licensed and public-domain text, after consulting legal experts on what counts as a sufficiently open license.

Growing the pool of openly-licensed data and training models on it, explicitly to sidestep the copyright problem — they argue the common idea that unlicensed text drives performance is unjustified.

BigCode (Hugging Face + ServiceNow) Code Models

Models: StarCoder 2 — code model with training code and the underlying dataset (The Stack v2) released publicly. A rare fully-open entry in the code-generation space.

LLM360 Community-Owned AGI

Models: Amber / Crystal / K2. A research lab built around the phrase "community-owned AGI." K2 is fully transparent — they open-source all artifacts including code, data, model checkpoints, and intermediate results — developed with MBZUAI and Petuum.

Mission-driven rather than tightly scheduled: standards and tooling for fully reproducible large-model research.

Smaller / Newer Fully-Open Entrants

Stanford's Marin
Apertus (70B) — from ETH Zürich / EPFL, Switzerland
AMD's Instella
Zyphra — Zamba models + the Zyda dataset
BLOOM, T5 — older but fully open

EleutherAI groups AI2, Hugging Face, Zyphra, and LLM360 together as the organizations defying the industry's transparency decline.

2

The Partial-Transparency Tier

Open weights with detailed technical reports, but no training data release. These sit between the fully-open builders and the fully-closed labs.

DeepSeek Open Weights + Reports

Models: DeepSeek-V3, DeepSeek-R1. Open weights with detailed technical reports on architecture and training methodology, but no dataset release. R1's reinforcement-learning pipeline is documented enough to be reproducible in principle.

Mistral AI Open Weights + Papers

Models: Mixtral, Mistral 7B. Open weights (some Apache 2.0) with research papers describing recipes, but training data is not released. A member of the NVIDIA Nemotron Coalition.

Cohere For AI Open Weights + Data Cards

Models: Aya family. Open weights with multilingual data cards documenting composition, but not the raw dataset itself.

3

The Frontier-Scale Open Outlier

NVIDIA — Nemotron Frontier-Scale Open

Builds frontier-scale open models and releases the data — distinct from "frontier-class" capability. Nemotron 3 (Dec 2025) shipped with training datasets, recipes, and ~10T tokens of open data. The Nemotron Coalition (GTC, March 2026) pools data and compute with Mistral, Perplexity, Cursor, LangChain, and others.

A coalition-built base model co-developed with Mistral that will underpin the upcoming Nemotron 4 family, intended to be open-sourced on completion.

The honest trend note: The direction of the big closed labs is the opposite of this group. Even OpenAI, Anthropic, and Google DeepMind used to disclose substantial detail about their pretraining data mixtures pre-2022, but stopped — with researchers specifically citing lawsuits as the reason. The realistic forward picture is a widening split: the closed frontier gets less transparent on data, while this open-data cohort carries the full-transparency torch — increasingly using openly licensed data specifically to stay legally durable.

Caveat on roadmaps: Only Ai2 (Olmo-MoE 2026) and NVIDIA (Nemotron 4) have given fairly concrete public timelines. The others have stated direction (more data, more reproducibility) without firm dates — so their forward plans should not be read as scheduled commitments.

4

Nuances Worth Separating

Open weights ≠ open data. Google's Gemma and Meta's Llama come from frontier-scale labs, but they release the weights and withhold the dataset. Genuine full-data release from a frontier-scale builder is basically NVIDIA plus the dedicated research orgs.
Indirect contribution via distillation. Frontier models' outputs do flow into open models as synthetic training data — many open models are partly trained on GPT- or Claude-generated text. But that's usually done by the open-model builders scraping or generating from APIs, often against terms of service, rather than the closed lab deliberately contributing. It's influence, not donation.

5

Concrete Examples of Data Reuse

OLMo's Dolma corpus was explicitly built to be reusable — AI2 said outright they expected it to be useful for training other language models. Zyphra's Zyda dataset was built partly on Dolma's filtering and deduplication approach. SmolLM itself reuses the books subset of Dolma for its long-context training mix.
SmolLM's data lineage (FineWeb-Edu, DCLM, SmolTalk) flows in the other direction and gets reused widely. SmolLM2 was trained on Python-Edu from The Stack, FineWeb-Edu, DCLM, and curated datasets — all open components that show up across many small-model projects.
The tooling travels too: OLMo's evaluation harness (Catwalk/OLMES) and its fully-published training code are used by researchers studying training dynamics, precisely because everything is reproducible.

6

The Bar: OSAID 1.0

The Open Source Initiative published its Open Source AI Definition (OSAID 1.0) in October 2024. It requires sufficiently detailed information about the data used for training, the code to run the system, and the weights — note it does not mandate releasing the training dataset itself, nor the full training code. By even that bar, almost none of the popular "open" models qualify.