About the Builders — Open-Source AI Model Builders

The Fully-Open Model Builders

Organizations that release weights + training data + code + checkpoints — the full stack, reproducible from scratch.

Allen Institute for AI (Ai2) Gold Standard

Seattle, Washington · Non-profit, founded 2014 by Paul Allen

A Seattle-based non-profit AI research institute. Mission: "Building breakthrough AI to solve the world's biggest problems." Core values — openness, science, impact, collaboration. Develops foundational AI research across large-scale open models, data, robotics, and conservation. OLMo is their fully open language model and complete model flow; latest is Olmo 3 / 3.1 32B with all training and fine-tuning datasets released.

Hugging Face Data Stewards

New York, NY · For-profit, founded 2016

"The AI community building the future." Hosts the largest open repository of models, datasets, and Spaces — the hub the rest of the open ecosystem builds on. Maintains the SmolLM / SmolVLM family of fully open small models and the FineWeb / FineWeb-Edu corpora that many other open models train on. SmolLM3 (3B) was pretrained on 11.2T tokens with public data mixture and configs.

EleutherAI Original Open-Data Lab

Remote · Non-profit, founded July 2020 by Connor Leahy, Sid Black, Leo Gao

A non-profit AI research lab focused on interpretability and alignment of large models. Grew from a Discord server for talking about GPT‑3 into a leading research institute. Created landmark open-source foundations — GPT‑J, GPT‑NeoX, the Pythia suite, The Pile, and the new openly-licensed Common Pile v0.1. Models downloaded 70M+ times; 130+ publications at top ML venues. Operates primarily through a public Discord server with two dozen full- and part-time research staff.

BigCode (Hugging Face + ServiceNow) Code Models

Distributed · Open scientific collaboration

An open scientific collaboration working on the responsible development and use of large language models for code (Code LLMs), empowering the ML and open-source communities through open governance. Built StarCoder 2 with training code and the underlying dataset (The Stack v2) released publicly. Projects are developed through open collaboration and released with permissive licenses. Corporate support from ServiceNow and Hugging Face for hosting and compute; all technical governance takes place in community working groups.

LLM360 Community-Owned AGI

Research collaboration · Affiliated with CMU, USC, MBZUAI

A research lab built around the phrase "community-owned AGI" — developing standards and tools to advance the bleeding edge of LLM capability and empower knowledge transfer, research, and development. K2 is fully transparent: open-source all artifacts including code, data, model checkpoints, and intermediate results. Also released TxT360 (a trillion-token pre-training dataset) and the Decentralized Arena evaluation platform. Twitter handle is @llm360.

Smaller / Newer Fully-Open Entrants

Stanford's Marin — open model research from Stanford CRFM.
Apertus (70B) — from ETH Zürich / EPFL, Switzerland.
AMD's Instella — AMD's fully open model family.
Zyphra — Zamba models + the Zyda dataset, built partly on Dolma's filtering pipeline.
BLOOM, T5 — older but fully open landmark models.

EleutherAI groups AI2, Hugging Face, Zyphra, and LLM360 together as the organizations defying the industry's transparency decline.

The Partial-Transparency Tier

Open weights with detailed technical reports, but no training data release. These sit between the fully-open builders and the fully-closed labs.

DeepSeek Open Weights + Reports

Hangzhou, China · For-profit (High-Flyer / 深度求索)

Chinese AI lab focused on exploring AGI. Open weights with detailed technical reports on architecture and training methodology, but no dataset release. DeepSeek-R1's reinforcement-learning pipeline is documented enough to be reproducible in principle. Latest release: DeepSeek-V4 preview. "探索未至之境" — explore the uncharted.

Mistral AI Open Weights + Papers

Paris, France · For-profit, founded April 2023

"Putting frontier AI in everyone's hands." A European AI leader combining cutting-edge innovation with openness, transparency, cost efficiency, and responsibility. Builds frontier models, developer tools, applications, and compute. Open weights (some Apache 2.0) with research papers describing recipes, but training data is not released. A member of the NVIDIA Nemotron Coalition. Latest: Mistral OCR 4, the Vibe agent platform.

Cohere For AI (Cohere Labs) Open Weights + Data Cards

Toronto, Canada + remote · Non-profit research arm of Cohere

"Shaping the frontier of ML research." Cohere's research lab pushing the boundaries of fundamental ML research while changing where, how, and by whom breakthroughs happen. Built the Aya family of multilingual models — open weights with multilingual data cards documenting composition, but not the raw dataset itself.

The Frontier-Scale Open Outlier

NVIDIA — Nemotron Frontier-Scale Open

Santa Clara, California · For-profit, founded 1993

"NVIDIA pioneered accelerated computing to tackle challenges no one else can solve." Builds frontier-scale open models and releases the data — distinct from "frontier-class" capability. Nemotron 3 (Dec 2025) shipped with training datasets, recipes, and ~10T tokens of open data. The Nemotron Coalition (GTC, March 2026) pools data and compute with Mistral, Perplexity, Cursor, LangChain, and others. Upcoming Nemotron 4 family is intended to be open-sourced on completion.