AI Models

Ornith 1.0: The Local Coding Model That Writes Its Own Scaffold

DeepReinforce shipped Ornith 1.0 on June 25, 2026, and the real story is not the SWE-bench number, it is where the agent harness lives. Ornith learned to write its own scaffold, so the orchestration is inside the weights instead of bolted on around them. It is MIT-licensed, comes in four sizes, and the two smaller ones run on a single consumer GPU, fully offline, fully private.

Jithin Kumar PalepuJune 25, 202613 min read

On June 25, 2026, DeepReinforce released Ornith 1.0, a family of open, MIT-licensed coding models with one genuinely new idea: the model does not just solve the task, it also writes the agent scaffold it solves the task inside, and it learned both at the same time. That sounds like a detail. It is the whole point, and it is why a 35-billion parameter model you can run on a used graphics card behaves less like a chatbot and more like an agent that already knows how to drive itself.

For the last two years the open coding-model recipe has been fixed: someone trains a strong solver, then a separate team hand-builds the agent harness around it, the retry logic, the planning prompts, the tool-call formatting, the “if the test fails, do this” rules. The scaffold was human engineering; the model was the engine you dropped into it. Ornith 1.0 collapses that split. It treats the scaffold as something to learn, with reinforcement learning, at the same time as the code. That is what “self-scaffolding” means, and it changes what a small, local model is capable of on its own.

Everyone else trains the solver and hand-builds the harness around it. Ornith trained the harness into the weights.

What is “self-scaffolding,” concretely?

Start with the problem it solves. A raw language model is good at producing a chunk of code. An agent is something else: it reads a task, makes a plan, edits files, runs the tests, reads the failure, and tries again, across dozens of tool calls, without losing the thread. The gap between “writes good code” and “recovers from its own mistakes over a long run” is exactly the scaffold. Traditionally you write that scaffold by hand and pray it generalizes.

Ornith trains it instead. During RL, the model runs in two stages on each task. First it reads the task and proposes a refined scaffold, the plan and orchestration strategy it intends to use. Then it runs that scaffold to generate solution rollouts, the actual edits and tool calls that try to solve the problem. The key line from the DeepReinforce writeup is that the reward from the rollout flows back to both stages. If a scaffold leads to solutions that pass, that scaffold gets reinforced. Good orchestration strategies emerge on their own, without anyone hand-engineering a harness.

The two-stage loop, in one breath

Because the reward trains the planner and the coder together, the model does not just learn to write code that passes, it learns to set up the conditions under which it writes code that passes. It learns when to replan, when a step has failed, and how not to repeat the same dead end. Those are recovery behaviors, and they are the things fixed hand-written scaffolds are worst at.

Why it matters — Read task, propose scaffold, run scaffold, score the result, push the reward back into both the scaffold and the solution.

A better solver writes better code. A self-scaffolding model learns the moves around the code: plan, detect failure, replan, don't loop on the same mistake.

How was it trained?

The recipe is where the engineering lives, and it is worth being specific because “we used RL” hides a lot. Three pieces matter.

Token-level GRPO. Ornith optimizes with a token-level GRPO objective. GRPO (Group Relative Policy Optimization) is the critic-free RL method popularized by DeepSeek: instead of training a separate value network, it scores a group of rollouts against each other and pushes the policy toward the above-average ones. Doing it at the token level, rather than scoring a whole trajectory as one blob, gives finer credit assignment across a long agent run, which is exactly the regime Ornith operates in.

Asynchronous pipeline-RL with staleness weighting. Agent rollouts are slow, a single task can be dozens of tool calls and minutes of wall-clock. Waiting for every rollout to finish before the next training step would waste most of the cluster. Ornith runs generation and training asynchronously in a pipeline, so the policy keeps updating while rollouts are still in flight. The catch is that those in-flight tokens were produced by a slightly older policy, they are off-policy, and naively training on them biases the update. Ornith corrects for this with staleness weighting: the more out-of-date a token is, the less it counts. That is the boring, essential plumbing that makes agentic RL affordable at all.

Three layers against reward hacking. The instant you reward a model for making tests pass, you have invited it to make tests pass the wrong way: delete the failing test, hard-code the expected output, patch the grader. Ornith stacks three defenses. An immutable trust boundary marks files and actions the model is not allowed to touch. Deterministic monitors flag banned actions outright. And a frozen LLM judge sits as a veto layer, catching the subtler cheats the rule-based monitors miss. Reward hacking is the single biggest failure mode of coding RL, and treating it as a first-class part of the recipe, rather than a bug you patch later, is a sign the team knew where the bodies were buried.

The model family: four sizes, two lineages

Ornith 1.0 is not one model, it is a spread, so you can put the same behavior on a laptop or on an 8-GPU node. All of it is MIT licensed, with no regional restrictions, built on top of Gemma 4 and Qwen 3.5 base models, and all of it exposes a 256K-token context window behind an OpenAI-compatible API.

9B dense, the edge and single-GPU size. About 19 GB in bf16, ~6 GB at Q4. This is the one you run to triage a failing test suite on your own machine.
31B dense, the middle dense checkpoint for when you want more accuracy than the 9B and have the VRAM.
35B MoE, a mixture-of-experts model that activates only ~3B parameters per token. The DeepReinforce team calls this the sweet spot, and the reason is counter-intuitive (more on that below).
397B MoE, the flagship, the one that goes toe-to-toe with closed frontier models and needs a real multi-GPU node to serve.

Is it actually good? The benchmark table

Yes, and specifically it is good in a way that scales cleanly with size, which is what you want to see from an honest release. The headline: Ornith-397B scores 82.4 on SWE-bench Verified, trailing only Claude Opus 4.8 at 87.6 and beating a lot of the closed field. But the more interesting rows are the smaller models, because those are the ones you can actually run yourself.

Benchmark	Ornith 9B	Ornith 35B	Ornith 397B
SWE-bench Verified	69.4	75.6	82.4
Terminal-Bench 2.1 (Terminus-2)	43.1	64.2	77.5
SWE-bench Pro	42.9	50.4	62.2
NL2Repo	27.2	34.6	48.2

DeepReinforce-reported scores, June 2026. Source: github.com/deepreinforce-ai/Ornith-1.

Put the 397B next to the frontier and the gap is one point of SWE-bench Verified to Opus 4.8. On Terminal-Bench 2.1 it posts 77.5, ahead of Opus 4.7 (70.3) and behind Opus 4.8 (85). That is a genuinely strong open model. But look at the 35B: 75.6 on SWE-bench Verified from a model that fits on a single 24 GB card. Two years ago that was frontier-closed territory. Now it is a file you download.

Which GPU runs which size?

This is the part that decides whether “local” is a slogan or a fact for you, so here are the real numbers, not vibes. The two smaller models are the ones that matter for a personal machine.

Model / quant	On-disk size	Fits on
9B · Q4	~6 GB	8 GB card (RTX 4060 Ti)
9B · bf16	~19 GB	24 GB card / one 80 GB GPU
35B MoE · Q4_K_M	21.2 GB	24 GB card (RTX 3090 / 4090)
35B MoE · Q5_K_M	24.7 GB	32 GB card (RTX 5090)
397B MoE	FP8 / bf16	multi-GPU node, tensor-parallel

GGUF sizes per the community GPU guide. Source: runaihome.com.

The short version: a used RTX 3090 (around $1,000) runs the 35B MoE at Q4 with room to spare, and an 8 GB card runs the 9B. You do not need a data center. You need a gaming PC.

Why is the 35B faster than the 9B?

Because generation speed tracks active parameters, not total parameters, and the 35B is a mixture-of-experts model that only lights up ~3B parameters per token. The 9B dense model runs all 9 billion of its weights on every token. So the 35B, despite being nearly four times bigger on disk, does less arithmetic per token and generates faster, while being meaningfully more accurate. The only price you pay is VRAM: you have to hold all 35B of weights in memory even though you only use a slice of them at a time.

The MoE trade, stated plainly

This is why the DeepReinforce team calls the 35B the sweet spot. If you have a 24 GB card, you get a model that is both faster and smarter than the 9B. The one real constraint on that hardware is context: to fit the weights and a usable KV cache on 24 GB, you run the 35B at roughly 8K to 16K context in practice, well short of its 256K ceiling. The full window needs the cloud or a multi-GPU box.

Why it matters — More VRAM to hold the weights, fewer FLOPs to run them. You buy accuracy and speed with memory.

How do you actually run it?

Ornith speaks an OpenAI-compatible API through vLLM, SGLang, and Transformers, plus GGUF builds for llama.cpp, Ollama, LM Studio, and Jan. That means it drops into an existing agent framework as a base-URL swap, no code changes. For the 9B on vLLM:

# Serve an OpenAI-compatible endpoint on :8000
vllm serve deepreinforce-ai/Ornith-1.0-9B \
  --served-model-name Ornith-1.0 \
  --max-model-len 262144 \
  --enable-auto-tool-choice

# ...then point any agent at http://localhost:8000/v1

Prefer the laptop-friendly path? Pull a GGUF straight into Ollama and start chatting, no 80 GB GPU required:

# 9B or 35B MoE, quantized, on consumer hardware
ollama run hf.co/deepreinforce-ai/Ornith-1.0-35B-GGUF:Q4_K_M

Note the version floors if you go the Python route: Ornith needs transformers >= 5.8.1, and the 35B on vLLM wants vllm >= 0.19.1. Point a serious multi-GPU serve at the 35B or 397B with --tensor-parallel-size set to your GPU count.

Why this changes local computation

Here is the shift, and it is bigger than one model. For most of the agentic-coding era, the deal was simple: the smart agent lived behind someone else's API, and your code went to their servers to get help. Local models existed, but they were the compromise you accepted when privacy mattered more than capability. Ornith narrows that gap on the axis that actually counts. A self-scaffolding model carries its orchestration in its weights, so a 35B you run offline behaves like an agent, not just a code-completer, without you hand-building the harness it needs.

The frontier used to be a place you rented. Ornith is a step toward the frontier being a file you own, on hardware you control, with your code never leaving the building.

Three consequences fall out of that. Privacy and compliance: a regulated codebase, the kind you are contractually forbidden from sending to a closed API, can now have a capable agent run against it entirely on-premises. Cost: the marginal token is free. Once the weights are on your disk, running an agent all day costs electricity, not per-token billing, which changes the economics of the long, loopy, autonomous runs we wrote about in loop engineering. Sovereignty: MIT license, no regional restrictions, weights you can pin forever. Nobody can deprecate your model, change its behavior overnight, or price you out.

Where it sits in the 2026 landscape

Ornith is the clearest sign yet that the open coding race has stopped being a pure parameter-count contest and become a workflow contest. The differentiator is no longer raw solver quality, it is reliability across a long run: holding context through dozens of tool calls, noticing when a step failed, and replanning without a human. That is the same thesis behind orchestration-first systems like Sakana's Fugu, which commands a pool of other models rather than being one big brain. Ornith reaches the same place from the opposite direction: instead of orchestrating across models, it learned to orchestrate itself, and then shrank small enough to run on your desk.

If you care about the other half of the local-inference story, how to make these models generate faster on the hardware you have, it pairs naturally with our piece on speculative decoding and DeepSeek DSpark. And if you want the closed-frontier reference point that Ornith-397B is chasing, that is Claude Opus 4.8.