ReallyGood · Benchmark Datasets

Frontier tasks that actually separate the best models.

Three flagship families — extending Terminal-Bench, SWE-Bench and Toolathlon — of hand-calibrated, fully verifiable agentic tasks. Every task ships a deterministic oracle and is tuned to produce real score spread across frontier models, not a saturated 100%.

Explore the catalog →
1,057
Verifiable tasks
3
Benchmark families
8
Algorithmic domains
14+
Tool platforms
7
Frontier models scored
01
Family · Algorithmic Reasoning

Terminal-Bench Extension

CTF-grade algorithmic tasks solved inside a terminal sandbox — cryptanalysis, compiler internals, automata, coding theory and formal proof. Each task hands the agent a precise specification and a hidden checker; the answer must be computed, never guessed.

571 tasks 8 domains deterministic oracle per task Python · terminal sandbox
Domain distribution
Cryptanalysis
164
Compiler & PL
153
Data / Query
139
Symbolic Compute
41
Constraint Inversion
34
Reverse Engineering
26
Coding Theory
8
Formal Proof
6
Selected samples — click to open the real task
Approachable

Departments with the most students — ties included

Terminal-Bench · Data / Query · SPARQL over RDF

Over a knowledge graph of departments, courses and enrollments, return the name(s) of the department with the maximum number of distinct enrolled students — and every department tied for the top.

Why it discriminates: A single ORDER BY … LIMIT 1 silently drops ties; the checker builds graphs where several departments share the maximum, so only a tie-correct query passes.
View full task →
Representative

RSA — Wiener's attack

Terminal-Bench · Cryptanalysis

An RSA key was generated with a dangerously small private exponent d < n^0.25/3. Given only the public key (e, n), recover d via the continued-fraction approximation of e/n.

Why it discriminates: Brute force is infeasible; the agent must recall the continued-fraction convergent insight and implement the convergence test correctly.
View full task →
Hard

SAT solver: DPLL + CDCL + watched literals

Terminal-Bench · Symbolic / Constraint

Build a CNF satisfiability solver using DPLL with unit propagation, conflict-driven clause learning, and the two-watched-literals data structure. Return SAT/UNSAT and, when SAT, a model.

Why it discriminates: Correct clause learning and watched-literal invariants are subtle; the grader runs instances where naive backtracking times out.
View full task →
02
Family · Real-World Software Engineering

SWE-Bench Extension

Pull-request tasks mined from active open-source repositories. The agent works against a real codebase at a real before commit and must reproduce the behaviour of a merged PR — graded by the project's own hidden test suite, exactly as SWE-Bench intends.

289 tasks 16+ repositories Go · Python · TypeScript · Rust real before→after commits
Task type distribution
Feature Enhancement
104
Bug Fix
54
Code Generation
19
Refactoring
12
Configuration
10
Optimization
6
Source repositories — a sample
litestar-org/litestar
typeorm/typeorm
tortoise/tortoise-orm
SeaQL/sea-query
kite-org/kite
iota-uz/iota-sdk
marcopiovanello/yt-dlp-web-ui
almeidapaulopt/tsdproxy
bestruirui/octopus
kiwifs/kiwifs
1backend/1backend
Selected samples — click to open the real task
Approachable

Playlist download modifiers

marcopiovanello/yt-dlp-web-ui · Go · PR #262

A self-hosted yt-dlp web UI enqueues every entry of a playlist wholesale. Add per-playlist modifiers so a user can download only a chosen slice of a playlist.

Why it discriminates: The option must thread cleanly through the playlist-detect → enqueue path without breaking the existing single-video flow.
View full task →
Representative

Default transaction isolation level

typeorm/typeorm · TypeScript · PR #12269

Today an isolation level must be passed per call. Add a data-source-level default so every transaction on a connection uses it — without threading the argument through each call site.

Why it discriminates: Touches the core transaction machinery and must preserve per-call overrides; the hidden tests assert both default and override paths.
View full task →
Hard

Add SQL UNION to the QuerySet API

tortoise/tortoise-orm · Python · PR #2146

An async Python ORM can filter, order and aggregate a single model, but can't combine queries. Add first-class .union() support that stitches multiple querysets together and returns hydrated model instances.

Why it discriminates: Touches query compilation and result hydration; the panel split hard — some frontier models scored 0 while others reached 100.
View full task →
03
Family · Tool-Using Agents

Toolathlon Extension

End-to-end tasks that evaluate tool-using agents across realistic multi-tool environments — Feishu, DingTalk, e-commerce and exchange-rate APIs, spreadsheet and document toolchains. Each task ships a natural-language instruction, an executable verification harness, and full trajectories capturing tool calls, observations and final answers.

197 tasks 14+ tool platforms NL instruction + executable harness multi-turn tool trajectories
Category distribution
Business Data Workflows
44
Office & Collaboration
43
API Orchestration
38
Document & Slides
31
Multimedia Generation
10
Multi-Platform SaaS
9
In-house Toolkits
9
Social Platforms
5
Selected samples — click to open the real task
Approachable

Meeting-room booking via REST API

Toolathlon · Office Ops · mock REST API

Acting as an office-admin operator, fulfil four booking requests against a meeting-room REST API — honouring room-assignment rules, training-room flows, blacklists and per-user frequency limits.

Why it discriminates: The agent must read the API docs and the business rules, then make and verify several stateful calls in the right order; skipping a single rule check silently corrupts the booking state.
View full task →
Representative

Logistics carrier performance report

Toolathlon · Business Data · spreadsheet + tool I/O

Using the provided read/write tools over a shipments spreadsheet, compute per-carrier metrics — shipment count, average delivery days (delivered only), on-time rate and return rate — then surface the most reliable and the worst carrier.

Why it discriminates: Several interacting filters (delivered-only, promised-vs-actual days) across tool round-trips; weaker agents miscount the conditional aggregates.
View full task →
Hard

Cross-border restock alert across four tools

Toolathlon · Multi-Platform SaaS · 4 live tools

As an e-commerce operator, run a weekly stock-out alert: combine an overseas-store API (prices & sales), a live FX rate API, and two Feishu tables (domestic inventory & suppliers), then decide restocking within a ¥500k budget — and push the plan to DingTalk.

Why it discriminates: A genuine multi-tool orchestration — four heterogeneous APIs, a budget constraint and a priority×margin ranking; the panel split from 50 to 100 as weaker models dropped tool steps.
View full task →
Reference

How every task is built

A task only ships once it clears the same four gates, so the dataset stays clean and the numbers stay honest.

1 · Verifiable oracle

A deterministic checker decides pass/fail. No LLM-as-judge, no fuzzy matching — the reference solution scores 100% and a known-broken baseline scores low.

2 · Calibrated difficulty

Each task is run across five frontier models. Tasks that everyone aces (saturated) are retired; we keep the ones that produce real score spread.

3 · Real provenance

SWE-Bench tasks anchor to an actual merged PR and its hidden test suite; Terminal-Bench tasks derive from peer-reviewed algorithms with generated instances.

4 · Anti-memorization

Instances are freshly generated and the public contract is hidden from the prompt, so a model can't pattern-match its way to the answer.