Frontier tasks that actually separate the best models.
Three flagship families — extending Terminal-Bench, SWE-Bench and Toolathlon — of hand-calibrated, fully verifiable agentic tasks. Every task ships a deterministic oracle and is tuned to produce real score spread across frontier models, not a saturated 100%.
Terminal-Bench Extension
CTF-grade algorithmic tasks solved inside a terminal sandbox — cryptanalysis, compiler internals, automata, coding theory and formal proof. Each task hands the agent a precise specification and a hidden checker; the answer must be computed, never guessed.
Departments with the most students — ties included
Over a knowledge graph of departments, courses and enrollments, return the name(s) of the department with the maximum number of distinct enrolled students — and every department tied for the top.
ORDER BY … LIMIT 1 silently drops ties; the checker builds graphs where several departments share the maximum, so only a tie-correct query passes.RSA — Wiener's attack
An RSA key was generated with a dangerously small private exponent d < n^0.25/3. Given only the public key (e, n), recover d via the continued-fraction approximation of e/n.
SAT solver: DPLL + CDCL + watched literals
Build a CNF satisfiability solver using DPLL with unit propagation, conflict-driven clause learning, and the two-watched-literals data structure. Return SAT/UNSAT and, when SAT, a model.
SWE-Bench Extension
Pull-request tasks mined from active open-source repositories. The agent works against a real codebase at a real before commit and must reproduce the behaviour of a merged PR — graded by the project's own hidden test suite, exactly as SWE-Bench intends.
Playlist download modifiers
A self-hosted yt-dlp web UI enqueues every entry of a playlist wholesale. Add per-playlist modifiers so a user can download only a chosen slice of a playlist.
Default transaction isolation level
Today an isolation level must be passed per call. Add a data-source-level default so every transaction on a connection uses it — without threading the argument through each call site.
Add SQL UNION to the QuerySet API
An async Python ORM can filter, order and aggregate a single model, but can't combine queries. Add first-class .union() support that stitches multiple querysets together and returns hydrated model instances.
Toolathlon Extension
End-to-end tasks that evaluate tool-using agents across realistic multi-tool environments — Feishu, DingTalk, e-commerce and exchange-rate APIs, spreadsheet and document toolchains. Each task ships a natural-language instruction, an executable verification harness, and full trajectories capturing tool calls, observations and final answers.
Meeting-room booking via REST API
Acting as an office-admin operator, fulfil four booking requests against a meeting-room REST API — honouring room-assignment rules, training-room flows, blacklists and per-user frequency limits.
Logistics carrier performance report
Using the provided read/write tools over a shipments spreadsheet, compute per-carrier metrics — shipment count, average delivery days (delivered only), on-time rate and return rate — then surface the most reliable and the worst carrier.
Cross-border restock alert across four tools
As an e-commerce operator, run a weekly stock-out alert: combine an overseas-store API (prices & sales), a live FX rate API, and two Feishu tables (domestic inventory & suppliers), then decide restocking within a ¥500k budget — and push the plan to DingTalk.
How every task is built
A task only ships once it clears the same four gates, so the dataset stays clean and the numbers stay honest.
1 · Verifiable oracle
A deterministic checker decides pass/fail. No LLM-as-judge, no fuzzy matching — the reference solution scores 100% and a known-broken baseline scores low.
2 · Calibrated difficulty
Each task is run across five frontier models. Tasks that everyone aces (saturated) are retired; we keep the ones that produce real score spread.
3 · Real provenance
SWE-Bench tasks anchor to an actual merged PR and its hidden test suite; Terminal-Bench tasks derive from peer-reviewed algorithms with generated instances.
4 · Anti-memorization
Instances are freshly generated and the public contract is hidden from the prompt, so a model can't pattern-match its way to the answer.