Rule-based benchmark demo

Tasks

A compact collection of offline, deterministic benchmark tasks. Open a task to inspect its design, prompts, verifier checks, partial scores, resolved results, and trajectories.