The official repository for the paper Interactive Benchmarks [https://huggingface.co/papers/2603.04737].

src/situation_puzzle/: Situation-based reasoning.src/math/: Interactive math evaluation pipeline: naive solving vs. Interactive-Proof-style solving, with pass@k evaluation as a comparison baseline.src/trust_game/: Trust Game tournament (baseline + LLM agents).Most scripts read the following environment variables (you may define them in a .env file inside each subdirectory, or export them directly):
OPENROUTER_API_KEY: RequiredOPENROUTER_BASE_URL: Optional (default: https://openrouter.ai/api/v1)Example:
export OPENROUTER_API_KEY="sk-..."
export OPENROUTER_BASE_URL="https://openrouter.ai/api/v1"
pip install -r requirements.txt
Note: Different tasks require only subsets of dependencies. Please refer to each subdirectory’s README for details.
InteractiveBench/
README.md
LICENSE
src/
trust_game/
situation_puzzle/
math/
poker/
results/ directory (or a specified output path) within their respective folders, and include reproducibility metadata whenever possible (e.g., model name, hyperparameters).CONTRIBUTING.md (including requirements for adding new benchmark subdirectories, result formats, README standards, etc.).LICENSE)