This repository is organized as a collection of self-contained autoresearch
problems that can be tackled with an agent (e.g. Codex or Claude Code or similar). Each problem gets its own subfolder with one fixed benchmark, one
editable training script, one agent program, logs, and results.
The current problem folders are:
autoresearch_ca_life/ broad learned-architecture search
autoresearch_ca_life_transformer/ transformer-only architecture search
The autoresearch_ca_life and autoresearch_ca_life_transformer problems use the same fixed Game of Life-like cellular automata benchmark. The broad problem lets
the agent replace the baseline transformer with any learned neural architecture
that preserves the evaluator interface. The transformer-only problem keeps the
search space restricted to transformer-family architectures.
PDFs/ reference paper (AutomataGPT)
autoresearch_ca_life/ broad self-contained autoresearch problem
autoresearch_ca_life_transformer/ transformer-only autoresearch problem
...
Inside each autoresearch_* problem folder:
prepare.py fixed problem definition, data generation, and evaluator
train.py editable model and training loop
program.md agent-facing autoresearch instructions
analyze_results.py plotter/exporter for results.tsv
export_experiments.py local folder exporter for committed experiment snapshots
pyproject.toml dependencies for that problem
There is intentionally no README inside the problem folder. The convention is:
program.md = what the coding agent reads and follows
prepare.py = fixed benchmark; do not edit during experiments
train.py = editable research surface
The root README is for humans. program.md is for Codex, Claude Code, or any
other coding agent running a specific autoresearch problem.
Both current problems use the same CA benchmark. They do not model arbitrary cellular automata. They cover:
2D
binary cell states: 0/1
deterministic synchronous updates
toroidal boundary conditions
radius-1 Moore neighborhood
Life-like birth/survival rules
isotropic count-based rules
fixed 16 x 16 grids
Conway's Game of Life is the specific B3/S23 rule inside this broader
Life-like family.
Use autoresearch_ca_life/ when you want the agent to explore any learned
neural architecture. Use autoresearch_ca_life_transformer/ when you want a
clean transformer-only comparison.
From the repo root:
cd ./autoresearch_ca_life
python -m pip install -e .
python prepare.py --smoke-test
python train.py > run.log 2>&1
tail -n 40 run.logFor a tiny smoke training run:
AR_TIME_BUDGET=2 \
AR_EVAL_GRID_SAMPLES=4 \
AR_EVAL_INVERSE_SAMPLES=4 \
AR_EVAL_TRUTH_TABLE_LIMIT=16 \
AR_BATCH_SIZE=4 \
AR_EVAL_BATCH_SIZE=4 \
AR_N_LAYER=1 \
AR_N_HEAD=2 \
AR_N_EMBD=32 \
python train.pyFor the transformer-only problem, use the sibling folder:
cd ./autoresearch_ca_life_transformer
python -m pip install -e .
python prepare.py --smoke-test
python train.py > run.log 2>&1
tail -n 40 run.logCodex knows which problem to run from the current directory. Start Codex inside the problem folder, not the parent repo:
cd ./autoresearch_ca_life
codexFor the broad problem, give it this prompt:
Read program.md and follow it. Set up a new autoresearch run. Run the baseline
first, then iteratively edit train.py only. Optimize for lower val_error and try
to make life_solved become 1. Do not modify prepare.py.
For the transformer-only problem, start Codex in
autoresearch_ca_life_transformer/ and use:
Read program.md and follow it. Set up a new transformer-only autoresearch run.
Run the baseline first, then iteratively edit train.py only. Optimize for lower
val_error and try to make life_solved become 1. Do not modify prepare.py and do
not use non-transformer architectures.
The agent should use local Git commits as experiment checkpoints and append all
run outcomes to results.tsv.
human chooses problem folder
|
v
start Codex inside selected autoresearch_* folder
|
v
agent reads program.md
|
v
agent runs prepare.py --smoke-test
|
v
agent runs baseline train.py
|
v
agent commits one train.py change
|
v
agent runs python train.py > run.log 2>&1
|
v
agent appends metrics to results.tsv
|
v
better than previous best?
| yes | no
v v
keep commit reset to best commit
| |
+----------- repeat -----+
This workflow uses local Git commits as experiment checkpoints. It does not
open GitHub pull requests by itself. Training happens locally through
python train.py; the script automatically uses CUDA if available, then MPS on
Apple Silicon, then CPU. The agent tags each experiment commit before any reset
so later analysis can recover the exact train.py code snapshot.
After an agent has produced results.tsv, run, within the autoresearch directory:
cd <AUTORESEARCH DIRECTORY>
python analyze_results.pyThis writes:
analysis_results/
progress.png
progress.svg
progress.pdf
architecture_summary.png
architecture_summary.svg
architecture_summary.pdf
architecture_summary.tsv
architecture_summary.csv
architecture_report.md
architecture_diagrams/
index.md
index.tsv
exp_000_<commit>_<architecture>.png
exp_000_<commit>_<architecture>.svg
parameter_vs_performance.png
parameter_vs_performance.svg
parameter_vs_performance.pdf
parameter_vs_performance.tsv
parameter_vs_performance.csv
results_clean.tsv
results_clean.csv
summary.json
To also materialize each committed experiment as a local folder, run this from the relevant problem directory:
python export_experiments.pyThis reads results.tsv and writes:
experiment_snapshots/
manifest.tsv
manifest.json
BEST_EXPERIMENT.md
best_experiment/
train.py
changes.patch
metadata.json
metadata.tsv
README.md
BEST_EXPERIMENT.md
experiment_000/
train.py
changes.patch
metadata.json
metadata.tsv
README.md
experiment_001/
...
By default it exports train.py, the intended editable experiment surface. To
archive additional problem-folder files from every commit, pass them explicitly:
python export_experiments.py --files train.py program.mdThe exporter marks the best non-crash experiment by lowest val_error, writes
is_best=1 in manifest.tsv, and copies that experiment to
experiment_snapshots/best_experiment/ for quick inspection.
The original results.tsv is left untouched. Architecture plots are generated
when results.tsv includes an architecture column; new runs should use the
header specified in that problem's program.md. The Markdown report uses the
commit hashes in results.tsv to recover the exact train.py code for the best
commit in each architecture family. The architecture_diagrams/ folder contains
one labeled layer-by-layer PNG and SVG network diagram for each experiment row.
parameter_vs_performance.png and .svg plot parameter count versus
validation error.
To add a future problem, create a new sibling folder. Keep the problem self-contained unless a future project truly needs shared library code.
autoresearch_my_new_problem/
Recommended process:
-
Create the new folder and copy only the problem source files:
mkdir autoresearch_my_new_problem cp autoresearch_ca_life/prepare.py autoresearch_my_new_problem/ cp autoresearch_ca_life/train.py autoresearch_my_new_problem/ cp autoresearch_ca_life/program.md autoresearch_my_new_problem/ cp autoresearch_ca_life/analyze_results.py autoresearch_my_new_problem/ cp autoresearch_ca_life/export_experiments.py autoresearch_my_new_problem/ cp autoresearch_ca_life/pyproject.toml autoresearch_my_new_problem/
analyze_results.pyandexport_experiments.pyare problem-agnostic: they both readresults.tsvand recovertrain.pysnapshots from commit hashes, so they work unchanged in the new folder as long asprogram.mdkeeps thecommitandarchitecturecolumns in the result ledger. -
Rewrite
autoresearch_my_new_problem/prepare.pyto define the fixed problem:constants data/rule generation make_batch(...) evaluate_model(...) smoke test CLI -
Reset
train.pyonly if the new problem needs a different model interface. Otherwise keep the shared decoder-only transformer baseline. -
Rewrite
program.mdso the agent's objective, fixed files, metrics, and result ledger columns are problem-specific. -
Run:
cd autoresearch_my_new_problem python prepare.py --smoke-test AR_TIME_BUDGET=2 python train.py -
Start Codex in that folder and tell it to read
program.md. -
After the agent has produced
results.tsv, generate figures and export the committed experiment snapshots (same tooling as the existing problems):cd autoresearch_my_new_problem python analyze_results.py # plots, architecture report, summary.json python export_experiments.py # experiment_snapshots/ + best_experiment/
This keeps each autoresearch problem isolated and easy to publish, reproduce, or archive.
Use two local clones so reusable code changes and optimization experiments do not get mixed together:
git clone https://github.com/lamm-mit/explore-and-discover.git Explore-and-Discover-main
git clone https://github.com/lamm-mit/explore-and-discover.git Explore-and-Discover-runUse Explore-and-Discover-main for documentation, benchmark, plotting, and new
problem-folder changes:
cd Explore-and-Discover-main
git switch main
git pull --ff-only origin main
# edit reusable files here
git add README.md autoresearch_ca_life/analyze_results.py
git commit -m "Update analysis tooling"
git push origin mainUse Explore-and-Discover-run for Codex or Claude Code optimization runs:
cd Explore-and-Discover-run
git switch main
git pull --ff-only origin main
git switch -c autoresearch/may24-ca-life
cd autoresearch_ca_life
codexNever merge an autoresearch/* run branch into main unless you explicitly
want to publish that branch's optimized train.py. If you want to preserve an
interesting optimized model, push the experiment branch separately:
git push -u origin autoresearch/may24-ca-lifeThe root .gitignore ignores generated caches, run logs, results.tsv, plot
exports, Python bytecode, virtual environments, and scratch directories for all
autoresearch_* folders.
If you use this repository in your work, please cite:
@misc{buehler2026exploreanddiscover,
author = {Buehler, Markus J.},
title = {Explore and Discover Research Agents Solve Scientific Problems},
year = {2026},
url = {https://github.com/lamm-mit/explore-and-discover},
note = {GitHub repository}
}