Skip to content

Research Plan

LocoAgente is organised into four research tracks and a phased roadmap. The tracks are independent lines of investigation that share infrastructure; the phases sequence the work across semesters.


The scaffolding hypothesis: if a 4B model with RE2 + voting can match a 70B model on a single question, can a 4B agent with RE2 + voting + constrained actions match a frontier agent on a well-scoped task?


Port Karpathy’s autoresearch loop to a local 4B model. The simplest possible agent — read a metric, change code, run, check, repeat. This is the starting point because the action space is small enough that a 4B model might handle it, and evaluation is automatic.

Extend to other well-scoped tasks: data analysis, code review, documentation. Each domain gets its own adapter and evaluation harness. Depends on Track A for the working agent loop.

Systematic study of which scaffolding strategies most improve small-model agent reliability. The most publishable track — connects to the Cognitive Strategy Transfer research and the keep-asking nudge work.

Does framework choice meaningfully affect SLM agent performance? The hypothesis: minimalist orchestrators outperform full-featured frameworks with local models because of lower prompt overhead. Results feed into LocoBench.


Track A (autoresearch port)
├── Track B (new task domains — needs the working loop from A)
├── Track C (scaffolding — can start alongside A, uses A as test harness)
└── Track D (framework eval — can start alongside A, compares loop implementations)

Track A is the foundation. Tracks B, C, and D build on it but are otherwise independent of each other.


Tool calling is a controlled variable. LocoAgente assumes tool calling works (guided decoding + flat schemas + Pydantic). The research questions are about reasoning in loops and framework overhead, not tool calling reliability. Improvements from LocoLLM or the broader community are imported, not built here.

Local first. Everything runs on consumer hardware. Cloud APIs are baseline comparisons, not the system under study.

Constrained over capable. Start with the smallest viable action space and expand.

Measurably better. Every strategy must demonstrate quantifiable improvement. “It feels smarter” doesn’t count.

Comparable experiments. Fixed time budgets, fixed evaluation metrics, fixed hardware.


  • Port autoresearch loop to work with local models
  • Implement baseline experiment: frontier model vs bare local model
  • Build experiment harness (run, log, compare)
  • First scaffolding experiments (CoT, RE2)
  • Document findings
  • Systematic scaffolding strategy comparison (Track C)
  • Self-consistency voting in agent loops
  • Constrained action space experiments
  • Train first agent adapter via LocoLLM pipeline
  • First publication: scaffolding strategies for small-model agents
  • Extend to task-specific agents (Track B)
  • Data analysis agent, code review agent
  • Human-in-the-loop checkpoint experiments
  • Cross-task scaffolding comparison
  • Agent adapters contributed to LocoLLM ecosystem
  • Multi-agent coordination on consumer hardware
  • Integration with LocoConvoy for multi-GPU agent swarms

LocoAgente generates experiments and qualitative findings. When results need rigorous, reproducible measurement — particularly from Track D’s framework comparisons — the benchmarking methodology and runs move to LocoBench. LocoAgente asks the questions; LocoBench produces the numbers.