Benchmarking Greplica: Significant uplift on planning tasks on open-source repositories
Overview
Greplica improves coding-agent performance on complex engineering tasks by giving agents access to relevant memory from prior development sessions.
We benchmarked Greplica using the SWE-chat dataset on 10 selected high-context tasks across open-source repositories, and found that agents with Greplica memory consistently reached plans with less exploration than baseline agents that started from scratch.
Agents using Greplica performed better on all counts:
- 43% lower estimated cost
- 49% fewer tokens consumed
- 36% fewer tool calls
- 26% less time taken
Relevant context saved in memory and revealed to the agent when doing a related task improves task understanding, finding right subsystems and accounting for prior decisions, eventually leading to concentrated gains in producing an implementation plan.
In this post we walk through how how Greplica helps agents, how we designed that benchmark, what we measured, and what the pilot results show.
Why Coding Agents Need Memory
Coding agents are reasoning systems built around LLMs. On starting a new session, their context window only contains the user prompt, global skills and AGENTS.md. From there they must rebuild understanding of the codebase through tool calls: grep, glob, read, shell commands, and file inspection. Large repositories contain many millions lines of code, which means meaningful time and tokens lost reconstructing context that may already have been learned in previous sessions.
A larger context window does not automatically solve this. Too much irrelevant context can make the agent slower, more expensive, and less accurate. When the window fills up, harnesses compact the conversation and useful intermediate reasoning can be lost.
Developers compensate by giving project instructions in prompts, or writing them into AGENTS.md or other repo-level documentation. These are useful, but difficult to maintain, hard to keep current, and not designed for task-specific retrieval. As the project grows they either become too sparse or too large to trust.
What coding agents need is not just more context. They need persistent, queryable engineering memory.
What Greplica Does
Greplica works in the background, looking out for important bits of context to capture. It uses your coding session transcripts and fresh code changes to extract useful facts like architectural decisions, learnings from prior attempts, gotchas and edge cases. These are stored in a persistent SQLite-backed graph, automatically at the end of each session.
When an agent receives a new task, it can query Greplica before broad manual exploration. Instead of rediscovering the repository from scratch, it retrieves relevant prior context and uses that to produce a better plan.
We designed this benchmark to test whether that works on realistic, temporally valid session sequences.
Benchmark Design
We started with a specific question:
If a coding agent has access to memory built from prior related sessions on the same repository, does it produce a better plan for a later task — faster and with less exploration?
Why planning, not implementation
We chose the planning phase because most of an agent's initial exploration is spent understanding the repo, locating the right subsystem, and turning that context into a plan.
Data source
Cases are built from the SALT-NLP/SWE-chat dataset: real developer sessions with transcripts, checkpoints, and edit patches across many open-source repos.
Each case is a sequence of coding sessions:
- Prior (memory-building) sessions (2-4) — chronologically before the session chosen for testing. Memory is built only from these.
- Held-out (test) session — a later session on the same repo. Its main engineering task becomes the benchmark prompt. The agent never sees this transcript during memory build.
We built memory from prior sessions and ensured future sessions must not leak into memory.
Repository and task selection
We first shortlisted repositories by credibility (number of Github stars), history (number of past commits), and continuity (multiple contiguous sessions on related work).
From those, we chose 10 sessions where the user was doing highly contextual work: related to prior sessions or tasks requiring subsystem understanding rather than a one-file fix.
These tasks mimic real world development tasks in large, complex repositories.
Task Construction
For each chosen session, we inspect the work that happened in it and constructed a prompt for a planning task, mimicking what a real engineer might ask. Parallelly for verification, we made the LLM capture gold-facts in a hidden judge.md file, containing expected components of a good plan based on what the user made the LLM actually do.
We then materialize the repo at the pre-task base commit and start two arms - baseline and Greplica-arm. Greplica-arm uses memory built using prior sessions (i.e. transcripts and edit artifacts).
Memory is built the way a user would actually use Greplica:
- Bootstrap Greplica on the repo at the prior session's start checkpoint
- Reconstruct the session's code diff from SWE-chat edit artifacts
- Invoke
greplica-update-memorywith human/assistant transcript text and repo context - Save the updated memory and repeat for the next session, until we reach the held-out test session
For reference, one high-context memory build produced 37 claims across bootstrap plus three update sessions (21 + 6 + 5 + 5).
Evaluation & Results
The LLM judge reads the user-facing prompt, hidden gold guidance (judge.md) and the created final-plan.md.
We measure plan quality (LLM-judge's boolean scores across multiple dimensions in judge.md), tokens consumed, tool calls and elapsed time.
Pilot runs used gpt-5.4 for planning and judging. Results are single-run per arm unless noted; baseline trajectories can be noisy on identical prompts.
Across the selected top 10 tasks:
Per task:
| Task | Cost | Time | Tool calls | |||||
|---|---|---|---|---|---|---|---|---|
| Baseline | Greplica | % Delta | Baseline | Greplica | % Delta | Baseline | Greplica | |
| Moltis onboarding provider feedback | $2.42 | $0.74 | 70% | 408s | 228s | 44% | 90 | 47 |
| Gemini Voyager sync auth bug | $1.06 | $0.50 | 53% | 373s | 233s | 38% | 70 | 33 |
| Gemini Voyager AI folder organize | $1.74 | $0.83 | 53% | 436s | 275s | 37% | 95 | 48 |
| Gemini Voyager cross browser fork | $1.19 | $0.69 | 42% | 406s | 267s | 34% | 63 | 49 |
| Gemini Voyager chrome store restored | $1.00 | $0.65 | 35% | 292s | 280s | 4% | 49 | 35 |
| IPTVnator playback layout | $1.71 | $1.14 | 33% | 398s | 366s | 8% | 93 | 80 |
| Gemini Voyager quote reply IME | $0.41 | $0.29 | 30% | 187s | 183s | 2% | 21 | 14 |
| Gemini Voyager changelog badge | $0.61 | $0.44 | 27% | 310s | 203s | 35% | 54 | 37 |
| Gemini Voyager i18n bundle | $0.63 | $0.50 | 21% | 259s | 228s | 12% | 42 | 29 |
| IPTVnator add playlist entrypoint | $1.57 | $1.32 | 16% | 513s | 373s | 27% | 117 | 75 |
Readout
Greplica saved cost, tokens, tool calls, and time across the selected top 10 tasks. The strongest wins came from tasks where the missing context lived in prior sessions: onboarding/provider behavior in moltis, conversation and release behavior in gemini-voyager, and playlist/playback architecture in iptvnator.
Conclusion
The planning phase is the highest-touch parts of agentic software development. When the initial plan is wrong, incomplete, or based on missing context, the rest of the run compounds the error.
Human developers use their own memory to give useful nudges to coding agents in prompts. However it is often insufficient, and coding agents either rediscover context through expensive exploration or miss it entirely.
Greplica gives agents a way to retrieve that memory directly, and the benefits are stark.
Our SWE-chat plan benchmark pilot shows that when agents have access to temporally valid, task-relevant persistent memory, they can plan complex coding tasks with lower cost, fewer tokens, fewer tool calls, and less time — especially on tasks where prior sessions contain the missing subsystem context.
Why AGENTS.md Is Not Enough
Repo-level instruction files are useful, but not a scalable memory layer.
They require manual maintenance, do not support task-specific retrieval, and do not preserve the history of engineering decisions — failed attempts, migrations, design tradeoffs, and subsystem-specific gotchas.
Greplica continuously captures context from development work, stores it in a structured graph, and retrieves the relevant subset when an agent needs it.
Future Work
- Expand from ten pilot tasks to fifty-plus high-context cases with 3–5 repeated runs per arm and median reporting
- Wire cost estimation for gpt-5.4 and other agent models in the harness scorer
- Explore LLM-based retrieval methods apart from current semantic score and keyword based retrieval
- Include other sources of information (Github issues, PRs, PRDs) to add on context
If you find this work interesting or have feedback, please find us on Discord.