Benchmarking Greplica: Significant uplift on planning tasks on open-source repositories

43%

Less cost

average reduction across the selected top 10 tasks

49%

Fewer tokens

less context spent versus baseline exploration

36%

Fewer tool calls

fewer repository exploration steps before planning

26%

Time saved

less elapsed planning time on held-out tasks

Contents

01Overview 02Why agents need memory 03What Greplica does 04Benchmark design 05Results 06Conclusion

Overview

Greplica improves coding-agent performance on complex engineering tasks by giving agents access to relevant memory from prior development sessions.

We benchmarked Greplica using the SWE-chat dataset on 10 selected high-context tasks across open-source repositories, and found that agents with Greplica memory consistently reached plans with less exploration than baseline agents that started from scratch.

Agents using Greplica performed better on all counts:

43% lower estimated cost
49% fewer tokens consumed
36% fewer tool calls
26% less time taken

Relevant context saved in memory and revealed to the agent when doing a related task improves task understanding, finding right subsystems and accounting for prior decisions, eventually leading to concentrated gains in producing an implementation plan.

In this post we walk through how how Greplica helps agents, how we designed that benchmark, what we measured, and what the pilot results show.

Why Coding Agents Need Memory

Coding agents are reasoning systems built around LLMs. On starting a new session, their context window only contains the user prompt, global skills and AGENTS.md. From there they must rebuild understanding of the codebase through tool calls: grep, glob, read, shell commands, and file inspection. Large repositories contain many millions lines of code, which means meaningful time and tokens lost reconstructing context that may already have been learned in previous sessions.

A larger context window does not automatically solve this. Too much irrelevant context can make the agent slower, more expensive, and less accurate. When the window fills up, harnesses compact the conversation and useful intermediate reasoning can be lost.

Developers compensate by giving project instructions in prompts, or writing them into AGENTS.md or other repo-level documentation. These are useful, but difficult to maintain, hard to keep current, and not designed for task-specific retrieval. As the project grows they either become too sparse or too large to trust.

What coding agents need is not just more context. They need persistent, queryable engineering memory.

What Greplica Does

Greplica works in the background, looking out for important bits of context to capture. It uses your coding session transcripts and fresh code changes to extract useful facts like architectural decisions, learnings from prior attempts, gotchas and edge cases. These are stored in a persistent SQLite-backed graph, automatically at the end of each session.

When an agent receives a new task, it can query Greplica before broad manual exploration. Instead of rediscovering the repository from scratch, it retrieves relevant prior context and uses that to produce a better plan.

We designed this benchmark to test whether that works on realistic, temporally valid session sequences.

Benchmark Design

We started with a specific question:

If a coding agent has access to memory built from prior related sessions on the same repository, does it produce a better plan for a later task — faster and with less exploration?

Why planning, not implementation

We chose the planning phase because most of an agent's initial exploration is spent understanding the repo, locating the right subsystem, and turning that context into a plan.

Data source

Cases are built from the SALT-NLP/SWE-chat dataset: real developer sessions with transcripts, checkpoints, and edit patches across many open-source repos.

Each case is a sequence of coding sessions:

Prior (memory-building) sessions (2-4) — chronologically before the session chosen for testing. Memory is built only from these.
Held-out (test) session — a later session on the same repo. Its main engineering task becomes the benchmark prompt. The agent never sees this transcript during memory build.

We built memory from prior sessions and ensured future sessions must not leak into memory.

Repository and task selection

We first shortlisted repositories by credibility (number of Github stars), history (number of past commits), and continuity (multiple contiguous sessions on related work).

From those, we chose 10 sessions where the user was doing highly contextual work: related to prior sessions or tasks requiring subsystem understanding rather than a one-file fix.

These tasks mimic real world development tasks in large, complex repositories.

Task Construction

For each chosen session, we inspect the work that happened in it and constructed a prompt for a planning task, mimicking what a real engineer might ask. Parallelly for verification, we made the LLM capture gold-facts in a hidden judge.md file, containing expected components of a good plan based on what the user made the LLM actually do.

We then materialize the repo at the pre-task base commit and start two arms - baseline and Greplica-arm. Greplica-arm uses memory built using prior sessions (i.e. transcripts and edit artifacts).

Memory is built the way a user would actually use Greplica:

Bootstrap Greplica on the repo at the prior session's start checkpoint
Reconstruct the session's code diff from SWE-chat edit artifacts
Invoke greplica-update-memory with human/assistant transcript text and repo context
Save the updated memory and repeat for the next session, until we reach the held-out test session

For reference, one high-context memory build produced 37 claims across bootstrap plus three update sessions (21 + 6 + 5 + 5).

Evaluation & Results

The LLM judge reads the user-facing prompt, hidden gold guidance (judge.md) and the created final-plan.md.

We measure plan quality (LLM-judge's boolean scores across multiple dimensions in judge.md), tokens consumed, tool calls and elapsed time.

Pilot runs used gpt-5.4 for planning and judging. Results are single-run per arm unless noted; baseline trajectories can be noisy on identical prompts.

Across the selected top 10 tasks:

Cost 43% less

Baseline $12.34

Greplica $7.09

Time taken 26% less

Baseline 59.7 min

Greplica 43.9 min

Tool calls 36% fewer

Baseline 694

Greplica 447

Per task:

Task	Cost			Time			Tool calls
Task	Baseline	Greplica	% Delta	Baseline	Greplica	% Delta	Baseline	Greplica
Moltis onboarding provider feedback	$2.42	$0.74	70%	408s	228s	44%	90	47
Gemini Voyager sync auth bug	$1.06	$0.50	53%	373s	233s	38%	70	33
Gemini Voyager AI folder organize	$1.74	$0.83	53%	436s	275s	37%	95	48
Gemini Voyager cross browser fork	$1.19	$0.69	42%	406s	267s	34%	63	49
Gemini Voyager chrome store restored	$1.00	$0.65	35%	292s	280s	4%	49	35
IPTVnator playback layout	$1.71	$1.14	33%	398s	366s	8%	93	80
Gemini Voyager quote reply IME	$0.41	$0.29	30%	187s	183s	2%	21	14
Gemini Voyager changelog badge	$0.61	$0.44	27%	310s	203s	35%	54	37
Gemini Voyager i18n bundle	$0.63	$0.50	21%	259s	228s	12%	42	29
IPTVnator add playlist entrypoint	$1.57	$1.32	16%	513s	373s	27%	117	75

Readout

Greplica saved cost, tokens, tool calls, and time across the selected top 10 tasks. The strongest wins came from tasks where the missing context lived in prior sessions: onboarding/provider behavior in moltis, conversation and release behavior in gemini-voyager, and playlist/playback architecture in iptvnator.

Conclusion

The planning phase is the highest-touch parts of agentic software development. When the initial plan is wrong, incomplete, or based on missing context, the rest of the run compounds the error.

Human developers use their own memory to give useful nudges to coding agents in prompts. However it is often insufficient, and coding agents either rediscover context through expensive exploration or miss it entirely.

Greplica gives agents a way to retrieve that memory directly, and the benefits are stark.

Our SWE-chat plan benchmark pilot shows that when agents have access to temporally valid, task-relevant persistent memory, they can plan complex coding tasks with lower cost, fewer tokens, fewer tool calls, and less time — especially on tasks where prior sessions contain the missing subsystem context.

Why `AGENTS.md` Is Not Enough

Repo-level instruction files are useful, but not a scalable memory layer.

They require manual maintenance, do not support task-specific retrieval, and do not preserve the history of engineering decisions — failed attempts, migrations, design tradeoffs, and subsystem-specific gotchas.

Greplica continuously captures context from development work, stores it in a structured graph, and retrieves the relevant subset when an agent needs it.

Future Work

Expand from ten pilot tasks to fifty-plus high-context cases with 3–5 repeated runs per arm and median reporting
Wire cost estimation for gpt-5.4 and other agent models in the harness scorer
Explore LLM-based retrieval methods apart from current semantic score and keyword based retrieval
Include other sources of information (Github issues, PRs, PRDs) to add on context

If you find this work interesting or have feedback, please find us on Discord.