dp_006 q03 (Claude Code) | Evaluation Runs

Claude Code

dp_006 q03 100 files multi_hop Claude Code

PR #84 (concurrent file processing) reported benchmark numbers for the 60 MB Wikipedia corpus. What hardware was used for those benchmarks in the PR, and how does the reported 60 MB wall-clock time compare with the number published in the v0.5.0 release notes for the same corpus? Are the figures consistent, and if not, what explains the difference?

Gold answer

PR #84 states the benchmarks were run on an Apple M2 Max, 12-core, 32 GB RAM, and reports 278.3 ms (mean ± 6.4 ms) from hyperfine for the 60 MB corpus with --max-concurrency=12. The v0.5.0 release notes state '280 ms' for the same corpus with --max-concurrency=8. The figures are consistent (both round to ~280 ms) but were taken with different concurrency ceilings (12 vs 8) and reported at different precision levels; the release notes round to the nearest 10 ms. The release notes benchmark table was run on an M2 Max as well (macOS 14.5, --max-concurrency=8), explaining the slight difference from the PR's 12-core run.

Filesystem View raw JSON

Input

270k

Output

399

Steps

Tools

Latency

40.2s

Cost

$4.08

#2 Ran ls /Users/prasanna/Documents/supermemoryai/supermemoryfs/.scratch/eval/eval/workspace/fs/dp_006/ 32 chars

Command Bash

ls /Users/prasanna/Documents/supermemoryai/supermemoryfs/.scratch/eval/eval/workspace/fs/dp_006/

{
  "description": "List dp_006 workspace contents"
}

Result 3 lines

agent_memory
AGENTS.md
CLAUDE.md

#3 Ran ls /Users/prasanna/Documents/supermemoryai/supermemoryfs/.scratch/eval/eval/workspace/fs/dp_006/agent_memory/ 50 chars

Command Bash

ls /Users/prasanna/Documents/supermemoryai/supermemoryfs/.scratch/eval/eval/workspace/fs/dp_006/agent_memory/

{
  "description": "List agent_memory contents"
}

Result 7 lines

code
email
issues
memory
notes
pr_threads
releases

#4 Ran ls /Users/prasanna/Documents/supermemoryai/supermemoryfs/.scratch/eval/eval/workspace/fs/dp_006/agent_memory/pr_threads/ /Users/prasanna/Doc… 1,053 chars

Command Bash

ls /Users/prasanna/Documents/supermemoryai/supermemoryfs/.scratch/eval/eval/workspace/fs/dp_006/agent_memory/pr_threads/ /Users/prasanna/Documents/supermemoryai/supermemoryfs/.scratch/eval/eval/workspace/fs/dp_006/agent_memory/releases/

{
  "description": "List PR threads and releases"
}

Result 11 lines

/Users/prasanna/Documents/supermemoryai/supermemoryfs/.scratch/eval/eval/workspace/fs/dp_006/agent_memory/pr_threads/:
pr-007-readme-grammar-fixes.md
pr-015-first-external-typo-fix.md
pr-022-org-todo-keywords.md
pr-029-asciidoc-tables.md
pr-038-tokio-panic-fix.md
pr-044-tree-sitter-upgrade.md
pr-052-terminal-a11y.md
pr-058-error-messages-cleanup.md
pr-061-windows-line-endings-fix.md
pr-067-path-tr…

#5 Edited agent_memory/pr_threads/pr-084-concurrent-processing.md 18,600 chars

Command Read

{
  "file_path": "/Users/prasanna/Documents/supermemoryai/supermemoryfs/.scratch/eval/eval/workspace/fs/dp_006/agent_memory/pr_threads/pr-084-concurrent-processing.md"
}

Result 15 lines

1	# PR #84: perf(cli): concurrent file processing for batched conversions (closes #41)
2	
3	**Branch:** `hiroshi-n/concurrent-processing` → `main`
4	**Opened:** 2026-05-30 by @hiroshi-n
5	**Merged:** 2026-06-08 by @lior-bz
6	**Diff:** 14 files changed, +423 −158
7	**Status:** Merged ✅ — v0.5.0
8	
9	---
10	
11	## Description
12	
13	### Problem
14	
15	Batched conversions on corpora larger than ~50MB…

#6 Edited agent_memory/releases/release-v0.5.0-2026-06-08.md 15,908 chars

Command Read

{
  "file_path": "/Users/prasanna/Documents/supermemoryai/supermemoryfs/.scratch/eval/eval/workspace/fs/dp_006/agent_memory/releases/release-v0.5.0-2026-06-08.md"
}

Result 9 lines

1	# kitabi v0.5.0 (2026-06-08) — Concurrent file processing
2	
3	> **Breaking change:** plugin-API ABI version bumped from 1 → 2. Plugin authors must rebuild against `kitabi-plugin-api` 0.5.x. See [CHANGELOG-PLUGIN-API.md](../../CHANGELOG-PLUGIN-API.md) and the migration section below.
4	
5	---
6	
7	## Highlights 🚀
8	
9	- **5× throughput improvement** on multi-file batches. A 60 MB Wikipedia subs…

#8 Reasoning 2,001 chars

## Hardware in PR #84

The PR's benchmarks ran on an **Apple M2 Max, 12 physical cores, 32 GB RAM, macOS Sequoia 15.3**, with `--max-concurrency=12`, on a 60 MB Wikipedia markdown subset of **3,142 files** (median 19 KB, largest 1.2 MB). Methodology: `hyperfine`, 10 warm runs after 3 cold discards. Reported time: **278.3 ms ± 6.4 ms** (rounded to 280 ms in the PR body), with the cold-room caveat that Hiroshi ran the final bench at 06:30 JST on a Sunday with no Spotlight/background tasks.

## v0.5.0 release-notes figure for the same corpus

The release notes (`releases/release-v0.5.0-2026-06-08.md`, "Benchmark detail") report the Large / Wikipedia-subset row as **280 ms** on v0.5.0 vs 7,400 ms on v0.4.2 — a 26.4× speedup. But the surrounding caption disagrees with the PR on several environment details:

| Field | PR #84 | v0.5.0 notes |
|---|---|---|
| Wall-clock (60 MB) | **280 ms** | **280 ms** |
| `--max-concurrency` | 12 | 8 (table caption) |
| File count in 60 MB corpus | 3,142 | 1,200 |
| macOS version | Sequoia 15.3 | 14.5 |
| Runs | 10 warm + 3 cold discards | median of 5, `--warmup 2` |

## Consistency

The **headline number is identical (280 ms)** but the **measurement conditions reported around it are not**. Same hardware family (M2 Max), but different concurrency setting, different corpus file count for the "same" 60 MB Wikipedia subset, different macOS version, and a different `hyperfine` run plan.

The most plausible explanation — given the surrounding evidence in the docs themselves — is that the release-notes table reused the 280 ms cell from the PR while pairing it with a separately-drafted environment caption. Two specific hints support this:

1. The release notes' own "Notes" line says "the 26× gain ... is partly from the AST flattening fix (#41) and partly from parallelism … the flattening fix alone gave ~3.2×; concurrency gave the remainder." The PR thread frames the same 26× as v0.4.x→PR #84 (sequential→concurrent dispatch) with the bumpalo step…

Answer

## Hardware in PR #84 The PR's benchmarks ran on an **Apple M2 Max, 12 physical cores, 32 GB RAM, macOS Sequoia 15.3**, with `--max-concurrency=12`, on a 60 MB Wikipedia markdown subset of **3,142 files** (median 19 KB, largest 1.2 MB). Methodology: `hyperfine`, 10 warm runs after 3 cold discards. Reported time: **278.3 ms ± 6.4 ms** (rounded to 280 ms in the PR body), with the cold-room caveat that Hiroshi ran the final bench at 06:30 JST on a Sunday with no Spotlight/background tasks. ## v0.…

Judge: covers all asked facts

SMFS Won View raw JSON

Input

166k

Output

104

Steps

Tools

Latency

24.3s

Cost

$2.50

#2 Ran smfs grep "PR #84 concurrent file processing benchmark" /Users/prasanna/Documents/supermemoryai/supermemoryfs/.scratch/eval/eval/workspace/s… 6,648 chars

Command Bash

smfs grep "PR #84 concurrent file processing benchmark" /Users/prasanna/Documents/supermemoryai/supermemoryfs/.scratch/eval/eval/workspace/smfs/dp_006/agent_memory/

{
  "description": "Semantic search for PR #84 benchmarks"
}

Result 7 lines

# supermemory semantic search — 10 results for "PR #84 concurrent file processing benchmark"
# searches by meaning across files in this container. usage:
#   grep "natural language query"          search all files
#   grep "query" path/to/dir/              search within directory
# output: <filepath>:<line_start>-<line_end>:<chunk> (chunk text is verbatim from the file)

/memory/community/hiroshi-…

#3 Ran smfs grep "v0.5.0 release notes Wikipedia 60 MB" /Users/prasanna/Documents/supermemoryai/supermemoryfs/.scratch/eval/eval/workspace/smfs/dp_… 5,984 chars

Command Bash

smfs grep "v0.5.0 release notes Wikipedia 60 MB" /Users/prasanna/Documents/supermemoryai/supermemoryfs/.scratch/eval/eval/workspace/smfs/dp_006/agent_memory/

{
  "description": "Semantic search for v0.5.0 release notes"
}

Result 7 lines

# supermemory semantic search — 10 results for "v0.5.0 release notes Wikipedia 60 MB"
# searches by meaning across files in this container. usage:
#   grep "natural language query"          search all files
#   grep "query" path/to/dir/              search within directory
# output: <filepath>:<line_start>-<line_end>:<chunk> (chunk text is verbatim from the file)

/releases/release-v0.5.0-2026-06-…

#5 Reasoning 2,001 chars

## PR #84 hardware

From `pr_threads/pr-084-concurrent-processing.md`: **Apple M2 Max, 12 physical cores, 32 GB RAM, macOS Sequoia 15.3.** Corpus described as a 60 MB Wikipedia markdown subset, **3,142 files** (median 19 KB, largest 1.2 MB). Methodology: 10 warm runs after 3 cold discards via hyperfine; flame graph captured with `--max-concurrency=12`.

Reported 60 MB wall-clock with PR #84 applied: **280 ms** (vs. 3.1 s post-bumpalo, 7.4 s on v0.4.x baseline).

## v0.5.0 release notes

From `releases/release-v0.5.0-2026-06-08.md`: same M2 Max class machine, but the bench table header reads **macOS 14.5, `--max-concurrency=8`**, and the "Large (Wikipedia subset)" row lists **1,200 files / 60 MB**, with the v0.5.0 time at **280 ms** (vs. 7,400 ms on v0.4.2, 26.4× speedup). Times are described as median of 5 runs with `hyperfine --warmup 2`.

## Consistency

**The headline wall-clock figure agrees: 280 ms for the 60 MB corpus, with the same ~26× speedup vs. v0.4.x.** That's the load-bearing number and it matches.

The surrounding metadata does **not** fully line up, though:

| Field | PR #84 | v0.5.0 release notes |
|---|---|---|
| Wall-clock @ 60 MB | 280 ms | 280 ms ✅ |
| Hardware | M2 Max, 12-core, 32 GB | M2 Max (unspecified spec) |
| OS | macOS Sequoia 15.3 | macOS 14.5 |
| File count in 60 MB corpus | 3,142 | 1,200 |
| `--max-concurrency` | 12 (flame graph cmd) | 8 (bench table) |
| Run count | 10 warm + 3 cold discards | median of 5, `--warmup 2` |

So the figures are **consistent on the headline result but inconsistent on the supporting parameters** — different OS minor version, different corpus file count (same 60 MB total, so subsampling granularity differs), different concurrency cap, different hyperfine recipe. The two documents were authored separately (PR body finalized 2026-06-02, release notes shipped 2026-06-08) and appear to reflect distinct benchmark runs against the same workload class rather than one canonical run quoted twice. There's also a sepa…

Answer

## PR #84 hardware From `pr_threads/pr-084-concurrent-processing.md`: **Apple M2 Max, 12 physical cores, 32 GB RAM, macOS Sequoia 15.3.** Corpus described as a 60 MB Wikipedia markdown subset, **3,142 files** (median 19 KB, largest 1.2 MB). Methodology: 10 warm runs after 3 cold discards via hyperfine; flame graph captured with `--max-concurrency=12`. Reported 60 MB wall-clock with PR #84 applied: **280 ms** (vs. 3.1 s post-bumpalo, 7.4 s on v0.4.x baseline). ## v0.5.0 release notes From `re…

Judge: covers all asked facts