/ SMFS Technical Report

May 2026

SMFS: Memory as a Filesystem

Supermemory Research

LLM agents typically gain persistent memory through one of two paradigms: a filesystem of past notes the agent reads, writes, and searches with standard tools, or a vector index that returns semantically ranked chunks through a single search call. Each is effective in isolation and limited at scale. Filesystems give agents structural context but degrade as corpora grow beyond what filenames can signpost. Vector indices give semantic reach but return chunks severed from their surrounding context, leaving multi-hop retrieval without an anchor to extend from. Both share a failure mode: every retrieval inflates the agent's context window with intermediate paths, partial reads, and overlapping chunks, degrading reasoning across multi-turn tasks.

We present SMFS, a system that mounts a Supermemory container as a real, kernel-level filesystem with semantic search wired into the path operations agents already perform. Inside an SMFS mount, cat reads a memory, smfs grep runs a hybrid semantic query scoped to the current path, edits propagate back to the cloud through a bounded per-path push queue, and /profile.md exposes generated static and dynamic memories as a single filesystem read. A three-layer architecture (kernel mount, local SQLite cache, cloud) serves filesystem operations locally while four background loops reconcile remote state. The same surface ships across runtimes: FUSE on Linux, NFS on macOS, and a virtual-bash SDK for serverless environments. Alongside the system, we release xAFS, a benchmark of 110 audited questions over 13 cross-context corpora ranging from 5 to ~10,000 files, designed to measure the cost curve of agentic retrieval at scale. Across two frontier agents over n=440 trials per condition, SMFS reduces total tokens consumed by 53.8% and tokens per correct answer by 53.1% relative to a filesystem-native baseline, at near-parity pass rate.

Leaderboard

FS × gpt-5.5 [xhigh]

84M ±11M

SMFS × gpt-5.5 [xhigh]

48M ±8.6M

FS × claude-opus-4.7 [max]

72M ±11M

SMFS × claude-opus-4.7 [max]

24M ±1.9M

FS × Codex SMFS × Codex FS × Claude SMFS × Claude

Cumulative tokens across all 110 xAFS questions, summed per agent-condition.

Introduction

Using a filesystem or a vector index in conjunction with an LLM agent is a common paradigm for providing persistent memory beyond the context window. The first treats memory as a directory of past notes the agent reads, writes, and searches with standard tools. The second treats memory as a vector index in the style of retrieval-augmented generation (RAG): a corpus indexed for semantic similarity, queried through a single search call returning a ranked list of chunks. Each is effective in isolation, and each is fundamentally limited.

The filesystem approach is well-suited only to small, organized corpora. When a memory directory contains a few dozen well-named documents, the agent can list it, identify the relevant file by name, and grep for the literal string it wants. As the corpus grows, this assumption breaks. Filenames stop being reliable signposts, hierarchies decay as memories accumulate, and the agent walks an ever-larger tree, opening files speculatively and consuming context at every step. The failure is not that the filesystem cannot answer the question; it is that the agent cannot ask the right question.

The vector-index approach inverts the problem. A single semantic query can recover content on meaning alone, returning passages from deep within the corpus that no grep could have located. What comes back, however, is a ranked list of chunks bounded by a top-K cutoff: excerpts that arrive without their neighbors, severed from the surrounding paragraphs and related files that would otherwise let the agent place them in context. Multi-hop retrieval, in which the next query depends on context from the previous result, suffers in particular: the agent has no anchor to extend from.

A common factor compounds both failure modes: the growth of the agent's context window over a multi-turn task. As the agent issues retrievals, its context fills with intermediate results, file paths and partial reads from a filesystem walk, overlapping chunks from a vector query, many of which are tangential or redundant. The bottleneck of agentic memory is therefore not the speed of any single retrieval. It is the number of retrievals the agent must issue before it has gathered enough context to answer. We measure this directly: across the 5- to 9,988-file range of our benchmark, per-question token cost grows ~7× for a filesystem-only agent against ~4× for SMFS.

SMFS is built on top of Supermemory, a context cloud for agents that turns raw user and organizational data into persistent, queryable memory. Supermemory ingests documents into typed containers, extracts text from PDFs, web pages, images, audio, video, and raw files, and indexes the result with hybrid lexical-and-semantic retrieval, reranking, and structured context. Unlike a conventional vector database that stores and returns isolated chunks, Supermemory is powered by a custom user-understanding model and a graph engine in which memories, documents, entities, profiles, and retrieval results live in one ontology-aware structure. The graph supports update, merge, contradiction handling, inference, forgetting, and low-latency traversal, so memory can evolve across sessions rather than being appended as disconnected embeddings. A pipeline over the ingested content distills two kinds of generated memories per container: static memories (durable facts the agent should keep across sessions) and dynamic memories (newer contextual signals that matter for the current work), summarized together into a single profile. Every Supermemory primitive, document, container, memory, profile, graph traversal, and search, gets a direct filesystem surface in the mount, which is what the rest of this paper describes.

A recent line of work converges on the same observation from the runtime side: agents already operate through files and shells, so memory should look like a filesystem too. Files-Are-All-You-Need argues that file-like abstractions give agentic systems a composable and auditable control surface; Vercel's bash-tool and just-bash ship that idea as a runtime, keeping context in a filesystem and exposing bash-like commands such as find, cat, and grep so the model can retrieve smaller slices instead of loading everything into the prompt. Turso's AgentFS is the closest systems reference: an agent runtime stored in SQLite and exposed as a POSIX-like filesystem with key-value state and tool-call history, so agent state is portable, queryable, and auditable. SMFS shares the file-shaped surface; what differs is the object on the other side of the mount: a remote memory backend with extraction, embedding, hybrid search, and a generated profile, rather than a local agent runtime.

We present SMFS, a system that exposes a Supermemory cloud container as a real, mountable filesystem with semantic search wired into the path operations the agent already performs, collapsing the retrieval count by letting a single search land directly on a path the agent can read. We also release xAFS, an open benchmark designed to make the cost curve measurable.

Dataset

No public dataset, to our knowledge, measures interactive retrieval cost over a cross-context corpus as it scales. Conversational long-term memory benchmarks (LongMemEval) measure recall across chat sessions, not file-system retrieval. Long-context comprehension benchmarks (NoLiMa, HELMET, InfiniteBench) concatenate content into a single window and measure in-window QA; no retrieval is performed. Office-workflow benchmarks (OfficeBench, TheAgentCompany) score workflow correctness on fixed-size corpora. Code-generation benchmarks like SWE-bench operate on source repositories rather than the cross-context mix of email, transcripts, and notes that a real working folder contains.

To fill this gap, we built and release xAFS: a benchmark of 110 audited questions over 13 cross-context personal-corpus snapshots ranging from 5 to ~9,988 files, totaling 837 MB across 19,169 files. Each snapshot is a self-contained synthetic persona spanning a distinct domain (management consulting, clinical longitudinal records, civil litigation, open-source maintainership, investigative journalism, executive archive at Series C scale, and others) with a realistic file-type mix: email threads, Slack-style exports, meeting transcripts, structured memos, journal entries, and transcribed audio and image content.

Corpus sizes are logarithmically spaced (5, 10, 20, 30, 50, 100, 200, 299, 480, 991, 1,998, 4,998, 9,988 files), making the dataset suitable for plotting retrieval cost as a function of scale. Questions cover three retrieval-shape families: single_hop (34 questions; answer derivable from a single file), multi_hop (51 questions; answer combines two or more files), and format_spanning (26 questions; answer requires reading at least two distinct content formats). Each question carries a stable id, a family, a prompt, a list of gold_file_ids pointing to files in which the answer is derivable, and a canonical gold_answer. Every gold answer is independently audited against the visible corpus; the scenario specifications and fact registries used during generation are excluded from the public release, so the file corpus is the only source of ground truth available to an evaluating agent.

The dataset is released at huggingface.co/datasets/supermemory/xAFS. Full benchmark design, the seven-stage generation pipeline, audit methodology, and known limitations are documented in a separate technical report.

Architecture

SMFS is built around a simple constraint: agent memory should behave like files without giving up the retrieval behavior of a semantic index. A normal filesystem gives agents stable paths, incremental reads, edits, renames, and familiar tools. A memory backend gives them extraction, embedding, profile generation, and search over meaning. The architecture exists to keep those two views coherent while neither side runs at the same speed.

We call this property path-coherent memory. A path in the mount is not only a local filename; it is also the handle used for cloud storage, sync reconciliation, search scoping, transcription placement, and agent guidance. This makes the path the shared unit of coordination across the kernel, the local cache, and Supermemory's API. The rest of the system is designed to preserve that invariant under ordinary filesystem behavior: editors that save through temp-file renames, kernels that split writes into multiple flushes, agents that search before they read, and a backend that accepts a file before it has finished extraction and indexing.

SMFS architecture overview — Figure 1. SMFS keeps the kernel-facing filesystem contract local, then reconciles cloud-facing memory state through explicit queues and watermarks.

Filesystem Semantics First

The core implementation is not organized around API calls. It starts with a virtual filesystem interface: lookup, read, write, rename, unlink, readdir, stat, symlink, and related POSIX-shaped operations. Both mount backends call into this same interface, and the SQLite-backed implementation satisfies it. This separation matters because the hard contract is not "can we upload a document?" but "can this directory behave enough like a real filesystem that existing tools and agents stop treating it as special?"

SMFS keeps kernel-facing concerns at the edge. On Linux, the mount is served through FUSE. On macOS, where shipping a FUSE filesystem would require macFUSE, SMFS instead runs an NFSv3 server on localhost and asks the native operating-system NFS client to mount it. The two paths differ at the protocol boundary, but converge immediately into the same filesystem trait. The cloud client never needs to know whether a read came from FUSE, NFS, an editor, or an agent.

The Cache Is The Local Truth

Every read served to the kernel comes from a local SQLite cache. This is a deliberate choice. If ordinary file reads depended directly on the network, the mount would inherit the latency and partial-failure behavior of the backend. Instead, the cache stores inodes, directory entries, file chunks, symlink targets, remote document IDs, sync watermarks, and the persistent push queue. SQLite runs in WAL mode with a single serialized connection per mount, giving the daemon a small, durable state machine rather than a pile of in-memory bookkeeping.

The cache is passive: it does not call the API and it does not own background work. Filesystem operations mutate local state; the sync engine observes that state and reconciles it with Supermemory. A write completes once bytes are committed locally, while upload, extraction, embedding, indexing, retry, and conflict handling proceed outside the kernel path.

Two timestamps make this possible. dirty_since records when an inode was last changed locally, so a pull cannot overwrite an edit that has not finished pushing. mirrored_updated_at records the remote version the cache believes it has mirrored, so the pull side can skip work that has already been incorporated. Together they let SMFS distinguish three states that otherwise look similar: local data newer than cloud data, cloud data newer than local data, and local data already synchronized.

The Daemon Owns the Boundary

A mounted container is owned by a long-running daemon. The CLI resolves credentials, chooses the platform backend, starts or foregrounds the daemon, and communicates with it through a small Unix-socket IPC protocol. Each mount has its own cache scoped by organization and container tag, plus a pid file, socket, and log file under the platform cache directory.

On startup, the daemon validates the session, opens or creates the cache, warms /profile.md, performs the initial pull, starts the sync loops, mounts the filesystem, and writes a small marker file near the mount so commands can rediscover the container tag and mount path. On shutdown, it gives the push queue time to drain before releasing the mount. If it cannot drain fully, the queue remains in SQLite and resumes on the next mount.

The result is a filesystem illusion with explicit boundaries. The kernel sees a normal directory. The agent sees files and grep. The backend sees documents, filepaths, memory extraction, profile generation, and hybrid search. The daemon's job is to keep those views close enough that each side can use its native interface without learning the others.

Bidirectional Sync

Bidirectional sync in SMFS starts from an uneven contract. To the operating system, the mount has to behave like a local filesystem: writes return, renames take effect, deletes disappear, reads see the local tree immediately. To Supermemory, the same file is a document moving through an asynchronous memory pipeline: upload, extraction, transcription, chunking, embedding, indexing, and final search visibility.

The SQLite cache is the source of truth for the mounted tree. Supermemory is the shared remote state that the cache converges with. This is eventual consistency in filesystem form: local paths stay coherent now, while cloud memory catches up through bounded background work. The work is decomposed into four background loops.

Bidirectional sync architecture — Figure 2. Local file operations enter a per-file push queue. Remote state returns through delta pull, deletion scan, and processing polling.

The Write Path: Local Commits and the Push Queue

A local write is committed to the cache before it is sent to the cloud. SqliteFile::write splits bytes into chunks, updates the inode size and timestamps, and marks the inode with dirty_since. From the kernel's point of view, the write has completed against a normal file. The dirty marker tells the pull reconciler that this inode has local work which has not safely become remote state yet; a later remote document should not overwrite it.

On flush, SMFS converts the cached mutation into cloud-facing intent. The operation becomes a push_queue row: create, update, upload binary, rename, or delete. The filesystem path is the queue key, so the queue preserves the same unit of identity the user and agent see in the mount.

The push queue is persistent, path-scoped, and latest-wins. Each path has two effective slots, one main slot that may be inflight, and one pending slot for the newest change that arrives while the first request is still being processed. If a file is saved repeatedly before the worker claims it, only the latest version remains. This bounds write amplification per file: a burst of saves becomes at most one active request and one remembered follow-up, while different files can still move in parallel.

The push worker wakes on notification, falls back to a short poll, claims eligible queue rows by setting inflight_started_at, and sends the corresponding request. Failures stay local to the row: transient failures clear the inflight marker, increment attempt, and delay the next claim with backoff. A successful push clears dirty_since, records remote identity, and updates mirror state. For updates and renames, SMFS waits for the previous server-side processing step to finish before sending the next mutation to the same remote document.

Processing State

Supermemory can accept a document before it is finished becoming memory. The mounted file may already exist locally, while the remote document is still extracting, transcribing, embedding, or indexing. SMFS keeps that intermediate state visible in the cache instead of making the original filesystem call wait for it.

The inflight poller looks at local inodes whose fs_remote.last_status has not reached done, then checks the remote processing set in bulk. This matters when many files are uploaded together: the daemon does not create one polling loop per file. It asks once, compares the returned processing IDs against the locally tracked IDs, and only fetches final document state when a document leaves processing.

When processing completes, SMFS records mirrored_updated_at and last_status.

The Read Path: Delta Pull and Deletion Scan

The delta-pull loop is the return path. It lists Supermemory documents ordered by updatedAt descending and stops at the last watermark. Each newer document is handed to reconcile_one, which maps the remote document back into the mounted path tree. Reconciliation is careful about file type and local dirtiness: if the inode is dirty and the local write is newer than the remote update, the pull skips it. If the remote document is still processing, SMFS defers materializing partial content. If the document is done, text-like files can rewrite local chunks, while binary and media files keep their raw identity and expose derived text or transcription surfaces separately. The watermark advances only after the batch has been considered.

Updates have timestamps; deletions are absence. A document removed from Supermemory does not appear as a newer document in the normal delta stream, so a separate deletion-scan loop builds the current remote ID set and compares it with local fs_remote rows. If a known remote document is gone, SMFS applies the local deletion as long as the inode is not carrying unsynced local work.

Cross-Mount Convergence

The same Supermemory container can be mounted in more than one sandbox. Each mount has its own SQLite cache, push queue, and four sync loops. The shared state is the container.

If Mount A edits a file, the change first becomes local cache state in A, then a queued push, then a Supermemory document update. Mount B later sees that update through delta pull and reconciles it into its own cache. Local work is accepted immediately, remote memory becomes the convergence point, and dirty local files keep their queued work protected until their own push completes.

File Types and Transcriptions

Agents already know how to search and read text files. SMFS extends that workflow to PDFs, images, audio, video, and webpages by keeping the original file at its path and materializing extracted text as a derived sibling next to it:

research-paper.pdf
research-paper.pdf.pdf-transcription.md

meeting.mp3
meeting.mp3.audio-transcription.md

diagram.png
diagram.png.image-transcription.md

This convention is deliberately filesystem-shaped. The PDF remains a PDF. The audio file remains audio. The extracted language appears beside it, with a predictable suffix. An agent can search the container semantically, then read the sibling markdown file for the surrounding text instead of loading the original binary into context or choosing a parser itself.

Sibling Lifecycle

A transcription sibling is derived from its source file. It is not treated as a separate user-authored document.

When the source is renamed, the sibling is renamed with it. When the source is deleted, the sibling is removed with it. When Supermemory produces newer extracted text, SMFS rewrites the sibling in place. The source object and its readable surface stay together as the directory changes.

This also gives processing failures a visible filesystem shape. If extraction fails, SMFS places an error sibling next to the source file, so the mount shows why the readable surface is missing instead of hiding that state in daemon logs. The model avoids a common ambiguity in memory filesystems, whether a file path names the original artifact or the processed memory produced from it. In SMFS, the source path remains canonical.

Search Across Types

The sibling model matters because agents search before they read. A semantic search can match content inside a PDF, image, audio recording, video, or webpage because Supermemory indexed the extracted content. SMFS then exposes that same content locally as markdown.

Search finds the relevant source path. The agent reads the sibling file when it needs the extracted text. The user still sees an ordinary directory, but the directory now contains the surfaces an agent needs to work across mixed file types.

Profile and Memory Scope

SMFS gives generated memory a stable file path: /profile.md. The file appears at the root of every mounted container, including an otherwise empty mount. It is virtual, read-only, and generated from Supermemory's profile endpoint. Instead of asking the agent to learn a profile tool, SMFS puts the profile where the agent already looks: a file it can open.

The rendered file has a fixed markdown shape:

# Memory Profile
# This file is auto-generated from
# your memories. It is not editable.
# To update, modify the source files
# that contain this information.

## Core Knowledge
- ...

## Recent Context
- ...

Core Knowledge comes from static memories: durable facts that should remain useful across sessions. Recent Context comes from dynamic memories: newer signals that may matter for the current work. Writes to /profile.md are rejected by the mount because the profile is a view over memories, not the source of truth; to change what appears, the user or agent changes the source files that generate memories.

Virtual Profile

The Rust mount handles /profile.md outside the normal cache path. It assigns the file a fixed virtual inode, warms it from the profile endpoint when the daemon starts, and serves reads from an in-memory markdown rendering. Root directory reads include profile.md even when the local cache has no user files. Profile memory is not hidden behind an API call and it is not mixed into the writable document tree, it is a stable read surface at the top of the mount: quick to inspect, safe from accidental edits, and available before the agent has decided what to search.

Memory Scope

The profile is generated from memories, and SMFS lets the container decide which filesystem paths produce those memories. On the server, filesystem-ingested documents are classified by path. The default memory paths are /memory/, /user/, /memory.md, and /user.md; a trailing slash means recursive directory, a path without it is an exact file match. A mount can override the container's memory paths with smfs mount <tag> --memory-paths "/notes/,/journal.md", and passing an empty value disables memory generation for filesystem-ingested documents. Matching paths run through the memory pipeline that creates static and dynamic memories. Other filesystem-ingested documents can still be stored and retrieved as documents, but they do not feed the profile memory view.

Semantic Search

Agents do not need a new habit to search memory. They already reach for grep. SMFS keeps that interface and changes the matching function underneath it. Inside a mounted container, smfs grep "oauth refresh failure" work/ becomes a semantic search over the container, scoped to the work/ subtree. The query is still a string. The scope is still a path. The result still looks like file output.

The difference is what can match. A result may come from a generated memory, a markdown note, a source chunk, or extracted text from a PDF, image, audio file, video, or webpage. SMFS asks Supermemory for a hybrid ranked result set, then renders those hits back into the line-oriented shape agents already know how to follow.

work/debug-notes.md:42-45: refresh
  token failed after deploy ...
research-paper.pdf:118-122: the
  benchmark failed after token
  rotation ...

This is the core trade: retrieval becomes semantic, but the interface stays filesystem-shaped. The agent can search by meaning, open the path, inspect the surrounding file, and continue without learning a separate retrieval schema.

Grep as Retrieval

The mounted path gives search a natural scope. A query without a path searches the container. A query with a directory searches that subtree. A query with a file path narrows to that file. SMFS normalizes the path relative to the mount root and uses it as the retrieval boundary.

Literal grep remains literal, the agent reaches for the system tool when it wants a byte-level match, and reaches for smfs grep when it wants a semantic one. Both modes stay available: natural-language lookup for memory retrieval, byte-level matching for exact inspection.

Returning to Files

Semantic search is only useful if the result can return to the filesystem. For chunk results, SMFS tries to recover a local line range by finding the returned excerpt in the local file. If the verbatim match fails, it retries with whitespace-normalized text so reflowed passages can still resolve to useful lines.

For non-text files, this uses the sibling surfaces from Section 5. A match inside research-paper.pdf can be resolved against research-paper.pdf.pdf-transcription.md; a match inside an audio recording can resolve against its transcript.

Generated memories are different. They are facts produced by the memory pipeline, not necessarily spans copied from a file. SMFS still prints them in the same result stream, attached to a path when the backend can attribute one. The agent sees compact memory and supporting chunks together, ranked by relevance, instead of doing one search for the fact and another for evidence.

Evaluation

Protocol

The headline metric is tokens per correct answer. An agent's per-question cost is the total of input and output tokens consumed across all turns during which it answers the question, including tool-call payloads. Pass and fail are decided by an LLM judge (Gemini 2.5 Pro at temperature 0) scoring (question, gold_answer, candidate_answer) triples for semantic equivalence with paraphrase and format tolerance. Pass rate, mean tool calls per question, and mean wall-clock per question are reported as supplementary metrics.

The agent harness is held constant across two retrieval surfaces:

fs-only: bash with grep, find, cat over the persona's local data/ directory. The baseline contemporary agentic setup.
smfs: the persona's corpus ingested into a Supermemory container and exposed through an SMFS mount; the agent has access to smfs grep (semantic) in addition to the standard shell.

For each (surface, dp) pair we record per-question tokens, tool-call counts, wall-clock time, and judge verdict. The retrieval-cost curve plotted at corpus size n aggregates over the persona at that size and across all three question families. We run two frontier agent harnesses, Anthropic Claude Code and OpenAI Codex CLI, twice over all 110 questions across all 13 corpora, yielding n=440 trials per condition.

Tool Selection and the Path-Scoped Hint

A key design choice in the SMFS condition is that the agent retains both matching functions. Plain grep stays literal; smfs grep performs hybrid semantic retrieval. The agent must choose between them per query.

Plain grep is universally known by frontier LLMs from pre-training. smfs grep is not. There is no kernel-level mechanism that would make the agent discover a new shell command on its own, and bypassing the shell (e.g., aliasing all grep calls to semantic) would silently break literal-string lookups the agent legitimately needs. The minimum viable mechanism is a single path-scoped hint written into CLAUDE.md / AGENTS.md in the workspace directory, the file Claude Code and Codex already read on startup, describing the existence and shape of smfs grep and recommending it for semantic queries. The hint and its rationale are documented in the released eval harness (eval/conditions/smfs.ts).

We emphasize that the hint is not a thumb on the scale. Both conditions ship the same base prompt ("Your memory is in agent_memory/. The answer is in agent_memory/. Search until you find it."); the SMFS hint adds only the tool-existence and tool-shape information necessary for the agent to be able to use smfs grep at all. Withholding it would not measure "SMFS without an advantage", it would measure "SMFS with one of its tools hidden from the model." The fs-only baseline has nothing analogous to hide.

Results

Table 1. Aggregate results, averaged across both agents and all 13 corpora (n=440 per condition). Δ is SMFS relative to fs-only.

Metric	fs-only	smfs	Δ
Total tokens (M)	156	72	-53.8%
Tokens / correct (×10³)	766	359	-53.1%
Pass rate (%)	92.7	91.4	-1.4pp
Tool calls / question	10.2	6.7	-33.8%

Table 2. Per-agent breakdown. Both agents drop tokens and tool calls under SMFS; Claude is at pass-rate parity (slightly positive), while Codex absorbs a small pass-rate hit but still wins on tokens.

Agent	Metric	fs-only	smfs	Δ
Claude	Total tokens (M)	72	24	-66%
	Pass rate (%)	90.0	90.9	+0.9pp
	Tool calls / question	7.8	3.1	-60%
Codex	Total tokens (M)	84	48	-43%
	Pass rate (%)	95.5	91.8	-3.6pp
	Tool calls / question	12.5	10.3	-17%

Table 3. Per-question results by corpus size. SMFS reduces tokens at every size; at 9,988 files, SMFS pass rate also exceeds fs-only (81% vs 69%).

Files	fs tok/q	smfs tok/q	fs pass	smfs pass
5	232k	146k	100%	100%
10	329k	145k	94%	94%
20	299k	174k	100%	100%
30	312k	188k	100%	100%
50	379k	185k	94%	89%
100	258k	170k	100%	100%
200	528k	339k	100%	100%
299	1.54M	347k	94%	88%
480	852k	571k	88%	81%
991	1.18M	698k	100%	94%
1,998	875k	307k	88%	81%
4,998	1.16M	512k	75%	75%
9,988	1.59M	605k	69%	81%

Table 4. Pass-rate trade-off by question family. The aggregate -1.4pp pass-rate dip is concentrated entirely in multi-hop questions; single-hop and format-spanning questions are wins on tokens and pass rate.

Family	n	token Δ	fs pass	smfs pass
single_hop	68	-61%	92.6%	94.1%
multi_hop	100	-45%	93.0%	89.0%
format_spanning	52	-65%	92.3%	92.3%

Across n=440 trials per condition, SMFS reduces token consumption and tokens per correct answer at near-parity pass rate, with the per-agent breakdown showing that Claude captures the largest efficiency gains while slightly improving on pass rate, and Codex also benefits substantially in tokens while taking a small pass-rate hit.

The aggregate efficiency claim is in Table 1: SMFS uses 53.8% fewer total tokens, and produces correct answers for 53.1% fewer tokens each. The per-agent decomposition (Table 2) is the more interesting story. Claude cuts its tool-call count by 60% on SMFS and improves marginally on pass rate, it is the bigger beneficiary on every axis. Codex also reduces its tool-call count under SMFS, by 17%, indicating that semantic-search shortcuts exploration even for the more exploration-heavy of the two agents; it takes a modest pass-rate hit (-3.6pp) and still ends up substantially cheaper per correct answer. Both readings of the story show up in the numbers: SMFS is a context-management win that compounds with Claude's already efficient navigation, and a net-positive backend swap for Codex that also shifts its retrieval habits.

Figure 1 · xAFS cost curve

Per-question token cost vs. corpus size

fs-only SMFS

Per-question token cost across 5 to 9,988 files (log scale). fs-only grows ~7× over the range; SMFS grows ~4×. The 299-file corpus is a notable outlier where fs-only thrashes. n = 220 (fs) + 220 (smfs) trials · smfs.ai/research/runs

Figure 1 and Table 3 make the scaling story concrete. Per-question token cost grows roughly 7× across the file-count range for fs-only (0.23M to 1.59M tok/q) and only ~4× for SMFS (0.15M to 0.61M); the 299-file corpus is an outlier where fs-only thrashes (1.54M tok/q against SMFS's 0.35M). The mechanism is what the architecture would predict: the fs-only baseline's cost grows steeply with corpus size because the agent issues progressively more speculative reads (ls, then a handful of cats, then a grep for a literal that does not match, then more cats) to navigate an unfamiliar directory. SMFS's cost grows more slowly because the same grep habit retrieves semantically without those speculative reads.

Pass rate is the second axis. As corpus size grows, fs-only degrades from 100% on corpora of 200 files or fewer to 69% at 9,988 files, a 31pp decline; SMFS degrades more gradually to 81%, surpassing fs-only at the largest scale and suggesting that fs-only retrieval begins to fail outright at extreme scale rather than just become expensive. Decomposed by question family (Table 4), the aggregate -1.4pp pass-rate dip is concentrated entirely in multi-hop questions (-4pp); single-hop and format-spanning questions are wins on both tokens and pass rate. Across the full evaluation, fs-only consumed $2,103 in API spend and SMFS $946, a savings of $1,157 (-55.0%).

Limitations

xAFS is synthetic, audited, and released by the authors of the system under evaluation. We address each directly. The dataset is synthetic because a publicly shareable real cross-context corpus at this scale does not exist; voices and event distributions reflect the priors of the generating model. The dataset is audited so that gold answers cannot be the failure mode: every gold has been independently re-verified against the visible corpus, and the audit log ships with the release. Scaling is measured across distinct personas at distinct sizes, which introduces inter-persona variance into cost-curve estimates; family-distribution balance partially compensates, and the curve should be read as typical-case behavior across thirteen domains rather than an exactly controlled scaling experiment. Question design was completed and frozen against the corpus before any retrieval-system measurements were taken, and the construction pipeline does not preferentially target any retrieval surface. On multi-hop questions specifically, SMFS trades roughly four points of pass rate for 45% fewer tokens (Table 4); we attribute this to top-K semantic retrieval occasionally returning a near-neighbor for the second hop rather than the file the chain requires, and leave tightening this trade-off to future work. We nonetheless encourage independent reproduction and welcome counter-examples.

Conclusion

We argued that the bottleneck of agentic memory is not the speed of any single retrieval but the number of speculative retrievals the agent issues before it has gathered enough context to answer, and that filesystem-only and vector-only memory each amplify that count in their own way. SMFS responds with a path-coherent design: one identifier, the path in the mount, carries local storage, cloud document identity, search scope, transcription placement, and generated memory simultaneously. The kernel sees a normal directory; the agent uses the same grep habit it already has; the backend keeps doing extraction, embedding, and profile generation. Filesystem semantics live at the edge and complete locally; remote convergence happens in four explicit background loops backed by a persistent SQLite queue.

On xAFS, the first public benchmark designed around the cost curve of agentic retrieval, SMFS reduces total tokens consumed by 53.8% and tokens per correct answer by 53.1% against a filesystem-native baseline at n=440 trials per condition, winning on tokens at every corpus size with the largest absolute reductions at the upper end of the scale. At the largest corpus, SMFS pass rate also exceeds the fs-only baseline, suggesting that fs-only retrieval breaks down qualitatively at extreme scale, not just quantitatively. The aggregate pass-rate trade-off is near-parity (-1.4pp), with Claude slightly improving and Codex taking a modest hit (-3.6pp). We see agent filesystems as an underexplored area of public research, existing comparisons are demo-shaped, and the benchmarks that exist measure adjacent problems, and we release xAFS so that any reader skeptical of these results can rerun the evaluation, swap the judge, construct adversarial questions, or build a parallel dataset. Future work includes more personas at the top of the scale, surgically injected gold chains and trap files, native multimodal content, and languages beyond English.

Cite this work

@techreport{smfs2026,
  title   = {SMFS: Memory as a Filesystem},
  author  = {Prasanna AP and Soham Daga and Dhravya Shah},
  year    = {2026},
  month   = {May},
  institution = {Supermemory},
  url     = {https://supermemory.ai/research/memory-as-a-filesystem}
}