Evaluation Runs

xAFS benchmark · 19,169 files · 220 runs · Claude Code + Codex

220 runs 19,169 files 2 agents 43% avg token reduction
Question Agent DP Files Family FS SMFS Δ Tokens
q01 What was Coppertide's exact Stitch invoice amount … Claude Code 001 5 single_hop 207k 160k -23%
q01 What was Coppertide's exact Stitch invoice amount … Codex 001 5 single_hop 171k 102k -41%
q02 According to the signed SOW, what is the internal … Claude Code 001 5 single_hop 205k 95k -54%
q02 According to the signed SOW, what is the internal … Codex 001 5 single_hop 114k 77k -32%
q03 Priya's engagement plan notes that Coppertide's da… Claude Code 001 5 multi_hop 535k 194k -64%
q03 Priya's engagement plan notes that Coppertide's da… Codex 001 5 multi_hop 220k 104k -53%
q04 The engagement plan lists a financial sensitivity … Claude Code 001 5 multi_hop 290k 166k -43%
q04 The engagement plan lists a financial sensitivity … Codex 001 5 multi_hop 158k 99k -37%
q05 The kickoff transcript records Quentin discovering… Claude Code 001 5 multi_hop 538k 202k -63%
q05 The kickoff transcript records Quentin discovering… Codex 001 5 multi_hop 220k 187k -15%
q06 Priya Iyer's profile states she has a severe aller… Claude Code 001 5 multi_hop 237k 162k -32%
q06 Priya Iyer's profile states she has a severe aller… Codex 001 5 multi_hop 125k 137k +10%
q07 The company overview notes that Coppertide's CFO r… Claude Code 001 5 multi_hop 328k 237k -28%
q07 The company overview notes that Coppertide's CFO r… Codex 001 5 multi_hop 175k 230k +31%
q08 According to the SOW payment schedule table, on wh… Claude Code 001 5 format_spanning 175k 95k -45%
q08 According to the SOW payment schedule table, on wh… Codex 001 5 format_spanning 98k 93k -5%
q09 The kickoff transcript's action-items table lists … Claude Code 001 5 format_spanning 252k 165k -35%
q09 The kickoff transcript's action-items table lists … Codex 001 5 format_spanning 125k 119k -4%
q01 What is Ana Sokol's Amtrak reservation number for … Claude Code 002 10 single_hop 210k 91k -57%
q01 What is Ana Sokol's Amtrak reservation number for … Codex 002 10 single_hop 237k 60k -75%
q02 What is the VAULT confirmation number assigned to … Claude Code 002 10 single_hop 182k 92k -49%
q02 What is the VAULT confirmation number assigned to … Codex 002 10 single_hop 123k 99k -20%
q03 Mira warned Ana about specific logistical risks wh… Claude Code 002 10 multi_hop 391k 168k -57%
q03 Mira warned Ana about specific logistical risks wh… Codex 002 10 multi_hop 422k 144k -66%
q04 What is the OpenTable confirmation reference for t… Claude Code 002 10 multi_hop 481k 207k -57%
q04 What is the OpenTable confirmation reference for t… Codex 002 10 multi_hop 348k 200k -42%
q05 Jordan planned to visit Great Island Common in New… Claude Code 002 10 multi_hop 787k 163k -79%
q05 Jordan planned to visit Great Island Common in New… Codex 002 10 multi_hop 319k 290k -9%
q06 Why did Mira say she could not join Ana and Jordan… Claude Code 002 10 multi_hop 442k 96k -78%
q06 Why did Mira say she could not join Ana and Jordan… Codex 002 10 multi_hop 275k 132k -52%
q07 Carolyn Foley's reply to Ana's pre-arrival email m… Claude Code 002 10 multi_hop 561k 254k -55%
q07 Carolyn Foley's reply to Ana's pre-arrival email m… Codex 002 10 multi_hop 117k 167k +43%
q08 According to the Amtrak confirmation email, what i… Claude Code 002 10 format_spanning 182k 96k -47%
q08 According to the Amtrak confirmation email, what i… Codex 002 10 format_spanning 121k 90k -26%
q09 What is the revised final total for Ana's Martin H… Claude Code 002 10 format_spanning 526k 129k -75%
q09 What is the revised final total for Ana's Martin H… Codex 002 10 format_spanning 202k 132k -35%
q01 According to the cardiac catheterization report fo… Claude Code 003 20 single_hop 215k 120k -44%
q01 According to the cardiac catheterization report fo… Codex 003 20 single_hop 233k 121k -48%
q02 What volume of air was used to inflate the TR Band… Claude Code 003 20 single_hop 217k 96k -56%
q02 What volume of air was used to inflate the TR Band… Codex 003 20 single_hop 281k 99k -65%
q03 Hugo Marchetti was discharged after his NSTEMI on … Claude Code 003 20 multi_hop 295k 125k -57%
q03 Hugo Marchetti was discharged after his NSTEMI on … Codex 003 20 multi_hop 223k 345k +55%
q04 What was Hugo Marchetti's baseline hemoglobin on t… Claude Code 003 20 multi_hop 329k 147k -55%
q04 What was Hugo Marchetti's baseline hemoglobin on t… Codex 003 20 multi_hop 320k 150k -53%
q05 What stent was deployed in Hugo Marchetti's mid-LA… Claude Code 003 20 multi_hop 267k 344k +29%
q05 What stent was deployed in Hugo Marchetti's mid-LA… Codex 003 20 multi_hop 300k 169k -44%
q06 A cards fellow made a remark during Hugo Marchetti… Claude Code 003 20 multi_hop 325k 297k -9%
q06 A cards fellow made a remark during Hugo Marchetti… Codex 003 20 multi_hop 264k 223k -16%
q07 Hugo Marchetti asked a nurse whether he could take… Claude Code 003 20 multi_hop 780k 169k -78%
q07 Hugo Marchetti asked a nurse whether he could take… Codex 003 20 multi_hop 563k 274k -51%
q08 Using the HEART score component table in Hugo Marc… Claude Code 003 20 format_spanning 208k 93k -56%
q08 Using the HEART score component table in Hugo Marc… Codex 003 20 format_spanning 259k 117k -55%
q09 The Day-2 transthoracic echocardiogram report for … Claude Code 003 20 format_spanning 185k 92k -50%
q09 The Day-2 transthoracic echocardiogram report for … Codex 003 20 format_spanning 120k 145k +20%
q01 Carmen Ostrowski's demand letter of February 19, 2… Claude Code 004 30 single_hop 336k 161k -52%
q01 Carmen Ostrowski's demand letter of February 19, 2… Codex 004 30 single_hop 178k 57k -68%
q02 The pre-counsel correspondence file (correspondenc… Claude Code 004 30 single_hop 187k 92k -51%
q02 The pre-counsel correspondence file (correspondenc… Codex 004 30 single_hop 187k 75k -60%
q03 Cross-referencing the defense's discovery response… Claude Code 004 30 multi_hop 358k 409k +14%
q03 Cross-referencing the defense's discovery response… Codex 004 30 multi_hop 314k 214k -32%
q04 Karras's original Answer (pleadings/answer-2026-03… Claude Code 004 30 multi_hop 415k 166k -60%
q04 Karras's original Answer (pleadings/answer-2026-03… Codex 004 30 multi_hop 386k 57k -85%
q05 Karras claims a $4,500 change order was verbally a… Claude Code 004 30 multi_hop 457k 418k -9%
q05 Karras claims a $4,500 change order was verbally a… Codex 004 30 multi_hop 538k 377k -30%
q06 The corpus contains a discrepancy about the date W… Claude Code 004 30 multi_hop 356k 239k -33%
q06 The corpus contains a discrepancy about the date W… Codex 004 30 multi_hop 294k 98k -67%
q07 Carmen planned an adverse-inference request target… Claude Code 004 30 multi_hop 621k 356k -43%
q07 Carmen planned an adverse-inference request target… Codex 004 30 multi_hop 255k 134k -48%
q08 The filed complaint (pleadings/complaint-filed-202… Claude Code 004 30 format_spanning 270k 227k -16%
q08 The filed complaint (pleadings/complaint-filed-202… Codex 004 30 format_spanning 213k 64k -70%
q09 The court docket (Part B of correspondence/court/f… Claude Code 004 30 format_spanning 125k 132k +5%
q09 The court docket (Part B of correspondence/court/f… Codex 004 30 format_spanning 135k 103k -23%
q01 What was the Zelle confirmation number on Yael Str… Claude Code 005 50 single_hop 341k 93k -73%
q01 What was the Zelle confirmation number on Yael Str… Codex 005 50 single_hop 301k 105k -65%
q02 According to the apartment's shared appliances inv… Claude Code 005 50 single_hop 186k 96k -48%
q02 According to the apartment's shared appliances inv… Codex 005 50 single_hop 196k 76k -61%
q03 At the September 28 dinner party, Olu Adebayo brok… Claude Code 005 50 multi_hop 419k 131k -69%
q03 At the September 28 dinner party, Olu Adebayo brok… Codex 005 50 multi_hop 608k 258k -58%
q04 The September 22, 2025 bathroom ceiling leak in Ap… Claude Code 005 50 multi_hop 520k 266k -49%
q04 The September 22, 2025 bathroom ceiling leak in Ap… Codex 005 50 multi_hop 734k 199k -73%
q05 On September 30, 2025, Wren's payroll failed to po… Claude Code 005 50 multi_hop 280k 240k -15%
q05 On September 30, 2025, Wren's payroll failed to po… Codex 005 50 multi_hop 653k 273k -58%
q06 In the October 8, 2025 voice memo, Wren states tha… Claude Code 005 50 multi_hop 320k 212k -34%
q06 In the October 8, 2025 voice memo, Wren states tha… Codex 005 50 multi_hop 486k 247k -49%
q07 Mr. Aleksandar Nikolajević in Apt 2B had a brief f… Claude Code 005 50 multi_hop 490k 171k -65%
q07 Mr. Aleksandar Nikolajević in Apt 2B had a brief f… Codex 005 50 multi_hop 319k 334k +5%
q08 The image transcription of the September 22 ceilin… Claude Code 005 50 format_spanning 154k 94k -39%
q08 The image transcription of the September 22 ceilin… Codex 005 50 format_spanning 259k 101k -61%
q09 In the October 8, 2025 voice memo transcription, O… Claude Code 005 50 format_spanning 316k 294k -7%
q09 In the October 8, 2025 voice memo transcription, O… Codex 005 50 format_spanning 246k 145k -41%
q01 In PR #67 (the CVE-2026-31418 patch), exactly how … Claude Code 006 100 single_hop 150k 93k -38%
q01 In PR #67 (the CVE-2026-31418 patch), exactly how … Codex 006 100 single_hop 166k 95k -43%
q02 What exact CVSS 3.1 score and full vector string d… Claude Code 006 100 single_hop 183k 164k -10%
q02 What exact CVSS 3.1 score and full vector string d… Codex 006 100 single_hop 182k 148k -19%
q03 PR #84 (concurrent file processing) reported bench… Claude Code 006 100 multi_hop 270k 166k -38%
q03 PR #84 (concurrent file processing) reported bench… Codex 006 100 multi_hop 524k 283k -46%
q04 The scratch-plugin-design-brainstorm.md contains a… Claude Code 006 100 multi_hop 331k 201k -39%
q04 The scratch-plugin-design-brainstorm.md contains a… Codex 006 100 multi_hop 224k 214k -4%
q05 Lior's outreach email to Charlie Marsh describes h… Claude Code 006 100 multi_hop 246k 271k +10%
q05 Lior's outreach email to Charlie Marsh describes h… Codex 006 100 multi_hop 177k 146k -18%
q06 The v0.5.0 release notes state that v0.4.2 was yan… Claude Code 006 100 multi_hop 422k 169k -60%
q06 The v0.5.0 release notes state that v0.4.2 was yan… Codex 006 100 multi_hop 303k 193k -36%
q07 Bytebase is kitabi's second sponsor. What is Byteb… Claude Code 006 100 multi_hop 268k 202k -25%
q07 Bytebase is kitabi's second sponsor. What is Byteb… Codex 006 100 multi_hop 288k 108k -62%
q08 Using the benchmark table in the v0.5.0 release no… Claude Code 006 100 format_spanning 178k 95k -46%
q08 Using the benchmark table in the v0.5.0 release no… Codex 006 100 format_spanning 131k 113k -14%
q09 Reproduce the ADR-003 Section 9 (Status and timeli… Claude Code 006 100 format_spanning 311k 235k -24%
q09 Reproduce the ADR-003 Section 9 (Status and timeli… Codex 006 100 format_spanning 285k 158k -45%
q01 What is the lab protocol ID for general maintenanc… Claude Code 007 200 single_hop 272k 94k -66%
q01 What is the lab protocol ID for general maintenanc… Codex 007 200 single_hop 184k 86k -53%
q02 What version of R was used for the BIO-510 class? Claude Code 007 200 single_hop 283k 96k -66%
q02 What version of R was used for the BIO-510 class? Codex 007 200 single_hop 470k 275k -41%
q03 What is the PubMed ID associated with the Nature M… Claude Code 007 200 single_hop 234k 166k -29%
q03 What is the PubMed ID associated with the Nature M… Codex 007 200 single_hop 403k 137k -66%
q04 What is the cell line Lena used for her first form… Claude Code 007 200 multi_hop 267k 210k -21%
q04 What is the cell line Lena used for her first form… Codex 007 200 multi_hop 234k 217k -7%
q05 What was the subject of the GEN-600 paper critique… Claude Code 007 200 multi_hop 262k 96k -63%
q05 What was the subject of the GEN-600 paper critique… Codex 007 200 multi_hop 531k 336k -37%
q06 In which courses did Lena first encounter the gene… Claude Code 007 200 multi_hop 2657k 1434k -46%
q06 In which courses did Lena first encounter the gene… Codex 007 200 multi_hop 1575k 1590k +1%
q07 What was the date of Lena's first 1-on-1 meeting w… Claude Code 007 200 format_spanning 273k 168k -38%
q07 What was the date of Lena's first 1-on-1 meeting w… Codex 007 200 format_spanning 222k 169k -24%
q08 What date was the BIO-510 midterm exam (per the an… Claude Code 007 200 format_spanning 291k 223k -23%
q08 What date was the BIO-510 midterm exam (per the an… Codex 007 200 format_spanning 294k 122k -59%
q01 What was the codename for CogniSynth's Minimum Via… Claude Code 008 299 single_hop 309k 156k -49%
q01 What was the codename for CogniSynth's Minimum Via… Codex 008 299 single_hop 230k 190k -17%
q02 When did CogniSynth officially file for incorporat… Claude Code 008 299 single_hop 238k 194k -18%
q02 When did CogniSynth officially file for incorporat… Codex 008 299 single_hop 532k 239k -55%
q03 Who is CogniSynth's co-founder and CTO? Claude Code 008 299 single_hop 167k 113k -32%
q03 Who is CogniSynth's co-founder and CTO? Codex 008 299 single_hop 122k 76k -38%
q04 What was the total estimated H1 2023 operating bur… Claude Code 008 299 multi_hop 749k 452k -40%
q04 What was the total estimated H1 2023 operating bur… Codex 008 299 multi_hop 980k 938k -4%
q05 What was the date of the pivotal customer intervie… Claude Code 008 299 format_spanning 1133k 361k -68%
q05 What was the date of the pivotal customer intervie… Codex 008 299 format_spanning 1860k 1044k -44%
q06 When did Foundry Ventures provide a verbal commitm… Claude Code 008 299 format_spanning 8743k 1006k -88%
q06 When did Foundry Ventures provide a verbal commitm… Codex 008 299 format_spanning 7438k 0k -100%
q07 According to Sam Chen's signed CogniSynth employme… Claude Code 008 299 single_hop 1016k 193k -81%
q07 According to Sam Chen's signed CogniSynth employme… Codex 008 299 single_hop 536k 166k -69%
q08 What were the four components of the initial techn… Claude Code 008 299 multi_hop 310k 259k -17%
q08 What were the four components of the initial techn… Codex 008 299 multi_hop 306k 166k -46%
q01 What is the standard session rate for therapy at C… Claude Code 009 480 single_hop 285k 91k -68%
q01 What is the standard session rate for therapy at C… Codex 009 480 single_hop 1189k 98k -92%
q02 What is the CPT code most commonly billed by the p… Claude Code 009 480 single_hop 241k 95k -61%
q02 What is the CPT code most commonly billed by the p… Codex 009 480 single_hop 268k 127k -53%
q03 Who is the Claims Adjuster at Pacific Health Allia… Claude Code 009 480 single_hop 175k 92k -48%
q03 Who is the Claims Adjuster at Pacific Health Allia… Codex 009 480 single_hop 227k 68k -70%
q04 What was the total amount of money Pacific Health … Claude Code 009 480 multi_hop 4763k 812k -83%
q04 What was the total amount of money Pacific Health … Codex 009 480 multi_hop 1862k 5834k +213%
q05 Which client's claim was denied on December 18, 20… Claude Code 009 480 multi_hop 575k 96k -83%
q05 Which client's claim was denied on December 18, 20… Codex 009 480 multi_hop 724k 148k -80%
q06 What was the primary diagnosis for client AB-101, … Claude Code 009 480 multi_hop 547k 130k -76%
q06 What was the primary diagnosis for client AB-101, … Codex 009 480 multi_hop 519k 157k -70%
q07 What was the date of the first group supervision s… Claude Code 009 480 format_spanning 241k 234k -3%
q07 What was the date of the first group supervision s… Codex 009 480 format_spanning 1394k 655k -53%
q08 What is the Cascadia Behavioral Health reimburseme… Claude Code 009 480 format_spanning 255k 210k -18%
q08 What is the Cascadia Behavioral Health reimburseme… Codex 009 480 format_spanning 368k 284k -23%
q01 What was the start date for Project Nova? Claude Code 010 991 single_hop 802k 158k -80%
q01 What was the start date for Project Nova? Codex 010 991 single_hop 652k 818k +26%
q02 What is the name of the primary backend service fo… Claude Code 010 991 single_hop 1290k 94k -93%
q02 What is the name of the primary backend service fo… Codex 010 991 single_hop 391k 129k -67%
q03 What was the final version number for the Project … Claude Code 010 991 single_hop 501k 400k -20%
q03 What was the final version number for the Project … Codex 010 991 single_hop 355k 186k -47%
q04 What was the reported root cause of 'The Great Slo… Claude Code 010 991 multi_hop 260k 132k -49%
q04 What was the reported root cause of 'The Great Slo… Codex 010 991 multi_hop 1202k 351k -71%
q05 What was the contractual deadline for the Project … Claude Code 010 991 multi_hop 722k 536k -26%
q05 What was the contractual deadline for the Project … Codex 010 991 multi_hop 3615k 5013k +39%
q06 What was the total R&D budget allocated for Projec… Claude Code 010 991 multi_hop 1714k 444k -74%
q06 What was the total R&D budget allocated for Projec… Codex 010 991 multi_hop 5343k 2188k -59%
q07 What was the specific bug ticket ID for the critic… Claude Code 010 991 format_spanning 201k 124k -38%
q07 What was the specific bug ticket ID for the critic… Codex 010 991 format_spanning 637k 354k -44%
q08 What was the total amount of the September CloudPr… Claude Code 010 991 format_spanning 573k 123k -79%
q08 What was the total amount of the September CloudPr… Codex 010 991 format_spanning 672k 121k -82%
q01 What was the approved budget for 'Project Nighting… Claude Code 011 1,998 single_hop 201k 164k -18%
q01 What was the approved budget for 'Project Nighting… Codex 011 1,998 single_hop 1341k 477k -64%
q02 What internal project code was used for Veridian's… Claude Code 011 1,998 single_hop 327k 95k -71%
q02 What internal project code was used for Veridian's… Codex 011 1,998 single_hop 416k 87k -79%
q03 According to the published sidebar explainer on pr… Claude Code 011 1,998 single_hop 157k 241k +54%
q03 According to the published sidebar explainer on pr… Codex 011 1,998 single_hop 867k 108k -88%
q04 What was the date of the first significant intervi… Claude Code 011 1,998 multi_hop 377k 274k -27%
q04 What was the date of the first significant intervi… Codex 011 1,998 multi_hop 976k 385k -61%
q05 Which two I-Team members were primarily responsibl… Claude Code 011 1,998 multi_hop 1172k 165k -86%
q05 Which two I-Team members were primarily responsibl… Codex 011 1,998 multi_hop 2930k 1260k -57%
q06 What was the internal project code for Veridian's … Claude Code 011 1,998 multi_hop 970k 132k -86%
q06 What was the internal project code for Veridian's … Codex 011 1,998 multi_hop 640k 186k -71%
q07 What was the filename of the first encrypted email… Claude Code 011 1,998 format_spanning 723k 293k -59%
q07 What was the filename of the first encrypted email… Codex 011 1,998 format_spanning 1118k 645k -42%
q08 What specific Illinois regulation regarding staffi… Claude Code 011 1,998 format_spanning 348k 160k -54%
q08 What specific Illinois regulation regarding staffi… Codex 011 1,998 format_spanning 1432k 238k -83%
q01 What was the total estimated cost of the Kasnian G… Claude Code 012 4,998 single_hop 377k 208k -45%
q01 What was the total estimated cost of the Kasnian G… Codex 012 4,998 single_hop 1657k 1250k -25%
q02 What is the physical street address of the U.S. Em… Claude Code 012 4,998 single_hop 822k 204k -75%
q02 What is the physical street address of the U.S. Em… Codex 012 4,998 single_hop 805k 1528k +90%
q03 Which U.S. government bureau is responsible for ov… Claude Code 012 4,998 single_hop 271k 112k -59%
q03 Which U.S. government bureau is responsible for ov… Codex 012 4,998 single_hop 437k 201k -54%
q04 What was Omni Energy Corp.'s investment share in t… Claude Code 012 4,998 multi_hop 1528k 245k -84%
q04 What was Omni Energy Corp.'s investment share in t… Codex 012 4,998 multi_hop 3865k 1753k -55%
q05 When was Frank Miller arrested and what was the ca… Claude Code 012 4,998 multi_hop 286k 184k -36%
q05 When was Frank Miller arrested and what was the ca… Codex 012 4,998 multi_hop 2933k 568k -81%
q06 Who is the Kasnia Desk Officer in Washington (per … Claude Code 012 4,998 multi_hop 1077k 127k -88%
q06 Who is the Kasnia Desk Officer in Washington (per … Codex 012 4,998 multi_hop 575k 127k -78%
q07 According to Ambassador Jones's 2021-06-16 informa… Claude Code 012 4,998 format_spanning 372k 268k -28%
q07 According to Ambassador Jones's 2021-06-16 informa… Codex 012 4,998 format_spanning 671k 452k -33%
q08 In the email thread regarding Frank Miller's arres… Claude Code 012 4,998 format_spanning 1602k 196k -88%
q08 In the email thread regarding Frank Miller's arres… Codex 012 4,998 format_spanning 1265k 767k -39%
q01 What is the aggregate financing amount specified i… Claude Code 013 9,988 single_hop 5063k 206k -96%
q01 What is the aggregate financing amount specified i… Codex 013 9,988 single_hop 711k 324k -54%
q02 What is the version number associated with the 'Od… Claude Code 013 9,988 single_hop 1440k 291k -80%
q02 What is the version number associated with the 'Od… Codex 013 9,988 single_hop 2177k 451k -79%
q03 Which customer provided Nexus Innovations with a f… Claude Code 013 9,988 single_hop 274k 88k -68%
q03 Which customer provided Nexus Innovations with a f… Codex 013 9,988 single_hop 774k 315k -59%
q04 What was the annual contract value (ACV) for the G… Claude Code 013 9,988 multi_hop 2582k 186k -93%
q04 What was the annual contract value (ACV) for the G… Codex 013 9,988 multi_hop 4017k 590k -85%
q05 Who is the CTO of Nexus Innovations (per his self-… Claude Code 013 9,988 multi_hop 956k 344k -64%
q05 Who is the CTO of Nexus Innovations (per his self-… Codex 013 9,988 multi_hop 1096k 276k -75%
q06 According to Alex Miller's Q1 planning data synthe… Claude Code 013 9,988 multi_hop 784k 582k -26%
q06 According to Alex Miller's Q1 planning data synthe… Codex 013 9,988 multi_hop 888k 2502k +182%
q07 Maya Reyes's 2023 personal goals and principles do… Claude Code 013 9,988 format_spanning 851k 453k -47%
q07 Maya Reyes's 2023 personal goals and principles do… Codex 013 9,988 format_spanning 1312k 1013k -23%
q08 What was the date of the Starlight Shipping Odysse… Claude Code 013 9,988 format_spanning 1238k 267k -78%
q08 What was the date of the Starlight Shipping Odysse… Codex 013 9,988 format_spanning 1330k 1796k +35%