| q01 What was Coppertide's exact Stitch invoice amount … | Claude Code | 001 | 5 | single_hop | ✓ 207k
| ✓ 160k
| -23% |
| q01 What was Coppertide's exact Stitch invoice amount … | Codex | 001 | 5 | single_hop | ✓ 171k
| ✓ 102k
| -41% |
| q02 According to the signed SOW, what is the internal … | Claude Code | 001 | 5 | single_hop | ✓ 205k
| ✓ 95k
| -54% |
| q02 According to the signed SOW, what is the internal … | Codex | 001 | 5 | single_hop | ✓ 114k
| ✓ 77k
| -32% |
| q03 Priya's engagement plan notes that Coppertide's da… | Claude Code | 001 | 5 | multi_hop | ✓ 535k
| ✓ 194k
| -64% |
| q03 Priya's engagement plan notes that Coppertide's da… | Codex | 001 | 5 | multi_hop | ✓ 220k
| ✓ 104k
| -53% |
| q04 The engagement plan lists a financial sensitivity … | Claude Code | 001 | 5 | multi_hop | ✓ 290k
| ✓ 166k
| -43% |
| q04 The engagement plan lists a financial sensitivity … | Codex | 001 | 5 | multi_hop | ✓ 158k
| ✓ 99k
| -37% |
| q05 The kickoff transcript records Quentin discovering… | Claude Code | 001 | 5 | multi_hop | ✓ 538k
| ✓ 202k
| -63% |
| q05 The kickoff transcript records Quentin discovering… | Codex | 001 | 5 | multi_hop | ✓ 220k
| ✓ 187k
| -15% |
| q06 Priya Iyer's profile states she has a severe aller… | Claude Code | 001 | 5 | multi_hop | ✓ 237k
| ✓ 162k
| -32% |
| q06 Priya Iyer's profile states she has a severe aller… | Codex | 001 | 5 | multi_hop | ✓ 125k
| ✓ 137k
| +10% |
| q07 The company overview notes that Coppertide's CFO r… | Claude Code | 001 | 5 | multi_hop | ✓ 328k
| ✓ 237k
| -28% |
| q07 The company overview notes that Coppertide's CFO r… | Codex | 001 | 5 | multi_hop | ✓ 175k
| ✓ 230k
| +31% |
| q08 According to the SOW payment schedule table, on wh… | Claude Code | 001 | 5 | format_spanning | ✓ 175k
| ✓ 95k
| -45% |
| q08 According to the SOW payment schedule table, on wh… | Codex | 001 | 5 | format_spanning | ✓ 98k
| ✓ 93k
| -5% |
| q09 The kickoff transcript's action-items table lists … | Claude Code | 001 | 5 | format_spanning | ✓ 252k
| ✓ 165k
| -35% |
| q09 The kickoff transcript's action-items table lists … | Codex | 001 | 5 | format_spanning | ✓ 125k
| ✓ 119k
| -4% |
| q01 What is Ana Sokol's Amtrak reservation number for … | Claude Code | 002 | 10 | single_hop | ✓ 210k
| ✓ 91k
| -57% |
| q01 What is Ana Sokol's Amtrak reservation number for … | Codex | 002 | 10 | single_hop | ✓ 237k
| ✓ 60k
| -75% |
| q02 What is the VAULT confirmation number assigned to … | Claude Code | 002 | 10 | single_hop | ✓ 182k
| ✓ 92k
| -49% |
| q02 What is the VAULT confirmation number assigned to … | Codex | 002 | 10 | single_hop | ✓ 123k
| ✓ 99k
| -20% |
| q03 Mira warned Ana about specific logistical risks wh… | Claude Code | 002 | 10 | multi_hop | ✓ 391k
| ✓ 168k
| -57% |
| q03 Mira warned Ana about specific logistical risks wh… | Codex | 002 | 10 | multi_hop | ✓ 422k
| ✓ 144k
| -66% |
| q04 What is the OpenTable confirmation reference for t… | Claude Code | 002 | 10 | multi_hop | ✓ 481k
| ✓ 207k
| -57% |
| q04 What is the OpenTable confirmation reference for t… | Codex | 002 | 10 | multi_hop | ✓ 348k
| ✓ 200k
| -42% |
| q05 Jordan planned to visit Great Island Common in New… | Claude Code | 002 | 10 | multi_hop | ✓ 787k
| ✓ 163k
| -79% |
| q05 Jordan planned to visit Great Island Common in New… | Codex | 002 | 10 | multi_hop | ✓ 319k
| ✓ 290k
| -9% |
| q06 Why did Mira say she could not join Ana and Jordan… | Claude Code | 002 | 10 | multi_hop | ✓ 442k
| ✓ 96k
| -78% |
| q06 Why did Mira say she could not join Ana and Jordan… | Codex | 002 | 10 | multi_hop | ✓ 275k
| ✓ 132k
| -52% |
| q07 Carolyn Foley's reply to Ana's pre-arrival email m… | Claude Code | 002 | 10 | multi_hop | ✗ 561k
| ✓ 254k
| -55% |
| q07 Carolyn Foley's reply to Ana's pre-arrival email m… | Codex | 002 | 10 | multi_hop | ✓ 117k
| ✗ 167k
| +43% |
| q08 According to the Amtrak confirmation email, what i… | Claude Code | 002 | 10 | format_spanning | ✓ 182k
| ✓ 96k
| -47% |
| q08 According to the Amtrak confirmation email, what i… | Codex | 002 | 10 | format_spanning | ✓ 121k
| ✓ 90k
| -26% |
| q09 What is the revised final total for Ana's Martin H… | Claude Code | 002 | 10 | format_spanning | ✓ 526k
| ✓ 129k
| -75% |
| q09 What is the revised final total for Ana's Martin H… | Codex | 002 | 10 | format_spanning | ✓ 202k
| ✓ 132k
| -35% |
| q01 According to the cardiac catheterization report fo… | Claude Code | 003 | 20 | single_hop | ✓ 215k
| ✓ 120k
| -44% |
| q01 According to the cardiac catheterization report fo… | Codex | 003 | 20 | single_hop | ✓ 233k
| ✓ 121k
| -48% |
| q02 What volume of air was used to inflate the TR Band… | Claude Code | 003 | 20 | single_hop | ✓ 217k
| ✓ 96k
| -56% |
| q02 What volume of air was used to inflate the TR Band… | Codex | 003 | 20 | single_hop | ✓ 281k
| ✓ 99k
| -65% |
| q03 Hugo Marchetti was discharged after his NSTEMI on … | Claude Code | 003 | 20 | multi_hop | ✓ 295k
| ✓ 125k
| -57% |
| q03 Hugo Marchetti was discharged after his NSTEMI on … | Codex | 003 | 20 | multi_hop | ✓ 223k
| ✓ 345k
| +55% |
| q04 What was Hugo Marchetti's baseline hemoglobin on t… | Claude Code | 003 | 20 | multi_hop | ✓ 329k
| ✓ 147k
| -55% |
| q04 What was Hugo Marchetti's baseline hemoglobin on t… | Codex | 003 | 20 | multi_hop | ✓ 320k
| ✓ 150k
| -53% |
| q05 What stent was deployed in Hugo Marchetti's mid-LA… | Claude Code | 003 | 20 | multi_hop | ✓ 267k
| ✓ 344k
| +29% |
| q05 What stent was deployed in Hugo Marchetti's mid-LA… | Codex | 003 | 20 | multi_hop | ✓ 300k
| ✓ 169k
| -44% |
| q06 A cards fellow made a remark during Hugo Marchetti… | Claude Code | 003 | 20 | multi_hop | ✓ 325k
| ✓ 297k
| -9% |
| q06 A cards fellow made a remark during Hugo Marchetti… | Codex | 003 | 20 | multi_hop | ✓ 264k
| ✓ 223k
| -16% |
| q07 Hugo Marchetti asked a nurse whether he could take… | Claude Code | 003 | 20 | multi_hop | ✓ 780k
| ✓ 169k
| -78% |
| q07 Hugo Marchetti asked a nurse whether he could take… | Codex | 003 | 20 | multi_hop | ✓ 563k
| ✓ 274k
| -51% |
| q08 Using the HEART score component table in Hugo Marc… | Claude Code | 003 | 20 | format_spanning | ✓ 208k
| ✓ 93k
| -56% |
| q08 Using the HEART score component table in Hugo Marc… | Codex | 003 | 20 | format_spanning | ✓ 259k
| ✓ 117k
| -55% |
| q09 The Day-2 transthoracic echocardiogram report for … | Claude Code | 003 | 20 | format_spanning | ✓ 185k
| ✓ 92k
| -50% |
| q09 The Day-2 transthoracic echocardiogram report for … | Codex | 003 | 20 | format_spanning | ✓ 120k
| ✓ 145k
| +20% |
| q01 Carmen Ostrowski's demand letter of February 19, 2… | Claude Code | 004 | 30 | single_hop | ✓ 336k
| ✓ 161k
| -52% |
| q01 Carmen Ostrowski's demand letter of February 19, 2… | Codex | 004 | 30 | single_hop | ✓ 178k
| ✓ 57k
| -68% |
| q02 The pre-counsel correspondence file (correspondenc… | Claude Code | 004 | 30 | single_hop | ✓ 187k
| ✓ 92k
| -51% |
| q02 The pre-counsel correspondence file (correspondenc… | Codex | 004 | 30 | single_hop | ✓ 187k
| ✓ 75k
| -60% |
| q03 Cross-referencing the defense's discovery response… | Claude Code | 004 | 30 | multi_hop | ✓ 358k
| ✓ 409k
| +14% |
| q03 Cross-referencing the defense's discovery response… | Codex | 004 | 30 | multi_hop | ✓ 314k
| ✓ 214k
| -32% |
| q04 Karras's original Answer (pleadings/answer-2026-03… | Claude Code | 004 | 30 | multi_hop | ✓ 415k
| ✓ 166k
| -60% |
| q04 Karras's original Answer (pleadings/answer-2026-03… | Codex | 004 | 30 | multi_hop | ✓ 386k
| ✓ 57k
| -85% |
| q05 Karras claims a $4,500 change order was verbally a… | Claude Code | 004 | 30 | multi_hop | ✓ 457k
| ✓ 418k
| -9% |
| q05 Karras claims a $4,500 change order was verbally a… | Codex | 004 | 30 | multi_hop | ✓ 538k
| ✓ 377k
| -30% |
| q06 The corpus contains a discrepancy about the date W… | Claude Code | 004 | 30 | multi_hop | ✓ 356k
| ✓ 239k
| -33% |
| q06 The corpus contains a discrepancy about the date W… | Codex | 004 | 30 | multi_hop | ✓ 294k
| ✓ 98k
| -67% |
| q07 Carmen planned an adverse-inference request target… | Claude Code | 004 | 30 | multi_hop | ✓ 621k
| ✓ 356k
| -43% |
| q07 Carmen planned an adverse-inference request target… | Codex | 004 | 30 | multi_hop | ✓ 255k
| ✓ 134k
| -48% |
| q08 The filed complaint (pleadings/complaint-filed-202… | Claude Code | 004 | 30 | format_spanning | ✓ 270k
| ✓ 227k
| -16% |
| q08 The filed complaint (pleadings/complaint-filed-202… | Codex | 004 | 30 | format_spanning | ✓ 213k
| ✓ 64k
| -70% |
| q09 The court docket (Part B of correspondence/court/f… | Claude Code | 004 | 30 | format_spanning | ✓ 125k
| ✓ 132k
| +5% |
| q09 The court docket (Part B of correspondence/court/f… | Codex | 004 | 30 | format_spanning | ✓ 135k
| ✓ 103k
| -23% |
| q01 What was the Zelle confirmation number on Yael Str… | Claude Code | 005 | 50 | single_hop | ✓ 341k
| ✓ 93k
| -73% |
| q01 What was the Zelle confirmation number on Yael Str… | Codex | 005 | 50 | single_hop | ✓ 301k
| ✓ 105k
| -65% |
| q02 According to the apartment's shared appliances inv… | Claude Code | 005 | 50 | single_hop | ✓ 186k
| ✓ 96k
| -48% |
| q02 According to the apartment's shared appliances inv… | Codex | 005 | 50 | single_hop | ✓ 196k
| ✓ 76k
| -61% |
| q03 At the September 28 dinner party, Olu Adebayo brok… | Claude Code | 005 | 50 | multi_hop | ✓ 419k
| ✓ 131k
| -69% |
| q03 At the September 28 dinner party, Olu Adebayo brok… | Codex | 005 | 50 | multi_hop | ✓ 608k
| ✓ 258k
| -58% |
| q04 The September 22, 2025 bathroom ceiling leak in Ap… | Claude Code | 005 | 50 | multi_hop | ✓ 520k
| ✓ 266k
| -49% |
| q04 The September 22, 2025 bathroom ceiling leak in Ap… | Codex | 005 | 50 | multi_hop | ✓ 734k
| ✓ 199k
| -73% |
| q05 On September 30, 2025, Wren's payroll failed to po… | Claude Code | 005 | 50 | multi_hop | ✓ 280k
| ✓ 240k
| -15% |
| q05 On September 30, 2025, Wren's payroll failed to po… | Codex | 005 | 50 | multi_hop | ✓ 653k
| ✓ 273k
| -58% |
| q06 In the October 8, 2025 voice memo, Wren states tha… | Claude Code | 005 | 50 | multi_hop | ✓ 320k
| ✓ 212k
| -34% |
| q06 In the October 8, 2025 voice memo, Wren states tha… | Codex | 005 | 50 | multi_hop | ✓ 486k
| ✓ 247k
| -49% |
| q07 Mr. Aleksandar Nikolajević in Apt 2B had a brief f… | Claude Code | 005 | 50 | multi_hop | ✗ 490k
| ✗ 171k
| -65% |
| q07 Mr. Aleksandar Nikolajević in Apt 2B had a brief f… | Codex | 005 | 50 | multi_hop | ✓ 319k
| ✗ 334k
| +5% |
| q08 The image transcription of the September 22 ceilin… | Claude Code | 005 | 50 | format_spanning | ✓ 154k
| ✓ 94k
| -39% |
| q08 The image transcription of the September 22 ceilin… | Codex | 005 | 50 | format_spanning | ✓ 259k
| ✓ 101k
| -61% |
| q09 In the October 8, 2025 voice memo transcription, O… | Claude Code | 005 | 50 | format_spanning | ✓ 316k
| ✓ 294k
| -7% |
| q09 In the October 8, 2025 voice memo transcription, O… | Codex | 005 | 50 | format_spanning | ✓ 246k
| ✓ 145k
| -41% |
| q01 In PR #67 (the CVE-2026-31418 patch), exactly how … | Claude Code | 006 | 100 | single_hop | ✓ 150k
| ✓ 93k
| -38% |
| q01 In PR #67 (the CVE-2026-31418 patch), exactly how … | Codex | 006 | 100 | single_hop | ✓ 166k
| ✓ 95k
| -43% |
| q02 What exact CVSS 3.1 score and full vector string d… | Claude Code | 006 | 100 | single_hop | ✓ 183k
| ✓ 164k
| -10% |
| q02 What exact CVSS 3.1 score and full vector string d… | Codex | 006 | 100 | single_hop | ✓ 182k
| ✓ 148k
| -19% |
| q03 PR #84 (concurrent file processing) reported bench… | Claude Code | 006 | 100 | multi_hop | ✓ 270k
| ✓ 166k
| -38% |
| q03 PR #84 (concurrent file processing) reported bench… | Codex | 006 | 100 | multi_hop | ✓ 524k
| ✓ 283k
| -46% |
| q04 The scratch-plugin-design-brainstorm.md contains a… | Claude Code | 006 | 100 | multi_hop | ✓ 331k
| ✓ 201k
| -39% |
| q04 The scratch-plugin-design-brainstorm.md contains a… | Codex | 006 | 100 | multi_hop | ✓ 224k
| ✓ 214k
| -4% |
| q05 Lior's outreach email to Charlie Marsh describes h… | Claude Code | 006 | 100 | multi_hop | ✓ 246k
| ✓ 271k
| +10% |
| q05 Lior's outreach email to Charlie Marsh describes h… | Codex | 006 | 100 | multi_hop | ✓ 177k
| ✓ 146k
| -18% |
| q06 The v0.5.0 release notes state that v0.4.2 was yan… | Claude Code | 006 | 100 | multi_hop | ✓ 422k
| ✓ 169k
| -60% |
| q06 The v0.5.0 release notes state that v0.4.2 was yan… | Codex | 006 | 100 | multi_hop | ✓ 303k
| ✓ 193k
| -36% |
| q07 Bytebase is kitabi's second sponsor. What is Byteb… | Claude Code | 006 | 100 | multi_hop | ✓ 268k
| ✓ 202k
| -25% |
| q07 Bytebase is kitabi's second sponsor. What is Byteb… | Codex | 006 | 100 | multi_hop | ✓ 288k
| ✓ 108k
| -62% |
| q08 Using the benchmark table in the v0.5.0 release no… | Claude Code | 006 | 100 | format_spanning | ✓ 178k
| ✓ 95k
| -46% |
| q08 Using the benchmark table in the v0.5.0 release no… | Codex | 006 | 100 | format_spanning | ✓ 131k
| ✓ 113k
| -14% |
| q09 Reproduce the ADR-003 Section 9 (Status and timeli… | Claude Code | 006 | 100 | format_spanning | ✓ 311k
| ✓ 235k
| -24% |
| q09 Reproduce the ADR-003 Section 9 (Status and timeli… | Codex | 006 | 100 | format_spanning | ✓ 285k
| ✓ 158k
| -45% |
| q01 What is the lab protocol ID for general maintenanc… | Claude Code | 007 | 200 | single_hop | ✓ 272k
| ✓ 94k
| -66% |
| q01 What is the lab protocol ID for general maintenanc… | Codex | 007 | 200 | single_hop | ✓ 184k
| ✓ 86k
| -53% |
| q02 What version of R was used for the BIO-510 class? | Claude Code | 007 | 200 | single_hop | ✓ 283k
| ✓ 96k
| -66% |
| q02 What version of R was used for the BIO-510 class? | Codex | 007 | 200 | single_hop | ✓ 470k
| ✓ 275k
| -41% |
| q03 What is the PubMed ID associated with the Nature M… | Claude Code | 007 | 200 | single_hop | ✓ 234k
| ✓ 166k
| -29% |
| q03 What is the PubMed ID associated with the Nature M… | Codex | 007 | 200 | single_hop | ✓ 403k
| ✓ 137k
| -66% |
| q04 What is the cell line Lena used for her first form… | Claude Code | 007 | 200 | multi_hop | ✓ 267k
| ✓ 210k
| -21% |
| q04 What is the cell line Lena used for her first form… | Codex | 007 | 200 | multi_hop | ✓ 234k
| ✓ 217k
| -7% |
| q05 What was the subject of the GEN-600 paper critique… | Claude Code | 007 | 200 | multi_hop | ✓ 262k
| ✓ 96k
| -63% |
| q05 What was the subject of the GEN-600 paper critique… | Codex | 007 | 200 | multi_hop | ✓ 531k
| ✓ 336k
| -37% |
| q06 In which courses did Lena first encounter the gene… | Claude Code | 007 | 200 | multi_hop | ✓ 2657k
| ✓ 1434k
| -46% |
| q06 In which courses did Lena first encounter the gene… | Codex | 007 | 200 | multi_hop | ✓ 1575k
| ✓ 1590k
| +1% |
| q07 What was the date of Lena's first 1-on-1 meeting w… | Claude Code | 007 | 200 | format_spanning | ✓ 273k
| ✓ 168k
| -38% |
| q07 What was the date of Lena's first 1-on-1 meeting w… | Codex | 007 | 200 | format_spanning | ✓ 222k
| ✓ 169k
| -24% |
| q08 What date was the BIO-510 midterm exam (per the an… | Claude Code | 007 | 200 | format_spanning | ✓ 291k
| ✓ 223k
| -23% |
| q08 What date was the BIO-510 midterm exam (per the an… | Codex | 007 | 200 | format_spanning | ✓ 294k
| ✓ 122k
| -59% |
| q01 What was the codename for CogniSynth's Minimum Via… | Claude Code | 008 | 299 | single_hop | ✓ 309k
| ✓ 156k
| -49% |
| q01 What was the codename for CogniSynth's Minimum Via… | Codex | 008 | 299 | single_hop | ✓ 230k
| ✓ 190k
| -17% |
| q02 When did CogniSynth officially file for incorporat… | Claude Code | 008 | 299 | single_hop | ✓ 238k
| ✓ 194k
| -18% |
| q02 When did CogniSynth officially file for incorporat… | Codex | 008 | 299 | single_hop | ✓ 532k
| ✗ 239k
| -55% |
| q03 Who is CogniSynth's co-founder and CTO? | Claude Code | 008 | 299 | single_hop | ✓ 167k
| ✓ 113k
| -32% |
| q03 Who is CogniSynth's co-founder and CTO? | Codex | 008 | 299 | single_hop | ✓ 122k
| ✓ 76k
| -38% |
| q04 What was the total estimated H1 2023 operating bur… | Claude Code | 008 | 299 | multi_hop | ✓ 749k
| ✓ 452k
| -40% |
| q04 What was the total estimated H1 2023 operating bur… | Codex | 008 | 299 | multi_hop | ✓ 980k
| ✓ 938k
| -4% |
| q05 What was the date of the pivotal customer intervie… | Claude Code | 008 | 299 | format_spanning | ✓ 1133k
| ✓ 361k
| -68% |
| q05 What was the date of the pivotal customer intervie… | Codex | 008 | 299 | format_spanning | ✓ 1860k
| ✓ 1044k
| -44% |
| q06 When did Foundry Ventures provide a verbal commitm… | Claude Code | 008 | 299 | format_spanning | ✗ 8743k
| ✓ 1006k
| -88% |
| q06 When did Foundry Ventures provide a verbal commitm… | Codex | 008 | 299 | format_spanning | ✓ 7438k
| ✗ 0k
| -100% |
| q07 According to Sam Chen's signed CogniSynth employme… | Claude Code | 008 | 299 | single_hop | ✓ 1016k
| ✓ 193k
| -81% |
| q07 According to Sam Chen's signed CogniSynth employme… | Codex | 008 | 299 | single_hop | ✓ 536k
| ✓ 166k
| -69% |
| q08 What were the four components of the initial techn… | Claude Code | 008 | 299 | multi_hop | ✓ 310k
| ✓ 259k
| -17% |
| q08 What were the four components of the initial techn… | Codex | 008 | 299 | multi_hop | ✓ 306k
| ✓ 166k
| -46% |
| q01 What is the standard session rate for therapy at C… | Claude Code | 009 | 480 | single_hop | ✗ 285k
| ✗ 91k
| -68% |
| q01 What is the standard session rate for therapy at C… | Codex | 009 | 480 | single_hop | ✗ 1189k
| ✗ 98k
| -92% |
| q02 What is the CPT code most commonly billed by the p… | Claude Code | 009 | 480 | single_hop | ✓ 241k
| ✓ 95k
| -61% |
| q02 What is the CPT code most commonly billed by the p… | Codex | 009 | 480 | single_hop | ✓ 268k
| ✓ 127k
| -53% |
| q03 Who is the Claims Adjuster at Pacific Health Allia… | Claude Code | 009 | 480 | single_hop | ✓ 175k
| ✓ 92k
| -48% |
| q03 Who is the Claims Adjuster at Pacific Health Allia… | Codex | 009 | 480 | single_hop | ✓ 227k
| ✓ 68k
| -70% |
| q04 What was the total amount of money Pacific Health … | Claude Code | 009 | 480 | multi_hop | ✓ 4763k
| ✗ 812k
| -83% |
| q04 What was the total amount of money Pacific Health … | Codex | 009 | 480 | multi_hop | ✓ 1862k
| ✓ 5834k
| +213% |
| q05 Which client's claim was denied on December 18, 20… | Claude Code | 009 | 480 | multi_hop | ✓ 575k
| ✓ 96k
| -83% |
| q05 Which client's claim was denied on December 18, 20… | Codex | 009 | 480 | multi_hop | ✓ 724k
| ✓ 148k
| -80% |
| q06 What was the primary diagnosis for client AB-101, … | Claude Code | 009 | 480 | multi_hop | ✓ 547k
| ✓ 130k
| -76% |
| q06 What was the primary diagnosis for client AB-101, … | Codex | 009 | 480 | multi_hop | ✓ 519k
| ✓ 157k
| -70% |
| q07 What was the date of the first group supervision s… | Claude Code | 009 | 480 | format_spanning | ✓ 241k
| ✓ 234k
| -3% |
| q07 What was the date of the first group supervision s… | Codex | 009 | 480 | format_spanning | ✓ 1394k
| ✓ 655k
| -53% |
| q08 What is the Cascadia Behavioral Health reimburseme… | Claude Code | 009 | 480 | format_spanning | ✓ 255k
| ✓ 210k
| -18% |
| q08 What is the Cascadia Behavioral Health reimburseme… | Codex | 009 | 480 | format_spanning | ✓ 368k
| ✓ 284k
| -23% |
| q01 What was the start date for Project Nova? | Claude Code | 010 | 991 | single_hop | ✓ 802k
| ✓ 158k
| -80% |
| q01 What was the start date for Project Nova? | Codex | 010 | 991 | single_hop | ✓ 652k
| ✓ 818k
| +26% |
| q02 What is the name of the primary backend service fo… | Claude Code | 010 | 991 | single_hop | ✓ 1290k
| ✓ 94k
| -93% |
| q02 What is the name of the primary backend service fo… | Codex | 010 | 991 | single_hop | ✓ 391k
| ✓ 129k
| -67% |
| q03 What was the final version number for the Project … | Claude Code | 010 | 991 | single_hop | ✓ 501k
| ✓ 400k
| -20% |
| q03 What was the final version number for the Project … | Codex | 010 | 991 | single_hop | ✓ 355k
| ✓ 186k
| -47% |
| q04 What was the reported root cause of 'The Great Slo… | Claude Code | 010 | 991 | multi_hop | ✓ 260k
| ✓ 132k
| -49% |
| q04 What was the reported root cause of 'The Great Slo… | Codex | 010 | 991 | multi_hop | ✓ 1202k
| ✓ 351k
| -71% |
| q05 What was the contractual deadline for the Project … | Claude Code | 010 | 991 | multi_hop | ✓ 722k
| ✓ 536k
| -26% |
| q05 What was the contractual deadline for the Project … | Codex | 010 | 991 | multi_hop | ✓ 3615k
| ✓ 5013k
| +39% |
| q06 What was the total R&D budget allocated for Projec… | Claude Code | 010 | 991 | multi_hop | ✓ 1714k
| ✓ 444k
| -74% |
| q06 What was the total R&D budget allocated for Projec… | Codex | 010 | 991 | multi_hop | ✓ 5343k
| ✓ 2188k
| -59% |
| q07 What was the specific bug ticket ID for the critic… | Claude Code | 010 | 991 | format_spanning | ✓ 201k
| ✓ 124k
| -38% |
| q07 What was the specific bug ticket ID for the critic… | Codex | 010 | 991 | format_spanning | ✓ 637k
| ✗ 354k
| -44% |
| q08 What was the total amount of the September CloudPr… | Claude Code | 010 | 991 | format_spanning | ✓ 573k
| ✓ 123k
| -79% |
| q08 What was the total amount of the September CloudPr… | Codex | 010 | 991 | format_spanning | ✓ 672k
| ✓ 121k
| -82% |
| q01 What was the approved budget for 'Project Nighting… | Claude Code | 011 | 1,998 | single_hop | ✗ 201k
| ✗ 164k
| -18% |
| q01 What was the approved budget for 'Project Nighting… | Codex | 011 | 1,998 | single_hop | ✓ 1341k
| ✓ 477k
| -64% |
| q02 What internal project code was used for Veridian's… | Claude Code | 011 | 1,998 | single_hop | ✓ 327k
| ✓ 95k
| -71% |
| q02 What internal project code was used for Veridian's… | Codex | 011 | 1,998 | single_hop | ✓ 416k
| ✓ 87k
| -79% |
| q03 According to the published sidebar explainer on pr… | Claude Code | 011 | 1,998 | single_hop | ✓ 157k
| ✓ 241k
| +54% |
| q03 According to the published sidebar explainer on pr… | Codex | 011 | 1,998 | single_hop | ✓ 867k
| ✓ 108k
| -88% |
| q04 What was the date of the first significant intervi… | Claude Code | 011 | 1,998 | multi_hop | ✓ 377k
| ✓ 274k
| -27% |
| q04 What was the date of the first significant intervi… | Codex | 011 | 1,998 | multi_hop | ✓ 976k
| ✓ 385k
| -61% |
| q05 Which two I-Team members were primarily responsibl… | Claude Code | 011 | 1,998 | multi_hop | ✓ 1172k
| ✗ 165k
| -86% |
| q05 Which two I-Team members were primarily responsibl… | Codex | 011 | 1,998 | multi_hop | ✓ 2930k
| ✓ 1260k
| -57% |
| q06 What was the internal project code for Veridian's … | Claude Code | 011 | 1,998 | multi_hop | ✓ 970k
| ✓ 132k
| -86% |
| q06 What was the internal project code for Veridian's … | Codex | 011 | 1,998 | multi_hop | ✓ 640k
| ✓ 186k
| -71% |
| q07 What was the filename of the first encrypted email… | Claude Code | 011 | 1,998 | format_spanning | ✓ 723k
| ✓ 293k
| -59% |
| q07 What was the filename of the first encrypted email… | Codex | 011 | 1,998 | format_spanning | ✓ 1118k
| ✓ 645k
| -42% |
| q08 What specific Illinois regulation regarding staffi… | Claude Code | 011 | 1,998 | format_spanning | ✓ 348k
| ✓ 160k
| -54% |
| q08 What specific Illinois regulation regarding staffi… | Codex | 011 | 1,998 | format_spanning | ✗ 1432k
| ✗ 238k
| -83% |
| q01 What was the total estimated cost of the Kasnian G… | Claude Code | 012 | 4,998 | single_hop | ✓ 377k
| ✓ 208k
| -45% |
| q01 What was the total estimated cost of the Kasnian G… | Codex | 012 | 4,998 | single_hop | ✗ 1657k
| ✓ 1250k
| -25% |
| q02 What is the physical street address of the U.S. Em… | Claude Code | 012 | 4,998 | single_hop | ✓ 822k
| ✓ 204k
| -75% |
| q02 What is the physical street address of the U.S. Em… | Codex | 012 | 4,998 | single_hop | ✓ 805k
| ✓ 1528k
| +90% |
| q03 Which U.S. government bureau is responsible for ov… | Claude Code | 012 | 4,998 | single_hop | ✓ 271k
| ✓ 112k
| -59% |
| q03 Which U.S. government bureau is responsible for ov… | Codex | 012 | 4,998 | single_hop | ✓ 437k
| ✓ 201k
| -54% |
| q04 What was Omni Energy Corp.'s investment share in t… | Claude Code | 012 | 4,998 | multi_hop | ✗ 1528k
| ✗ 245k
| -84% |
| q04 What was Omni Energy Corp.'s investment share in t… | Codex | 012 | 4,998 | multi_hop | ✗ 3865k
| ✗ 1753k
| -55% |
| q05 When was Frank Miller arrested and what was the ca… | Claude Code | 012 | 4,998 | multi_hop | ✓ 286k
| ✗ 184k
| -36% |
| q05 When was Frank Miller arrested and what was the ca… | Codex | 012 | 4,998 | multi_hop | ✓ 2933k
| ✓ 568k
| -81% |
| q06 Who is the Kasnia Desk Officer in Washington (per … | Claude Code | 012 | 4,998 | multi_hop | ✓ 1077k
| ✓ 127k
| -88% |
| q06 Who is the Kasnia Desk Officer in Washington (per … | Codex | 012 | 4,998 | multi_hop | ✓ 575k
| ✓ 127k
| -78% |
| q07 According to Ambassador Jones's 2021-06-16 informa… | Claude Code | 012 | 4,998 | format_spanning | ✓ 372k
| ✓ 268k
| -28% |
| q07 According to Ambassador Jones's 2021-06-16 informa… | Codex | 012 | 4,998 | format_spanning | ✓ 671k
| ✓ 452k
| -33% |
| q08 In the email thread regarding Frank Miller's arres… | Claude Code | 012 | 4,998 | format_spanning | ✗ 1602k
| ✗ 196k
| -88% |
| q08 In the email thread regarding Frank Miller's arres… | Codex | 012 | 4,998 | format_spanning | ✓ 1265k
| ✓ 767k
| -39% |
| q01 What is the aggregate financing amount specified i… | Claude Code | 013 | 9,988 | single_hop | ✓ 5063k
| ✓ 206k
| -96% |
| q01 What is the aggregate financing amount specified i… | Codex | 013 | 9,988 | single_hop | ✓ 711k
| ✓ 324k
| -54% |
| q02 What is the version number associated with the 'Od… | Claude Code | 013 | 9,988 | single_hop | ✓ 1440k
| ✓ 291k
| -80% |
| q02 What is the version number associated with the 'Od… | Codex | 013 | 9,988 | single_hop | ✓ 2177k
| ✓ 451k
| -79% |
| q03 Which customer provided Nexus Innovations with a f… | Claude Code | 013 | 9,988 | single_hop | ✗ 274k
| ✓ 88k
| -68% |
| q03 Which customer provided Nexus Innovations with a f… | Codex | 013 | 9,988 | single_hop | ✓ 774k
| ✓ 315k
| -59% |
| q04 What was the annual contract value (ACV) for the G… | Claude Code | 013 | 9,988 | multi_hop | ✗ 2582k
| ✗ 186k
| -93% |
| q04 What was the annual contract value (ACV) for the G… | Codex | 013 | 9,988 | multi_hop | ✗ 4017k
| ✗ 590k
| -85% |
| q05 Who is the CTO of Nexus Innovations (per his self-… | Claude Code | 013 | 9,988 | multi_hop | ✓ 956k
| ✓ 344k
| -64% |
| q05 Who is the CTO of Nexus Innovations (per his self-… | Codex | 013 | 9,988 | multi_hop | ✓ 1096k
| ✓ 276k
| -75% |
| q06 According to Alex Miller's Q1 planning data synthe… | Claude Code | 013 | 9,988 | multi_hop | ✗ 784k
| ✗ 582k
| -26% |
| q06 According to Alex Miller's Q1 planning data synthe… | Codex | 013 | 9,988 | multi_hop | ✓ 888k
| ✓ 2502k
| +182% |
| q07 Maya Reyes's 2023 personal goals and principles do… | Claude Code | 013 | 9,988 | format_spanning | ✓ 851k
| ✓ 453k
| -47% |
| q07 Maya Reyes's 2023 personal goals and principles do… | Codex | 013 | 9,988 | format_spanning | ✓ 1312k
| ✓ 1013k
| -23% |
| q08 What was the date of the Starlight Shipping Odysse… | Claude Code | 013 | 9,988 | format_spanning | ✗ 1238k
| ✓ 267k
| -78% |
| q08 What was the date of the Starlight Shipping Odysse… | Codex | 013 | 9,988 | format_spanning | ✓ 1330k
| ✓ 1796k
| +35% |