Under researchResults reproducible · hardware validation pending

The findings.

Every number here comes from the program's results and is reproducible from the open release. The headline: a systematic search found heavy-hex-native codes that beat IBM's Gross code on the figure of merit — and we are exact about which distances are proven and which are bounded.

The search · heavy-hex bivariate-bicycle codes

60,000

candidates searched

3 seeds, random + evolutionary

14 of 50

beat Gross on k·d²/n

after rigorous re-verification

1.8–4.4×

figure-of-merit gain

seven headline codes

2.00×

exact-certified

[[192,8,24]], MIP solver

Finding 01 · the codes

Seven codes that beat the benchmark.

Inspect interactively

Code [[n,k,d]]	Torus	k·d²/n	vs Gross	Distance	qubits/log.	Novelty
[[144, 12, 12]]	12×6	12.0	1.00×	baseline	12.0	IBM Gross
[[196, 18, 24]]	14×7	52.9	4.41×	bound	10.9	novel
[[192, 16, 20]]	12×8	33.3	2.78×	bound	12.0	novel
[[196, 8, 26]]	14×7	27.6	2.30×	bound	24.5	novel
[[192, 8, 24]]	12×8	24.0	2.00×	exact	24.0	novel
[[168, 8, 22]]	12×7	23.1	1.92×	bound	21.0	cyclic risk
[[180, 6, 26]]	10×9	22.5	1.88×	bound	30.0	cyclic risk
[[198, 6, 27]]	11×9	22.1	1.84×	bound	33.0	cyclic risk

Distances: exact = certified by a MIP solver over every coset; bound = BP+OSD upper bound at partial coverage. Cyclic-torus codes carry residual prior-art risk pending a live-literature check.

Finding 02 · the certified one

Proven, not just estimated.

For the lead code, we didn't settle for a fast-decoder estimate. An exact solver checked every logical coset — and revised the distance down.

[[192, 8, 24]]

12×8 torus · weight-8 checks · asymmetric d_X=30, d_Z=24

fast-decoder estimate

d ≤ 26

→

MIP exact

d = 24

k·d²/n = 24.0 = exactly 2.00× Gross. Every one of 255 cosets solved by an integer-programming solver. The asymmetric distance makes it especially strong under biased noise. This is the result we'd stake the paper on.

Finding 03 · the honest tradeoff

More merit, less threshold — and we say so.

The figure-of-merit gain comes at a cost: the highest-scoring codes currently have a circuit-level threshold below today's hardware. This is the central result of the study, not a caveat buried in an appendix.

What we gained

▲ Up to 4.4× Gross on k·d²/n.
▲ As few as 10.9 physical qubits per logical (Gross: 12).
▲ Heavy-hex-local checks — a constraint Gross doesn't meet.
▲ One distance certified exactly.

What it costs (for now)

▼ Circuit-level threshold below IBM gate-error rates.
▼ Highest-scoring code uses heavier weight-12 checks.
▼ Threshold measured with a naive, non-FT schedule.
▼ No hardware run yet — the demo is designed, not done.

Finding 04 · the ledger

Proven, bounded, pending, null.

The whole credibility of a QEC result rests on this distinction. So we publish it line by line — including the negative results and the work still to do.

Proven

Codes beating Gross on k·d²/n

14 of the top 50 candidates beat Gross after rigorous BP+OSD re-verification; reproducible from the open release.

Certified

[[192,8,24]] distance d = 24

Exact MIP solver, every one of 255 cosets solved. It corrected the fast-decoder estimate downward — the value of certification.

Bounded

High-k distances (e.g. [[196,18,24]], [[192,16,20]])

BP+OSD upper bounds at partial coverage (≈12–40% of logical classes). Strong, but not yet exactly certified — k is too large for the MIP solver.

Bounded

Heavy-hex implementability

Enforced via a toroidal L1 ≤ 4 locality surrogate; an explicit Eagle/Heron graph embedding has not yet been produced.

Pending

Circuit-level threshold vs Gross

Currently below IBM gate-error rates — the figure-of-merit gain trades against threshold. Full-scale threshold runs and a fault-tolerant schedule are the next phase.

Pending

Novelty vs the literature

Smith-Normal-Form analysis rules out equivalence to IBM's published codes; a live arXiv / code-tables dedupe is still required before any novelty claim is final.

Null result

LLM-guided discovery

Tested honestly across three rounds: language models are good constraint-respecting proposers but did not beat random search at finding frontier codes. Reported as a negative result.

Not yet

Hardware demonstration

An IBM memory-experiment protocol is designed and preflighted, but has not been run. No experimental logical-error claim is made.

The discipline, in one number

[[192, 20, 20]] topped the raw leaderboard at a claimed score of 41.67 (d=20). Rigorous verification dropped it to d=8, score 6.67 (Δd=-12). We publish failures like this on purpose.

What stands between this and a hardware result is the next experiment.

A fault-tolerant schedule, full threshold runs, and a memory demo on real hardware — that is where backing makes the difference.

Back the validation Open the explorer