Evidence and validation

Internal runtime evidence for the imLayer core.

The current strongest signal is a large internal runtime evaluation showing higher next-action correctness, materially lower input load, and preserved or improved latency on a bounded live-model workflow surface.

Internal evidence only. Not production proof. Not customer proof.

Current primary runtime evidence

Current primary runtime evidence

Compact decision state outperformed raw workflow history on next-action correctness while materially lowering downstream model input cost.

Evaluated cases
2521 / 2521
Packet accuracy
1.0000
Raw accuracy
0.3455
Input token reduction
61.25%
Key metrics

Key metrics

Coverage

Evaluated cases: 2521 / 2521
Failed cases: 0

Correctness

Raw accuracy: 0.3455
Packet accuracy: 1.0000
Raw-only wins: 0
Packet-only wins: 1650
Both wrong: 0

Runtime economics

Raw input tokens: 319.4597
Packet input tokens: 118.0079
Input token reduction: 61.25%
Raw output tokens: 6.8401
Packet output tokens: 2.3328
Average raw latency: 715.95 ms
Average packet latency: 640.032 ms
Average latency delta: -75.918 ms
Median raw latency: 638.228 ms
Median packet latency: 551.257 ms
Median latency delta: -86.971 ms
p95 raw latency: 1119.712 ms
p95 packet latency: 1090.07 ms
p95 latency delta: -29.642 ms
Interactive comparisons

Interactive comparisons

Raw accuracy
0.0000
Packet accuracy
0.0000
Raw accuracy0.0000
Packet accuracy0.0000
Packet-only wins
0
Raw-only wins
0
Both wrong
0
Methodology

Methodology

What was measured

The evaluation measured next-action correctness, input/output token usage, and latency on a bounded workflow decision surface.

What raw means

Raw means the downstream model received raw workflow history directly, without imLayer compression into bounded decision state.

What packet means

Packet means the downstream model received the compact decision-ready state produced by imLayer instead of full raw history.

What expected means

Expected refers to the bounded target next action used as the evaluation reference for correctness comparison.

What surface was evaluated

The surface was a controlled live-model workflow slice designed to test runtime decision quality, not a broad production environment.

Qualification

Qualification

Internal runtime evidence
Controlled evaluated surface
Not customer proof
Not production proof
Not blind external adjudication
Not broad prevalence proof
Historical benchmark archive

Historical benchmark archive

Archived internal benchmarks below show progression and strategic development. The current runtime evidence above remains the primary proof layer.

Historical internal snapshot

Heldout benchmark snapshot

A strong heldout benchmark snapshot showing a materially stronger result than the structured non-memory baseline on a bounded internal evaluation surface.

HistoricalInternalNot current primary evidence
Total examples

573

Heldout subset

347

Packet vs baseline

0.6381 vs 0.1590

Delta

+0.4791

Next layer

See how this becomes a first commercial wedge.