Evidence and validation

Internal runtime evidence for the imLayer core.

The current strongest signal is a large internal runtime evaluation showing higher next-action correctness, materially lower input load, and preserved or improved latency on a bounded live-model workflow surface.

Internal evidence only. Not production proof. Not customer proof.

Current signal

Evaluated cases

2521 / 2521

Packet accuracy

1.0000

Input token reduction

61.25%

Current primary runtime evidence

Compact decision state outperformed raw workflow history on next-action correctness while materially lowering downstream model input cost.

Evaluated cases

2521 / 2521

Packet accuracy

1.0000

Raw accuracy

0.3455

Input token reduction

61.25%

Key metrics

Coverage

Evaluated cases: 2521 / 2521

Failed cases: 0

Correctness

Raw accuracy: 0.3455

Packet accuracy: 1.0000

Raw-only wins: 0

Packet-only wins: 1650

Both wrong: 0

Runtime economics

Raw input tokens: 319.4597

Packet input tokens: 118.0079

Input token reduction: 61.25%

Raw output tokens: 6.8401

Packet output tokens: 2.3328

Average raw latency: 715.95 ms

Average packet latency: 640.032 ms

Average latency delta: -75.918 ms

Median raw latency: 638.228 ms

Median packet latency: 551.257 ms

Median latency delta: -86.971 ms

p95 raw latency: 1119.712 ms

p95 packet latency: 1090.07 ms

p95 latency delta: -29.642 ms

Interactive comparisons

Raw accuracy

0.0000

Packet accuracy

0.0000

Raw accuracy0.0000

Packet accuracy0.0000

Packet-only wins

Raw-only wins

Both wrong

Methodology

What was measured

The evaluation measured next-action correctness, input/output token usage, and latency on a bounded workflow decision surface.

What raw means

Raw means the downstream model received raw workflow history directly, without imLayer compression into bounded decision state.

What packet means

Packet means the downstream model received the compact decision-ready state produced by imLayer instead of full raw history.

What expected means

Expected refers to the bounded target next action used as the evaluation reference for correctness comparison.

What surface was evaluated

The surface was a controlled live-model workflow slice designed to test runtime decision quality, not a broad production environment.

Qualification

Internal runtime evidence

Controlled evaluated surface

Not customer proof

Not production proof

Not blind external adjudication

Not broad prevalence proof

Historical benchmark archive

Archived internal benchmarks below show progression and strategic development. The current runtime evidence above remains the primary proof layer.

Historical internal snapshot

Heldout benchmark snapshot

A strong heldout benchmark snapshot showing a materially stronger result than the structured non-memory baseline on a bounded internal evaluation surface.

HistoricalInternalNot current primary evidence

Total examples

573

Heldout subset

347

Packet vs baseline

0.6381 vs 0.1590

Delta

+0.4791

Next layer

See how this becomes a first commercial wedge.

Explore support_v1