Skip to main content

Observability

Membrane exposes behavioral metrics via GetMetrics and ships a comprehensive evaluation suite covering retrieval quality, revision semantics, decay curves, trust gating, and vector-aware recall.

GetMetrics

GetMetrics returns a point-in-time *metrics.Snapshot collected by scanning all records in the store.

snap, err := m.GetMetrics(ctx)
if err != nil {
log.Fatal(err)
}
fmt.Printf("Total: %d, Usefulness: %.2f\n", snap.TotalRecords, snap.RetrievalUsefulness)

Example snapshot

{
"collected_at": "2026-02-05T14:23:10Z",
"total_records": 160,
"records_by_type": {
"episodic": 80,
"entity": 18,
"semantic": 35,
"competence": 15,
"plan_graph": 7,
"working": 5
},
"avg_salience": 0.62,
"avg_confidence": 0.78,
"salience_distribution": {
"0.0-0.2": 14,
"0.2-0.4": 22,
"0.4-0.6": 34,
"0.6-0.8": 50,
"0.8-1.0": 40
},
"active_records": 148,
"pinned_records": 3,
"total_audit_entries": 890,
"memory_growth_rate": 0.15,
"retrieval_usefulness": 0.42,
"competence_success_rate": 0.85,
"plan_reuse_frequency": 2.3,
"revision_rate": 0.08
}

Metrics reference

MetricTypeDescription
collected_atstringRFC 3339 timestamp when the snapshot was collected
total_recordsintTotal number of records in the store
records_by_typemap[string]intCount of records per memory type
avg_saliencefloat64Mean salience across all records
avg_confidencefloat64Mean confidence across all records
salience_distributionmap[string]intRecord counts in 0.2-wide salience buckets
active_recordsintRecords with salience > 0
pinned_recordsintRecords with lifecycle.pinned = true
total_audit_entriesintTotal audit log entries across all records
memory_growth_ratefloat64Fraction of records created in the last 24 hours
retrieval_usefulnessfloat64Ratio of reinforce audit actions to total audit entries
competence_success_ratefloat64Average SuccessRate across all competence records
plan_reuse_frequencyfloat64Average ExecutionCount across all plan_graph records
revision_ratefloat64Fraction of audit entries that are supersede, fork, or merge operations

Interpreting key metrics

retrieval_usefulness — A high value (near 1.0) means retrieved records are frequently reinforced after use, indicating good retrieval quality. A low value may indicate the retrieval is surfacing records that aren't actually useful.

memory_growth_rate — Tracks how fast the substrate is growing. A rate near 1.0 means almost all records were created in the last 24 hours, which may indicate runaway ingestion or a new agent start.

revision_rate — Measures how often the knowledge base is being revised. A very low rate may indicate the agent isn't learning from feedback; a very high rate may indicate instability in the knowledge base.


Evaluation suite

The eval suite covers functional correctness across all major subsystems.

Run everything

make eval-all

This runs all Go-based eval tests and the vector end-to-end evaluation script.

Targeted capability evals

make eval-typed          # Memory type handling
make eval-revision # Revision semantics
make eval-decay # Decay curves and pruning
make eval-trust # Trust-gated retrieval
make eval-competence # Competence learning
make eval-plan # Plan graph operations
make eval-consolidation # Episodic consolidation
make eval-metrics # Observability metrics
make eval-invariants # System invariants
make eval-grpc # gRPC endpoint coverage

Each target maps to a go test run against the ./tests package with a specific -run filter.


Recall regression tests

The recall regression test validates that the retrieval layer returns expected records given a known corpus:

go test ./tests -run TestRetrievalRecallAtK

This test checks recall@k for a fixed set of records and queries, failing if recall drops below the configured threshold. Use it as a canary to detect regressions in retrieval ordering or trust filtering.


Vector end-to-end metrics

The vector E2E evaluation requires Python and measures recall, precision, MRR, and NDCG over a synthetic corpus with pgvector-backed retrieval.

Install Python dependencies

python3 -m pip install -r tools/eval/requirements.txt

Run the eval

make eval

This invokes tools/eval/run.sh, which spins up the eval corpus and measures retrieval metrics at multiple k values.

Check the results

The script reports recall@k, precision@k, MRR@k, and NDCG@k and fails with a non-zero exit code if any metric falls below its threshold.

Metric definitions

MetricDescription
recall@kFraction of relevant records found in the top-k results
precision@kFraction of top-k results that are relevant
MRR@kMean reciprocal rank of the first relevant result in top-k
NDCG@kNormalized discounted cumulative gain; measures ranking quality

Environment variable overrides

Thresholds can be overridden via environment variables:

VariableDefaultDescription
MEMBRANE_EVAL_MIN_RECALL0.90Minimum acceptable recall@k
MEMBRANE_EVAL_MIN_PRECISION0.20Minimum acceptable precision@k
MEMBRANE_EVAL_MIN_MRR0.90Minimum acceptable MRR@k
MEMBRANE_EVAL_MIN_NDCG0.90Minimum acceptable NDCG@k

Latest benchmark results

Local run (Feb 5, 2026):

SuiteResult
Unit/Integration22 top-level eval tests + 7 subtests = 29 test cases, 0 failures (~0.40s)
Vector E2E35 records, 18 queries — recall@k 1.000, precision@k 0.267, MRR@k 0.956, NDCG@k 0.955
Note

End-to-end recall depends on ingestion quality, trust filters, and reinforcement behavior. Treat recall tests as scenario-level regression guards rather than universal benchmarks.