Making Sense of Unstructured Data in Hospitals

Kialan Pillay*, Dylan Sandfelder, Nikita Trojanskis, Calum Braham, Edgars Labsvirs
Amorphous AI, London, the United Kingdom
* Corresponding author: kialan@amorphous.health
Summary

Striata turns hospital managers and clinicians into a data team of one. It answers high-complexity questions about pathways, costs, outcomes, revenue, and operations in minutes — powered by a Data Engine that standardizes unstructured clinical notes into analysis-ready data, enabling large-scale observational research and regulatory-grade RWE (Real World Evidence).

Benchmarks show 97.8% recall on clinical entity extraction from unstructured EHRs, 100% recall on lab (name, value, unit) triplets, and Striata completes 36/36 end-to-end analytics benchmark questions in ~3–4 minutes per question.

For a hospital, this means the notes you already have can finally answer the questions you’ve been carrying for years — so you can improve workflows and redesign patient pathways that have been stuck for decades.

Hospitals run on decisions: which pathways are working, where the bottlenecks are, what care actually costs, and which interventions improve outcomes. Yet in most systems, answering even “simple” operational or clinical questions still requires weeks of analyst time, multiple IT tickets, and brittle dashboards that break the moment the question changes.

Amorphous AI is building the analytics layer hospitals have been missing: software that can ingest messy, real-world healthcare data and reliably turn it into answers decision-makers can act on.

Striata, our AI data analyst, turns every hospital manager and clinician into a data team of one.

Any question about patient pathways, costs, outcomes, revenue, or operational performance that used to take weeks can now be answered in minutes.

Concretely, here are the kinds of questions Striata is designed to answer — and who inside the hospital typically needs them:

This matters because hospitals already have the data — they just can’t use it fast enough. A typical hospital generates around 50 petabytes of data annually, yet estimates suggest 97% of this valuable information remains unused.

Consequently, only a small fraction of the data drives evidence-based decisions in healthcare facilities. Over time, this compounds into avoidable cost, preventable harm, and missed opportunity. There are two main challenges that make it difficult to use health data to drive decision-making:

  1. More than 80% of health data is unstructured.
  2. Healthcare providers lack resources to analyze this data.

Amorphous AI offers clients access to two key technological components:

  1. A data engine that ingests all health data — regardless of format or structure — and standardizes it using established clinical terminologies like SNOMED, LOINC, and ICD-10.
  2. An AI data analyst that sits atop millions of rows of historical and real-time data, extracting insights to drive managerial decisions across diverse use cases including cost reduction, improved patient outcomes, regulatory compliance, and process efficiency.

In this white paper, we explain both technological components and our early results from benchmarking their accuracy and consistency.

EMR Data Raw input
AI Data Engine Standardization
Data Lake Structured data
Striata AI Data Analyst
Insights For Managers

Figure 1: High-level overview of the Amorphous Data Engine and Striata pipeline.

Next, we explain how Striata works.

Striata: AI Data Analyst for Hospitals

Once clinical notes have been converted into clean, standardized data, the main challenge — and opportunity — is turning that data into answers to high-complexity questions. Striata is built to do this reliably: translating ambiguous, cross-cutting operational and clinical questions into auditable analyses over millions of rows of historical and real-time hospital data.

Striata is our deep analytics platform built for this task.

How Striata Works

Think of Striata as a team of data scientists with clinical training. It is composed of specialized agents for planning, exploration, code generation, validation, visualization, and communication. A central orchestrator coordinates this workflow, iteratively selecting the next agent based on the evolving state of the analysis until the task is complete.

Orchestrator
Chooses next agent
from state + memory
Planning
Clarify objective, constraints, and approach.
strategy
Data Exploration
Select tables, generate SQL, extract data sample.
data slice
Code Generation
Write analysis code for given task.
python
Validation
Check consistency of results.
verify
Visualization
Generate visuals to improve interpretability of results.
plots
Reporter
Consolidate the findings into a cohesive report.
report
Patient Journey Graph Reconstruction
Analyse bottlenecks, treatment pathways, inconsistencies.
graph
Debugging Debugging

Figure 2: Striata behaves like an orchestrated “agent swarm.” The orchestrator selects the next agent based on current state.

planning patient_journey_reconstruction code_generation debugging code_regeneration bottleneck_analysis report_preparation
planning web_search data_exploration visualization report_preparation

Figure 3: Two plausible routes for different tasks. Striata’s orchestrator will take different paths depending on what data and type of analysis is required.

Step 1: Question Interpretation

Before Striata touches any data, it nails down what the question really asks: the cohort of patients, the time window, the metric analysed, and what counts as basic units of analysis (per patient, per encounter, per observation, etc.).

User Prompt
"What is the association between proton pump inhibitor use and C. difficile infection?"
Contains under-specified umbrella clinical terms
Graph traversal of clinical ontologies resolves umbrella terms to specific concepts
Proton Pump Inhibitor
Omeprazole
Esomeprazole
Pantoprazole
Lansoprazole
descendant-of relationships
Resolved Clinical Codes
186431008Clostridioides difficile infectionSNOMED CT
7646OmeprazoleRxNORM
283742EsomeprazoleRxNORM
40790PantoprazoleRxNORM
17128LansoprazoleRxNORM

Figure 4: Umbrella clinical terms in the user's prompt are resolved to precise codes via graph traversal of standard medical ontologies.

Step 2: Data Selection (Tables → SQL → Reduced Dataframe)

Striata works against real industry-grade schemas. A key design choice is to reduce first: generate queries that extract only the rows and columns needed to answer the question, producing a salient dataset that is fast to analyze.

A typical FHIR database can hold millions of rows across dozens of tables (FHIR defines 126+ resource types) even for a few thousand patients. Identifying the right data is akin to finding a needle in a haystack.

patients
Demographics
encounters
Admissions, timing
conditions
Diagnoses (SNOMED)
observations
Labs/vitals (LOINC)
procedures
Procedures (SNOMED)
medications
Meds (RxNorm)
immunizations
Vaccines (SNOMED)
allergies
Allergy history
organizations
Facilities / providers

Figure 5: Striata identifies the tables needed for the current question, then generates SQL queries to extract a dataset for analysis.

Step 3: Statistical Analysis (Generate Code → Execute → Debug -> Summarise)

Once the reduced dataframe exists, Striata generates analysis code focused on the question and executes it in a controlled environment. The output is a structured report: an Answer Summary (output of the analysis) and a Method Summary (how Striata computed it).

import pandas as pd # df is the reduced analysis dataframe df['troponin_value'] = pd.to_numeric(df['troponin_value'], errors='coerce') elevated = df[df['troponin_value'] >= 0.5] patients = elevated['patient_id'].nunique() print(f"Patients with elevated troponin: {patients}")
Example Output (Benchmark Run)
Patients with elevated troponin: 777
Dataset: Hospital discharge summaries • Question: Troponin elevation analysis
The full run writes artifacts: SQL queries, Python execution logs, and summary markdown documents.

Figure 6: Striata produces executable analysis code and a structured written summary, backed by saved artifacts.

Data Engine: Turning Clinical Notes into Reliable Data

For Striata to be useful, the underlying data lake it runs on needs to contain sufficient clinical and operational signal. Much of that signal is locked in unstructured (or semi-structured) EHRs. These documents are rich in detail but difficult to search, analyze, and report at scale.

The data engine exists to unlock it: we convert that information into standardized, interoperable data without losing clinical nuance.

Consider an EHR note that says: "Patient has been recommended to discontinue metformin treatment after 3 weeks."

Capturing that in structured form is hard: you need to capture "metformin" as a medication, the recommending tone of the statement, discontinuation as an action, and the duration.

Doing it once is easy; doing it reliably over millions of documents with varying formats, abbreviations, languages and typos is very hard.

The engine is built to be robust to imperfect inputs: to capture meaning first and to standardize second. That design choice protects against data loss and makes the output useful for clinical analytics, operations, research, and downstream AI agents.

Raw Input PDFs, Scans, Notes
Stage 1: Extracting Clinically Relevant Entities Polymorphic Schema (14 types)
Stage 2: Clinical Coding Multi-ontology Vector Search
Structured Output FHIR R4 + OMOP CDM

Figure 7: High-level data flow from unstructured input to standardized output.

How the Data Engine Works (Two-Stage Pipeline)

Stage 1: Document Processing and Entity Extraction

We begin by normalizing source documents (PDFs, scans, text), extracting basic metadata (patient, note, dates), and detecting language. The engine builds a note-specific abbreviation glossary so short forms like "HTN" (i.e. Hypertension) can be resolved safely.

Agent 1 Pre-Processing
Normalize text layout, extract administrative metadata (patient_id, dates), and detect language.
Agent 2 Glossary Builder
Create note-specific abbreviation glossary to disambiguate terms like "MS" (Mitral Stenosis vs Multiple Sclerosis).
Agent 3 Entity Extraction
Extracts 14 entity types using a Proprietary data model and "capture first" strategy.
Agent 4 Review Pass
Safety net scan for missed items (Labs, Normals, History) to ensure high recall.
Agent 5 Normalization
Expand abbreviations using the glossary and translate terms to English.
Final Output Parquet Dataset
Validated, structured data ready for Stage 2 analysis.
{ "entity_type": "Condition", "surface_text": "Hypertension", "temporal_context": "CHRONIC", "certainty": "CONFIRMED", "negated": false }

Figure 8: Detailed agent flow for Stage 1 processing.

It then uses a large language model to extract clinical entities (conditions, medications, labs, procedures, etc.) according to a proprietary data model with 14 focused entity types. This "capture first" step prioritizes completeness over early standardization.

View Sample of Proprietary Data Model
I. Common Attributes (Inherited by All Entities)

Every extracted entity—regardless of type—inherits a core set of attributes to ensure consistent metadata, temporal grounding, and provenance tracking.

surface_textstr (Exact snippet from source text)
normalized_phrasestr (Canonical English term)
temporal_contextEnum: ACTIVE | RECENT | CHRONIC | HISTORICAL | PLANNED
certaintyEnum: CONFIRMED | SUSPECTED | RULED_OUT
negatedbool
clinical_relationshipEnum: PRIMARY | SECONDARY | RISK_FACTOR | COMPLICATION
II. Entity Linkages (The Knowledge Graph)

Entities are not isolated; they are linked to form a semantic graph of clinical concepts.

body_site_entity_refLinks finding to anatomy (e.g., "Rash" → "Left Arm")
caused_by_entity_refCausal link (e.g., "Anemia" → "Bleeding")
reason_for_entity_refJustification link (e.g., "Ibuprofen" → "Pain")
evidence_entity_refsSupportive evidence (e.g., "Pneumonia" → ["Cough", "Fever"])
III. Specialized Entity Types (Example: Medication)

Each of the 14 entity types extends the common schema with domain-specific attributes. For example, the Medication entity captures detailed prescribing information:

Medication Entity Schema
medication_statusEnum: PRESCRIBED | ADMINISTERED | DISCONTINUED
quantity_valuefloat | str (Strength/Dosage)
quantity_unitstr (e.g., "mg", "tablets")
medication_formstr (e.g., "patch", "suspension")
administration_routestr (e.g., "oral", "intravenous")

... plus 13 other specialized schemas for Condition, Procedure, Observation, Immunization, Allergy/intolerance, Device, Social history, Family history, Specimen, BodyStructure, ClinicalEvent, and more.

Stage 2: Multi-Ontology Clinical Coding

During Stage 2, the engine matches each entity against a pre-generated text embedding of the appropriate ontology to retrieve candidate codes, before running an LLM-based quality control step to ensure the most clinically relevant code(s) are selected.

Entity Type → Ontology Routing
Procedure Device Allergy/intolerance Immunization Family history Specimen BodyStructure ClinicalEvent
SNOMED CT
Observation Social history
LOINC
Condition
ICD-10
Medication MedicationClass
RxNORM
Condition (rare_disease_flag)
ORDO
Falls back to SNOMED CT if no match

Figure 9: Stage 2 ontology routing logic. Each clinical entity type extracted in Stage 1 is routed to the appropriate terminology system.

For each entity, the engine generates embeddings, retrieves candidate codes from a vector store, then runs an LLM-based quality control routine to select the best code(s).

Benchmarking: Ensuring Reliability at Scale

Our core dogma at Amorphous AI is benchmarking-driven development. Testing AI in healthcare for accuracy and consistency is a major bottleneck — demos on five examples are easy; reliable systems at scale are hard. We do not ship improvements unless we can quantify impact on accuracy and latency. Below, we show some of the results from benchmarking our data engine and Striata.

The Synthetic Data Generation Pipeline

To rigorously test the Data Engine, we built a dedicated benchmarking pipeline. The fundamental challenge in healthcare NLP is the absence of labelled ground truth — real clinical notes cannot easily be annotated at scale. Our in-house pipeline generates realistic synthetic data where we control the ground truth from the start. Reach out if you would like to give it a try.

The generation process has three distinct phases (exemplified for an experiment where we generated 500 synthetic EHRs):

Phase 1 Entity Bank & Sampling
A curated bank of 1000+ entities per category (condition, medication, etc.) ensures clinically diverse coverage.
Phase 2 Narrative Generation (GPT-5.2)
High-fidelity model weaves entities into realistic clinical prose.
Phase 3 Text Modification
Simulate real-world challenges with abbreviated forms and typo injection.
Output Ground Truth Dataset
500 test documents total, each with known ground truth.

Figure 10: The synthetic data generation pipeline creates labeled test data with known ground truth.

Metrics and Evaluation Strategy

We evaluate extraction quality using Recall (fraction of ground truth entities found) and IoU (Intersection over Union) (character-level overlap between predicted and ground truth spans). We use a relaxed matching strategy (±5 character tolerance at span boundaries).

Since all extracted entities are subsequently mapped to standardized clinical codes (SNOMED CT, LOINC, ICD-10, RxNorm, etc.) in Stage 2, minor span boundary differences have no clinical consequence—e.g., "Type 2 diabetes" and "Type 2 diabetes mellitus" both map to the same SNOMED CT concept. This approach better reflects what matters in practice: did we correctly identify the clinical concept?

Visualizing Performance: The OVA View

Our benchmarking dashboard includes "OVA" (Observed vs. Actual) visualizations that allow developers to inspect individual documents. Ground truth entities are shown with colored underlines; extracted entities appear as highlighted overlays. This makes it immediately obvious where the engine succeeds and fails.

Patient presents with Conditionshortness of breath and Missedmild chest pain.

History of ConditionHTN. Currently on MedicationLisinopril 10mg daily.

Labs show Observationelevated troponin.

Figure 11: OVA visualization example. Green highlight = True Positive (correctly extracted). Red underline only = False Negative (missed by the engine).

Benchmark Results: Dataset Overview

Our most comprehensive benchmark used 100 synthetic EHRs with approximately 2,300 ground truth entities, tested across 5 text variants. The table below summarizes the overall performance.

Text Variant Recall Mean IoU
Original (Clean) 97.8% 98.2%
Abbreviated 95.0% 95.8%
Typo - Low 86.4% 96.2%
Typo - Medium 78.9% 95.5%
Typo - High 71.1% 96.0%

Figure 12: Summary metrics across all text variants.

Key Observations from the Summary

For a hospital, this means the engine is accurate enough for most high‑value analytics use cases (cohorting, registries, reporting) — and when it isn’t (OCR/typos), we can quantify it.

Per-Entity Type Performance

Aggregate metrics hide important variation across entity types. Some categories are inherently easier to extract than others. The chart below shows recall by entity type on unmodified text.

Allergy/intolerance
100%
Condition
99.1%
Medication
99.1%
Immunization
96.6%
Device
95.9%
Observation
95.7%
Social history
87.6%
Procedure
86.4%
Specimen
84.4%
Family history
73.6%

Figure 13: Recall by entity type (unmodified text). Core clinical entities achieve >99% recall.

Why Some Entity Types Are Harder

Impact of Text Degradation

Real clinical notes contain typos, abbreviations, and OCR errors. We systematically tested how noise affects extraction accuracy. The following table shows per-entity recall across noise levels.

Entity Type Original Condensed Typo Low Typo Med Typo High
Allergy/intolerance 100% 100% 86.6% 73.2% 69.3%
Condition 99.1% 89.7% 79.7% 69.8% 66.5%
Medication 99.1% 99.7% 88.7% 79.6% 76.5%
Observation 95.7% 91.5% 80.7% 71.9% 63.1%
Procedure 86.4% 72.1% 64.3% 61.6% 55.5%
Family history 73.6% 57.8% 63.9% 45.8% 42.3%
Social history 87.6% 72.5% 57.1% 51.0% 34.7%

Figure 14: Recall degradation by entity type as text quality decreases. Medications are most resilient; social history degrades fastest.

Key Insights from Degradation Analysis

Span Accuracy: IoU Analysis

When the model finds an entity, how precisely does it capture the boundaries? We measure this with Intersection over Union (IoU)—the character-level overlap between predicted and ground truth spans.

IoU Scores by Entity Type (Original Text)
Allergy/intolerance 100.0% (perfect boundaries)
Condition 99.7%
Medication 99.6%
Immunization 99.0%
Device 98.8%
Observation 98.8%
Social history 96.9%
Procedure 94.8%
Specimen 94.3%
Family history 92.2%

All IoU scores exceed 92%, meaning when we find an entity, we capture it accurately. The slightly lower scores for procedure and family history reflect boundary ambiguity (e.g., should "laparoscopic cholecystectomy" include "laparoscopic"?).

Review Pass: Impact and Cost-Benefit

We evaluated the optional "Review Pass" agent—a second-pass extractor that re-reads the document to catch missed entities:

Text Variant Recall (No Review) Recall (With Review) +Recall
Original 94.59% 97.81% +3.22%
Condensed 87.60% 95.00% +7.40%
Typo Low 76.83% 86.39% +9.56%
Typo Medium 68.89% 78.92% +10.03%
Typo High 63.01% 71.08% +8.07%

Figure 15: Impact of the Review Pass agent across text variants.

The Review Pass adds up to +10 percentage points on noisy text, but nearly doubles both latency and inference cost. For clean text, the +3% gain doesn't justify 2x latency, so we disable it by default. For noisy or high-stakes documents, the gains may justify the latency increase.

Consistency Evaluation

We measured reproducibility across multiple independent runs using Multiple Intersection over Union (mIoU), where 1.0 indicates perfect reproducibility. The Data Engine shows exceptional consistency on clean data: Original text achieves mIoU 0.9856 and Condensed text 0.9189, confirming highly deterministic extraction.

For a hospital, this means the same note yields the same extracted data run after run—critical for audit trail and repeatable reporting.

Consistency Comparison between five EHR variants

Figure 16: Consistency Comparison between five EHR variants. The model shows near-perfect reproducibility on Original and Condensed datasets.

Typo noise reduces consistency gradually—from 0.8326 (Low) to 0.7731 (Medium) to 0.7005 (High)—underscoring the value of upstream OCR correction for degraded scans.

Typo impact on consistency

Figure 17: Consistency of clinical extraction per Clinical Category.

Lessons Learned

Benchmark 2: Laboratory Results Extraction

Beyond general entity recognition, the Data Engine must excel at extracting quantitative laboratory observations—capturing not just a concept name, but a precise triplet of (name, value, unit). We built a dedicated benchmark with heterogeneous formatting styles (bullet lists, inline prose, markdown tables) for lab result processing.

The Lab Results Generation Pipeline

Following the same philosophy as our entity recognition benchmark, we generate synthetic EHRs with known ground truth lab values. The generation pipeline has four distinct phases:

Phase 1 Lab Bank Generation
Generate a diverse bank of lab tests with realistic reference ranges (100+ unique labs across common categories).
Phase 2 Coverage-First Allocation
Allocate labs across EHRs to maximize diversity.
Constraint: each lab appears in at most 2 EHRs (coverage-first sampling).
Phase 3 Value Generation
Generate realistic values and units for each allocated lab, including a controlled fraction of abnormal results.
Phase 4 EHR Text Generation
Generate clinical notes that include those labs in varied real-world formats (prose, panels, or tables).
Output Verified Ground Truth
Automated verification confirms every planned lab appears with the correct value and unit.

Figure 18: Lab results benchmark data generation pipeline

Evaluation Methodology

Lab extraction requires a more nuanced evaluation than general entity recognition. We must verify that the extraction captures the complete clinical observation—not just the lab name, but also the value and unit.

Our lab results benchmark achieved 100% recall across 200 ground truth lab observations. The table below summarizes the results:

For a hospital, this means even the most information‑dense lab sections can be structured perfectly.

Metric Value Description
Total EHRs 10 Synthetic clinical notes with heterogeneous lab formatting
Total Labs 200 Ground truth lab observations (20 per EHR)
Fully Matched 200 Labs where value+unit were correctly extracted

Figure 19: Lab results benchmark summary. 100% recall demonstrates reliable lab extraction.

Interactive EHR Examples

The following examples show real benchmark outputs. Hover over highlighted lab results to see the ground truth, extracted values, and match status.

Exact Match (name+value+unit)
Value+Unit Match (name differs slightly)
Missed (FN)
EHR #1 — Bullet List Format (20/20 labs matched: 17 exact, 3 value+unit)
Chief Complaint / HPI: Patient presents for routine follow-up after hospitalization for dehydration; symptoms resolved and patient reports stable appetite and energy. Labs / Studies: - Complement C3: 156.8 mg/dL
GT:Complement C3: 156.8 mg/dL
Extracted:Complement C3: 156.8 mg/dL
Status:Exact Match
- Complement C4: 28.7 mg/dL
GT:Complement C4: 28.7 mg/dL
Extracted:Complement C4: 28.7 mg/dL
Status:Exact Match
- Butyrylcholinesterase: 8447.3 U/L
GT:Butyrylcholinesterase: 8447.3 U/L
Extracted:Butyrylcholinesterase: 8447.3 U/L
Status:Exact Match
- Prealbumin (Transthyretin): 26.4 mg/dL
GT:Prealbumin (Transthyretin): 26.4 mg/dL
Extracted:Prealbumin: 26.4 mg/dL
Status:Value+Unit Match (name differs)
- HDL Cholesterol: 32.1 mg/dL
GT:HDL Cholesterol: 32.1 mg/dL
Extracted:HDL Cholesterol: 32.1 mg/dL
Status:Exact Match
- Erythrocyte Sedimentation Rate: 8.6 mm/hr
GT:ESR: 8.6 mm/hr
Extracted:Erythrocyte Sedimentation Rate: 8.6 mm/hr
Status:Exact Match
- Thyroid Stimulating Immunoglobulin: NEG
GT:TSI: NEG
Extracted:Thyroid Stimulating Immunoglobulin: NEG
Status:Exact Match
- Absolute Lymphocyte Count: 2.90 10^9/L
GT:ALC: 2.90 10^9/L
Extracted:Absolute Lymphocyte Count: 2.90 10^9/L
Status:Exact Match
- Ceruloplasmin: 31.9 mg/dL
GT:Ceruloplasmin: 31.9 mg/dL
Extracted:Ceruloplasmin: 31.9 mg/dL
Status:Exact Match
- Urinalysis Protein: NEG
GT:Urinalysis Protein: NEG
Extracted:Urinalysis Protein: NEG
Status:Exact Match
- Potassium: 4.9 mEq/L
GT:K: 4.9 mEq/L
Extracted:Potassium: 4.9 mEq/L
Status:Exact Match
- Urine Red Blood Cell Count (Microscopic): 2.9 cells/hpf
GT:Urine RBC (Microscopic): 2.9 cells/hpf
Extracted:Urine Red Blood Cell Count: 2.9 cells/hpf
Status:Value+Unit Match
- Chromogranin A: 34.0 ng/mL
GT:Chromogranin A: 34.0 ng/mL
Extracted:Chromogranin A: 34.0 ng/mL
Status:Exact Match
- Lipoprotein(a): 18.1 mg/dL
GT:Lp(a): 18.1 mg/dL
Extracted:Lipoprotein(a): 18.1 mg/dL
Status:Exact Match
- Calcitonin: 4.6 pg/mL
GT:Calcitonin: 4.6 pg/mL
Extracted:Calcitonin: 4.6 pg/mL
Status:Exact Match
- White Blood Cell Count: 9.0 x10^3/uL
GT:WBC: 9.0 x10^3/uL
Extracted:White Blood Cell Count: 9.0 x10^3/uL
Status:Exact Match
- Blood Urea Nitrogen: 10.1 mg/dL
GT:BUN: 10.1 mg/dL
Extracted:Blood Urea Nitrogen: 10.1 mg/dL
Status:Exact Match
- Direct (Conjugated) Bilirubin: 0.24 mg/dL
GT:Direct (Conjugated) Bilirubin: 0.24 mg/dL
Extracted:Direct Bilirubin: 0.24 mg/dL
Status:Value+Unit Match
- Free Triiodothyronine: 2.33 pg/mL
GT:Free T3: 2.33 pg/mL
Extracted:Free Triiodothyronine: 2.33 pg/mL
Status:Exact Match
- D-dimer: 474.3 ng/mL FEU
GT:D-dimer: 474.3 ng/mL FEU
Extracted:D-dimer: 474.3 ng/mL FEU
Status:Exact Match
Assessment and Plan: Patient clinically stable with labs largely within expected ranges. Continue current outpatient meds, encourage diet and activity, follow up in 4 weeks or sooner if new symptoms.

Figure 20: EHR #1 with bullet-list formatting. Green = exact match, Yellow = value+unit match. Hover for details.

Benchmark 3: Attribute Extraction Reliability

Beyond identifying which clinical entities appear in text, the Data Engine must correctly determine contextual attributes—such as whether a condition is negated, historical, or hypothetical. We evaluated this through a focused unit-style benchmark of 180 test cases.

The Unit Benchmark Pipeline

The testing process is fully automated: a Teacher LLM generates clinical snippets with specific attribute assertions (Phase 1), the Data Engine processes them (Phase 2), and the system automatically verifies the output against the ground truth (Phase 3).

Tested Capabilities

Category Example Input Extracted Attribute
Negation "Patient denies hypertension" negated: true
Temporal Context "Appendectomy in 2015" context: HISTORICAL
Certainty "Possible pneumonia" certainty: SUSPECTED
Body Site "Left wrist pain" laterality: left
Recommendations "Advised to start walking" is_recommendation: true
Family History "Mother has diabetes" family_member: mother
Medication Status "Discontinued lisinopril" status: DISCONTINUED

Figure 21: Examples of attribute extraction capabilities tested in the benchmark.

Benchmark Results

The benchmark demonstrated robust performance across key categories, with fundamental attributes achieving perfect scores.

Category Pass Rate
Body Site Laterality 100%
Negation 100%
Temporal Context 100%
Certainty 100%
Recommendations 100%
Family History 80%
Medication Status 75%

Figure 22: Pass rates for key attribute categories.

Negation, Temporal Context, Certainty, Body Site, and Recommendations all achieved 100% accuracy. These are the most critical attributes for correct clinical interpretation, ensuring that past history is not mistaken for active conditions and that ruled-out diagnoses are not coded as present.

For a hospital, this means the engine preserves clinical meaning (e.g., “rule out” vs. “diagnosed”), reducing the risk of incorrect coding, quality flags, and downstream analytics errors.

Moderate Performance & Analysis

Family History (80%) and Medication Status (75%) showed strong performance with minor deviations. A detailed analysis revealed that the majority of these "failures" were due to valid simplifications or vocabulary mismatches in the ground truth rather than extraction errors.

Overall, the results confirm that the Data Engine is highly reliable for the core attributes required for downstream clinical reasoning.

Benchmarking Striata

To validate that Striata can reliably answer real hospital analytics questions, we built a comprehensive benchmark suite. The suite contains 36 questions spanning the kinds of analyses that hospital administrators, quality teams, and clinical researchers routinely need—run against a synthetic EHR dataset of 60,000 patients.

How the benchmark works

Each benchmark question is given to Striata with no human assistance. The system must autonomously: (1) write SQL to extract the right patient cohort from the database, (2) generate and execute Python code to perform the analysis, and (3) produce a written insight summarizing the findings. We then automatically validate whether the correct data was extracted and the correct conclusions were reached.

Progressive complexity

Questions are organised into families, where each family explores a single clinical topic at four levels of increasing difficulty. This tests whether Striata can handle not just simple lookups, but also multi-layered stratifications and statistical modelling. Here is an example from one family:

Level i
How many patients were hospitalized for pneumonia?
Simple count and filtering
Level ii
What is the mean length of stay for pneumonia admissions?
Requires date arithmetic and aggregation
Level iii
What are the mean length of stay and mortality rate, stratified by presence of hypoxemia?
Multi-metric analysis with clinical subgroup stratification
Level iv
...stratified by age group and presence of hypoxemia?
Multi-level stratification across two dimensions simultaneously

Figure 23: One question family (pneumonia hospitalisations) at four complexity levels. Striata must handle all four autonomously.

The full benchmark spans nine such families across two categories:

What we validate

For each question, we automatically check two things:

Walkthrough: one question end-to-end

To make this concrete, here is exactly what Striata produced for a single Level ii question — with no human guidance at any step.

Benchmark question Q1B-3ii
"Is the mean change in HbA1c over 12 months different between patients with diabetes treated with insulin versus oral hypoglycemic agents?"
1 Striata writes SQL to extract the right data
patients × conditions × medications × observations
SELECT patient, diabetes_type, medication, HbA1c_values FROM patients JOIN conditions — diabetes diagnosis (Type 1 or Type 2) JOIN medications — insulin or oral agents (metformin) JOIN observations — HbA1c lab results over time WHERE condition IN (Type 1 diabetes, Type 2 diabetes) AND medication IN (Insulin, Metformin)
Result: 1,466 rows × 11 columns — every HbA1c reading for every diabetic patient on either medication.
2 Striata writes Python to build the analysis cohort

The generated code pairs each patient's baseline HbA1c (near medication start) with their 12-month HbA1c, then computes the change.

patient diabetes treatment baseline HbA1c 12-month HbA1c change
df12…cb3a Type 2 insulin 7.8% 7.1% −0.7%
7938…660b Type 2 metformin 7.0% 7.2% +0.2%
a5ef…84bd Type 1 insulin 11.3% 7.2% −4.1%
… 247 patients total (128 insulin, 119 metformin)
3 Striata chooses and runs the right statistical test

Without being told which method to use, Striata selected a Welch's t-test to compare the two groups — and ran a sensitivity check with a non-parametric alternative.

Insulin group (n=128)
−0.84%
mean HbA1c change
Oral agent group (n=119)
−0.18%
mean HbA1c change
4 Striata produces the written finding
"Yes, the mean 12-month change in HbA1c differs significantly between groups. Insulin-treated patients achieved a mean reduction of 0.84%, compared to 0.18% in the oral agent group — a statistically significant difference of 0.66 percentage points favouring insulin (p = 0.005)."
Total time: 3.7 minutes — from question to written insight, fully autonomous.

Figure 24: End-to-end walkthrough of a single benchmark question. Striata autonomously selects the right tables, extracts a patient cohort, chooses a statistical method, and produces a written conclusion.

Results

Across both categories, Striata successfully completed every question—producing a valid data extraction, statistical analysis, and written summary for all 36 questions without any human intervention.

For a hospital, this means managers can finally ask the questions that have been bothering them for years - and improve pathways and operations based on evidence, not gut feel.

Benchmark run Questions Success rate Median time per question
Category 1A (descriptive) 12 100% ~3.3 minutes
Category 1B (inferential) 24 100% ~4.2 minutes

Figure 25: Striata benchmark results across 36 questions on a 60,000-patient synthetic EHR dataset. Every question completed successfully.

Currently, our focus is reducing the median time per question to under one minute and expanding the benchmark to additional use cases, including:

Conclusion

The combined product story is simple: the Data Engine standardizes messy clinical text into interoperable data, and Striata turns that data into answers. A question asked in plain language becomes a clear cohort and explicit analysis that can be repeated and audited.

We believe progress in healthcare depends on shortening the distance between question and evidence without lowering the standard of proof. That is why benchmarking is a constraint: improvements only count when reliability, latency, and correctness move in measurable ways.

Striata is designed to be inspectable, not a black box: for each answer, users can review the reduced dataset used for the computation and the analysis code that produced the result—and iterate conversationally with follow-ups to refine criteria and dig deeper without restarting from scratch.