Standing up a patient chart viewer from zero to MVP
Real-world evidence (RWE) platform leveraging EMR, claims-linked longitudinal data, and health registries to power pharmaceutical research across a portfolio of liver-disease indications.
The setup
TargetRWE builds real-world evidence infrastructure for pharmaceutical sponsors running studies across a portfolio of liver-disease indications. The data estate spans EMR charts and health registries across a research network of tier-1 academic medical centers, with patient records tokenized and linked to external claims datasets via industry-standard vendors (Datavant-style DOD linking) for longitudinal cohort studies and NLP model validation at scale. Standards alignment: FHIR, OMOP, DICOM.
When I joined, the Patient Chart Viewer didn't exist. Data scientists and clinical analysts worked unstructured EMR notes through ad-hoc SQL and spreadsheets, no unified workflow for ground truthing, annotation review, or cohort definition. Sigma dashboards on Snowflake (ODS + EDW) required manual SAS refreshes before a cohort could be materialized.
My role sat at the intersection of engineering, data science, and medical affairs, translating clinical hypotheses into productized data pipelines, and acting as the risk manager of credibility for what reached pharma sponsors. Most of my contribution was upstream of code, scoping schemas, writing validation specs, and defining the acceptance criteria that engineering and DS implemented.
What was broken
Clinical NLP models needed structured ground truth at scale, specifically for liver-disease endpoints (biopsy findings, alcohol-use mentions, fibrosis progression, medication history), but there was no operator-friendly interface to label unstructured EMR records.
Cohort identification was hypothesis-driven but the tooling was batch. A data scientist filed a SAS request, waited hours, reviewed results in a spreadsheet, then iterated, turning what should be a 10-minute hypothesis test into a multi-day loop.
Linked EMR + claims data needed data-quality contracts and an AI model build/validate process before pharma sponsors would trust downstream analytics.
A distributed engineering + data science team across US time zones needed clear requirements, acceptance criteria, and a rhythm for standups, grooming, and retrospectives.
What I did
- 0→1 RWE product definition Drove requirements gathering across clinical informatics, biostatistics, data science, and engineering to scope the MVP. Owned technical architecture, data model for linked EMR + claims, viewer components, annotation interactions, and presented to pharma-facing stakeholders for buy-in.
- LLM operations for clinical data extraction Architected LLM-powered pipelines for extracting structured data from unstructured EMR notes (liver biopsy findings, alcohol-use mentions, medication history, lab values). Cut manual review time ~45% and accelerated trial startup by ~3 weeks on average across studies.
- AI/ML ground truthing workflows Architected annotation guidelines and labeling workflows (Encord) for clinical NLP models processing unstructured liver-disease EMR data. Defined inter-annotator agreement thresholds and review loops for biopsy, alcohol-use, and medication-history extraction feeding downstream NLP validation.
- Multi-site ingestion + PII-safe landing zone Owned the S3 → ODS → Snowflake EDW pipeline with Presidio-based PII redaction, automated QC, duplicate-patient quarantine logic, and cross-site person-domain matching. Healthcare standards alignment across FHIR, OMOP, and DICOM.
- Snowflake + Sigma + SAS cohort pipeline Built Sigma BI dashboards on Snowflake ODS and EDW, automating SAS refreshes so analysts could run hypothesis-driven cohort queries in real time instead of batch. Tokenized patient records and linked them to external claims datasets (Datavant-style DOD linking) for longitudinal views.
- Data quality + AI model validation Codified data quality contracts for the EMR + claims pipeline and the AI model build/validate process, trace from ingest → curation → model output, with QC sanity checks and unit-test gating before release.
- Agile rhythm for a distributed team Led cross-functional standups, retrospectives, and backlog grooming for engineering + data science, translating clinical hypotheses into user stories, acceptance criteria, and release trains.
What moved
Manual clinical-review time via LLM-powered extraction, accelerating trial startup by ~3 weeks.
Feasibility studies and prospective trials delivered with 100% on-time pharma package delivery.
Patient Chart Viewer live with stakeholder buy-in; cohort identification moved from batch SAS refreshes to real-time Sigma dashboards.
Annotation guidelines adopted as the ground-truth standard for downstream clinical NLP model validation.
Built with
What I learned
- The risks in RWE don't shout, they whisper. Protocol drift (eligibility quietly stretched), incomplete provenance (data pulled from multiple systems without clear traceability), PHI creep (identifiers slipping back into datasets despite de-identification), unchecked assumptions ("everyone knows how this variable is collected" until you find out they don't), stakeholder misalignment (sponsors expecting trial-like rigor while operational teams adapt to real-world flexibility). Most of my job here was making those whispers visible early, turning each into a validation rule, schema decision, or phased delivery plan.
- The trade-off I wrestled with most. Speed versus rigor on clinical-variable standardization. Engineering wanted raw processor outputs in the warehouse fast; medical affairs wanted full standardization first. I usually landed on phased delivery, promote the safe categorical layer now, standardize the noisier quantitative layer next cycle, flag low-confidence outputs explicitly. Kept sponsor timelines intact without spending data scientists' trust.
- The detective work that mattered. When sponsor analysts flagged missing values on a standard clinical lab, the kind that should appear on every chemistry panel, the investigation meant comparing ODS to EDW patient-by-patient, breaking discrepancies out by site and by cycle date, and finding a promotion-logic bug long before the sponsor escalation curve. The pattern was clear enough that I started treating "lab missingness investigations" as a recurring product surface, query schema, sampling protocol, known root-cause taxonomy.
- What I'd do differently. I underestimated how much of the job was translation, between engineering's idea of "done," a data scientist's idea of "usable," and a sponsor's idea of "defensible." Doing this again from day one, I'd build the validation-criteria template before the pipeline, not after.
Also worth a look
- FDA CRL Analyzer
Personal project, 202 FDA Complete Response Letters analyzed with DuckDB + Polars to extract regulatory deficiency patterns, therapeutic area trends, and sentiment. Same pattern as RWE analytics: unstructured regulatory docs → structured cohort analysis.
- Drug Name Normalizer
Personal project, client-side web app that normalizes messy medication names to standardized generics via the NIH RxNorm API. The entity-resolution backbone of any medication-history pipeline in claims or EMR work.