The speed thesis
Speed of data infrastructure deployment is itself a clinical outcome measure. The Nurses' Health Study took 15 to 25 years to produce its consequential findings. Each year of delay in starting the equivalent for rare disease pushes back every subsequent finding by the same gap.
A child with classic galactosemia born today will, in expected developmental terms, encounter the cognitive and motor consequences of the disease over the next two decades. Galactosemia is on the newborn screening panel and treated from the first weeks of life with dietary restriction. Despite that, long-term outcomes in galactosemia cohorts include speech apraxia, motor coordination problems, and learning differences in substantial fractions. The pattern is well-documented and not well-understood.
The reason it is not well-understood is that the longitudinal data that would explain it has not been collected. Most galactosemia patients are followed at the metabolic clinic where they were diagnosed. The clinic captures clinical encounters in unstructured notes and routine laboratory values. The data that would correlate dietary patterns with cognitive outcomes, identify the dietary or metabolic features that predict better trajectories, or characterize the residual biochemical perturbation that current dietary management does not eliminate, is the data that does not exist at the scale needed to answer the questions.
If a longitudinal study of galactosemia patients started today and ran for fifteen years, it could detect treatment-response patterns by 2041. If the same study started in 2028 instead, the same patterns would be detected in 2043. The two-year delay costs the field two years of insight that subsequent generations of galactosemia children would have benefited from.
The cost is borne by the children, in cognitive outcomes that improved earlier diagnostic understanding could have improved. The data infrastructure question, in this framing, has a clinical outcome attached to it. Speed of infrastructure deployment is itself a clinical outcome measure.
What gets lost in delay
Three categories of loss accumulate during years of infrastructure delay.
The first is the loss of patients who age out of the relevant cohort. The galactosemia patient who is 35 today, with 30 years of post-diagnostic life history, holds the natural history data for galactosemia in adulthood. The patient at 35 in 2026 contains data that no clinical record at any institution holds. If the patient never participates in structured longitudinal data collection, the data is lost when the patient transitions out of metabolic care, ages, and dies. The data does not transfer to the next patient because the data was never captured outside the patient's own life history.
The second is the loss of pre-treatment baseline. The cohort of patients diagnosed today, who could provide the natural history baseline against which future therapies are compared, has a baseline only if the data is captured during their pre-treatment period. A future gene therapy for galactosemia evaluated in 2040 will be compared against natural history controls. The natural history controls have to come from somewhere. They have to come from patients whose pre-treatment trajectory was systematically captured. If the systematic capture starts in 2030 instead of 2026, the cohort of pre-treatment patients available in 2040 is four years smaller and four years less mature.
The third is the loss of cumulative analytic value. Each year of accumulated longitudinal data multiplies the analytic power of the dataset more than additively. A five-year dataset supports cross-sectional comparisons and limited time-series analyses. A ten-year dataset supports trajectory analysis, treatment-effect detection over decade-scale windows, and the kind of cohort comparisons that can isolate population-specific effects. A twenty-year dataset, the timescale at which the Nurses' Health Study produced its most consequential findings, supports analyses that the ten-year dataset cannot. Each year of delay pushes back the date the dataset crosses the analytic thresholds that allow the next generation of findings.
What the math says about urgency
The acceleration math is not abstract. The Nurses' Health Study began in 1976. The cardiovascular findings that defined preventive cardiology emerged in the 1990s and 2000s. The interval between data collection start and the consequential finding is approximately 15 to 25 years. The interval cannot be compressed by adding more participants in the present; the temporal dimension is what produces the insight, and the temporal dimension cannot be back-filled.
A rare disease longitudinal data infrastructure that begins in 2026 produces its first consequential findings in approximately 2041. A version that begins in 2030 produces its first consequential findings in approximately 2045. The four-year delay translates directly into four years of children with rare disease growing up with management that the subsequent findings could have improved.
The compounding extends. The first consequential finding informs the next round of clinical questions. The next round of analyses, conducted on the dataset that has continued to grow, produces the second consequential finding several years later. Each finding delays. The infrastructure that started later runs every subsequent finding later by the same gap.
What the urgency does not justify
The speed thesis does not justify low-quality data. A poorly collected dataset that runs for fifteen years produces noise rather than signal. The standards have to be met from the beginning, because the longitudinal value of the data depends on consistency of measurement over time. Changing instruments mid-study, restructuring the data model to fix architectural choices made early, or relaxing the consent and provenance standards under time pressure all undermine the analytic value the patience was meant to produce.
The speed thesis also does not justify ignoring the affected community in the design of the infrastructure. The infrastructure that the affected community does not trust does not get used. Building the infrastructure with community governance from the beginning takes longer than building it without, and produces an infrastructure that supports the long-term goal. Skipping the governance step to ship faster produces an infrastructure that ships and then fails to populate because contributors do not contribute to a structure they have no role in.
What the urgency does justify is starting now with the infrastructure that current technology supports, with the consent frameworks that current law supports, and with the governance structures that the affected community can establish quickly. The infrastructure improves over time. The data quality improves over time. The longitudinal dimension does not.
The mission statement made concrete. The first cure depends on data that has not been collected yet. The data has to start collecting before the cure can be developed. The interval between data collection and cure is real, and it is the interval that the rare disease community is racing against.