Accelerating innovation

What the Nurses' Health Study Built, and What Rare Disease Lacks

What 121,701 women and 50 years of follow-up did for cardiovascular disease, cancer, and diabetes. What rare disease has never had. What patient-controlled data infrastructure could build that institutional infrastructure has not.

In 1976, Frank Speizer at Harvard mailed a questionnaire to registered nurses across 11 U.S. states. 121,700 women returned it. They agreed to fill out a follow-up survey every two years and to allow their medical records to be accessed for verification. Most of them continued participating for decades. Many are still contributing data in their late seventies and beyond.

The Nurses' Health Study (NHS) became one of the most productive longitudinal epidemiological studies in history. It has generated findings on cardiovascular disease, cancer, diabetes, diet, hormone replacement, physical activity, and dozens of other conditions. Its results have shaped U.S. Surgeon General recommendations, World Health Organization guidelines, and the 2008 Physical Activity Guidelines for Americans. It has produced hundreds of peer-reviewed papers across scores of diseases.

The power of the NHS comes from three properties: a large, well-characterized cohort; decades of continuous follow-up; and structured, standardized data collection from the beginning. The same people, answering the same questions, on the same schedule, for 50 years. The longitudinal dimension is what makes the findings possible. A snapshot of 121,700 women at one point in time would tell you what their health looks like today. Five decades of follow-up tells you what causes what.

Rare disease has nothing comparable.

The Structural Gap

The largest rare disease registries are typically disease-specific, meaning they cover one condition. They are sponsor-controlled, meaning a pharmaceutical company or academic institution manages the data. They are short-duration, meaning they last as long as the grant or the drug development program that funds them. And they use inconsistent data standards, meaning data from one registry cannot be easily combined with data from another.

When a pharmaceutical company exits a rare disease market, the registry it funded may be shut down. When the principal investigator who runs a natural history study retires or loses funding, data collection stops. When a contract research organization that held trial data is acquired, the data may be absorbed into proprietary systems or simply lost.

The PKU community has the longest-running treatment experience of any metabolic condition on the newborn screening panel. Dietary treatment began in the 1950s. Adults with PKU who were identified through screening in the 1960s are now in their sixties. Their life histories, treatment trajectories, cognitive outcomes, and quality of life over six decades constitute an extraordinarily valuable longitudinal dataset.

That dataset does not exist in any centralized, structured, accessible form. It is scattered across dozens of metabolic clinics, hundreds of medical records systems, and thousands of individual lives. The data was generated. It was never aggregated.

What the Nurses' Health Study Did That Registries Cannot

The NHS was designed as infrastructure. It was not a clinical trial with a start and end date. It was not tied to a specific drug or intervention. It was a commitment to follow a defined cohort over time, collecting structured data on a regular schedule, and making that data available for research across conditions.

The multi-condition design was critical. The NHS did not set out to study only cardiovascular disease or only cancer. It collected baseline data on demographics, lifestyle, diet, physical activity, medication use, and medical history, then tracked outcomes across every condition that developed. The same cohort data supported analyses of hormone replacement and heart disease, diet and colon cancer, physical activity and diabetes, smoking and stroke, simultaneously. Discoveries emerged that no single-disease study would have found, because the connections between conditions were visible only in a dataset that covered all of them.

The rare disease equivalent would be a longitudinal study that enrolls people with PKU, galactosemia, EDS, MCADD, and dozens of other conditions into a single data infrastructure, using shared data standards, with follow-up measured in decades. The people with PKU who are experiencing early cognitive decline at age 40 may share patterns with the people with galactosemia who experience similar cognitive trajectories despite different dietary treatments. The adults with hEDS who develop cardiovascular autonomic dysfunction may share biomarker patterns with adults who have POTS from other causes. The cross-condition signals are invisible in disease-specific registries because the data never occupies the same space.

What It Took to Start

The NHS launched in 1976 with the data collection tools available in 1976: mailed paper questionnaires. The instruments were imperfect. The cohort was not representative of the general population (overwhelmingly white, middle-class, American nurses). The early surveys asked simpler questions than the sophisticated instruments developed later.

None of that mattered for the most important thing the NHS did, which was to start. The longitudinal value of the study compounds over time. Data collected with imperfect instruments in 1976 turned out to be more valuable than no data at all, because the trajectory is what generates the insight, and you cannot go back in time to collect the trajectory you missed.

A person with PKU who begins contributing structured data today and continues for 10 years generates a 10-year longitudinal record by 2036. If the infrastructure is not built until 2029, and data collection begins then, the same person generates a 7-year record by 2036. Those three years of data are gone permanently. The cognitive decline that happened between ages 35 and 38, the medication change at age 36, the blood phenylalanine trend during a difficult period at age 37: all lost.

The adults with PKU who are 55 today are the last generation whose treatment began in the earliest years of newborn screening. Their outcomes over six decades represent irreplaceable natural history data for a treated population. That data is vanishing into fragmented medical records and clinical notes that no one is systematically collecting.

The Design Difference

The NHS is institutionally held. Harvard administers it. NIH funds it. The data is managed by academic researchers under institutional review board oversight. The participants consented broadly to research use of their data. The model is institutional custody with broad consent.

The rare disease equivalent does not need to copy the NHS model. A data trust, where the people who contribute data retain ownership and governance rights, can achieve the same longitudinal power with a different custody structure. The critical design elements are the same: structured data collection, standardized instruments, regular follow-up, multi-condition scope, and a commitment to persistence that outlasts any single funder or investigator.

The NHS proved that a large, long-term, structured data collection effort can transform medical knowledge across dozens of conditions simultaneously. It also proved that starting with imperfect tools is better than waiting for perfect ones. The nurses who filled out paper questionnaires in 1976 could not have imagined the discoveries their data would generate four decades later. They did not need to. They needed to start.

Rare disease needs to start.