Delphi Group Uses Data To Forecast the Flu and Other Epidemics
Working to help officials manage future public health emergencies, 麻豆村 researchers want to forecast infectious disease outbreaks like meteorologists predict the weather.
Outbreaks of diseases like COVID-19, or a resurgence of one like monkey pox, can happen any time of year, said (left), University Professor of machine learning, language technologies, computer science and computational biology in the at Carnegie Mellon.
He co-founded the in 2012 with Professor , now at University of California-Berkeley’s Department of Statistics, to use data to create epidemic forecasts during normal times as well as during public health emergencies. The forecasts can then help people take preventive measures and keep them from catching and spreading illnesses, including influenza, RSV and COVID-19, Rosenfeld said.
“Delphi tries to provide early warning to public health authorities by scanning our indicators for unexplained upward trends,” he said. “Delphi's indicators can provide a real-time geographically detailed view of the trend’s dynamics and spread, and Delphi's short-term forecasts can provide geographically detailed risk estimates for a few weeks' horizon.”
Using data to track and predict outbreaks
Some who catch a respiratory illness may only suffer minor symptoms, but because the risk to vulnerable groups, such as infants and those who are immunocompromised, is so much greater, the forecast can better inform and influence their personal decision-making, as well as decisions by public health officials and healthcare organizations.
For example, during the peak week of flu season, which can vary by a few weeks from place to place and by a few months across seasons, the risk for people in those groups can be up to 40 times higher than the risk off-season, Rosenfeld said.
“It should be possible to make people aware of when the wave is coming to their city, at different times of the year in different seasons,” he said. “I believe we're not far from a time when people will be able to look on their phone and see what is the current level of circulation of any major pathogen in their city and what is the current prediction of when a wave will arrive.”
Members of the Delphi Research Group — which has expanded to include Will Townes (left), assistant professor in 麻豆村’s Statistics & Data Science Department in the Dietrich College of Humanities and Social Sciences; , assistant professor in 麻豆村’s in the School of Computer Science; , a statistics professor at the University of British Columbia; as well as staff members, graduate and undergraduate students — realized to make these predictions meaningful that they needed to aggregate and curate as much reliable, real-time data as possible.
Now, , the repository they built, lists more than 1,600 distinct indicators for a variety of pathogens, with a total of over 5 billion de-identified records. Millions of records are collected, cleaned up and categorized then added daily. These include traditional government statistics, indicators derived from insurance claims, laboratory test results, and electronic medical records, statistics on night coughing and search trends.
More data means better prediction accuracy, with the diversity and volume of sources allowing researchers to confirm suspected trends and tell them apart from random fluctuations, Rosenfeld said.
“We learned that perhaps the biggest obstacle to useful forecasts is the lack of data,” he said. “We initially focused on improving our projection of the future of epidemics, but soon realized that if we improve our situational awareness about the present that will automatically translate into improved forecasts for the future.”
Sleep Cycle, a sleep-tracking technology company, recently with Delphi to provide the research group with privacy-protected aggregated sleep data, including information about coughing and breathing patterns from wearable and sleep-monitoring devices.
Since symptoms like coughing and congestion often appear days before someone seeks medical care, this data offers earlier warning of outbreaks than hospital records, Rosenfeld said.
“I envision a future where epidemic forecasting is everywhere, properly understood and useful,” he said.
Why partnerships and revisions matter
Roughly half of Delphi’s indicators now come directly from nongovernment partners, according to Rosenfeld, who emphasized the importance of building partnerships for data access.
These include healthcare companies’ electronic health record summaries and laboratory testing results that are not publicly released, as well as nonadjudicated insurance claims, which arrive faster than finalized billing data. These relationships are carefully negotiated to ensure all data is first de-identified.
“We are constantly reaching out to launch these collaborations with organizations who hold data that is of value,” Rosenfeld said.
, professor and director of the Machine Learning Department, said Delphi’s work represents one way the department applies research to creating broader societal impact.
“Delphi reflects what MLD is about: combining strong statistical foundations with modern machine learning to tackle urgent, real-world problems,” he said. “Delphi will continue to play a leading role in shaping epidemic forecasting in the U.S. and stands as a powerful example of MLD's innovative ecosystem.”
Public health data evolves. Early reports are often incomplete and get revised over time.
Instead of relying only on finalized numbers, Delphi preserves each version of the data as it was originally reported. This allows them to test forecasts under real-world conditions, using the same provisional information decision-makers must rely on in real-time.
“If you train a forecasting model on finalized and cleaned-up data, you’re cheating,” Rosenfeld said. “In real life, forecasters only have access to messy, preliminary data.”
Using provisional data results in more accurate models, and will lead to more accurate forecasts, improving decisions and the public's health.
From pandemic patchwork to lasting infrastructure
Delphi began with a focus on influenza and shifted to COVID-19 during the pandemic. The group rapidly scaled up thanks to volunteers and temporary collaborators, including dozens of engineers from Google and elsewhere outside the university, who helped build data pipelines quickly.
Starting in April 2020, Delphi collected real-time data on self-reported COVID-19 symptoms and other disease indicators nationwide. County-level information about the coronavirus pandemic was updated continuously and shared with both the public and health researchers.
In September 2020, Google.org donated $1 million to Carnegie Mellon to support , the Delphi group’s effort to track and forecast localized COVID-19 activity nationwide.
During the height of the pandemic, Delphi’s Epidata database, which includes COVIDcast, received an average of 100,000 queries per day. At that time, Delphi began producing COVID-19 forecasts, then sharing them with the CDC.
In 2023, Delphi became one of 13 national Centers for Outbreak Analytics and Disease Modeling at the CDC, collectively known as . Since then, Delphi has re-engineered its entire data ingestion system using modern tools, creating a more uniform, scalable platform that can bring new data sources online every few weeks, Rosenfeld said. The overhaul has already allowed the group to expand dramatically, with hundreds of additional indicators added in recent years.
“The five-year funding horizon gave us the depth and the confidence to revamp our systems to make them faster and more responsive,” said Adam Johns, Delphi’s engineering manager. “This agreement with the CDC allowed us to think more about the future and to redesign our pipelines to be much more uniform, robust and easy to maintain, with the ability to scale at need to more and larger data sources.”
Public health officials use the records to inform their actions and communications, and to support their own forecasting activities. Health care systems can use them to inform decisions on purchasing and equipment positioning, scheduling of elective procedures, vacations and other short-term staffing decisions. Individuals can use the forecasts to assess current and near-term risk, and influence personal decision-making.
Data sources that were available only during the pandemic are still accessible for retrospective analysis, and are also configured for rapid resumption during the next emerging event, Rosenfeld said.
“We believe that with the next public health emergency, some of them will be reopened, so we want to be ready for that,” he said.
Making public health data more useful to the public
Delphi’s public-facing Epidata platform allows registered and unregistered users to browse, visualize and download this data without needing advanced programming skills (registration is required for large-volume downloads). Users can filter by disease, geography, data source or time period, then plot trends or export the information for further analysis. Beyond public health, registered users access the data for research education, forecasting, analysis and incorporation into reports.
Delphi also helps users discover data that exists elsewhere, such as local and state public health agencies, documenting where the data lives and how it might be accessed, said Peter Jhon, Delphi’s executive director and strategic coordinator of public health research initiatives, adding that most of the available data can be repurposed for noncommercial use through a Creative Commons license.
Ultimately, Delphi wants to quantify infectious disease risks and make them local, timely, understandable and actionable, especially when a new epidemic is on the horizon.
“One of our core values is to make our data as maximally accessible as possible,” Jhon said, “We want to give the public better insight into the information that we have.”