A team of Harvard statisticians has come up with a new way to track the flu virus using Internet search data. The system uses a method that has been used before, tracking searchers for key words and phrases, but has coupled it with additional data to improve its accuracy.
The new model, ARGO (AutoRegression with Google search data), was published in the Proceedings of the National Academy of Sciences and was built on Google Flu Trends, which was Google’s flu tracking model that emerged in 2008 but discontinued last August. The predictions that developed from Google Flu Trends often missed the retrospective analysis of the U.S. Centers for Disease Control and Prevention (CDC), and its demise ultimately resulted from an underestimate of the 2009 H1N1 swine flu outbreak and a massive overestimate of the 2012-13 flu season’s cases. One of the major issues with Google Flu Trends is that it did not account for changes in people’s search behavior.
ARGO, reportedly accounts for changes in Google’s search engine and shifts in search behavior. The new model also combines Google data with historical records from the CDC and information on seasonality of the flu. The team of researchers is reportedly working to make ARGO widely available. David Lazer, an expert in computational social science at Northeastern University, told Ars Technica that he sees such models, in the future, being improved by adding more sources of data from Twitter, Facebook and other sites. “There’s tremendous value in big data,” said Lazer, Ars Technica reported. “But we have to think carefully about the distinctive types of noise that comes in.” The result, according to Samuel Kou, a professor of statistics, is the most precise method yet. “If you see a spike in search volume, it probably indicates that something is going on,” Kou said.
The approach, called ARGO, for AutoRegression with Google search data, combines Google data with historical records from the CDC and information on seasonality of the flu. It also accounts for changes in the inner workings of Google’s search engine and shifts in search behavior. People learn as they search for information, Kou said, changing their queries and becoming better searchers. “If I want to search for something, I do it better now than I did two years ago. Besides, Google’s search engine evolves, and so does the interaction between people and the engine,” Kou said.
For ARGO, he and colleagues took the trend data and then designed a model that could self-correct for changes in how people search. The model has a two-year sliding window in which it re-calibrates current search term trends with the CDC’s historical flu data (the gold standard for flu data). They also made sure to exclude winter search terms, such as March Madness and the Oscars, so they didn’t get accidentally correlated with seasonal flu trends. Last, they incorporated data on the historical seasonality of flu.
The result was a model that significantly out-competed the Google Flu Trends estimates for the period between March 29, 2009 to July 11, 2015. ARGO also beat out other models, including one based on current and historical CDC data. It’s a move in the right direction, David Lazer, an expert in computational social science at Northeastern University, told Ars. In the future, Lazer sees such models being optimized by adding more sources of data, from Twitter, Facebook, and other sites. “There’s tremendous value in big data,” he said. “But we have to think carefully about the distinctive types of noise that comes in.”
For more information please visit: www.harvard.edu