Mining Wikipedia for health data

By: |
Published: April 16, 2015 12:27:51 AM

The use of internet media to track the wavefronts of fast-moving infections such as the flu is a successful strategy.

The rise of communicable diseases such as Ebola, swine flu and MERS, which have been developing into sudden public health catastrophes because they cannot be combated by common drugs and clinical techniques, has prioritised the need to track diseases and contain their spread by preventive and social strategies before they can develop epidemic proportions. Now, a paper by researchers at the Los Alamos National Laboratory and the universities of Utah and Iowa suggest that Wikipedia is a public repository from which data can be vacuumed to track both infections and public health responses almost in real time. The paper by Geoffrey Fairchild, Sara Y Del Valle, Lalindra De Silva and Alberto M Serge is at, hosted by Cornell University.

The use of internet media to track the wavefronts of fast-moving infections such as the flu is a successful strategy. About a decade ago, it was noticed that search engine queries concerning a disease accurately followed its progress on the ground much faster and more accurately than data reported by the public health system. Hospitals take days if not weeks to report patient data up the hierarchy, and the database fails to capture afflicted people who do not visit public health facilities. Google, on the other hand, picked up every query about sniffles made by everyone with sniffles, and standard Big Data techniques were used to resolve them into a disease waveform which had a speed and a direction—a very useful advance warning for public health managers.

Social media had been harnessed to the cause by 2010, and last year, Wikipedia began to be mined for health data. The paper adds a fresh dimension, noting that the editing history of a Wikipedia article, such as the one on the Ebola outbreak, offers an accurate timeline of the spread of disease. Named-entity recognition can be used to extract the following data, the authors report: “dates, locations, case counts, death counts, case fatality rates, demographics, and hospitalisation counts in the text. These data are, in general, swiftly updated as new data become available.”

Along with external links (useful for data validation), images, imaged data and the geographical location of editors, the authors captured these entities in near-real time through Wikipedia’s official web application programming interface. Perhaps the biggest advantage of their proposal is that Wikipedia data is completely open.

While numerous services like HealthMap and Google’s Flu Trends and Dengue Trends have been used to show public health professionals accurate snapshots of the progress of communicable diseases, these are either closed source or require intervention by proprietary software. The public user has no choice over data sources and program functionalities, except those baked in by the developers. By way of example, with closed source projects, if someone wants to riff on the code for tracking dengue to bootstrap an encephalitis tracker, they just can’t.

The use of internet technologies to speed up, deepen and standardise disease tracking in India has significant potential. Health was made a state subject in recognition of the diversity of public health goals across the country. A state with districts in the Terai is likely to prioritise encephalitis, while a heavily industrialised state is more likely to be interested in cardiac and pulmonary disease. While this targets treatment delivery better, the downside is that surveillance, which is also undertaken by the states, has shown differentials in the generation of reliable data on the outbreak and management of disease.

The Integrated Disease Surveillance Programme was initiated only a decade ago and, prior to that, central oversight and intervention were restricted to specific problems which were deemed to be national in scope. Of these, while the oral polio drive has been spectacularly successful, malaria control has turned out to be an oxymoron which enriches the manufacturers of mosquito repellent. Old scourges like cholera, typhoid, smallpox, leprosy and elephantiasis were controlled by better drugs and vaccines rather than better epidemic management. But the communication-driven control of HIV, which was expected to reach pandemic proportions in India, is a great success story.

This is a bumpy career, and the health machinery is faltering so badly on new scourges such as swine flu that some doctors have begun to describe them as “seasonal fevers”. Advance warning and targeted preventive communication may give health professionals the confidence to engage more purposively with such diseases, which are producing an unacceptably high mortality rate and public anxiety. Since it is doubtful if the health authorities in any state have access to information on disease outbreaks in anything approaching real time, the authorities may wish to augment traditional forms of disease reporting with data from internet sources such as Wikipedia.

Get live Stock Prices from BSE and NSE and latest NAV, portfolio of Mutual Funds, calculate your tax by Income Tax Calculator, know market’s Top Gainers, Top Losers & Best Equity Funds. Like us on Facebook and follow us on Twitter.