Roxana Geambasu and Augustin Chaintreau, both assistant professors of computer science at Columbia Engineering, along with PhD student Mathias Lecuyer created the tool called XRay to understand how personal data is being used on web services like google, amazon, Facebook and youtube.
"Today we have a problem: the web is not transparent. We see XRay as an important first step in exposing how websites are using your personal data," said Geambasu.
We live in a "big data" world, where staggering amounts of personal data - our locations, search histories, emails, posts, photos, and more - are constantly being collected and analysed by many other web services.
While harnessing big data can certainly improve our daily lives, these beneficial uses have also generated a big data frenzy, with web services aggressively pursuing new ways to acquire and commercialise the information.
"It's critical, now more than ever, to reconcile our privacy needs with the exponential progress in leveraging this big data," said Chaintreau.
"If we leave it unchecked, big data's exciting potential could become a breeding ground for data abuses, privacy vulnerabilities, and unfair or deceptive business practices," Geambasu added.
Determined to provide checks and balances on data abuse, XRay is designed to be the first fine-grained, scalable personal data tracking system for the web.
For example, one can use the XRay prototype to study why a user might be shown a specific ad in Gmail. Geambasu and Chaintreau found, for example, that a Gmail user who sees ads about various forms of spiritualism might have received them because he or she sent an email message about depression.
"The theoretical results were encouraging, but seemed too good to be true. So we tested XRay in actual situations, learning from experiments we ran on Gmail, Amazon, and YouTube, and refining the design multiple times.
"The final design surprised us: XRay succeeded in all the experiments we ran, and it matched our theoretical predictions in increasingly complex cases," researchers said.
The current XRay system works with Gmail, Amazon, and YouTube. However, XRay's core functions are service-agnostic and easy to instantiate for new services, and they can track data within and across services.
The key idea in XRay is to use black-box correlation of data inputs and outputs to detect data use.
To assess XRay's practical value, the researchers created an XRay-based demo service that continuously collects and diagnoses Gmail ads related to a set of topics, including various diseases, pregnancy, race, sexual orientation, divorce, debt, etc.