A sample of the fake ‘Lorem Ipsum’ text seen on a tolgart mobile phone (Getty Images / iStockphoto)
In the graphic design industry, the pending texts to write are completed with what is known as Lorem Ipsum, a gibberish in Latin that comes from a text by Cicero whose syllables and characters have been erased. If we introduce these two words in PrivaSeer, a privacy policy search engine created by three researchers at Pennsylvania State University, we get more than two thousand results. Specifically 2,462 pages that should explain the way in which a company uses its customers’ data, but when they were indexed they showed a nonsensical string of words at least in some of its sections. “There are pages out there that have not yet published their privacy policy. And in many cases it is illegal. But it happens ”, he explains Shomir Wilson, assistant professor at Cincinatti State University, developed by PrivaSeer in collaboration with Lee Giles, professor at the same institution, and Mukund Srinath, PhD student.
More information
The texts in Latin are an unexpected revelation among which this search engine can be obtained, designed to provide greater transparency to these texts, already famous for their difficult digestion. Others we already knew: “The most significant problem is the time it takes to read these policies and how complicated they are,” sums up Srinath. And many others are yet to be discovered: “There is a growing community of researchers interested in studying the privacy policies of applications and pages and most of the collections that have been so far were relatively small,” continues Wilson.
For now, PrivaSeer has indexed more than a million privacy policies collected through a web crawler. web crawler) capable of identifying these documents based on a series of keywords. Once the texts have been identified, a natural language processing system automatically extracts their characteristics, so that each search not only shows the texts that contain specific words, but also allows additional information to be collected about those results: which industries they correspond to. those policies, what tracking technologies are mentioned, what regulations are taken into account … “As the filters become richer and more informative, we will be able to display more information,” promises Srinath.
Why do we need a search engine of this type? “On the one hand, we are gossips,” sums up Giles, who during his career has already created several specialized search engines that share the last name “seer ”(CiteSeer, ChemSeer, BotSeer …). “In addition, the search engine allows us to see on a large scale the trends in terms of consumer privacy, details that we cannot always detect in the news. And we can gain visibility into how privacy changes over time, ”Wilson continues.
Unexpected variety
Although initially the researchers expected to find quite a few similarities between the indexed texts, the reality is that there is less copy-paste in the sector than one might imagine. “Very few companies use privacy policy generators. And the ones that do, borrow the original structure, but apply a considerable amount of change, ”confirms Srinath. Is it good that there is so much diversity or would it be better if privacy policies were more standardized? “I think it would be worrying if companies were just copying and pasting without articulating what each part means to their business,” Wilson reasons.
For the near future, the researchers hope to develop automated processes that allow new privacy policies to be indexed and updated, and to apply more sophisticated analysis methods to extract more information. Will we see other languages on PrivaSeer? The plan is for them to arrive. “At least in the European Union, the most common is that the company publishes its policies in a single language, usually English, and if they add a second, it is the dominant language in the country where the business is located,” says Wilson. “An open question that I am working on with another research group is: How often do policies written in different languages contradict each other? We do not know yet, but we have found cases in which they do not have the same content ”.
In the long term, the researchers hope that initiatives like PrivaSeer will move towards a new setting for these privacy policies: a format that truly informs people about what is happening with their data and allows them to make effective decisions about it. “We want to reveal more about how the consumer privacy landscape works on the internet and we hope that that information will be used by regulators to influence what comes next,” Wilson concludes.
The goal is not easy. In order to do their work, these researchers need, among other things, to obtain funding that allows them to carefully study texts that the rest of society, as a general rule, ignores. However, Giles is optimistic. “It’s easier to find funding for things that people know about. But now people are starting to worry about privacy. So I think it’s a good time. “
You can follow EL PAÍS TECNOLOGÍA at Facebook and Twitter or sign up here to receive our newsletter semanal.