Yahoo Releases Largest Cache of Internet Data – Wall Street Journal
In the race among tech companies to attract top talent in artificial intelligence, Yahoo Inc. is making a dramatic move: giving away a huge amount of data about how users interact with its services.
On Thursday, the embattled Internet company said it would release the largest cache of Internet behavior data—the clicks, hovers and scrolls of some 20 million anonymous users on Yahoo’s sports, finance, news, real estate and other pages. The trove, which will be available only to universities, is expected to give researchers a rare, real-world look at how large numbers of people behave online.
Yahoo, which is facing a brain drain after years of stagnant growth, is looking to attract academic researchers in the fast-growing and highly competitive field of artificial intelligence.
The Yahoo data dump comes at a time when technology companies are racing to strengthen their ties with academia, particularly in areas of artificial intelligence known as machine learning and deep learning, which involve training machines to mine massive data sets so they can respond to complex queries or make predictions. Facebook Inc. and Google have recruited top researchers; for instance, Yann LeCun, who joined Facebook in 2013, continues to run New York University’s Center for Data Science.
“No matter how much talent you have, there is always more on a manager’s bucket list,” said Andrew Moore, Dean of the School of Computer Science at Carnegie Mellon University. “No one in these big technology companies feels like they have enough people to do the things they want to do.”
Large quantities of data are necessary for machine learning, in which computers spot complex patterns and figure out in Yahoo’s case, say, what kinds of headlines or design features attract teenage girls living in Rapid City, S.D., at 7:30 p.m. Such data sets are rare outside major Internet companies, and they’re closely held for what they can reveal about the business. The Yahoo data set weighs in at 13.5 terabytes, about two-thirds the size of the library of Congress.
That is larger than anything available to the vast majority of academic computer scientists, and so big that it likely will have to be stored outside a university system, possibly in a cloud computing center run by Amazon.com Inc. or Alphabet Inc.
’s Google, said Carnegie’s Moore, a former Google executive. The university signed a five-year, $10 million partnership with Yahoo last year, to develop personalized apps based on user data.
“Data is not easy to come by for folks not inside companies,” said Gert Lanckriet, a professor in the Department of Electrical and Computer Engineering, University of California, San Diego, who spoke at an event announcing the data release.
The Yahoo cache’s sheer size makes it valuable, experts said. Algorithms capable of analyzing large amounts of data differ fundamentally from those designed for less data. Yahoo’s generous release can help researchers learn how to build the large-scale algorithms, which are especially useful to corporations. Yahoo has released over 50 data sets since 2006, including a cache of 100 million Flickr photos in 2014. Its largest past release was 413 gigabytes, a fraction of the current set. Google and Amazon have released relatively little data.
Tension is higher than ever between the need to attract talent and generate new ideas, on one hand, and the need to protect privacy and competitive advantage, on the other, said Hilary Mason, the founder of Fast Forward Labs, a data science startup. Many of the large technology companies are trying to create the same sorts of capabilities, she said, such as self-driving cars, image recognition, and personalized services. Yahoo runs a small risk of revealing trade secrets by revealing user data, but it has decided that the reward of attracting talent could be greater.
While several companies have released data aimed at researchers, the practice has a fraught history. AOL, in the process of releasing data to researchers in 2006, accidentally revealed search queries. Netflix released the movie recommendations and logs of hundreds of thousands of customers in 2009, offering a $1 million prize to anyone who improved its recommendation algorithm. In both instances, outsiders used the data to deduce users’ identities, leading to class action lawsuits over violations of privacy laws. Netflix cancelled its prize.
Facebook in 2014 landed in hot water when it worked with researchers at Cornell University and elsewhere to study and manipulate users’ emotions. The study, which adjusted the content of users’ news feeds to generate emotional responses, set off a huge privacy backlash. Facebook since has limited the proprietary data it makes available to outsiders.
“Ever since the AOL privacy debacle in 2006, companies have been afraid to release data,” Ms. Mason said.
Yahoo’s cache appears to be less sensitive. It includes only basic demographic information such as city, age, and gender, along with clicks and other interactions with Yahoo Web properties. The data set was scrubbed to provide a strong barrier to tracing information to an individual, said Ricardo Baeza-Yates, Yahoo Labs’ Chief Research Scientist. For example, data emanating from rural areas with small populations was excluded.
Write to Elizabeth Dwoskin at email@example.com