Whether you love or hate their paywall, the Times successfully balances competeing business frictions using a deep view of data.
Since our initial DataEngConf in 2015, The New York Times has been a key supporter of the conference. The very first ever DataEngConf talk was a keynote given by Chris Wiggins, the Times' Chief Data Scientist, who presented a broad yet fascinating perspective on "Data Science at The New York Times" (video here).
In the years since, we've had deeply technical talks from both data engineers and data scientists at the Times, and I'm excited that their involvement in DataEngConf this year is as large as it's ever been.
What Value Does Data Have to the Times' Business?
In an interview, Nick Rockwell, the Times' CTO, explained to me the significance of data in the organization and why the he has doubled-down on his company's efforts to reshape the way it uses data over the past several years:
Nick also shared some of their recent technical learnings, and explained ways that they have made data management and analysis much easier from a data infrastructure and tooling perspective.
One of the changes they made over the past couple of years was to move their data infrastructure off Hadoop and into Google's BigQuery. This has simplified and improved data management and consistency, and also allows them quicker access to the data they need without the typical batch processing delays that are required in a Hadoop-based system.
Nick said that while it was challenging to switch platforms, they now have deeper visibility into their business metrics and are excited to continue to expand the breadth of their schema and data collection based on initial results.
With these thoughts as a backdrop on how the Times approaches data, here are the great talks you can expect to hear from their team this year at DataEngConf NYC '17.
Keynote Panel: The Future of Data Science in the Media
Chris Wiggins is back as a panelist on our opening day keynote panel of data scientists in media. There are an ever increasing number of ways that data is shaping the media business, and the quantification of both media as well as advertising are advancing quickly and providing new opportunities for measurement and analysis. On the panel, Chris is joined by Claudia Perlich (Dstillery), Jonathan Roberts (Dotdash), Haile Owusu (Mashable) and moderator Adam Kelleher (BuzzFeed). Expect Chris to pull no punches and for a lively conversation to ensue.
In a sure to be fascinating talk, Anne Bauer, Sr. Data Scientist, will explain a project that integrates data science in order to increase the efficiency of single copy (newsstand) sales, a significant source of revenues for the company.
Predicting sales demand is a quintessential time series modeling problem, and when Anne's team began the project they expected that their new algorithms would easily outperform the existing methods. Up until that time, the company had predicted the volume of single copy sales using heuristics and rules of thumb that had been hard-coded in COBOL.
But to their surprise, those hard-coded rules worked quite well; they were the result of years of work by people with a lot of experience in the business. Under randomized controlled trials, their first statistical models significantly underperformed compared to that baseline. This was a reminder of the potential for "data science hubris" in thinking that one can simply throw modern tools at a complex business problem and expect to surpass a proven solution easily.
Make sure to stop by Anne's talk at the conference to hear her story of how they improved the algorithm, and discover other lessons the data science team learned in applying modern data science techniques to a long-standing, and offline, business problem.
Data Engineering: The Crossword's Puzzle - Building a Recursive BigQuery Mapper
Ever play the Times' legendary Crossword? Their digital version is something you can conveniently play anywhere on any platform. Due to the success of the application and its large number of users, the engineering team was presented with some challenges as they considered how to move the application to the cloud.
Namely, as they gave up their MySQL database in exchange for Google Cloud Datastore, they lost the ability to execute certain kinds of queries representing questions their users often asked. Questions like "was today's puzzle harder than last week's," or "what quantile does my time fall in" and "what word did solvers struggle with the most?"
So how did they solve this problem? They wrote a Go application that can do the job of replicating the data from Datastore into Google's BigQuery orders of magnitude faster than other solutions. They found that by using this system, they were able to achieve over 1.5 million streaming inserts per second, a significant number by any standard.
At his talk at DataEngConf, Darren McCleary, the software engineer that built this system will discuss their complete solution and walk us through a deeper view of some of the challenges his team faced. As an engineer, Darren had to learn to think differently about how to solve this particular problem and came away pleased with how they discovered a creative way to use their serverless infrastructure in a recursive manner. This required a completely new style of thinking as to how to design a modern application architecture. It also provides a valuable model for other engineers in learning how microservices can communicate across a distributed platform.
We're thrilled to have The New York Times join us for deeply technical talks at our upcoming conference. Don't miss this unique opportunity to chat with their engineering and data science teams at DataEngConf NYC at Columbia University.