Daniel J. Willis on Data Journalism, Public Records, Salaries and Statistics
Daniel J. Willis is a “data journalism mercenary, guild freelancer,” and “truth-seeking malcontent” according to his Twitter bio. He’s a big guy – not Andre the Giant big, but big – who is also one of the nicest people you’re likely to meet, with a contagious enthusiasm for journalism, data, and truth. The kind of person who makes hard, verifiable facts something to get excited about. He knows his way around a newsroom, having worked for Bay Area News Group – publisher of the Oakland Tribune and a chain of other Bay Area newspapers – for nearly a decade. My conversation with Danny is the first of a series I hope to do with people working in a variety of interesting domains at the crossroads of technology and something else. We caught up in Oakland, California in late January 2017.
The following interview has been edited for brevity and style. For the full interview, listen to the audio version below, or on SoundCloud.
Justin Allen: To start off, how would you define data journalism to someone who is not familiar with the term?
Daniel Willis: Well, I would immediately be snarky and say journalism that involves data, because I’m kind of a jerk like that. It’s basically journalism using a large dataset as a source rather than a person. Rather than meeting a hooded figure in the shadowy parking garage, you get a massive amount of data from some public repository, FOIA (Freedom of Information Act) request, PRA (Public Records Act), scraping it together yourself out of public records on paper, whatever, then mining that for the story tip.
Well, that’s one side of it. That’s the side I’m more focused in. The other side would be presenting that data for easy consumption by the public online, which I’ve done but it’s not my wheelhouse.
JA: How did you wind up working at this crossroads of technology and journalism?
DW: I got a degree in economics with a focus on econometrics and statistical analytics and then discovered that I did not want to work at a Wall St. firm, whatsoever. It’s like a circle of hell. Which means my degree was completely useless. Then I got a part-time job at the local paper because I had been between jobs for a year, because I graduated right around the time Arthur Anderson went under, which means there were no jobs in accounting or finance anyway. That part-time job, due to someone just not showing up one day, turned into a full-time job answering phones, being a clerk.
Then, all of a sudden the opportunity came up: a large dataset came in due to a successful public records lawsuit. I was sick of answering the phones, so I showed up to a meeting uninvited and the reporter who had just spent $100,000 of the company’s money suing for this information, who had no idea what to do with it, who was open to literally anything, was even open to listening to a clerk who said, “Oh, no, I could totally just put that online and then mine it for stories and data and stuff. I actually have a degree in this, believe it or not, even though I’ve been delivering your mail for the past couple of years.”
And then that project was very well received, won a whole bunch of awards, everyone loved it, and I’m still doing basically the same thing though I’m a little bit better at it.
JA: Is that still in a form people can go check out online?
DW: No, it’s not. About four months after I got laid off the whole project just kind of went away in a big website refresh.
JA: That’s painful.
DW: It is painful, because now I need to have clips because I’m a freelancer, and they took that away four months after I was laid off. It was about three-quarters of a trillion dollars of local government spending, itemized to the penny, that they just didn’t feel like putting up when they refreshed their website. Not that I’m bitter, obviously.
JA: Some basic facts about you. Where are you from? How old are you? Where do you live?
DW: I’m from Concord, California. I still live there because it’s the only place in the Bay Area where housing is still affordable. I’m 34. I went to UC Santa Cruz. I’m a Banana Slug, which is how I managed to get a degree in economics without falling under the sway of Wall St.
JA: Any interesting takeaways from your time at Santa Cruz?
DW: One thing that did lead to this was econometrics, which at the time was tools used for money, exclusively. Of course they’re the Banana Slugs, obviously they were not beholden to tradition, so they were all about this whole broadening-the-econometric-tools-to-things-other-than-money thing before it was cool.
They actually hosted a lecture by then-Baseball America editor Nate Silver about sabermetrics and how he used econometric tools and how Bill James used econometric tools to predict baseball stats rather than stock market movement. He kept dropping a lot of hints about how you could also use it for, say, poll numbers, but no one knew that was him yet.
JA: That’s amazing because it completely transitioned into my next question: I know you’re a sports fan.
DW: Yes.
JA: Is there a relationship between that and your love of hard data?
DW: Oh, yeah. I grew up as a huge baseball fan and I owned Baseball Prospectus. Like, I saved up allowances for Baseball Prospectus as a kid. I was so into that, because I’m the least athletic person in the entire world and that is where being a nerd and being a sports fan combine.
I could not play, but by God I could know everyone’s on-base percentage. Plus, I’m an A’s fan, and they kind of…
JA: Moneyball.
DW: Yeah, they kind of invented that. That was basically how I ended up a math nerd.
“I have a file box of people threatening lawsuits in writing.”
JA: What do you think was the most, to you, exciting and impactful data journalism project you’ve worked on?
DW: Actually it was probably that first one, which I continued working on until literally the day I was laid off. That was the one work thing I did that day before I got the call asking me to come in for my exit interview. Which was a public employee salary database. Every public employee, all their compensation, all their monetary compensation, their cost of employment…
JA: In the State of California.
DW: Yeah. Not everyone in the State of California because there were only two of us on the project and there’s a whole lot of municipalities, and each one required its own public records request every year, but like I said, three-quarters of a trillion dollars over eight years. It was a very large database.
JA: Where did you get all the data from?
DW: Individual Public Records Act requests. The California PRA is a fantastic thing. The lawsuit that was won said that salaries fall under the California Public Records Act. We would send a request under the Public Records Act to the titular head of the agency, be it the city manager or board president of the Water District, or whatever, for everyone’s salaries broken down the way we needed it for the previous calendar year. The data would come in in all sorts of very exciting formats, which I would then clean up and process.
It was pretty much three months out of every year, but the web traffic was consistently, year-round, in the millions per year. It would blow up the servers of the company we were using to host it every year, no matter what safeguards they put in place. It was extremely popular, extremely impactful. State legislators would mention it on the floor of the state senate over the state’s own version, to make their point. That was impactful, to say the least. That shifted discussion all over the place.
When BART went on strike a couple of years ago, both the union and BART cited us for the records, even though BART itself gave us the information. They could have very well cited themselves, but people still looked it up on our database.
JA: That’s a big project, over the years, so what would be some of the more interesting anecdotes to come out of that project?
DW: Every year we would find something and we had to dig deeper and deeper every year, but my personal favorite was actually just before BART went on strike. I’m not saying there’s a connection. I actually kind of hope there wasn’t, because it was very inconvenient to me personally. The highest paid person at BART the year before had been fired for incompetence two years prior, but due to the settlement agreement she was still drawing her full salary and she was the highest paid person at the agency even though she hadn’t worked there. She had been straight-up fired 18 months before.
JA: What was the salary? Do you remember?
DW: It was over half a million dollars, total compensation. Her replacement didn’t make that much.
JA: Publishing people’s salary information, that’s a fairly sensitive thing. I could see people taking that personally. Did you get people mad at you?
DW: Oh boy, yes, but strangely enough, when I went into it I expected the rank and file to be furious. My dad, when I started, he actually was a city employee. I expected him and his friends to come after me because they knew where I lived.
That wasn’t the case at all. Most of our traffic on the first day was from employees in the database. I got thank you’s from public works employees around the Bay Area saying, “Thank you for telling people objectively that I am not some rich government fat cat living on some huge salary. I make like 50 grand, I’m in it for the benefits.”
And no one would believe it because of the whole “government employees make a fortune” rhetoric. Really on the lower end, the maintenance, the blue-collar, the rank and file – teachers especially were grateful. You would see on social media them posting their own salaries on there saying, “See! See! Search for me! I told you!”
Meanwhile, the people at the top end of the spectrum, the managers, the city managers, the elected and the appointed officials? Furious.
I have a file box of people threatening lawsuits in writing. Some of them would actually have their lawyer write up the lawsuit and send it to me and say, “Take it down or else I’ll file this for the courts.” And I never took it down and they never actually followed through. I actually have my “threats” box just from that project.
JA: Trophies.
DW: Yeah, and I don’t think a single person made less than six figures. Cash. Not even including benefits. And that’s not what I expected. I expected the people who were making more wouldn’t care and the people making less would be upset, but it was the opposite.
The less someone made, the happier they were it was online. The more someone made, the more likely they were to threaten me somehow, either physically or through the courts. But I’m very large so the physical threats never went anywhere once they saw me. But they would occasionally show up in the lobby. They were always very small.
JA: What are some of the data journalism projects that you most admire? What are some of your favorites that are just out there in the world?
DW: Out there in the world, usually it’s stuff involving police. Arrests, police use of force, stuff like that. Most of them come out of Chicago for obvious reasons. There’s a lot of them and I like them all. They all come at it from different angles. What Chicago media has done has just been amazing.
JA: I’m sure you’ve seen Settling for Misconduct by the Chicago Reporter on police lawsuits and settlements. Amazing work.
DW: Yes – right. It’s very fertile ground. It does a public service and professionally, I look at how they display something that’s this complicated and controversial, and they do it in a very simple and very objective way.
JA: You just said a couple things like simplicity, creating this transparency. What are the things that make a data journalism project a success on the presentation side?
DW: On the presentation side… that’s the hardest part of the whole thing. For me, the math is difficult in that you need to know the math, but once you know it it’s like riding a bike. It’s just something you know.
The part that’s actually really difficult, and the part that has to be done every time individually, is to take this ridiculously complicated information, and put just what you need to to tell the story, so you don’t overwhelm people with numbers. How you sort it right off the bat is crucial to people’s understanding.
There’s psychological, subconscious factors. You need to see something you recognize up top. It needs to be the newsy-est and also most impactful. This is just tables of data. This isn’t even getting into graphical visualizations.
JA: What about on the data analysis side? What do you think makes a data journalism project a success? Maybe it’s data analysis or maybe just how the project starts? Where does the data come from? Is it getting the data? Is it then munging it? Is it combining different datasets?
DW: Really the first step is, it should be a dataset that one way or another has not really been done to death. Public datasets exist, and there’s a lot of them, and most of them have been excessively scrutinized. Like you said, if you can join two of them, if you can find a common field and merge some stuff, if you can make the dataset novel in some way, there’s going to be a lot more there for you to find.
The other thing that is absolutely crucial, which seems a little bit counterintuitive, is coming into it with an open mind. If you go into a dataset looking to prove a point, you’re going to be able to prove that point, because lies damn lies and statistics. You can use numbers to prove anything if you want to.
The hard part is coming into it blind, letting the numbers talk to you. Can you tell I went to Santa Cruz? This is how I learned. My professor was named Dr. Carlos. You should meet him. You let the numbers talk to you. When I’m working with salaries, I’ll strip off the names. I’ll assign numbers to different job titles because I don’t want to know when I’m searching. I want it to be cold, hard, objective numbers.
I can come up with the analysis later. If you’re looking at test scores at a school, leave the race, socioeconomics in numerical form, just so you don’t see what you’re expecting to see. You run all the analytics, you run all the numbers, you run all the sorts, you find the patterns in raw, cold numbers, then you go back and figure out what those patterns are in English and you translate them into words and you let that tell you the story.
JA: Interesting. You’re actually talking about removing your own biases basically by reducing it to the raw numbers so you can see whatever patterns exist.
DW: Right. If you’re using R, even if you’re just sorting it in Excel, most people will see patterns in numbers. That’s not to say you should always do that. I’m talking more about the large-scale analysis, like I said, R, regressions, sorting, breaking things down into tiers, whatever you want to do to try to find some pattern within the numbers, just find it within the numbers themselves.
One of three things will happen. Either common sense and common knowledge is proven true: cities with a higher African American population have more incidents of police use of force. Everyone knows that but if the numbers can prove it without you actually knowing what the races are first, that adds some weight to it. Stories about confirming assumptions always play very well with people.
The other thing is that the common assumption is completely wrong, which is interesting for the same reason. That the numbers coming in completely cold and objective say what everyone assumes doesn’t exist whatsoever. Sometimes that the opposite is true. That’s super interesting.
And third, which is my personal favorite, is you find a pattern that no one had even thought to look for before. It’s like a whistleblower coming to you. It’s really rare that this happens, but when it does you get Pulitzers in your eyes.
Those are the really fun stories because then you get to do detective work and you get to do old-fashioned request for records, boots on the ground, do some interviews, journalism to try to figure out what’s going on there. Because once you see the pattern, and it exists, that is a given, and then you just have to explain it.
“Today’s corrupt city councilman is tomorrow’s corrupt state senator, is eventually a corrupt senator. If you can’t get anything from the Feds, at least make sure your city’s clean.”
JA: So this is all interesting, and exciting at least to me. But how does snooping for patterns in data – which in itself I think most people would say sounds cool – how does that translate to practical tools and techniques, math, statistics, programming languages, that a person needs to know to do data journalism?
DW: If you really want to get into it, Google “econometrics,” which is not a word most people know. It’s been around forever. It’s a field of economics that’s basically high-level statistics. It’s not scary like the statistics everyone failed in high school or community college. It is basically math-defined patterns in numbers. Predictive, it’s super basic algorithm design.
It’s this whole field basically for stock markets and businesses. It’s what you use to project the profits five, ten years down the line. It’s what mutual funds and hedge fund managers used to predict stock performance – tools used to predict the flow of money in an economy. A number is a number, whether it’s a dollar or a test score, the math doesn’t care. You learn that and then apply it to anything you can quantify numerically.
Once you get into that you’re also going to learn R which is a piece of software that replaced an expensive standard from way back in my day. I wish R existed when I was in college; it would have saved me a boatload of money getting my own license for Stata. Then from there, data journalism is increasingly broad. We generate so much data these days that everyone’s personal experience and interests leads them into one field or another.
JA: Yeah, one of the evocative phrases I’ve heard to describe that is “digital exhaust.” Basically all of us with our cell phones, each individual person creates a trail of data as we move through the world.
DW: Yeah, and a lot of this stuff, even from a government standpoint, a lot of the stuff that used to just be on paper, for convenience’s sake they put in a database and then all of a sudden that database falls under FOIA or the Public Records Act, and that’s data we can analyze now. It’s a massive amount that you can delve into, and each arena requires its own skills, its own set of knowledge. Once you learn the basics of how to look at numbers you’ll start to figure out there’s certain things you need to know to be able to do stuff within that area specifically.
JA: Taking a critical eye for a second, what do you think are some of the limitations of data journalism? Where does good old fashioned reporting do better than this new-fangled data stuff, or are there blind spots that data journalism can introduce?
DW: Data journalism, traditional journalism, I don’t think those are two different things. I think those are very complementary things. Neither one is useful without the other, in my opinion, in today’s world, because there’s been a crackdown on whistle blowers, there’s been a clampdown on leaks for years now, but that’s only going to get more so. You’re not going to get sources anymore in these high-level agencies. Yesterday the EPA, the FDA can’t release scientific studies anymore, so you can’t rely on the old ways of getting sources within things. Fortunately, data still exists. Either scraped data, requested data, whatever.
Data analytics is the new sourcing. It’s the new boots on the ground, it’s the new poring through records at a library, looking at microfiche, leafing through police reports. It’s where the stories come from. It’s the news mine that you get this ore out of. Once you find that story, though, once you find that pattern, once you find that anomaly, it takes journalists, traditional journalists, to do the reporting. Just because Deep Throat in a parking garage says they broke in at Watergate, that’s not news in itself. That’s a tip. There still needed to be reporting done, writing done, in order for that to become news.
You find a pattern in data, that’s not the news. That’s the news tip. If you give this to an editor who assigns it, one way or another, that traditional journalist needs to take that and run with it and confirm it and interview and contextualize and do everything that journalists do.
Then you take it to data journalists in the presentation sense who will make a graph, an infographic, a chart, whatever, to show your work.
They have to work together. One or the other, just data, is not interesting.
JA: Right, it’s not a substitute for storytelling, for interpreting.
DW: Right. And storytelling without the proof, no one is going to believe you. Fake news, it either is fake news or it’s accused of it because it’s something you don’t agree with. You need the data to prove what you’re saying. We need each other, data journalists and traditional journalists.
JA: I’m going to wind this up with one more question. What do you think are some of the most urgent areas for data journalism that have impact in the current political environment? Where is it most needed?
DW: Literally right now? Like today? Is scraping and archiving all public data that currently exists, because it’s already coming down. This is the third business day of the administration and they’re already pulling down information. We need to have the information in order to analyze it, at least historically, to provide context for anything we do get in the future. Assuming anyone hears this and there’s still some data up, download it. Make as many mirrors as possible. Hoard everything.
From there, objective impact of things that are happening. It’s hard in this climate because they’re not going to be releasing a lot of data that you can use, but FOIA still does exist, and stuff is going to come out, and there’s still sources for this stuff. There’s always use for contextualization because there are always people who want to know how many people are affected by policy and how much money something is costing, accountability.
Data is trickier these days. If I were still at a daily paper I would redouble my efforts. I would focus locally a lot more just because as much as everyone wants to cover what’s happening nationally, there’s not a lot you can do when they’re already clamping down on the flow of information.
That said, today’s corrupt city councilman is tomorrow’s corrupt state senator, is eventually a corrupt senator. If you can’t get anything from the feds, at least make sure your city’s clean.
You can find Daniel J. Willis on Twitter at @BayAreaData