Bright Lights, Big Data

Until about six months ago, when I finally fled the sinking ship of my academic career for the precarious lifeboat of freelance writing, I worked on the top floor of a sleek, contemporary building in the center of Dublin called the Long Room Hub. High and airy, it overlooked the venerable panorama of Trinity College. The building was intended as a home for innovative research across the various disciplines of the arts and the humanities, and one of the priorities of the research was to facilitate a relatively recent academic enthusiasm known as the digital humanities. My desk in the building came as part of a postdoctoral fellowship I was doing; the project had no connection to anything that could be considered digital, but I was happy to have a place to sit and put my books.

Occasionally, I would be cc’ed on an e-mail asking everyone in the building to provide brief outlines of our research projects so that they could be included in promotional materials for the Long Room Hub, but I consistently managed, without consequence, to avoid answering these. (A lot of the other postgraduate and postdoctoral researchers were working on forbiddingly technical-sounding projects involving things like the “systematic evaluation of archeological digital epistemology” and “digital genetic dossiers.” I was basically just trying to think of clever things to say about the work of John Banville.) When I mentioned to fellow literary academic types where I happened to work—or work from—they tended to suspect that this work of mine had something to do with the digital humanities, and to ask me what the mysterious business was supposed to be about. To this, I usually replied that I wasn’t totally sure, but that I thought it had something to do with using computers to read books. As far as I could tell, there was a general skepticism about the digital humanities, combined with a certain measure of unease—arising, perhaps, from the vague aura of utility, even of outright science, emanating from the discipline, and the sense that this aura was attracting funding that might otherwise have gone to more low-tech humanities projects.

Having read “Uncharted: Big Data as a Lens on Human Culture,” a new book by the scientists Erez Aiden and Jean-Baptiste Michel, I am now experiencing a minor uptick in my understanding of this discipline. And it turns out that it does, on a very basic level, have to do with using computers to read books. Aiden and Michel are the founders of a field they call “culturomics,” in which quantitative analysis is performed on digitized texts to generate empirical data about historical, cultural, and linguistic trends. In 2010, with the patronage of Google, they created something called the Ngram Viewer. If you’re not familiar with it—and bear with me here a moment before you start playing around with it, because it’s addictive enough that you probably won’t be coming back anytime soon—it’s essentially a graphing application that measures, over a set period, the occurrences of a particular word or phrase (in the terminology of computational linguistics, an n-gram) in the thirty million or so volumes that have so far been scanned by Google in the company’s effort to digitize the world’s books. Both Aiden and Michel have backgrounds in biology, and “Uncharted” reveals the extent to which that disciplinary sensibility fed into the creation of culturomics. It began with a curiosity about why the ten most common verbs in the English language are irregular, even though the vast majority of verbs are regular. Their discovery, arrived at through data-mining several centuries’ worth of texts, amounts to a sort of linguistic natural selection: the more frequently an irregular verb is used, the less likely it is to be regularized over time. It was the Ngram Viewer, and access to Google’s vast library of digitized books, that enabled this discovery.

Aiden and Michel provide plenty of examples of how well their system works; a disproportionate amount of the book is given over to lengthy explication of various text-mining exercises that the authors have undertaken. One of the more interesting chapters deals with the issue of censorship; they run n-gram searches on figures such as Marc Chagall, Paul Klee, and Max Beckmann—condemned as “degenerate” artists by the Nazis—and provide graphs to illustrate how effectively their fame was suppressed during the years of the Third Reich. (“Between 1936 and 1943,” they write, “Marc Chagall’s full name appears only once in our German book records.”) The graph comparing the frequency of the word “Tiananmen” in English-language books between 1950 and 2000 with its frequency in Chinese books from the same period is a genuinely chilling thing to behold. “In the West,” as they put it, “mentions of Tiananmen soar after the 1989 massacre. In China, there is a transient blip of interest—hardly approaching even 1976 levels—after which things go back to normal.”

As striking as these infographics are in their encapsulations of historical truths, they don’t typically tell us anything that we didn’t already know. And this is true of the book as a whole. The data on censorship, for instance, is embedded deep in a luxuriance of padding. We get stuff about how Helen Keller was “a hero to millions, a symbol of the triumph of the human spirit over adversity” and how “Marcel Proust became famous for writing good books,” which is one of those facts so incontrovertibly true that stating it sounds a mysteriously false note. And a data-mining examination of the history of fame, whereby we learn that Adolf Hitler is the most famous person born in the past two centuries (i.e., mentioned in the most books), leads to the insight that “darkness, too, lurks among the n-grams, and no secret darker than this: Nothing creates fame more efficiently than acts of extreme evil. We live in a world in which the surest route to fame is killing people, and we owe it to one another to think about what that means.” After a while, you begin to suspect that this sort of wan reflection might be compensating for the fact that the data itself reveal little that is new.

The book is mostly entertaining, and its authors are an amiable presence. But the claims that they make about the impact of their work—and the larger impact of big data on the humanities—are imposingly serious. “At its core,” they write, “this big data revolution is about how humans create and preserve a historical record of their activities. Its consequences will transform how we look at ourselves. It will enable the creation of new scopes that make it possible for our society to more effectively probe its own nature. Big data is going to change the humanities, transform the social sciences, and renegotiate the relationship between the world of commerce and the ivory tower.” We are, in other words, deep in TED territory here, where no innovation can ever be merely useful or profitable, and must always mark something like a turning point in human history. (Unsurprisingly, the authors have presented their work as a TED talk.) And, like a TED talk, “Uncharted” approaches its subject as a personal story, a softball dialectic of problem and solution. There are moments when you might fancy that you can almost hear the sound of a slide clicker. (“To see why we were stuck, we need to take a trip through time to 1930, to a little town in Norway called Kristiansand….”) And there’s more material than I felt I needed about how they managed to get their project funded by Google. (“So we asked [Steven] Pinker: Look, we’ve generated these two billion ngrams—could you help us liberate them? Pinker thought our work had potential to be useful and agreed to come. So Clancy agreed to come, too. We had thirty minutes to make our case.”)

Aiden and Michel are far more entrepreneurial than scholarly in their tone, and their book has about it the breezy self-assurance of salesmanship. They’re selling an invention, bundled with the idea of this invention as an important part of a new era in the study of human culture. And you can see why they would make such grand claims on its behalf; addictiveness aside, it’s a valuable source of information. In mere seconds, for instance, I was able to find out via the Google Ngram Viewer that the term “digital humanities” first started to assert itself in print around 2000, and has been on a steep upward curve ever since. (At least, it was up to 2008, beyond which point the data remain unmined.) And that the terms “modernism,” “modernist,” “postmodernism,” and “postmodernist” all experienced a giddy ascent in the early nineteen-eighties, and have been in a pattern of smoothly inexorable decline since the late nineties. Aiden and Michel’s insistence, however, on portraying the Ngram Viewer, and innovations like it, as part of a Copernican turn in the humanities overstates the extent to which it is anything more than a very useful tool for quantifying cultural and intellectual trends. It’s a new way of gathering information about culture, rather than a new way of thinking about it or of understanding it—things for which we continue to rely on the analog humanities. This, of course, is a fairly standard conclusion when it comes to this topic, and, despite some of their more extravagant claims, it’s one that Aiden and Michel don’t disagree with. But innovations like culturomics continue to gain prominence because they extend the prospect of something like scientific progress in the study of human culture. They offer, in other words, what had up to now been outside the purview of the humanities: results.

Mark O’Connell is Slate’s books columnist, and a staff writer for The Millions. You can follow him on Twitter @mrkocnnll.

Read Josh Rothman’s related post about Franco Moretti and the science of literary criticism.

Illustration by Roman Muradov.