Filling in the blanks on DocumentCloud

Back in November, some folks from The New York Times and ProPublica filed an ambitious grant proposal in the Knight News Challenge competition. It asks for $1 million to fund DocumentCloud, a solution that would apply the wisdom of the crowd to the problem of organizing and examining documents.

The muchbuzzedabout idea aims to develop open standards and APIs to make source documents “easy to find, share, read and collaborate on.” (You can find the full text of the proposal here.)

I asked three of the proposal’s authors, Aron Pilhofer of the Times and Scott Klein and Eric Umansky of ProPublica, to elaborate on their vision for document nirvana.


Can anyone add documents to the repository, or is it necessary to be a news organization? Any concerns over the possibility of forged documents being uploaded?

Aron Pilhofer: The repository will be open for anyone to read from, but not to contribute to. It will be limited to news organizations, bloggers and watchdog groups whose mission includes publishing source documents as a means of better informing the public about issues of the day. That said, the software that makes DocumentCloud go will itself be open source, and available for anyone to use. So, if others want to create DocumentClouds of their own, they can certainly do that.

Scott Klein: We don’t want DocumentCloud to become a generic repository for all documents, or as a quick-and-dirty way to host PDFs. We want somebody to have found these documents to be of news value.

Presumably, DocumentCloud will not be branded with the NYT and ProPublica logos front and center. Would it be staffed and maintained as a separate entity?

AP: There is so much misinformation out there on this question, so I’m glad you asked. In fact, that is what we are asking Knight to fund: the creation of a completely independent entity called DocumentCloud. So the answer, of course, is: It won’t have any NYT or ProPublica branding.

Though we’ve just started to talk about structure and such, it’s entirely possible the only connection the Times, at least, has to DocumentCloud once it’s up and running is as a user and contributor.

SK: Same with ProPublica. Although I suspect somebody from both the Times and ProPublica will be part of the board for DocumentCloud, it’s important to note that this is going to be completely separate from both organizations and shouldn’t monetarily benefit either.

What is the nature of the collaboration between the Times and ProPublica? How will the work on this project be divided?

AP: TBD, but probably I will focus more on the technology side because the Times is contributing a large amount of the software and I understand that part best.

SK: I think we’ll each do a bit of everything but the plan is for the grant to fund developers, so the bulk of the development work won’t need dividing.

The Knight grants come with strings attached (namely, the requirement that projects be open-source) that might turn off some for-profit companies. Aron, how did you sell your bosses on the idea of applying for this? And, as a for-profit company, how would the NYT benefit from this grant?

AP: There’s a bit of misinformation out there about the role of The New York Times in this project, so maybe I should clarify this a bit more.

The grant is not for The New York Times, so the question of strings and for-profits just isn’t relevant. The Times won’t be involved in any way except as a founding participant and donor to the project (contributing my time and a significant chunk of software).

The grant would be used to create an independent, non-profit organization called DocumentCloud, which would manage the grant, build and maintain the software and so forth. Given the intensely competitive nature of the news business, we reckoned that this project had to be in the hands of an independent, impartial broker in order for a consortium like this to work.

DocumentCloud hasn’t been a hard sell because we’re we’re not asking anyone to do anything they aren’t already doing. We (like most media organizations) are already posting source documents online — just not in a way they can be easily searched, cataloged or shared.

If things go well, everyone will benefit because, finally, there will be open standards and open-source technologies available to make that happen. And even if it fails utterly and completely, DocumentCloud will still provide new tools to make publishing documents online easier, faster and more accessible for everyone.

If the proposal is approved, will DocumentCloud be developed in-house, or will you hire outside developers (or both)?

AP: Development will be done entirely by DocumentCloud developers (see above). Part of the grant funding is to support a dedicated development team.

SK: One tidbit that I don’t think we’ve shared widely is that DocumentCloud is designed to live in the cloud (get it?) so we plan to use Amazon’s EC2 and S3 infrasctructure very extensively, and I know Aron’s toying with releasing the DocumentViewer as an EC2 AMI to make it really easy for news orgs to use it without worrying about their content management system or IT people at all.

Seems to me that one of the biggest differences between the DocumentCloud idea and existing document-viewing systems (Docstoc, Scribd, etc.) is the provision to OCR each document, which will allow people to search within documents and to link to and annotate specific passages. Any thoughts on how the OCR part will work?

AP: We outline some of the differences in our latest grant application, but this is really quite a bit more of an apples/oranges comparison than you may realize.

DocumentCloud isn’t a viewer; it’s a standard, and a web service. It’s a system that allows anyone to make documents sharable regardless of what platform it’s on or where it’s hosted.

Scribd is similar in that users can upload documents and make them public. Within Scribd, registered users can comment on those documents, link to them, search them, etc. But everything has to happen within the Scribd environment.

DocumentCloud takes that idea a step further and removes the barriers. It allows users to search, link to and comment on documents regardless of where they are housed, or what platform they are sitting on. All we will ask is that those who are contributing documents do so in a standardized format.

So, Scribd or Docstoc could, in theory, adopt the standard and enable their users to contribute to DocumentCloud, and we hope they do.

I think some of the confusion on this point is of our own making because of the DocumentViewer portion of the project. The viewer is (or will be) nothing more than an off-the-shelf, completely open-source implementation of that standard. But DocumentCloud will be completely agnostic in this regard. If Scribd or Docstoc (or or The Smoking Gun) want to create their own compatible viewer, they are completely welcome to do so.

The reason we included the viewer in the grant application (and there was a lot of discussion internally about this) is because a key part of this project is lowering the barriers of participation. Many organizations don’t have the capability of developing their own software for viewing documents or integrating them with DocumentCloud, so we felt that was an important part of the project too so we kept it in.

SK: Aron’s making a key point here: This isn’t competitive with Docstoc or Scribd, and isn’t even meant to replace a simple list of PDFs if that’s what you want to use. DocumentCloud is a way to organize all of these disparate ways of storing digitized source documents in a way that makes them maximally useful to “reporters” (counting, of course, traditional newsroom reporters as well as bloggers, academic researchers, etc.) Frankly, DocumentViewer is, for a news organization presenting complex document collections, a really great user experience, but it’s not required to be part of DocumentCloud.

Will DocumentViewer be released to the public even if the DocumentCloud proposal isn’t funded? Is there a timeline for that?

AP: Yes, but there’s no specific timeline right now. We’re working on it in between other, more deadline-specific projects. My best guess right now is that we’ll have something releasable in the late spring. That’s about as specific as I can get right now.

What organizations are you soliciting source documents from? I think Eric mentioned the National Security Archive; anywhere else?

AP: None yet. We have talked to a limited number of groups (Gotham Gazette and, yes, the National Security Archive and possibly others) to partner with us on the development of the project. But we’re not actively soliciting documents at this point.

SK: We’ve got a fairly extensive wish list of news organizations and nonprofit groups we want to bring in on the project (none of whom would surprise you I think), and we’ve talked with some folks very informally but all of our discussions have been like “save the date” cards as opposed to wedding invitations, if you get my meaning.

Eric Umansky: As Aron and Scott have said, we’re just at the beginning of this and have just had initial discussion with a few groups. Having said that, we have been in touch with the NSA (the private, non-profit one) and are particularly excited about working with them since they are really among the best in the biz at cataloging and archiving government source documents.

Are there certain kinds of documents that you think will be particularly well-suited to perusal and annotation using DocumentCloud?

EU: Honestly, I’m not really sure. Like the best parts of the Web, what we’re trying to do is build an infrastructure that will support and encourage intelligent contributions. So, not to get all web doe-eyed about it, but the very utility of it is that people will have the ability and interest to submit documents beyond the one we’re already aware of.

How do you plan to surface the most interesting stuff from within this potentially vast database? Will there be a blog or a recent highlights list of some kind? Will you take some pop-culture cues from The Smoking Gun?

AP: We’re hopeful that users will surface this stuff, and we won’t have to. We have not talked about whether we’ll have a blog or highlights — or even if DocumentCloud itself will have a web presence outside the APIs. It’s just not something we’ve decided yet.

SK: We’re laying the foundation for the great work of others, and have very little interest in applying our own editorial judgment on what people post, assuming two things: 1) people follow whatever rules we come up with (like don’t post inappropriate things, etc.), and 2) they themselves apply editorial judgment to what they upload. I think it’s impossible to predict what kinds of stories this will help tell, and I find that really exciting.

EU: I agree with Scott and Aron. We’re really at too early a stage to have a concrete sense of this. And I’m the farthest one here from the software side of this, but one thing we would like to do is build a kind of reader loop into the system. So, not only could you sort by the “most read” documents but you could also sort specific pages that way. For example, if you had a 500-page report that had juicy bits buried on pg. 432, the “crowd” would eventually point you there since it would be flagged and become the most popular page.

Any updates on the News Challenge judging process? Do you know if you’re in the “top 50”?

AP: No idea.

SK: All we know is that we’ve passed the first of four rounds of scrutiny, as have some other really great ideas.

Winners will be announced in the fall, according to the Knight News Challenge site.

Using games to help readers understand the news

With more journalistic sites using games as an interactive way to package content, a $250,000 grant from the Knight Foundation’s News Challenge contest will help one nonprofit news site take these games to the next level.

A pioneer in this format, The Gotham Gazette has featured games about New York City policy issues that are an effective and entertaining way for users to weigh decisions and deal with consequences.

Online Journalism Review spoke to Gotham Gazette Editor-in-Chief Gail Robinson about what makes a successful game and why they work well for journalistic sites. Proving good games can be built on a modest budget, Robinson discussed why simplicity works but dumbing down doesn’t.

Online Journalism Review: How did you first become interested in utilizing games at the Gotham Gazette?

Robinson: In 2002 there were a lot of discussions about what to do with the World Trade Center site, so we created a game [Ground Zero Planner] to let people try to envision what they wanted the site to look like, and we got quite a good response.

We’re very focused on New York City policy, and we try to make the material accessible and interesting to people, not just to policy wonks or people who work for city government or bureaus. So our games [become] almost a story set to a game.

OJR: How do you actually conceptualize and build these games?

Robinson: As the editor-in-chief, I’ll be involved and we have a technical director and a design director. We don’t have an illustrator on staff and we’ll probably get [a freelancer] to do the technical work. But probably the writing and content will all be done in house.

OJR: How involved are the journalists on staff in the creative process?

Robinson: In the past we were very involved. [For example] The Budget Game sort of jumped out at us. The city was having a lot of problems after 9/11, so we thought it would be good to dramatize that by letting people make choices with the caveat that because the city was legally required to balance the budget, you couldn’t play the game unless you balanced it.

There were other similar games, so we did a lot of research and played a lot of other games. And then we came up with assignments and writers were assigned to various aspects. I’ve written a lot about education so [I researched] how much would it cost for x number of teachers.

OJR: What kind of content works well when it’s incorporated in this game format?

Robinson: Almost anything can work with a game if you have an intelligent way of flushing it out– I think it’s important to not be too complicated. That doesn’t mean you can’t have people making lots of choices, or you can’t have graphics and animation. But I look at some games where I feel like they’re asking me to do too many things, to play too many roles.

OJR: You do have a consistent thread of simplicity that runs throughout your games.

Robinson: What we tried to do was create something simple that would show people the story but would still be fun to play. I think you get a lot of that enjoyment partly through the animation and the way you present material.

The infrastructure game called Breakdown is basically a glorified quiz. But we had a wonderful clip of animation showing ways that New York was going to crumble under it’s own weight. And my son who was then 11 (who I don’t think has a lot of interest in New York City infrastructure) loved that animation and played the game several times and then he showed it to his friends. I think that indicates how you can build something straightforward and still make it a lot of fun.

OJR: Can games stand alone as a good storytelling technique or are they best purposed as part of a package?

Robinson: I think they can stand alone. For example, someone can make a decision about something like how to build an affordable housing project in New York. Just by playing the game, the user would probably learn about some of the tradeoffs and then could click on things for more information.

In our case the story is sort of behind the game, and it can be incorporated into the game itself or it could [stem from] a separate article. We’ve actually done both here. The Judges Game [was inspired by] the big probe of whether the bench is basically bought and sold. It had actually started out as an article and then we built the game.

OJR: The games on your site are effective because they help users to understand the consequences of their decisions.

Robinson: Right, that’s what we’re hoping for. That was a big thing with the budget game. People say I don’t have a cop on my corner and why is my child is in a class with 20 students and why are my taxes so high? And this is a really good way [to illustrate that] because you see the money go up or down. You see what things cost to make it clear that you couldn’t have both really low taxes and pay for really tiny classes.

OJR: Do users expect to win when they play games? What kind of reward do they expect aside from obtaining information?

Robinson: We haven’t had winning in these games. For example there’s obviously not a right way to plan Ground Zero, and if there is one the city still hasn’t discovered it. As for winners and losers, my sense is we would like to try both models and determine what people prefer. Part of the Knight project (in general) is to get information out there that other people can use.

On games where people don’t win we hope we’re offering an educational tool. We’re also hoping to get answers back from the readers that we will share with decision makers in the city and [incorporate the responses] into articles.

OJR: From your standpoint what are the technical challenges of building a news game?

Robinson: Knight wants everything to be open source here and that’s probably our biggest challenge. Most games are done in Flash and we can’t use Flash.

OJR: What are some of the games you’re considering now?

Robinson: All the games are pretty tentative at this point because we’ve always let the news dictate the games to some extent. We’ve always had a news peg on the games.

One of the games we’re considering is related to garbage in New York. It’s an endless issue here and it’s one of those situations where there’s no ideal wonderful solution.

In the course of this grant there will be two important political campaigns, one being the presidential and congressional race. Then as the grant ends in 2009 we’ll be right in the middle of electing a mayor, so we imagine we’d somehow want to address that.

OJR: Have you learned anything about what doesn’t work with these games?

Robinson: I think they do have to be clear. I think we have one game that didn’t work–The NYC Preservation Game–although I’m not sure all my colleagues agree with me. I think we could never really decide what exactly we wanted to do with it. We could never figure out if it was a quiz where you’re trying to decide what makes a building a landmark or if you’re playing landmark commissioner.

So it just seems to be that the game has to be well designed and have a clear purpose, whether you’re playing a role or making decisions.

OJR: How do you strike the balance between entertaining and the balance of delivering the news?

Robinson: I think you can do both [if] you keep information very solid. Don’t talk down to someone just because it’s a game. You can put people in interesting, genuinely challenging situations.

Also I think the visuals on these games are enormously important. You’re not debasing the information if you have really clever animation. You’re just engaging people in another way. If you put a really ripping, entertaining lead on a news feature you’re going to pull people into the news feature who might not normally want to read about that subject, and it certainly doesn’t downgrade or dumb down the information that follows.

OJR: How can indie web publishers add a game element to their site if they lack the budget and have technical constraints?

Robinson: That’s one thing I think that Knight is hoping we’ll come up with ways to do. [All the grant winners] are going to be writing, blogging and sharing ideas with each other about that. I assume the plan is to make those ideas available to people. I hope people can learn from what we did right and also learn from our mistakes.

What makes a winning news website?

To an online video game that recreates a once-vibrant jazz scene in Oakland, California and an MIT think tank project designed to facilitate widespread community news online, the John S. and James L. Knight Foundation recently awarded almost $5 million (and pledged at total of $12 million) to fund digital news projects. The winners of this Knight News Challenge met the following criteria: their projects incorporated digital media, involved community news experiments and used open source software.

Describing itself as a national foundation with local roots, the Knight Foundation has pledged an additional $5 million for the next four years to continue to award such community-based digital innovations.

OJR talked with Gary Kebbel, journalism program officer for the Knight Foundation, to find out what distinguished the winners, the implications for the future of digital media and why, if you have an idea or project and 20 minutes to complete an application, you could be among the next series of winners.

Online Journalism Review: How did this News Challenge come about?

Gary Kebbel:It came about because the president of the foundation, Alberto Ibargüen, is the former publisher of the Miami Herald, and in that role, he tried to put the paper on the Web. He analogizes that to trying to make a movie out of a book. Unless you’re native and taking full advantage of what the medium has to offer, trying to transfer information from one medium to another doesn’t always work.

He also thought that declining circulation and advertising in newspapers carried implications far beyond just lack of readership. Newspapers helped to identify what it meant to be a Miamian or a Philadelphian, and they helped identify problems and brought the community together. So [how] can this community organizing be done in cyberspace?

OJR:How is the amount of each grant determined?

Kebbel: Every individual determines what they feel their project would need. As the process went farther along, we would ask more specific questions and eventually the [finalists] created a line item budget.

We didn’t think it was necessary to make people go through that in the early stages. We wanted it to be easy to apply. If you know what idea you’re proposing, the application process takes about 20 minutes.

OJR: The main winner, the Center for Future Civic Media, was awarded $5 million. Their goal seems to be broader than any of the others, so what will be the tangible result of their project?

Kebbel: The MIT Media Lab will come together with the studies of sociology, psychology, political and cultural science to develop new processes [for gathering community news] and [assess] what mediums and technologies they can bring to solve those problems.

OJR: So it’s basically trying to catch up these communities with new technology?

Kebbel: Yes, to bring new technology to the community or old technology to the community in ways that people hadn’t thought about before. They’re going to be working with all the communities individually to find out what issues they should be solving.

The [Center for Future Civic Media] will also be hosting all of the other News Challenge winners at MIT for education, discussion and conferences that everyone will attend.

OJR: So these winners will be checking in with each other throughout the course of their project development?

Kebbel: Yes, that’s very important to us–that these winners develop a community of their own. Just because they’re experts doesn’t mean they’re only experts in their particular fields. They’re all experts in digital media so if one of them has a problem in one area, they’ll be able to talk to four others who might have a solution.

OJR:Adrian Holovaty won $1.1 million for his open-source software idea. What stood out about his project?

Kebbel: First of all, it’s an extension of his current, but it’s on steroids. It’s going to take every possible public database that makes sense–whether that’s global or regional or national–and combine it in a way where you type in your address and you find out everything going on on your street or in your neighborhood. You can find out where there’s a new school proposal, or where a restaurant is going to be shut down, or if the city has decided to change trash pickup regulations.

OJR: Like Holovaty, the winners already have some momentum behind their projects. Is it crucial for the winners to have already established themselves in some way?

Kebbel: The competition had various categories. One category was for ideas and those are represented in the blog entries. They didn’t have projects underway but they had a great idea that might get underway someday.

OJR: What about these blog entries stood out?

Kebbel: These winners … wanted to share and educate in a particular area, and to create a sense of community in a specific geographic area. The project had to have these elements and anything that was developed as a result of that project had to be open source. That’s one reason why we probably did not have applications from newspapers.

OJR: Rich Gordon (Medill School of Journalism, Northwestern University) has created a project designed to teach students both technology and journalism. Why is it important for the next wave of journalists to be technologically proficient?

Kebbel: We’re not saying that every graduating journalism student should also be an expert in programming. What we are saying is that it would benefit that any organization, and in particular computer scientists who love and understand journalism.

So if an online newspaper wants to create a product, a lot of times they have a great idea, but they don’t have the technical expertise to carry it out. What we’re thinking is that if there were more people with both the journalistic knowledge to understand what makes a good story and the expertise to carry it out then [news] organizations would benefit.

OJR: MTV’s project won $700,000 to fund cub reporters [Knight Digital Youth Journalists] who will cover the election with video spots designed specifically for distribution on cell phones. Is packaging what made this project stand out?

Kebbel: Actually, what made it stand out is that you have this organization that traditionally knows how to reach the youth audience and …with packaging including cell phones, we want them to create a story for themselves and by themselves centered around issues that are important during an election year. I think the project will be enormously important in defining what is of interest to [this audience] and how best to reach them. So one of our goals for them is to learn a lot about the production and how to package stories on mobile phones and media.

OJR: Other projects, including Geoff Doughtery’s, are designed specifically to advance citizen journalism.

Kebbel: With the ChiTownDailyNews, one of the things that’s important is adding to the recruitment and training of journalists. It’s interesting because we haven’t seen that sort of blanket application focused on [so many specific communities].

OJR: Are there any other winning projects that stand out to you?

Kebbel: Yes, one area is games. Three different grants [Gail Robinson/Gotham Gazette, Paul Grabowicz/UC Berkeley and Nora Paul/University of Minnesota] will approach storytelling through games in different ways.

Another is the “incubator centers” [created by Dianne Lynch] based out of Ithaca College and includes six other academic institutions working together to try to solve journalists’ problems in digital newsrooms. It will bring together [a cross-section that includes] engineering, marketing and journalism students to try to solve a real problem [occurring in a digital newsroom].

OJR: Is there anything missing from this year’s winners that you’d like to see next year?

Kebbel: Our goal for next year is to have more quality applications from young people. And as a result, we setting aside $500,000 for a special category to award ideas and projects created by young people.

Our other goal is we hope for more international applications, and as such we are advertising the News Challenge in nine different languages this year.

OJR: Will anything change about the application process?

Kebbel: The application process will be much more open in the coming year. Applicants will have the opportunity to choose to go the open route or a closed route.

If you go the open route, you post your application on the Web and anyone can post comments about your project, and they can rate it. Let’s say your application gets 27 comments, and you decide some are good ideas and incorporate something that would strengthen your application. You can then resubmit an application that incorporates those comments.