Filling in the blanks on DocumentCloud

Back in November, some folks from The New York Times and ProPublica filed an ambitious grant proposal in the Knight News Challenge competition. It asks for $1 million to fund DocumentCloud, a solution that would apply the wisdom of the crowd to the problem of organizing and examining documents.

The muchbuzzedabout idea aims to develop open standards and APIs to make source documents “easy to find, share, read and collaborate on.” (You can find the full text of the proposal here.)

I asked three of the proposal’s authors, Aron Pilhofer of the Times and Scott Klein and Eric Umansky of ProPublica, to elaborate on their vision for document nirvana.

+++

Can anyone add documents to the repository, or is it necessary to be a news organization? Any concerns over the possibility of forged documents being uploaded?

Aron Pilhofer: The repository will be open for anyone to read from, but not to contribute to. It will be limited to news organizations, bloggers and watchdog groups whose mission includes publishing source documents as a means of better informing the public about issues of the day. That said, the software that makes DocumentCloud go will itself be open source, and available for anyone to use. So, if others want to create DocumentClouds of their own, they can certainly do that.

Scott Klein: We don’t want DocumentCloud to become a generic repository for all documents, or as a quick-and-dirty way to host PDFs. We want somebody to have found these documents to be of news value.

Presumably, DocumentCloud will not be branded with the NYT and ProPublica logos front and center. Would it be staffed and maintained as a separate entity?

AP: There is so much misinformation out there on this question, so I’m glad you asked. In fact, that is what we are asking Knight to fund: the creation of a completely independent entity called DocumentCloud. So the answer, of course, is: It won’t have any NYT or ProPublica branding.

Though we’ve just started to talk about structure and such, it’s entirely possible the only connection the Times, at least, has to DocumentCloud once it’s up and running is as a user and contributor.

SK: Same with ProPublica. Although I suspect somebody from both the Times and ProPublica will be part of the board for DocumentCloud, it’s important to note that this is going to be completely separate from both organizations and shouldn’t monetarily benefit either.

What is the nature of the collaboration between the Times and ProPublica? How will the work on this project be divided?

AP: TBD, but probably I will focus more on the technology side because the Times is contributing a large amount of the software and I understand that part best.

SK: I think we’ll each do a bit of everything but the plan is for the grant to fund developers, so the bulk of the development work won’t need dividing.

The Knight grants come with strings attached (namely, the requirement that projects be open-source) that might turn off some for-profit companies. Aron, how did you sell your bosses on the idea of applying for this? And, as a for-profit company, how would the NYT benefit from this grant?

AP: There’s a bit of misinformation out there about the role of The New York Times in this project, so maybe I should clarify this a bit more.

The grant is not for The New York Times, so the question of strings and for-profits just isn’t relevant. The Times won’t be involved in any way except as a founding participant and donor to the project (contributing my time and a significant chunk of software).

The grant would be used to create an independent, non-profit organization called DocumentCloud, which would manage the grant, build and maintain the software and so forth. Given the intensely competitive nature of the news business, we reckoned that this project had to be in the hands of an independent, impartial broker in order for a consortium like this to work.

DocumentCloud hasn’t been a hard sell because we’re we’re not asking anyone to do anything they aren’t already doing. We (like most media organizations) are already posting source documents online — just not in a way they can be easily searched, cataloged or shared.

If things go well, everyone will benefit because, finally, there will be open standards and open-source technologies available to make that happen. And even if it fails utterly and completely, DocumentCloud will still provide new tools to make publishing documents online easier, faster and more accessible for everyone.

If the proposal is approved, will DocumentCloud be developed in-house, or will you hire outside developers (or both)?

AP: Development will be done entirely by DocumentCloud developers (see above). Part of the grant funding is to support a dedicated development team.

SK: One tidbit that I don’t think we’ve shared widely is that DocumentCloud is designed to live in the cloud (get it?) so we plan to use Amazon’s EC2 and S3 infrasctructure very extensively, and I know Aron’s toying with releasing the DocumentViewer as an EC2 AMI to make it really easy for news orgs to use it without worrying about their content management system or IT people at all.

Seems to me that one of the biggest differences between the DocumentCloud idea and existing document-viewing systems (Docstoc, Scribd, etc.) is the provision to OCR each document, which will allow people to search within documents and to link to and annotate specific passages. Any thoughts on how the OCR part will work?

AP: We outline some of the differences in our latest grant application, but this is really quite a bit more of an apples/oranges comparison than you may realize.

DocumentCloud isn’t a viewer; it’s a standard, and a web service. It’s a system that allows anyone to make documents sharable regardless of what platform it’s on or where it’s hosted.

Scribd is similar in that users can upload documents and make them public. Within Scribd, registered users can comment on those documents, link to them, search them, etc. But everything has to happen within the Scribd environment.

DocumentCloud takes that idea a step further and removes the barriers. It allows users to search, link to and comment on documents regardless of where they are housed, or what platform they are sitting on. All we will ask is that those who are contributing documents do so in a standardized format.

So, Scribd or Docstoc could, in theory, adopt the standard and enable their users to contribute to DocumentCloud, and we hope they do.

I think some of the confusion on this point is of our own making because of the DocumentViewer portion of the project. The viewer is (or will be) nothing more than an off-the-shelf, completely open-source implementation of that standard. But DocumentCloud will be completely agnostic in this regard. If Scribd or Docstoc (or GovernmentDocs.org or The Smoking Gun) want to create their own compatible viewer, they are completely welcome to do so.

The reason we included the viewer in the grant application (and there was a lot of discussion internally about this) is because a key part of this project is lowering the barriers of participation. Many organizations don’t have the capability of developing their own software for viewing documents or integrating them with DocumentCloud, so we felt that was an important part of the project too so we kept it in.

SK: Aron’s making a key point here: This isn’t competitive with Docstoc or Scribd, and isn’t even meant to replace a simple list of PDFs if that’s what you want to use. DocumentCloud is a way to organize all of these disparate ways of storing digitized source documents in a way that makes them maximally useful to “reporters” (counting, of course, traditional newsroom reporters as well as bloggers, academic researchers, etc.) Frankly, DocumentViewer is, for a news organization presenting complex document collections, a really great user experience, but it’s not required to be part of DocumentCloud.

Will DocumentViewer be released to the public even if the DocumentCloud proposal isn’t funded? Is there a timeline for that?

AP: Yes, but there’s no specific timeline right now. We’re working on it in between other, more deadline-specific projects. My best guess right now is that we’ll have something releasable in the late spring. That’s about as specific as I can get right now.

What organizations are you soliciting source documents from? I think Eric mentioned the National Security Archive; anywhere else?

AP: None yet. We have talked to a limited number of groups (Gotham Gazette and, yes, the National Security Archive and possibly others) to partner with us on the development of the project. But we’re not actively soliciting documents at this point.

SK: We’ve got a fairly extensive wish list of news organizations and nonprofit groups we want to bring in on the project (none of whom would surprise you I think), and we’ve talked with some folks very informally but all of our discussions have been like “save the date” cards as opposed to wedding invitations, if you get my meaning.

Eric Umansky: As Aron and Scott have said, we’re just at the beginning of this and have just had initial discussion with a few groups. Having said that, we have been in touch with the NSA (the private, non-profit one) and are particularly excited about working with them since they are really among the best in the biz at cataloging and archiving government source documents.

Are there certain kinds of documents that you think will be particularly well-suited to perusal and annotation using DocumentCloud?

EU: Honestly, I’m not really sure. Like the best parts of the Web, what we’re trying to do is build an infrastructure that will support and encourage intelligent contributions. So, not to get all web doe-eyed about it, but the very utility of it is that people will have the ability and interest to submit documents beyond the one we’re already aware of.

How do you plan to surface the most interesting stuff from within this potentially vast database? Will there be a blog or a recent highlights list of some kind? Will you take some pop-culture cues from The Smoking Gun?

AP: We’re hopeful that users will surface this stuff, and we won’t have to. We have not talked about whether we’ll have a blog or highlights — or even if DocumentCloud itself will have a web presence outside the APIs. It’s just not something we’ve decided yet.

SK: We’re laying the foundation for the great work of others, and have very little interest in applying our own editorial judgment on what people post, assuming two things: 1) people follow whatever rules we come up with (like don’t post inappropriate things, etc.), and 2) they themselves apply editorial judgment to what they upload. I think it’s impossible to predict what kinds of stories this will help tell, and I find that really exciting.

EU: I agree with Scott and Aron. We’re really at too early a stage to have a concrete sense of this. And I’m the farthest one here from the software side of this, but one thing we would like to do is build a kind of reader loop into the system. So, not only could you sort by the “most read” documents but you could also sort specific pages that way. For example, if you had a 500-page report that had juicy bits buried on pg. 432, the “crowd” would eventually point you there since it would be flagged and become the most popular page.

Any updates on the News Challenge judging process? Do you know if you’re in the “top 50”?

AP: No idea.

SK: All we know is that we’ve passed the first of four rounds of scrutiny, as have some other really great ideas.

Winners will be announced in the fall, according to the Knight News Challenge site.

About Eric Ulken

Eric Ulken left his job as editor for interactive technology at the Los Angeles Times in November 2008 to travel and report on trends and best practices in online journalism. He is a 2005 graduate of the communication management M.A. program at USC's Annenberg School for Communication, where he was an editor and producer for OJR and Japan Media Review. He has been a web monkey at newsrooms in six states, including his native Louisiana.

Comments

  1. 1 Million sounds like a lot doesn’t it?