NewsML aims for the mainstream

The protocol news agencies use to transmit stories to newspapers and news portals like MSN and Yahoo will get its version 2.0 by year end. Developers of the standard — called NewsML — hope improvements will take it beyond its typical old media sponsors. Critics argue there are better tools to do the job.

XML-based NewsML bundles all story elements — like photos, audio, video and text — together in a virtual “envelope,” including a ton of information that describes the content in a way that a content management system (CMS) can understand.

The practical upshot is that all elements of a story are linked together and a CMS can automatically render, for example, the headline, byline, dateline, photo, intro and hyperlink on a news portal’s front page, and all elements of the story on separate webpage accessed by the hyperlink. A CMS can even render stories based on priority, or automatically update breaking news stories. It’s got a ton of other useful features too (see below).

But outside its core constituency NewsML remains a little known format languishing in a small tributary of the Web standards mainstream. “NewsML is a niche standard,” one of NewsML’s chief architects, Laurent Le Meur, concedes. “But it is an active one.”

Le Meur, Chairman of the NewsML Architecture (NAR) working group at the International Press Telecommunications Council (IPTC) and head of the Media Lab at Agence France-Press, is hoping the standard will become a lot more active in version 2.0. and will move from an isolated backwater into the main current of Web standards.

It could be quite a paddle. Some observers remain skeptical about NewsML and its relevance. “To me NewsML is essentially a tagging system,” says Robin Miller, aka Roblimo, editor in chief of the Open Source Technology Group and author of several books about open source applications. “We’re experimenting with an open source CMS called Xaraya that seems to do pretty much the same thing. And I think most open source CMSs are moving toward similar functionality.”

Still, at the moment, NewsML is the standard of news agencies. It is used by almost all the big international agencies, like Reuters, AFP, UPI, as well as about 40 national agencies, like Italy’s ANSA. In Japan it has even become an official Japanese industry standard (JIS X7201), which works like the codes of ISO, the international organization for standardization, the official body that decrees the size of threads on a screw or the dimensions of a freight container.

Of the big agencies, only AP doesn’t use it and, according to Le Meur, that’s because there’s no demand for it. “They’ll move to it when there’s a demand among their customers,” says Le Meur.

He adds that it’s not just for agencies, because news portals like MSN and Yahoo use it. “News aggregators are very interested in it, too. I’ve been told that it can cut the time it takes to integrate stories into their databases from weeks to days,” says Le Meur.

NewsML 101

NewsML’s role as a news agency standard began in 1999 when Reuters handed over development to the IPTC, an association that began life as an industry lobby group working towards newspaper access to telecommunications.

Since the 1970s, the IPTC started work on a series of technical standards for news exchange, such as the IPTC Core, the Information Interchange Model (IIM), the News Interchange Text Format (NITF) and NewsCodes, in addition to NewsML, SportsML, EventML and ProgramGuideML, various XML standards for handling specific types of content. NewsML, however, is the leading standard for news handling.

“In many ways, NewsML was a way to bring XML to the newsroom, where newspapers were often locked into proprietary editorial management systems,” says Michael Steidl, managing director of the IPTC. “You have to remember, many newsrooms were living with technology from the ’80s and early ’90s.”

Slugs in cyberspace

So what can NewsML do?

In its current version, 1.2, NewsML is a model and standard to represent and manage news throughout its lifecycle, including production and interchange. It can handle a variety of media indifferently. All the elements that make up a story are packaged in a NewsML envelope, so they are clearly associated on the receiving end.

A NewsML envelope might include pictures, text, video and audio in different formats — for example thumbnail and main photo — or text in different languages, or various video or audio formats. Stories can be automatically received and rendered on a Web page, for example, without any human intervention. NewsML tracks versions of a story and enables automatic updates.

All this is achieved by the cloud of metadata that surrounds the story elements themselves. Metadata — or data about data — is essentially a description of all the story elements. So metadata would describe the format of the text, but it would also it differentiate the headline from the byline, or the intro from the main body of the text, or the photo caption from other text.

That last part is important. When an editor looks at a story, all the elements are immediately apparent. But to a machine it’s all just text, so it needs to be told that the byline is a byline. With this information it can automatically put the appropriate caption under the right photograph. It can put a teaser or intro on the portal’s main page, but exclude it from the main body of the story. It can drop the byline on the front page, if that’s the website’s policy, but include it with the main story. All this is made possible by metadata.

In NewsML the metadata itself comes in bewildering variety. There are specific terms to describe a story’s genre (analysis, obituary, feature, opinion), its location, a NewsItem’s role (caption, main, sidebar and so on) and the subject codes, to mention only four types out of a total 28.

The subject codes are a world to themselves: “There’s three levels, within NewsCodes, of increasing detail,” says the IPTC’s Steidl. Each layer drills down to a finer layer of granularity.

Steidl offers the example of a political story. The top layer would identify it as politics, the middle layer would offer terms like local politics, diplomacy, defense and so on. Drill down further through, say, diplomacy, and you get terms like treaties, alliances, and summits and so on. And on and on. All the codes are in a machine-readable format and language independent, which means the same codes work in English or Urdu, Spanish or Swahili.

NewsML’s metadata also provides status details, like “publishable” or “embargoed,” and administrative details, such as acknowledgements or copyright details.

“There’s also a Unique ID [UID] for each story, so that when a story disappears from a Web page you could still search for it using its unique number,” says Le Meur, who considers this function as a type of ISBN code for news.

All of these standards can be used freely. “Essentially, the policy of the IPTC is only to claim the intellectual property for these standards. You can’t claim that NewsCodes, for example, are your work. But we don’t charge any fees or royalties,” says Steidl.

The metadata works like the digital equivalent of slugs, referencing all material related to the story, the story’s evolution and origin. It’s very precise. It’s also very, very elaborate: There are 1,300 subject codes alone, for example, and with other codes like genre and location, NewsML contains a blizzard of information for each story. On the plus side, stories get a common nomenclature to a very fine degree. And the NewsCodes can be adopted independently of the NewsML model, so it can be used with almost any tagging system. On the minus side, it uses too many terms to be practically applied by deadline-panicked reporters.

“The big agencies already have software that can semi-automatically apply the NewsCodes to individual stories. These then simply need to be checked by an editor,” says Le Meur. “In four or five years’ time, I could see a situation where an Application Service Provider [ASP] could offer the same functionality for bloggers, for example.”

Le Meur hopes that, if NewsML adoption becomes widespread, one day search engines will be able to index this metadata to provide very exact results for particular search terms. And, if adopted, he believes NewsML could be very useful to bloggers. “A lot of relevant blog posts, for example, don’t get returned as a search engine result. I hope that NewsML metadata could change that,” he says.

But the skeptics say …

The IPTC’s core constituency is convinced and actively working on NewsML 2.0, but convincing others of the standard’s relevance will be a tougher task.

“I think it’s very cute that these large old-fashioned content producers have discovered what we call tagging, and that they have managed to formalize it in a very cumbersome and rigid way,” says OSTG’s Miller.

This is a common complaint about NewsML. Last year, OJR editor Robert Niles described the NewsML standard as “overkill” in a blog post about the need for new standards to govern distributed online reporting.

“I can do pretty much the same thing with Word Press — a popular, open source blogging program that I use to run my little personal website,” continues Miller. “All I’d have to do is add the exact forms. Because that’s all they’ve done, is come up with standardized tags for photographs, for video, for text and ways to link them together and to update these files as they change. I can do that now, without NewsML and without buying a special program to run NewsML.”

“It is tagging,” concedes NewsML’s Le Meur. “But it’s also a model for representing news. Word Press can do the tagging, but can you open a NewsML document in Word Press, maintain all the metadata and edit it? Right now you can’t.”

But Le Meur believes one day you could. NewsML is a model that can be added to content management systems, and Le Meur hopes it will become a sort of .pdf, a universal standard, for semantically tagged news stories. In same way Word can open .txt or .rtf as well as .doc files, CMS software can add the NewsML model to handle files in that standard.

One of the primary goals of version 2.0 is to tackle the biggest criticism: excessive complexity. “Right now NewsML is too verbose and too complex a standard. In version 2.0, we’re going to simplify the syntax. We’re also going to offer two levels of compliance: simplified core compliance, and then the power model. It should then be much easier to use,” says Le Meur.

The IPTC also hopes to add news “concepts” to version 2.0 of the standard, so that searches for the term “euthanasia,” for example, would return stories about Terri Schiavo or Dr. Jack Kervorkian — even if the word euthanasia isn’t used in those stories.

The hidden standard

But NewsML also suffers from another weakness: Very few people outside those directly developing it know much about it. Robin Miller is an exception. Of about 15 people contacted to comment on NewsML who were not involved in its development, only two agreed to speak. Two contacts were unavailable; the rest didn’t feel qualified to comment.

“I’m no expert on NewsML,” says Andrew Nachison, director of The Media Center of the American Press Institute echoing a common sentiment. “Is it important? The best I can say is that I don’t know. Standards probably have a place, and I can see how a standard set of codes might help.”

“NewsML made a lot of sense when everybody was buying million dollar content management systems,” says Nachison. “Frankly, companies that are still buying those systems I would say are making some critical IT mistakes.”

It’s a sentiment echoed by Miller: “I think it’s an improvement or advance for Reuters or BusinessWire. It’s an improvement for those who had nothing. But what we’re really seeing with NewsML — it’s real significance — is that for the first time the old publishing businesses are saying, ‘Here we’re going to use XML, and we’re going to publish all our stories as Web content and peel some of it off as print.’ ”

Nachison also believes NewsML may have a role. “NewsML was always an open architecture, but one that’s not easy to implement. If they can get flexible enough and open enough that it could be incorporated into lots of low-end publishing systems and open source publishing systems, there’s a chance it could be seamlessly integrated into existing open source programs,” he says.

There are a lot of ifs, buts and maybes in Nachison’s prognosis for NewsML. But, despite criticisms of this news standard, agencies across the world will continue with its development. Whether the standard can move outside the its industrial tributary and into the mainstream remains to be seen.