Low-key Topix.net tries to recreate a journalist's brain with computers

When you arrive at Topix.net on a small side street in Palo Alto, Calif., you look up at a non-descript, low-slung office building and wonder. Is this the right place? There’s a sign for the Palo Alto Trophy & Shirt Shop. A sign for a business called The Maids. After coming through the door, there’s finally a cubicle for mail labeled Topix.net. Could this possibly be the world headquarters for a site that gets nearly 2 million unique visitors per month and was just valuated at $64 million by three huge media conglomerates?

It is. Their humble work space is a cross between a dorm room and a home office, with no conference rooms and no receptionist. Likewise, the low-key site was launched in January 2004 with little fanfare and derided by journalists as ugly, drab and irrelevant despite the promise of Topix.net to deliver news headlines on hundreds of thousands of topics — from doughnuts to Bar Nunn, Wyoming.

And along the way, the site slowly gained traction by providing these niche pages with links and summaries and photos from stories elsewhere and serving up Google AdSense ads that were much more relevant to the subject matter of the stories. How did they pull it off? With a constantly evolving computer algorithm that scans stories and categorizes them into the right silos. It’s certainly imperfect technology, but it’s perfect enough to have brought in partnerships with America Online, Ask Jeeves and Citysearch, which all run Topix feeds on their sites. Plus, sites like the New York Times and USA Today have bought up ‘Featured Placement’ slots.

You would expect Topix to get under the skin of grizzled news veterans. There are no trained editors, and no advertising sales force. When I visited the Topix.net office, I asked co-founder and CEO Rich Skrenta, 38, about this problem, and he pointed at a journalism textbook on the table.

“We might be lay people, but we can study the field,” Skrenta said. “That’s what programmers do. The people who program systems at Blue Cross are not experts in HIPPA compliance. They have to learn all this to implement the thing. We’re so far away from some of the thorny issues of journalism, we’re not the ones going to jail for not revealing our sources. We’re down at the level of the person taking the press release. If we do that right, then maybe we can move to the next level up.”

Skrenta’s relevant past experience comes from the Open Directory Project, which was set up to categorize Web pages for Netscape and later AOL and Time Warner. That experience with tens of thousands of volunteer editors made Skrenta decide to avoid editors in this categorization project and stick to algorithms that are adjusted by the Topix team over and over again. In this case, the technology must scan 10,000-plus news sources that are constantly being updated. Plans are in the works to add in Weblogs as well.

Skrenta doesn’t have any illusions about making the computer responsible for any tough editorial decisions.

“Somebody wrote the algorithm to make it happen,” he said. “[At Google] they pretend they’re not making editorial decisions, but they really are. You go across the street to Yahoo, and they have no problem making editorial decisions. They’re not doing the 100-percent engineer thing trying to hide behind the algorithm. They take responsibility for it.”

The Topix team focused on getting more traffic to its site but also on retaining visitors. Because the startup was self-funded and had no venture funding, they quickly decided to add advertisements from the very start — something which built in an expectation from visitors at the beginning.

Though Topix found that pop-under advertising was valuable for advertisers and had a high rate of clickthroughs, it hurt the retention of readers, turning them off to the site. That type of experimentation led Topix to optimize its Google AdSense ads on the site, using artificial intelligence to make the strangely juxtaposed ads much more useful — improving clickthroughs and making the site look better.

That, in turn, brought them attention from media companies who have wanted to serve contextual text ads from AdSense but were put off by the awkward relevancy issue. A routine meeting between Topix and Knight Ridder in San Jose led to an eventual investment from Knight Ridder, Gannett and Tribune, who bought a 75 percent stake that valued the company at around $64 million even though Topix only had $1 million in revenues in 2004.

“The market has priced the deal, and it is not up to me to say if it is fair or expensive,” said Jeff Clavier, a venture consultant and angel investor. “What I can tell is that from a multiple standpoint, it is out of whack with standard content deals. It looks like a strategic investment/buyout — the excuse to throw multiples out the window. These three media companies have experienced merger and acquisition and executive staff. If they paid such a valuation — that I can only rationalize as a Google premium on a forward projection — it is because they believe they can really leverage it. Topix is very strong on the local side, and we all know that’s where a lot of focused advertising resides that can be tied to content.”

I spoke at length with Skrenta, along with Chris Tolles, 37, vice president of sales and marketing for the site. The following is an edited transcript of our wide-ranging conversation in their funky office. (They do have plans to move to a new location across town with 10,000 square feet of office space.)

Online Journalism Review: What were your earliest plans for the site when you launched it?

Rich Skrenta: We started off with a high-level idea of creating a Web page for every person, place or thing in the world. When you go to Google and do a search, you get 10 Web sites and they’re all different. You type in “Madonna” and you get 10 Madonna pages. And if you click and go to each one, each one is authored by someone else. You might go to a GeoCities site, you might go to all these official sites.

We had the idea that by using new ideas in the artificial intelligence community, you could actually read the entire Web and parse it and understand it and produce a regular page. So rather than have 10 different pages, you could produce a summary page about whatever you typed. Sort of a Google News per topic, whether it was Madonna, or Billings, Montana, or presidential elections or whatever it was.

So we decided to figure out where to get data to drive this. For localities, we got digital map data from the U.S. Geological Survey. For celebrities we have lists of all the movies and all the music CDs that have ever been put out. You can detect references that way. So we made a prototype of it, fleshed it out, and put up a prototype in January of 2004. We had a pretty good first year. We did deals with America Online, Ask Jeeves, InfoSpace, Citysearch.

OJR: What did you do for them?

Skrenta: One of the most unique things we provide is the local news aggregation. The Citysearch deal was providing local news for their city guides, and AOL and Ask Jeeves are local news. There really isn’t anywhere else you can get local aggregation down to the ZIP code. The other interesting thing we hadn’t foreseen when we started this, we hadn’t taken any venture funding and said, ‘Let’s see if we can bootstrap the company and make some money.’ We had ads on the site from Day 1. A lot of times when a little Web company takes money from a VC, they’ll deliberately not put up advertising and say, ‘We’ll modify it later.’ And we had people telling us that’s what we should be doing. But we said screw that. We’re funding it with our own money, so let’s put up advertising.

It turns out that the hygiene of doing that was really good, because maybe we’ll make five bucks the first day, but we’ll figure out how to make six bucks the next day. Because we were working on the advertising with the content from Day 1, we realized that 50 percent of the content that people want in the newspaper is commercial content. I get the Sunday paper, and I look at the real estate section, and it’s actually 100 percent ads with a couple fake stories, you know, from the Knight Ridder network on title insurance and mold and stuff. It’s fascinating and I want to know. It’s 100 percent commercial, but it’s news to me.

My wife, when we get the paper, sees a Macy’s ad, a big event, and that’s commercial news. That is all missing from the online news sites. You can go to the San Jose Mercury News site, and you can see a giant PDF of [the ad], but it hasn’t been Web-ified. It’s not part of the integrated stream. What we found was we started to pour in our categorization technology to the Google [AdSense] ads that were on our site, and it started working a lot better. We doubled the clickthrough rate on them. But beyond that, it made our site look better. Improving the quality of the advertising improved the quality of the entire product.

OJR: You were improving the relevance of the AdSense ads?

Skrenta: Right. Google works great on a static Web page, but it’s a disaster on most news sites and blogs. It doesn’t perform very well. We’re not media people, we’re technology people, so we got books on how to be a news editor. And we learned that the local news mantra is, ‘If it bleeds, it leads.’ So we said, ‘That’s great. We’re going to have car crashes and fires and muggings every day.’ But the more we were successful at that, the worse the ads got from Google. If it was a story about a fire, the ad would be for a fire extinguisher or something like that. The famous case was when the New York Times’ site had a story on a suitcase of body parts that washed ashore in New Jersey, and Google was showing luggage ads beside it.

Google will see ‘Janet Jackson’ 25 times in a story and put up an ad to buy a Janet Jackson CD next to the story. Our technology will see the same story and figure out it’s actually a story about the FCC and indecency issues on the airwaves, and it’s not actually an entertainment story. When you get that right, the ‘buy her latest CD’ ads go away, and ads for [FCC compliance] go up, and the clickthrough rate goes up, and the money you make goes up.

OJR: How do you automate that? Is it categorizing the stories?

Skrenta: We have a bunch of artificial intelligence algorithms and a big knowledgebase, which basically is reading the words in every story. It’s trying to figure out what the story is about, and [finding out if it’s] about something relevant to a location — and it’s all 100 percent automated.

Chris Tolles: What we’re doing is categorization and creating all these news channels, and if you look at the incremental things coming out every hour, there’s enough stuff coming out of the fire hose that you can’t drink it, you have to segment it somehow. If you segment it in terms of interest, in category and topic, and put ads on it from the beginning, you miss the trap of people saying here’s my product, and advertising pollutes it and makes my product worse. The sad thing is saying, ‘Let me separate those things out; we’ll make sure there’s a Chinese Wall there between all advertising and all content.’

Skrenta: If you take the ads out of the paper, you’re missing half of the product.

Tolles: The ads are a measure of the things you care about. In our case, we measure both how much a page makes in a day, but you can’t just look at that. Are you also growing your user retention? If you’re doing both of those things, optimizing for getting the most people and making the most money, those are the two things as a product team that you want to put together.

We actually have our own measure, what we’re calling the marginal eCPM of a site. One of the problems here is you have the Wall Street Journal approach, where you have to buy a subscription, but you’re not going to get any traffic that way. But the people who are winning the Web war — Google, Yahoo and Microsoft — what do they do? They try to get as many users as possible. They try to make a product that says, ‘come to me.’ And we look and say, ‘What percentage of our incoming traffic will make for an incremental hit? Where can you make one more page view and make another cent?’ Just thinking that way forces you into thinking that you want more traffic as opposed to saying, ‘Hold on, let me get people to do a registration of some sort.’ No, no, no. The more traffic, the better.

OJR: When I was going through the site, I noticed there was some kind of behavioral thing, where it was serving ads based on what parts of the site I’d been to before.

Skrenta: We’ve expanded it. Originally, we had it on the front page. What we’ve found is that the ads on the front page, nobody clicked on them. They didn’t make any money. We talked to somebody, a business development guy at MSN Newsbot, and he tested advertising on MSN Newsbot and nobody clicked on it. And nobody bought anything when they clicked.

When you’re looking at Top 10 U.S. and world headlines, what do we know about people to show relevant ads? We really don’t know anything about them, so you can just show untargeted advertising, which isn’t very good. The most random ad in the newspaper is still targeted at least by your geography. But on the Web it didn’t work very well. But what we did was look at the cookie and see what pages they visited on Topix. And we could serve ads from that page rather than the worst-performing default ad that might be on the home page. And it worked pretty well, and we expanded that, so half or a third of the ads you see are relevant to something else you’ve looked at and not to what you’re looking at.

OJR: I’m wondering about the Chinese Wall problem, because when I go onto your site I’m not always sure what’s been paid for and what hasn’t.

Skrenta: The ads are visually segmented in the standard Google tower. We have included some ads in the RSS feeds or in-line with a story, but it’s always labeled with the word ‘advertising.’ We worked on Netscape search at AOL, and that was right around the time that the FTC was serving search engines with notices about appropriately disclosing, saying ‘this is a paid thing.’ They had in their letter guidelines which said, if you say ‘Featured Partner,’ it’s not sufficient. It should say ‘Sponsored Listing’ or ‘Paid Advertisement’ or something. But we think that Topix is a different experience because you’re browsing, but we tried to do the same kind of thing.

OJR: Do your users ever ask you about that or get confused about what’s paid for?

Tolles: No. The average consumer doesn’t mention it. The people who are vocal about that are publishers, are journalists, are people in the industry. We do get feedback every day, and usually it has to do with the quality of the stories on the page or missing stories. Or they say, ‘I have an article I want to give to you. Will you publish it?’ Or they say, ‘I would like you to do investigative reporting in Altoona, Pennsylvania,’ and I say, ‘Uh, OK.’ I think people are pretty savvy, for the most part. In the ‘Featured Placement,’ for the New York Times, there’s a link labeled ‘What’s This’. And we took that directly from Rotten Tomatoes, which does that for featured reviews.

It says ‘What’s This’ as opposed to ‘Advertisement.’ Here’s something that doesn’t happen in newspaper-land, which is we actually have our algorithm look at the stories on our featured partner’s site and figure out what would be on the page. It’s not that they wouldn’t be on the page, it’s that they’re featured with a brand around that. It’s kind of a different problem than the average magazine publisher has because it doesn’t exist in that space. So we highlight it and explain what they pay for. It’s not like they’re paying to show up. It’s a performance-based thing, so if people didn’t like it, it wouldn’t go there.

That’s a big part of this. Because you can measure every thing that goes on the site, you can measure what people like and don’t. If they think we’re screwing something up, then we can change it. It’s sort of an evolving thing.

OJR: Tell me about your financial situation. You’re profitable now?

Skrenta: We were profitable, but I don’t think we’re profitable anymore.

Tolles: We were profitable in December, and as part of the investment [by the three media companies], there was some investment, and there was money in the bank we were using to grow the business. If someone invests in you, and you’re not using the investment, you’re not getting the investment.

Skrenta: You can either grow with your revenue, or you can accelerate that growth by taking investment and getting profitable for awhile and hiring people and getting lots of servers. We’re not very far off from where we were. We were eight people and now we’re 13. We’re not exactly building a marble palace by the Bay. A lot of times a startup model is to get a ton of money [from investors] and try to jump, and sometimes you clear the wall and other times you smack into it. I don’t think it’s going to be like that here. We’re making use of the investment, but we’re pretty low-key about it.

OJR: And your revenues are what? I had heard around $1 million last year.

Tolles: On the order of $1 million in revenues, I think that’s a fair statement. We’re on track to grow those revenues. It’s not like we’re going to go public in six months. We have disclosed the number of visitors that have gone to our site, which is 3.1 million per month.

[Note: Nielsen//NetRatings puts that number at about 2 million unique visitors in June 2005.]


Tolles: Half of our traffic is off-site, because we give our feeds away. We’ve been doing story clicks, because you can’t really do feed clicks. But we have 3.1 million unique visitors per month and they generate about 3.5 million story clicks per month. We get another 3.5 million story clicks per month from people who put our feeds on their sites. So we’re generating 7 million story clicks per month. The half that’s off our site is growing faster than the traffic on our site. And that includes, AOL and Yahoo and feeds and RSS — all that stuff.

OJR: How do you monetize that?

Skrenta: We did a test ad in an RSS feed with an Overstock.com ad under a movie review that they had written. If the user clicks they get an advertorial, and then they can click on Overstock to buy it. That was really successful, and they were happy with it and wanted more from us. But our goal isn’t to crank it up and monetize it. RSS advertising is like where Web advertising was in 1994. It’s goofy stuff [on the technical side] and people are just feeling their way through this. We want to be careful with that and not get tarred and feathered over that and just hang back on it.

OJR: So why did you make the deal with the three media companies?

Skrenta: It’s interesting because we never did take venture capital money. And our buddies in Silicon Valley were always saying, ‘You’re going to sell out to Yahoo or Google.’ And we said, ‘No, we aren’t going to sell out.’ We never thought about the newspaper companies in that sense, we thought of Amazon or eBay or the standard big Net companies.

We went to Knight Ridder to pitch them the standard publisher deal, and we started to talk to them, and they got increasingly interested in us. The more we talked to them, we started to see the different opportunities we might have with them. We’ve been a small technology team stuck in the innards of Time Warner before. The newspapers actually have tons of revenue, and tons of advertiser contacts, and hundreds of thousands of local advertisers. They have the number of advertisers that any [venture funder] would find very valuable in starting an ad network. They also have a ton of traffic.

What they didn’t have a lot of was technology, and they were concerned about My Yahoo and Google and automated aggregation. No matter what you think about it today, if you’re in the content business and want to be online, it’s going to require technology to be effective.

OJR: Are you trying to get into the AdSense business?

Tolles: Google has 400,000 advertisers. We have zero salespeople.

Skrenta: We’re leveraging other people’s salesforces as much as possible.

Tolles: The more ad networks that are out there that we can craft onto the page … for example, if one of the aspects of this is that you can measure everything down to the click. So if someone has a great group of advertisers on a topic that AdSense isn’t performing well on, then that’s great if we can help them out. But I don’t think we want to create it on our own.

OJR: So you guys aren’t really a media company. You don’t have editors, and you don’t have ad salespeople. You’re really a technology company.

Skrenta: Well, what’s Google? Well, if you look at what Sidewalk tried to do in 1997, they started up editorial and sales forces in 50 cities. But we don’t author any content currently, and we don’t have any sales force. We cover 30,000 cities. When we started, people said, ‘It’s a great idea to make the content for each city, but how can you sell ads to plumbers in St. Louis? You’ll never be able to build a salesforce to do that.’

What Google has created with AdSense is astonishing. When it works, it works really well. You’ve got real estate agents here in Palo Alto bidding six bucks a click for Palo Alto real estate clicks, and they pay for it because it works. It’s very potent advertising compared to the alternative ways to sell it. Whether we’re a media company or not is just an argument over the semantics of the word. Are Google and Yahoo media companies? I don’t know. What’s a media company?

OJR: But you look at your site and you see news there, there are news stories and ads there. But it’s a new way of thinking for people in the media business to think here’s a site doing a decent business, and you’ve got a small staff, and they’re feeling threatened by that.

Skrenta: There are print publications that are 90 percent aggregated content, too. They add a little bit themselves, but a lot of it comes from wire stories.

Tolles: We’re kind of like radio. I didn’t know about The Killers until I heard them on the radio. I didn’t know about any of the bands I’ve heard in the last five years until I heard them on the radio. Likewise, the San Francisco Examiner is a great example. The Examiner used to be a broadsheet newspaper delivered to a home. Then it became a tabloid newspaper in boxes and wasn’t delivered to the home. And I never looked at it because it was in the same news box that I never opened. And I live in its primary area, San Francisco. But on Topix, I see these hyperlocal stories from the Examiner. And I had no idea that the Examiner was doing local news, because it had pretty much been AP stories and few editorials. Now I see they’re doing a lot of local, really good journalism.

We connect people to sources that users wouldn’t normally see. If you’re a hardcore reader to a local newspaper or radio or TV site, we’re not going to replace you by going to Topix, because we don’t have all their stories. But we might show you stories you wouldn’t have seen — we have a value add by the categorization. We’re sending out 7 million clicks into the mediasphere, and those are clicks we are sending to them; we’re not taking it from them. We are giving them stuff back from the search engine universe. There are content producers who have a problem with it…

Skrenta: In a year and a half, there have been only four publications that have said, ‘We’d rather you not crawl our site.’ They’ve all been really small sites, a TV station in Pennsylvania, a tiny newspaper in Pennsylvania. We’ve had several thousand people come to us and say, ‘Can you crawl us? We want to be on your site.’

OJR: What do you do about scrapers of your own site — people who take content from your site and put it on their site and try to sell ads off of it?

Skrenta: Nothing at the moment. We shipped them all our doughnut page. We have this page on doughnuts, and we served the doughnut feed to them. We were serving them all doughnut news. There’s a lot of junk polluting the blogosphere. I have a theory that what happened to your e-mail box is going to happen to the Web as a whole — it’s going to be 90 percent spam.

Our point of view is that if you write a story, you kind of want the story in as few places as possible. You should put that up on your site. You want the headline and summary to go out to as many places as possible — Google News, the Google index, Feedster, Topix, My Yahoo — to put the traffic back to your site where you can monetize it.

OJR: How does your algorithm figure out what stories are interesting or what hasn’t been done before? Or credible or not credible?

Skrenta: There are a lot of ways that an algorithm can judge a story. Obviously we can’t do as good a job as a human could do. We had Lincoln Millstein [formerly with the New York Times Digital and now with Hearst New Media] visit us, and he said, ‘You people suck. You need to put human reviewers on all the channels, and then you’ll be good.’

Yes, we have to clean it up. And even though we’re not as good as Lincoln Millstein, making the algorithm as good as he is, is a 50- or 100-year job, that’s HAL 9000 stuff, and it’s kind of a ridiculous goal to get there. But we can make it better each day. So it measures the tonal heat of the article, what kind of extraneous information is there. We have an alternative energy section in technology, but we found that a lot of stories were rants about the Kyoto treaty and President Bush — which have a place but not in that section. But we can create detectors for that and pull it off the page.

We actually have an algorithm here called the Lincoln Millstein Algorithm. He was from a town in Connecticut, and he said, ‘The thing that sucks about your service is that when you can’t find news about my town, you go the neighboring town. My town is an affluent town, and no one in my town ever wants to hear about that town.’ Like Palo Alto should import stories from Atherton and Menlo Park and not from East Palo Alto. So in our system, we have demographic data for these towns. So if there’s a slow news day in Palo Alto, it shouldn’t jump a big demographic barrier. So that’s the Lincoln Millstein Algorithm. It never occurred to us that we could codify it into an algorithm.

Our Gay & Lesbian channel is our No. 1 feed on My Yahoo. It’s extremely popular, and we have a whole bunch of semantic filters on that to make sure it’s great content. And it’s a great application for our technology because the reporting is not constrained to a few sources. There’s tons of good reporting all over the world on gay and lesbian issues, but there’s also a lot of material that’s not appropriate. It’s not every story on the Net that has the word ‘gay’ in it. We have an Islam page, and it had a lot of inappropriate material for what we were looking for in there. It’s important to see if it’s a blog with excessive first person references, is it riddled with misspellings.

OJR: So are you ever going to add human editors, or are you categorically rejecting them?

Skrenta: We could. But as technology people, it’s kind of an admission of failure.

OJR: What about the problems with miscategorized news and having news stories with bias?

Tolles: Our editors don’t decide whether anything is good or bad. They decide where it should go.

OJR: But even the decision about where a story should go — there’s a political aspect to that. You were talking about taking out stories from the Gay & Lesbian page, but there are people who would say there should be a rant against gays on that page.

Skrenta: But we wouldn’t publish them.

Tolles: I agree with you. You can politicize anything.

Skrenta: Yes, we’re going to take those stories out of the Gay & Lesbian section. We’re a news site; we have to make these kinds of calls. Our goal is to look at someone like Lincoln Millstein and say we want to build an algorithm to mimic his decisions.