Are full RSS feeds now more trouble than they are worth?
I wondered that last week as the umpteenth Google alert hit my email in-box with a link to another “blog” that had scraped the full content of my posts. Curious this time, I clicked through and found something interesting at the bottom of the post.
It was the same list of social media links that I’ve opted into appending to the bottom of my posts in my Feedburner RSS feed. Their inclusion confirmed to me something I’d long suspected, but shoved to the back of my mind, that scrapers are using the convenient XML formatting of RSS feeds to populate their spam webpages.
(Let’s continue down the stream of consciousness, shall we?) That prompted me to wonder how many actual human beings are reading my site via RSS feeds today, versus spam bots harvesting those feeds to steal my content for their websites. With the rise of Twitter and Facebook, those have become the go-to sources for me to push new posts to my readers. Does anyone still use RSS?
It’s tough to answer that by looking at Feedburner stats. Perhaps an OJR reader with this information might inform us in the comments, but I don’t know of a good way to parse that data to separate human readers from scraper bots.
But the presence of so many scraper sites on the Internet, even after Google’s much-hyped Panda update, inspires me to consider cutting off their source of content. What if I killed my RSS feeds? Would the scrapers leave me alone? Would Google and Bing still find my content? Would my readership suffer?
Sitemaps provide a superior way to use XML to alert search engines and legitimate aggregators to new posts and content on a website, so I don’t believe the loss of an RSS feed would hurt you there. As I mentioned before, Facebook and Twitter provide new, more popular avenues for pushing new URLs to your readers and fans. But without an RSS feed breaking down your site’s content into easy-to-parse XML, scrapers likely would have a harder time extracting readable content from your website to put on theirs.
One interesting fact about the way that scrapers mine RSS feeds: They take only the headline and content, never the link. So as an interim step before killing off my RSS feeds, I’ve tried modifying them instead. I’ve rewritten the script that generates my feed to add the following line to each post in the feed:
“The article originally appeared at HYPERLINKED_URL_HERE. If you are not reading this post on a personal RSS reader (such as Feedburner) or on HYPERLINKED_WEBSITE_NAME, you are reading on a “scraper” site that has illegally copied our content. Please visit HYPERLINKED_URL_HERE for the original version, which includes all the reader comments.”
This places the original URL, and links it, within the copy of each post. Not only should that help search engines to know the canonical URL when the piece is scraped, it should help drive some of the scraper sites’ traffic back to my website. Ultimately, I don’t care about scraper sites if they drive their traffic back to me. It’s just when they take my content without returning traffic that offends me.
I just made this switch, and I’ll report back if I see any change in traffic, search engine placement or scraper abuse of my websites, as a result. In the meantime, I’d like to hear what you’re doing (or not) with your RSS feeds to fight scraping.
Comments welcomed!