At NewsGator and Sepia Labs I worked with Brian Reischl, one of the server-side guys. Among other things, he worked on NewsGator’s RSS content service, which reads n million feeds once an hour.
(I don’t know if I can say what n is. It surprised me when I heard it. The system is still running, by the way.)
Brian is intimately acquainted with the the different ways feeds can be screwed up. So he posted Stupid Feed Tricks on Google Docs...
Stupid HTTP Tricks
When the feed is gone/errored, publisher may still return a 200 OK but send an HTML page instead.
- Using permanent redirects for temporary errors. In one instance, all the Microsoft blogs had a temporary system error. All the feeds did a permanent redirect to the same system error page, and we updated all 40,000 feeds to point to that one URL. Whoops.
Stupid XML Tricks
Any sort of XML well-formedness error you can think of. Missing closing tags, mismatched tags, bad escaping, not quoting attributes, missing root elements.
Including unescaped HTML content inside a tag - which sort of works, except that most HTML isn’t XML-compliant.
Stupid RSS/Atom Tricks
Missing any element you can think of.
Adding custom elements without namespaces.
Other Stupid Tricks
- Updating posts very frequently. Newspapers are very fond of this. In 4 hours they might change a post 12 times, by the end it might have nothing in common with the original article (completely different title, completely different body). Sometimes combined with not using lastUpdated, or just not changing lastUpdate.
- Publishing updated posts as new posts, so you have 12 versions of the same post in the feed.
- You should think hard about canonicalization of URLs. Some parts of the URL can be case-sensitive (path and query) other parts can’t (protocol, host and port). Users (and webmasters) will absolutely use different upper/lower casing in different places.
- If you build a database index on FeedUrl, consider that 99% of them start with “http://”, which makes for a shitty index. Consider separating the protocol into its own column, and then indexing on the remainder of the URL. Alternatively, you could index on a hashed value of the URL. Theoretically you could have collisions, but in practice there are not that many feeds.
Since we're all about RSS this past week'ish and the fact that many might again play in the RSS space, I thought this document great, from someone who's really been there, done that...