Creating a Generic Site-To-Rss Tool

Making an RSS feed Out of It

The first step in creating a valid RSS feed is to know how the RSS schema looks. There are several RSS standards available today. I've chosen to implement this using the RSS 2.0 standard. I won't bore you with the entire schema definition here, but a standard RSS feed using the RSS 2.0 schema should look something like this:

<?xml version="1.0" encoding="utf-8" ?>
<rss version="2.0" xmlns:blogChannel="http://backend.userland.com/blogChannelModule">
<channel>
</title>
</link>
</description>
</copyright>
</generator>
            <item>
                      </title>
                      </link>
</description>
</category>
</pubDate>
            </item>
</channel>
</rss>

The easiest way to write XML with the .NET Framework is using the XMLTextWriter class. This class abstracts away the need to explicitly write strings that represent XML, and it supports writing directly to a file or an IO.Stream object. That stream can represent a file stream, memory stream, response stream, or anything else that derives from System.IO.Stream.

Here's a small method that gets all the matches from a site's HTML, loops through them, and uses an XMLTextWriter to write the XML representing the RSS feed:

Public Sub WriteRSSToStream(ByVal txWriter As TextWriter)

'our pattern to parse the page
Const REGEX_PATTERN as string = "<p\s*class=""clsNormalText""><a\shref=""(?<link>.*)?
(""\s*target=""newwindow"")(.|\n)*?>(?<title>.*\n?.*)
?(</a><br>\s*\n*)(?<description>(.|\n)*?)(<br>(.|\n)*?>)
    (?<category>.*)?\.\s*(?<pubDate>.*)?(\.</span>)"

'Get the HTML to parse
Dim DownloadedHtml As String = GetHtml()
'Get the matches using our regular expression
Dim found As MatchCollection = Regex.Matches(DownloadedHtml, REGEX_PATTERN)
Dim writer As New XmlTextWriter(txWriter)

With writer
    'make the resulting xml human readable
    .Formatting = Formatting.Indented

    'write the document header declaring rss version
    'and channel info
    .WriteStartDocument()
    .WriteComment("RSS generated by SiteToRSS generator at " _
+ DateTime.Now.ToString("r"))
    .WriteStartElement("rss")
    .WriteAttributeString("version", "2.0")
    .WriteAttributeString("xmlns:blogChannel", _
"http://backend.userland.com/blogChannelModule")

    .WriteStartElement("channel", "")
    .WriteElementString("title", RSSFeedName)
    .WriteElementString("link", RssFeedLink)
    .WriteElementString("description", RssFeedDescription)
    .WriteElementString("copyright", RssFeedCopyright)
    .WriteElementString("generator", "SiteParser RSS engine 1.0 by Roy Osherove")

    'write out the individual posts
    For Each aMatch As Match In found
        Dim link As String = aMatch.Groups("link").Value
        Dim title As String = aMatch.Groups("title").Value
        Dim description As String = aMatch.Groups("description").Value

      'format the date as RFC1123 date string (“Tue, 10 Dec 2002 22:11:29 GMT”)
        Dim pubDate As String = _
DateTime.Parse(aMatch.Groups("pubDate").Value).ToString("r")
        Dim subject As String = aMatch.Groups("category").Value

        .WriteStartElement("item")

        .WriteElementString("title", title)
        .WriteElementString("link", link)

      'The description may contain illegal chars
      ‘so write it our as CDATA
        .WriteStartElement("description")
        .WriteCData(description)
        .WriteEndElement()
     
        .WriteElementString("category", subject)
        .WriteElementString("pubDate", pubDate)

        .WriteEndElement()

    Next

    'close all open tags and finish up
    WriteEndDocument()
    Flush()
    Close()
End With

End Sub

The code to generate an RSS feed is surprisingly simple. After you create this XML file notice that the method accepts a TextWriter, which can potentially be a stream writing to a file, string, or lots of other things. We are not bound to any particular target in this implementation. I still haven't shown how to get the actual HTML from the Web, but I'll explain shortly.

Validating our Feed

To validate the feed as valid XML RSS, you can use one of the various free RSS validating sites out there ( www.FeedValidator.org pops to mind). The site will make sure your feed lived up to the standard it claims to support and will tell you if you missed anything important.

It's very helpful to test against such a site to make sure you don't screw up people's aggregators that subscribe to your new feed.

Subscribing to our Feed

Now that we have a ready-made XML file, we can test it using a real aggregator. I used SharpReader and simply registered for a feed located at the path leading to the XML file. In SharpReader, I made sure that there are just the same number of posts as there are news items on the site and that the titles are correct. Also I made sure that the “subject” column will correctly represent the “category” of each news item.

Approaches for a Generic Tool

Now that we have the basic mechanics of the thing working, we need to understand the power that comes from such a simple technique. What we've seen here demonstrates that given a simple regular expression and text to parse, we are basically able to parse any site we want.

It comes to mind that we can build a simple class that receives these parameters and outputs RSS feeds appropriately. Such a class can later be used to build a much more generic website or Web service, to which sites and expressions can be added dynamically, and that returns valid RSS feeds given a site ID. But let's start small.

You might also like...

Comments

About the author

Roy Osherove Israel

Roy Osherove has spent the past 6+ years developing data driven applications for various companies in Israel. He's acquired several MCP titles, written a number of articles on various .NET topic...

Interested in writing for us? Find out more.

Contribute

Why not write for us? Or you could submit an event or a user group in your area. Alternatively just tell us what you think!

Our tools

We've got automatic conversion tools to convert C# to VB.NET, VB.NET to C#. Also you can compress javascript and compress css and generate sql connection strings.

“Hofstadter's Law: It always takes longer than you expect, even when you take into account Hofstadter's Law.”