Creating a Generic Site-To-Rss Tool

Creating our Scraping Regular Expression

For our RSS feed, we only need several pieces of data retrieved from the HTML for every “post” we indent in order to create it in our RSS feed:

  • Link: Post reader clicks it to go to specific information. In .netWire it's the link to the news items.
  • Title: Appears in the RSS reader that the user reading the posts will use. In .netWire it's the title of the news item.
  • Description: Actual text of an individual post. In .netWire it's the text of the news item.

These various items are buried deep inside the HTML of our website. It is now our job to find a regular expression that retrieves those items and enables us to easily reference them by code. Using our knowledge of groups in Regex, we want to have a group in the resulting Regex for every item we want to retrieve. We name them “link,” “title”, “description,” and “pubDate,” respectively.

In developing our Regex, I decided to use Expresso, a tool designed to help with regular expression testing. I'll rely on this piece of HTML, taken from the HTML of .netWire:

<p class="clsNormalText"><a href="/redirect.asp?newsid=4974"
  target="newwindow" class="clsNewsHead">Globalizing and Localizing
  Windows Applications, Part 1</a><br>
  With the explosive growth of the Internet and rapid
  globalization of the world's economies, the earth is getting
  smaller and smaller. The applications that you develop for
  a local market may soon be used in another country. If the
  world used a common language, that would make the life of
  developers much easier. However, reality is far from perfect.
  The author shows you how to make your applications ready for
  the global marketplace.<br>
<span class="clsSubText">Article. Sep 16, 2003.</span></p>

This HTML represents one news item on .netWire, and this is the one where we need to focus.

Our first item of business is getting the link of the news item. Why the link first? Because it's the first item in order of appearance, which makes it the least complicated to find.

Getting our Link

Here is the piece from which we want to extract the link:

<p class="clsNormalText"><a href="/redirect.asp?newsid=4974" target="newwindow"
class="clsNewsHead">Globalizing and Localizing Windows Applications, Part 1</a><br>
With the explosive…

We can easily see that each link (and title) is encapsulated between two items:

<p class="clsNormalText"><a href="->Our link" target="newwindow" class="clsNewsHead">

Simply enough, the following regular expression catches all instances of such a link within our HTML file and presents us with a group name “link” that gives us the actual redirection string of the link:

<p\s*class="clsNormalText"><a\shref="(?<link>.*)?("\s*target="newwindow")

I've added “\s” to prevent from declaring exactly how many spaces or tabs reside between the tag definitions and the actual tag attributes. Also notice that I've added the “?” before the “"\s*target="newwindow"” section. This is done so the expression can catch the first instance of this occurrence and not the last one (or it will match everything to the last link in the end of the file instead of closing the match on the first match).

Getting our Title

Now that we have the link, we need to get the title for the link. This one is also relatively easy. The title resides between the Href's closing tag (“>”) and the link's closing tag (“</a>”). Other things we need to consider along the way are new lines or spaces, so we take these into our regular expression as well.

Here's the full expression so far:

<p\s*class="clsNormalText"><a\shref="(?<link>.*)?("\s*target="newwindow")(.|\n)*?> (?<title>.*\n?.*)?(</a><br>\s*\n*)

And we have a group in there as well called “title,” so we can refer to it later in code. Notice that the title is made of any number of characters, followed by zero or more new lines and more characters.

Getting our Description

The description is a block of text that can contain new lines and is terminated by a “$lt;br>”:

<p\s*class="clsNormalText">
<a\shref="(?<link>.*)?("\s*target="newwindow")(.|\n)*?>(?<title>.*\n?.*)?
(</a><br>\s*\n*)(?<description>(.|\n)*?)(<br>(.|\n)*?>)

The end of the expression contains the beginning of the next expression we want to find.

Getting our Category

The category of the current news item is usually “Article” or “Product Release”. It always starts with the “>” sign and ends with a period (“.”):

<p\s*class="clsNormalText"><a\shref="(?<link>.*)?
("\s*target="newwindow")(.|\n)*?>(?<title>.*\n?.*)?
(</a><br>\s*\n*)(?<description>(.|\n)*?)(<br>(.|\n)*?>)(?<category>.*)?\.

Getting our Publishing Date

The news date follows right after the category's ending period (with zero or more spaces between them) and finishes with another period, ending with the closing Span tag and P tag.

<p\s*class="clsNormalText"><a\shref="(?<link>.*)?
("\s*target="newwindow")(.|\n)*?>
(?<title>.*\n?.*)?(</a><br>\s*\n*)(?<description>
(.|\n)*?)(<br>(.|\n)*?>)(?<category>.*)?\.\s*(?<pubDate>.*)?(\.</span></p>)

Our Final Regex

We finish with a piece of text that lets you scan .netWire's HTML and retrieve a list of matches, each of which contains groups named “link,” “title,” etc. that we can use in our code. Our next step is to transform this pile of data into useful readable information.

You might also like...

Comments

About the author

Roy Osherove Israel

Roy Osherove has spent the past 6+ years developing data driven applications for various companies in Israel. He's acquired several MCP titles, written a number of articles on various .NET topic...

Interested in writing for us? Find out more.

Contribute

Why not write for us? Or you could submit an event or a user group in your area. Alternatively just tell us what you think!

Our tools

We've got automatic conversion tools to convert C# to VB.NET, VB.NET to C#. Also you can compress javascript and compress css and generate sql connection strings.

“PHP is a minor evil perpetrated and created by incompetent amateurs, whereas Perl is a great and insidious evil perpetrated by skilled but perverted professionals.” - Jon Ribbens