For our RSS feed, we only need several pieces of data retrieved from the HTML for every “post” we indent in order to create it in our RSS feed:
- Link: Post reader clicks it to go to specific information. In .netWire it's the link to the news items.
- Title: Appears in the RSS reader that the user reading the posts will use. In .netWire it's the title of the news item.
- Description: Actual text of an individual post. In .netWire it's the text of the news item.
These various items are buried deep inside the HTML of our website. It is now our job to find a regular expression that retrieves those items and enables us to easily reference them by code. Using our knowledge of groups in Regex, we want to have a group in the resulting Regex for every item we want to retrieve. We name them “link,” “title”, “description,” and “pubDate,” respectively.
In developing our Regex, I decided to use Expresso, a tool designed to help with regular expression testing. I'll rely on this piece of HTML, taken from the HTML of .netWire:
<p class="clsNormalText"><a href="/redirect.asp?newsid=4974"
target="newwindow" class="clsNewsHead">Globalizing and Localizing
Windows Applications, Part 1</a><br>
With the explosive growth of the Internet and rapid
globalization of the world's economies, the earth is getting
smaller and smaller. The applications that you develop for
a local market may soon be used in another country. If the
world used a common language, that would make the life of
developers much easier. However, reality is far from perfect.
The author shows you how to make your applications ready for
the global marketplace.<br>
<span class="clsSubText">Article. Sep 16, 2003.</span></p>
This HTML represents one news item on .netWire, and this is the one where we need to focus.
Our first item of business is getting the link of the news item. Why the link first? Because it's the first item in order of appearance, which makes it the least complicated to find.
Getting our Link
Here is the piece from which we want to extract the link:
<p class="clsNormalText"><a href="/redirect.asp?newsid=4974" target="newwindow"
class="clsNewsHead">Globalizing and Localizing Windows Applications, Part 1</a><br>
With the explosive…
We can easily see that each link (and title) is encapsulated between two items:
<p class="clsNormalText"><a href="->Our link" target="newwindow" class="clsNewsHead">
Simply enough, the following regular expression catches all instances of such a link within our HTML file and presents us with a group name “link” that gives us the actual redirection string of the link:
<p\s*class="clsNormalText"><a\shref="(?<link>.*)?("\s*target="newwindow")
I've added “\s
” to prevent from declaring exactly how many spaces or tabs reside between the tag definitions and the actual tag attributes. Also notice that I've added the “?
” before the “"\s*target="newwindow"
” section. This is done so the expression can catch the first instance of this occurrence and not the last one (or it will match everything to the last link in the end of the file instead of closing the match on the first match).
Getting our Title
Now that we have the link, we need to get the title for the link. This one is also relatively easy. The title resides between the Href's closing tag (“>
”) and the link's closing tag (“</a>
”). Other things we need to consider along the way are new lines or spaces, so we take these into our regular expression as well.
Here's the full expression so far:
<p\s*class="clsNormalText"><a\shref="(?<link>.*)?("\s*target="newwindow")(.|\n)*?> (?<title>.*\n?.*)?(</a><br>\s*\n*)
And we have a group in there as well called “title
,” so we can refer to it later in code. Notice that the title is made of any number of characters, followed by zero or more new lines and more characters.
Getting our Description
The description is a block of text that can contain new lines and is terminated by a “$lt;br>
”:
<p\s*class="clsNormalText">
<a\shref="(?<link>.*)?("\s*target="newwindow")(.|\n)*?>(?<title>.*\n?.*)?
(</a><br>\s*\n*)(?<description>(.|\n)*?)(<br>(.|\n)*?>)
The end of the expression contains the beginning of the next expression we want to find.
Getting our Category
The category of the current news item is usually “Article
” or “Product Release
”. It always starts with the “>
” sign and ends with a period (“.
”):
<p\s*class="clsNormalText"><a\shref="(?<link>.*)?
("\s*target="newwindow")(.|\n)*?>(?<title>.*\n?.*)?
(</a><br>\s*\n*)(?<description>(.|\n)*?)(<br>(.|\n)*?>)(?<category>.*)?\.
Getting our Publishing Date
The news date follows right after the category's ending period (with zero or more spaces between them) and finishes with another period, ending with the closing Span tag and P tag.
<p\s*class="clsNormalText"><a\shref="(?<link>.*)?
("\s*target="newwindow")(.|\n)*?>
(?<title>.*\n?.*)?(</a><br>\s*\n*)(?<description>
(.|\n)*?)(<br>(.|\n)*?>)(?<category>.*)?\.\s*(?<pubDate>.*)?(\.</span></p>)
Our Final Regex
We finish with a piece of text that lets you scan .netWire's HTML and retrieve a list of matches, each of which contains groups named “link,” “title,” etc. that we can use in our code. Our next step is to transform this pile of data into useful readable information.
Comments