Library tutorials & articles

Creating a Generic Site-To-Rss Tool

Creating our Scraping Regular Expression

For our RSS feed, we only need several pieces of data retrieved from the HTML for every “post” we indent in order to create it in our RSS feed:

  • Link: Post reader clicks it to go to specific information. In .netWire it's the link to the news items.
  • Title: Appears in the RSS reader that the user reading the posts will use. In .netWire it's the title of the news item.
  • Description: Actual text of an individual post. In .netWire it's the text of the news item.

These various items are buried deep inside the HTML of our website. It is now our job to find a regular expression that retrieves those items and enables us to easily reference them by code. Using our knowledge of groups in Regex, we want to have a group in the resulting Regex for every item we want to retrieve. We name them “link,” “title”, “description,” and “pubDate,” respectively.

In developing our Regex, I decided to use Expresso, a tool designed to help with regular expression testing. I'll rely on this piece of HTML, taken from the HTML of .netWire:

<p class="clsNormalText"><a href="/redirect.asp?newsid=4974"
  target="newwindow" class="clsNewsHead">Globalizing and Localizing
  Windows Applications, Part 1</a><br>
  With the explosive growth of the Internet and rapid
  globalization of the world's economies, the earth is getting
  smaller and smaller. The applications that you develop for
  a local market may soon be used in another country. If the
  world used a common language, that would make the life of
  developers much easier. However, reality is far from perfect.
  The author shows you how to make your applications ready for
  the global marketplace.<br>
<span class="clsSubText">Article. Sep 16, 2003.</span></p>

This HTML represents one news item on .netWire, and this is the one where we need to focus.

Our first item of business is getting the link of the news item. Why the link first? Because it's the first item in order of appearance, which makes it the least complicated to find.

Getting our Link

Here is the piece from which we want to extract the link:

<p class="clsNormalText"><a href="/redirect.asp?newsid=4974" target="newwindow"
class="clsNewsHead">Globalizing and Localizing Windows Applications, Part 1</a><br>
With the explosive…

We can easily see that each link (and title) is encapsulated between two items:

<p class="clsNormalText"><a href="->Our link" target="newwindow" class="clsNewsHead">

Simply enough, the following regular expression catches all instances of such a link within our HTML file and presents us with a group name “link” that gives us the actual redirection string of the link:

<p\s*class="clsNormalText"><a\shref="(?<link>.*)?("\s*target="newwindow")

I've added “\s” to prevent from declaring exactly how many spaces or tabs reside between the tag definitions and the actual tag attributes. Also notice that I've added the “?” before the “"\s*target="newwindow"” section. This is done so the expression can catch the first instance of this occurrence and not the last one (or it will match everything to the last link in the end of the file instead of closing the match on the first match).

Getting our Title

Now that we have the link, we need to get the title for the link. This one is also relatively easy. The title resides between the Href's closing tag (“>”) and the link's closing tag (“</a>”). Other things we need to consider along the way are new lines or spaces, so we take these into our regular expression as well.

Here's the full expression so far:

<p\s*class="clsNormalText"><a\shref="(?<link>.*)?("\s*target="newwindow")(.|\n)*?> (?<title>.*\n?.*)?(</a><br>\s*\n*)

And we have a group in there as well called “title,” so we can refer to it later in code. Notice that the title is made of any number of characters, followed by zero or more new lines and more characters.

Getting our Description

The description is a block of text that can contain new lines and is terminated by a “$lt;br>”:

<p\s*class="clsNormalText">
<a\shref="(?<link>.*)?("\s*target="newwindow")(.|\n)*?>(?<title>.*\n?.*)?
(</a><br>\s*\n*)(?<description>(.|\n)*?)(<br>(.|\n)*?>)

The end of the expression contains the beginning of the next expression we want to find.

Getting our Category

The category of the current news item is usually “Article” or “Product Release”. It always starts with the “>” sign and ends with a period (“.”):

<p\s*class="clsNormalText"><a\shref="(?<link>.*)?
("\s*target="newwindow")(.|\n)*?>(?<title>.*\n?.*)?
(</a><br>\s*\n*)(?<description>(.|\n)*?)(<br>(.|\n)*?>)(?<category>.*)?\.

Getting our Publishing Date

The news date follows right after the category's ending period (with zero or more spaces between them) and finishes with another period, ending with the closing Span tag and P tag.

<p\s*class="clsNormalText"><a\shref="(?<link>.*)?
("\s*target="newwindow")(.|\n)*?>
(?<title>.*\n?.*)?(</a><br>\s*\n*)(?<description>
(.|\n)*?)(<br>(.|\n)*?>)(?<category>.*)?\.\s*(?<pubDate>.*)?(\.</span></p>)

Our Final Regex

We finish with a piece of text that lets you scan .netWire's HTML and retrieve a list of matches, each of which contains groups named “link,” “title,” etc. that we can use in our code. Our next step is to transform this pile of data into useful readable information.

Comments

  1. 08 Apr 2005 at 10:08

    I'm having trouble changing:
        http://www.dotnetwire.com  (the default URL in MakeRSS)
    to a different URL.



    1)  Uploaded the MakeRSS \bin files to the ASP.net server's BIN folder.
    2)  Uploaded the "dnw.aspx" and "dnw.aspx.vb" files.
    3)  Pointed the browser to:  "http://...../dnw.aspx"


    OK! It works "out of the box".  It shows an RSS feed for    http://www.dotnetwire.com.


    4) Changed the default URL:   http://www.dotnetwire.com
       in file: "dnw.aspx.vb"
       to point to a different URL. Uploaded the changed "dnw.aspx.vb" file.


    5) Reload "http://...../dnw.aspx".
       It works...but it still shows the same RSS feed as before the change.
       It still points to the default URL: http://www.dotnetwire.com


    The URL is "burned-in". How do I change it? Heeeeeeeelp!
    javascript:smilie('')confused



  2. 13 Aug 2004 at 01:56

    Well I have just posted a topic and saw you're complication. I am new at RegEx, but I think it has something to do with greediness. RegEx tries to match as much as possible, unless you specify otherwise.
    The trouble is whenever the text is in one single line, which happens a lot.


    I my search for a match the feeds i wanted to generate i found this in Mastering Regular Expressions in Jeffrey E. F. Friedl


    I have just altered it to give it a name. I should work on every link, and has so far for me
    <a\b(?<link>[^>]+)>(?<title>.*?)</a>

  3. 13 Aug 2004 at 01:45

    When you grab the html from another homepage it converts the text to UTF8, which is fine.
    However in Denmark and a lot of other countries we use special Characters like Æ Ø Å, which is replaced by meaningles characters, thereby messing up the spelling.


    I know that you should replace special characters by the correct value in youre own homepage, then every browser renders it correctly, but nobody does that.


    So one should either use a Regex replace or a Replace function when it writes the HTML to the file. Well so far so good, but I have enough difficulty just adjusting the code to my means.


    So does anyone know how to do that????


    I hope someone does, since I have just been able to get this working, investing long hours in installing and understanding the class, Regex and the whole .net framework. But then I discovered that most og the homepages had Æ Ø Å in them, which is perfectly natuaral.

  4. 29 Feb 2004 at 18:23

    As you mentioned that ? before the “"\s*target="newwindow"” section is used to to catch the first occurence but why i m having following problem


    Pattern = <a\shref="(?<link>.*)?(">)
    String = <a href="/news_1.html">News 1</a>  <a href="myPage.html">My Page</a>



    Why does it return "/news_1.html">News 1</a>  <a href="myPage.html" ?
    and Not 2 groups of link.



    Please help..
    Thanks

  5. 01 Jan 1999 at 00:00

    This thread is for discussions of Creating a Generic Site-To-Rss Tool.

Leave a comment

Sign in or Join us (it's free).

Roy Osherove Roy Osherove has spent the past 6+ years developing data driven applications for various companies in Israel. He's acquired several MCP titles, written a number of articles on various .NET topics, ...
AddThis

Related podcasts

  • xpert to Expert: Inside Concurrent Basic (CB)

    "Concurrent Basic extends Visual Basic with stylish asynchronous concurrency constructs derived from the join calculus. Our design advances earlier MSRC work on Polyphonic C#, Comega and the Joins Library. Unlike its C# based predecessors, CB adopts a simple event-like syntax familiar to VB progr...

We'd love to hear what you think! Submit ideas or give us feedback