Library tutorials & articles

Creating a Generic Site-To-Rss Tool

Introduction

I'll show how to use regular expressions to parse a Web page's HTML text into manageable chunks of data. That data will be converted and written as an RSS feed for the whole world to consume. Finally, I'll show how to create a generic tool that enables you to automatically generate an RSS feed from any website, given a small group of parameters. At the end of the day we will have a working RSS feed for www.dotnetwire.com.

Introduction

Ah, the joys of RSS. You can get the data you need, as soon as it's available, and no nagging browsers or pop-ups along the way. If only all sites had RSS feeds. If there's one thing that would be really nice, it would be the ability to generate an RSS feed from any site I want. For example, .netWire is a very interesting site with lots of useful information. However, the folks maintaining this site hadn't thought about providing it with an RSS feed, which it so sorely needs.

So I got to thinking: “All the data on the site that's important to me seems to be arranged in an orderly and predictable manner. I should be able to parse it in a fairly easy manner and make it into an RSS feed.” So I started trying. It worked out pretty well. So well that I've come up with a way to let you do your own site scraping using a generic tool, providing it with only simple rules expressed as a single regular expression.

Planning a Site Scrape

“Site scraping” depicts going over a site's HTML and mining it for any relevant data. All other text is discarded. For this article, I've chosen .netWire as the site I'll be scraping, since the outcome of this will be useful to a great many people. In planning the scraping I'll ignore the specifics of how I actually get the text to parse and leave that topic for the end of the article.

The first thing I did was open my Web browser on the .netWire site, right-click and select “view source.” Notepad shows me the site as my future parser sees it. This raw text is the juice I need to parse in order to get the data I need. To be honest, it looks quite scary. How on earth am I going to come up with an easy way to parse such an enormous amount of information without losing my head? Scrolling through the text, however, I could start to see patterns in which important text, text that was relevant to me, appeared.

There were links inside paragraphs, followed by SPANs and many more attributes. It was a nightmare to parse. Just writing all the rules in searching for a specific link or title for the RSS feed that I wanted to create was a hard enough, but I also had lots more with which to contend. I had to find text inside of where I found text inside of where I found text. It was hardly a job for a few hours on the weekend. So the next thing I decided to check was whether I could do the job with regular expressions.

Regex–A Powerful Scraping Tool

If you don't know what regular expressions are, there are loads of articles on the subject. You need to understand them before reading how to use them for scraping a site.

Regular expressions enable us to easily extract necessary information from text. It enables us, through complex expressions provided as plain text, to recover strings that match lots and lots of rules provided by us. The data we receive back after running our expressions on a string can be as complex and as detailed as we'd like. We can even divide it into groups of text that are matched, along with group names attached to them, enabling us to easily program against the regular expression (Regex) interface (see “Practical Parsing Using Groups” for more info).

Since a site is ultimately represented as plain text (be it HTML, JavaScript, or anything else), we can apply regular expressions to that text as well, enabling us to search and filter any irrelevant information quickly and easily.

Comments

  1. 08 Apr 2005 at 10:08

    I'm having trouble changing:
        http://www.dotnetwire.com  (the default URL in MakeRSS)
    to a different URL.



    1)  Uploaded the MakeRSS \bin files to the ASP.net server's BIN folder.
    2)  Uploaded the "dnw.aspx" and "dnw.aspx.vb" files.
    3)  Pointed the browser to:  "http://...../dnw.aspx"


    OK! It works "out of the box".  It shows an RSS feed for    http://www.dotnetwire.com.


    4) Changed the default URL:   http://www.dotnetwire.com
       in file: "dnw.aspx.vb"
       to point to a different URL. Uploaded the changed "dnw.aspx.vb" file.


    5) Reload "http://...../dnw.aspx".
       It works...but it still shows the same RSS feed as before the change.
       It still points to the default URL: http://www.dotnetwire.com


    The URL is "burned-in". How do I change it? Heeeeeeeelp!
    javascript:smilie('')confused



  2. 13 Aug 2004 at 01:56

    Well I have just posted a topic and saw you're complication. I am new at RegEx, but I think it has something to do with greediness. RegEx tries to match as much as possible, unless you specify otherwise.
    The trouble is whenever the text is in one single line, which happens a lot.


    I my search for a match the feeds i wanted to generate i found this in Mastering Regular Expressions in Jeffrey E. F. Friedl


    I have just altered it to give it a name. I should work on every link, and has so far for me
    <a\b(?<link>[^>]+)>(?<title>.*?)</a>

  3. 13 Aug 2004 at 01:45

    When you grab the html from another homepage it converts the text to UTF8, which is fine.
    However in Denmark and a lot of other countries we use special Characters like Æ Ø Å, which is replaced by meaningles characters, thereby messing up the spelling.


    I know that you should replace special characters by the correct value in youre own homepage, then every browser renders it correctly, but nobody does that.


    So one should either use a Regex replace or a Replace function when it writes the HTML to the file. Well so far so good, but I have enough difficulty just adjusting the code to my means.


    So does anyone know how to do that????


    I hope someone does, since I have just been able to get this working, investing long hours in installing and understanding the class, Regex and the whole .net framework. But then I discovered that most og the homepages had Æ Ø Å in them, which is perfectly natuaral.

  4. 29 Feb 2004 at 18:23

    As you mentioned that ? before the “"\s*target="newwindow"” section is used to to catch the first occurence but why i m having following problem


    Pattern = <a\shref="(?<link>.*)?(">)
    String = <a href="/news_1.html">News 1</a>  <a href="myPage.html">My Page</a>



    Why does it return "/news_1.html">News 1</a>  <a href="myPage.html" ?
    and Not 2 groups of link.



    Please help..
    Thanks

  5. 01 Jan 1999 at 00:00

    This thread is for discussions of Creating a Generic Site-To-Rss Tool.

Leave a comment

Sign in or Join us (it's free).

Roy Osherove Roy Osherove has spent the past 6+ years developing data driven applications for various companies in Israel. He's acquired several MCP titles, written a number of articles on various .NET topics, ...

Related podcasts

  • xpert to Expert: Inside Concurrent Basic (CB)

    "Concurrent Basic extends Visual Basic with stylish asynchronous concurrency constructs derived from the join calculus. Our design advances earlier MSRC work on Polyphonic C#, Comega and the Joins Library. Unlike its C# based predecessors, CB adopts a simple event-like syntax familiar to VB progr...

We'd love to hear what you think! Submit ideas or give us feedback