Library tutorials & articles

Creating a Generic Site-To-Rss Tool

Wrapping Up

Link Prefix

One property in the class needs to be explained. LinksPrefix contains the prefix pre-appended to each news item link that is discovered. Notice when harvesting the HTML that all links in the site are not “full” links, but they are usually “partial” links, pointing to some place in the same site. In cases such as these (and .netWire is), we want to specify the LinksPrefix as http://www.dotnetwire.com to make the links of the news items “full” again.

RFC Date Formats

RSS 2.0 requires a publish date formatted as an RFC822 date. We are using the RFC1123 format, which seems to return essentially the same result.

Using the Generic Class with .netWire

It's ready. Let's use it! Here's a simple code that can now use the class to parse .netWire and return an XML RSS feed from it:

Dim rss As RSSCreator.RSSCreator = _
New RSSCreator.RSSCreator("http://www.dotnetwire.com")

With rss
    .LinksPrefix = rss.UrlToParse
    .RegexPattern = "<p\s*class=""clsNormalText""><a\shref=""(?<link>.*)?
        (""\s*target=""newwindow"")(.|\n)*?>(?<title>.*\n?.*)?
        (</a><br>\s*\n*)(?<description>(.|\n)*?)(<br>(.|\n)*?>)
        (?<category>.*)?\.\s*(?<pubDate>.*)?(\.</span>)"

    .RSSFeedName = "The unofficial .NetWire RSS feed"
    .RssFeedLink = "http://www.DotNetWire.com"
    .RssFeedDescription = "A basic feed that parses the .NetWire site"
    .RssFeedCopyright = "Copyright 2003 Roy Osherove"

    return .GetRss()
End With

What's in the Download?

The download contains several projects:

  • RssCreator: Library with the source for the RSSCreator class.
  • MakeRss: Simple ASP.NET project that retrieves a feed for the .netWire site.
  • SiteToRss: Simple WinForm application that represents a simple utility to test various sites with regular expressions.

Comments

  1. 08 Apr 2005 at 10:08

    I'm having trouble changing:
        http://www.dotnetwire.com  (the default URL in MakeRSS)
    to a different URL.



    1)  Uploaded the MakeRSS \bin files to the ASP.net server's BIN folder.
    2)  Uploaded the "dnw.aspx" and "dnw.aspx.vb" files.
    3)  Pointed the browser to:  "http://...../dnw.aspx"


    OK! It works "out of the box".  It shows an RSS feed for    http://www.dotnetwire.com.


    4) Changed the default URL:   http://www.dotnetwire.com
       in file: "dnw.aspx.vb"
       to point to a different URL. Uploaded the changed "dnw.aspx.vb" file.


    5) Reload "http://...../dnw.aspx".
       It works...but it still shows the same RSS feed as before the change.
       It still points to the default URL: http://www.dotnetwire.com


    The URL is "burned-in". How do I change it? Heeeeeeeelp!
    javascript:smilie('')confused



  2. 13 Aug 2004 at 01:56

    Well I have just posted a topic and saw you're complication. I am new at RegEx, but I think it has something to do with greediness. RegEx tries to match as much as possible, unless you specify otherwise.
    The trouble is whenever the text is in one single line, which happens a lot.


    I my search for a match the feeds i wanted to generate i found this in Mastering Regular Expressions in Jeffrey E. F. Friedl


    I have just altered it to give it a name. I should work on every link, and has so far for me
    <a\b(?<link>[^>]+)>(?<title>.*?)</a>

  3. 13 Aug 2004 at 01:45

    When you grab the html from another homepage it converts the text to UTF8, which is fine.
    However in Denmark and a lot of other countries we use special Characters like Æ Ø Å, which is replaced by meaningles characters, thereby messing up the spelling.


    I know that you should replace special characters by the correct value in youre own homepage, then every browser renders it correctly, but nobody does that.


    So one should either use a Regex replace or a Replace function when it writes the HTML to the file. Well so far so good, but I have enough difficulty just adjusting the code to my means.


    So does anyone know how to do that????


    I hope someone does, since I have just been able to get this working, investing long hours in installing and understanding the class, Regex and the whole .net framework. But then I discovered that most og the homepages had Æ Ø Å in them, which is perfectly natuaral.

  4. 29 Feb 2004 at 18:23

    As you mentioned that ? before the “"\s*target="newwindow"” section is used to to catch the first occurence but why i m having following problem


    Pattern = <a\shref="(?<link>.*)?(">)
    String = <a href="/news_1.html">News 1</a>  <a href="myPage.html">My Page</a>



    Why does it return "/news_1.html">News 1</a>  <a href="myPage.html" ?
    and Not 2 groups of link.



    Please help..
    Thanks

  5. 01 Jan 1999 at 00:00

    This thread is for discussions of Creating a Generic Site-To-Rss Tool.

Leave a comment

Sign in or Join us (it's free).

Roy Osherove Roy Osherove has spent the past 6+ years developing data driven applications for various companies in Israel. He's acquired several MCP titles, written a number of articles on various .NET topics, ...
AddThis

Related podcasts

  • xpert to Expert: Inside Concurrent Basic (CB)

    "Concurrent Basic extends Visual Basic with stylish asynchronous concurrency constructs derived from the join calculus. Our design advances earlier MSRC work on Polyphonic C#, Comega and the Joins Library. Unlike its C# based predecessors, CB adopts a simple event-like syntax familiar to VB progr...

Want to stay in touch with what's going on? Follow us on twitter!