.NET
Java
Open Source
Mobile
Database
Architecture
RIA & Web
- CSS
- Flash
- Flex
- HTML
- JavaScript
- Silverlight
- XML
Toolbox

Creating a Generic Site-To-Rss Tool

10 Feb 2004 | by Roy Osherove | Filed in

Comments
PDF

Download.zip

Introduction
Creating our Scraping Regular Expression
Making an RSS feed Out of It
Building the Generic SiteToRSS Class
Writing the XML
Wrapping Up

Creating our Scraping Regular Expression

For our RSS feed, we only need several pieces of data retrieved from the HTML for every “post” we indent in order to create it in our RSS feed:

Link: Post reader clicks it to go to specific information. In .netWire it's the link to the news items.
Title: Appears in the RSS reader that the user reading the posts will use. In .netWire it's the title of the news item.
Description: Actual text of an individual post. In .netWire it's the text of the news item.

These various items are buried deep inside the HTML of our website. It is now our job to find a regular expression that retrieves those items and enables us to easily reference them by code. Using our knowledge of groups in Regex, we want to have a group in the resulting Regex for every item we want to retrieve. We name them “link,” “title”, “description,” and “pubDate,” respectively.

In developing our Regex, I decided to use Expresso, a tool designed to help with regular expression testing. I'll rely on this piece of HTML, taken from the HTML of .netWire:

<a href="/redirect.asp?newsid=4974" target="newwindow" class="clsNewsHead">Globalizing and Localizing Windows Applications, Part 1</a> With the explosive growth of the Internet and rapid globalization of the world's economies, the earth is getting smaller and smaller. The applications that you develop for a local market may soon be used in another country. If the world used a common language, that would make the life of developers much easier. However, reality is far from perfect. The author shows you how to make your applications ready for the global marketplace. Article. Sep 16, 2003.

This HTML represents one news item on .netWire, and this is the one where we need to focus.

Our first item of business is getting the link of the news item. Why the link first? Because it's the first item in order of appearance, which makes it the least complicated to find.

Getting our Link

Here is the piece from which we want to extract the link:

<a href="/redirect.asp?newsid=4974" target="newwindow" class="clsNewsHead">Globalizing and Localizing Windows Applications, Part 1</a> With the explosive…

We can easily see that each link (and title) is encapsulated between two items:

<a href="->Our link" target="newwindow" class="clsNewsHead">

Simply enough, the following regular expression catches all instances of such a link within our HTML file and presents us with a group name “link” that gives us the actual redirection string of the link:

<p\s*class="clsNormalText"><a\shref="(?<link>.*)?("\s*target="newwindow")

I've added “\s” to prevent from declaring exactly how many spaces or tabs reside between the tag definitions and the actual tag attributes. Also notice that I've added the “?” before the “"\s*target="newwindow"” section. This is done so the expression can catch the first instance of this occurrence and not the last one (or it will match everything to the last link in the end of the file instead of closing the match on the first match).

Getting our Title

Now that we have the link, we need to get the title for the link. This one is also relatively easy. The title resides between the Href's closing tag (“>”) and the link's closing tag (“</a>”). Other things we need to consider along the way are new lines or spaces, so we take these into our regular expression as well.

Here's the full expression so far:

<p\s*class="clsNormalText"><a\shref="(?<link>.*)?("\s*target="newwindow")(.|\n)*?> (?<title>.*\n?.*)?(</a> \s*\n*)

And we have a group in there as well called “title,” so we can refer to it later in code. Notice that the title is made of any number of characters, followed by zero or more new lines and more characters.

Getting our Description

The description is a block of text that can contain new lines and is terminated by a “$lt;br>”:

<p\s*class="clsNormalText"> <a\shref="(?<link>.*)?("\s*target="newwindow")(.|\n)*?>(?<title>.*\n?.*)? (</a> \s*\n*)(?<description>(.|\n)*?)( (.|\n)*?>)

The end of the expression contains the beginning of the next expression we want to find.

Getting our Category

The category of the current news item is usually “Article” or “Product Release”. It always starts with the “>” sign and ends with a period (“.”):

<p\s*class="clsNormalText"><a\shref="(?<link>.*)? ("\s*target="newwindow")(.|\n)*?>(?<title>.*\n?.*)? (</a> \s*\n*)(?<description>(.|\n)*?)( (.|\n)*?>)(?<category>.*)?\.

Getting our Publishing Date

The news date follows right after the category's ending period (with zero or more spaces between them) and finishes with another period, ending with the closing Span tag and P tag.

<p\s*class="clsNormalText"><a\shref="(?<link>.*)? ("\s*target="newwindow")(.|\n)*?> (?<title>.*\n?.*)?(</a> \s*\n*)(?<description> (.|\n)*?)( (.|\n)*?>)(?<category>.*)?\.\s*(?<pubDate>.*)?(\.)

Our Final Regex

We finish with a piece of text that lets you scan .netWire's HTML and retrieve a list of matches, each of which contains groups named “link,” “title,” etc. that we can use in our code. Our next step is to transform this pile of data into useful readable information.

You might also like...

Comments

About the author

Roy Osherove

Roy Osherove has spent the past 6+ years developing data driven applications for various companies in Israel. He's acquired several MCP titles, written a number of articles on various .NET topic...

www.iserializable.com

Interested in writing for us? Find out more.

HTML tutorials

HTML books

VB.NET for AutoCAD 2010 - Level 1

VB.NET can be used to customize AutoCAD with nearly unlimited power and flexibility. And for those who have been customizing AutoCAD with VBA, VB.NET is the logical language of choice once VBA is completely phased out of AutoCAD.This book is the edit...

HTML forum discussion

edmonton female escort services near me

by canadapleasure (0 replies)
Bagaimana memenangkan $ 1,54 miliar dalam Mega Jutaan

by gametogelan (0 replies)
input integer from text file and output text file

by shmilon (0 replies)
cSharp stuck at exercise

by xander_Michiels (0 replies)
Need help in selected the Tax Audit Year from drop down menu and displaying results for the selected year

by citymumbai (0 replies)

HTML podcasts

.NET Rocks: Eric Lippert Talks About Project Roslyn

Published 9 years ago, running time 0h56m

Recorded on PI day, Carl and Richard talk to the one-and-only Eric Lippert from the C# Compiler team. But we don't only talk about C#! The conversation wanders around all the languages, a little F#, a little IronPython, heck, even VB.NET! Eric talks about Project Roslyn, Microsoft's efforts to ma.

Managed hosting by Everycity

Creating a Generic Site-To-Rss Tool

Creating our Scraping Regular Expression

Getting our Link

Getting our Title

Getting our Description

Getting our Category

Getting our Publishing Date

Our Final Regex

You might also like...

Comments

About the author

Roy Osherove

HTML tutorials

HTML books

VB.NET for AutoCAD 2010 - Level 1

HTML forum discussion

edmonton female escort services near me

by canadapleasure (0 replies)

Bagaimana memenangkan $ 1,54 miliar dalam Mega Jutaan

by gametogelan (0 replies)

input integer from text file and output text file

by shmilon (0 replies)

cSharp stuck at exercise

by xander_Michiels (0 replies)

Need help in selected the Tax Audit Year from drop down menu and displaying results for the selected year

by citymumbai (0 replies)

HTML podcasts

.NET Rocks: Eric Lippert Talks About Project Roslyn

Published 9 years ago, running time 0h56m

Contribute

Web Development

Developer Jobs

Our tools