Library tutorials & articles
Creating a Generic Site-To-Rss Tool
- Introduction
- Creating our Scraping Regular Expression
- Making an RSS feed Out of It
- Building the Generic SiteToRSS Class
- Writing the XML
- Wrapping Up
Creating our Scraping Regular Expression
For our RSS feed, we only need several pieces of data retrieved from the HTML for every “post” we indent in order to create it in our RSS feed:
- Link: Post reader clicks it to go to specific information. In .netWire it's the link to the news items.
- Title: Appears in the RSS reader that the user reading the posts will use. In .netWire it's the title of the news item.
- Description: Actual text of an individual post. In .netWire it's the text of the news item.
These various items are buried deep inside the HTML of our website. It is now our job to find a regular expression that retrieves those items and enables us to easily reference them by code. Using our knowledge of groups in Regex, we want to have a group in the resulting Regex for every item we want to retrieve. We name them “link,” “title”, “description,” and “pubDate,” respectively.
In developing our Regex, I decided to use Expresso, a tool designed to help with regular expression testing. I'll rely on this piece of HTML, taken from the HTML of .netWire:
<p class="clsNormalText"><a href="/redirect.asp?newsid=4974"
target="newwindow" class="clsNewsHead">Globalizing and Localizing
Windows Applications, Part 1</a><br>
With the explosive growth of the Internet and rapid
globalization of the world's economies, the earth is getting
smaller and smaller. The applications that you develop for
a local market may soon be used in another country. If the
world used a common language, that would make the life of
developers much easier. However, reality is far from perfect.
The author shows you how to make your applications ready for
the global marketplace.<br>
<span class="clsSubText">Article. Sep 16, 2003.</span></p>
This HTML represents one news item on .netWire, and this is the one where we need to focus.
Our first item of business is getting the link of the news item. Why the link first? Because it's the first item in order of appearance, which makes it the least complicated to find.
Getting our Link
Here is the piece from which we want to extract the link:
<p class="clsNormalText"><a href="/redirect.asp?newsid=4974" target="newwindow"
class="clsNewsHead">Globalizing and Localizing Windows Applications, Part 1</a><br>
With the explosive…
We can easily see that each link (and title) is encapsulated between two items:
<p class="clsNormalText"><a href="->Our link" target="newwindow" class="clsNewsHead">
Simply enough, the following regular expression catches all instances of such a link within our HTML file and presents us with a group name “link” that gives us the actual redirection string of the link:
<p\s*class="clsNormalText"><a\shref="(?<link>.*)?("\s*target="newwindow")
I've added “\s” to prevent from declaring exactly how many spaces or tabs reside between the tag definitions and the actual tag attributes. Also notice that I've added the “?” before the “"\s*target="newwindow"” section. This is done so the expression can catch the first instance of this occurrence and not the last one (or it will match everything to the last link in the end of the file instead of closing the match on the first match).
Getting our Title
Now that we have the link, we need to get the title for the link. This one is also relatively easy. The title resides between the Href's closing tag (“>”) and the link's closing tag (“</a>”). Other things we need to consider along the way are new lines or spaces, so we take these into our regular expression as well.
Here's the full expression so far:
<p\s*class="clsNormalText"><a\shref="(?<link>.*)?("\s*target="newwindow")(.|\n)*?> (?<title>.*\n?.*)?(</a><br>\s*\n*)
And we have a group in there as well called “title,” so we can refer to it later in code. Notice that the title is made of any number of characters, followed by zero or more new lines and more characters.
Getting our Description
The description is a block of text that can contain new lines and is terminated by a “$lt;br>”:
<p\s*class="clsNormalText">
<a\shref="(?<link>.*)?("\s*target="newwindow")(.|\n)*?>(?<title>.*\n?.*)?
(</a><br>\s*\n*)(?<description>(.|\n)*?)(<br>(.|\n)*?>)
The end of the expression contains the beginning of the next expression we want to find.
Getting our Category
The category of the current news item is usually “Article” or “Product Release”. It always starts with the “>” sign and ends with a period (“.”):
<p\s*class="clsNormalText"><a\shref="(?<link>.*)?
("\s*target="newwindow")(.|\n)*?>(?<title>.*\n?.*)?
(</a><br>\s*\n*)(?<description>(.|\n)*?)(<br>(.|\n)*?>)(?<category>.*)?\.
Getting our Publishing Date
The news date follows right after the category's ending period (with zero or more spaces between them) and finishes with another period, ending with the closing Span tag and P tag.
<p\s*class="clsNormalText"><a\shref="(?<link>.*)?
("\s*target="newwindow")(.|\n)*?>
(?<title>.*\n?.*)?(</a><br>\s*\n*)(?<description>
(.|\n)*?)(<br>(.|\n)*?>)(?<category>.*)?\.\s*(?<pubDate>.*)?(\.</span></p>)
Our Final Regex
We finish with a piece of text that lets you scan .netWire's HTML and retrieve a list of matches, each of which contains groups named “link,” “title,” etc. that we can use in our code. Our next step is to transform this pile of data into useful readable information.
Related articles
Related discussion
-
How to write a query set to excel using vb.net
by BarbaMariolino (1 replies)
-
Very Urgent regarding deleting the images from a folder
by rameshbandi (2 replies)
-
Block Accessing MSSQL 2000
by militia (0 replies)
-
.NET Developer in Ghana Required....
by sysview (0 replies)
-
Sending SMS to mobile using secure gateway from VB.net 2008 c#
by pratikasthana17 (0 replies)
Related podcasts
-
xpert to Expert: Inside Concurrent Basic (CB)
"Concurrent Basic extends Visual Basic with stylish asynchronous concurrency constructs derived from the join calculus. Our design advances earlier MSRC work on Polyphonic C#, Comega and the Joins Library. Unlike its C# based predecessors, CB adopts a simple event-like syntax familiar to VB progr...
I'm having trouble changing:
http://www.dotnetwire.com (the default URL in MakeRSS)
to a different URL.
1) Uploaded the MakeRSS \bin files to the ASP.net server's BIN folder.
2) Uploaded the "dnw.aspx" and "dnw.aspx.vb" files.
3) Pointed the browser to: "http://...../dnw.aspx"
OK! It works "out of the box". It shows an RSS feed for http://www.dotnetwire.com.
4) Changed the default URL: http://www.dotnetwire.com
in file: "dnw.aspx.vb"
to point to a different URL. Uploaded the changed "dnw.aspx.vb" file.
5) Reload "http://...../dnw.aspx".
It works...but it still shows the same RSS feed as before the change.
It still points to the default URL: http://www.dotnetwire.com
The URL is "burned-in". How do I change it? Heeeeeeeelp!
javascript:smilie('
Well I have just posted a topic and saw you're complication. I am new at RegEx, but I think it has something to do with greediness. RegEx tries to match as much as possible, unless you specify otherwise.
The trouble is whenever the text is in one single line, which happens a lot.
I my search for a match the feeds i wanted to generate i found this in Mastering Regular Expressions in Jeffrey E. F. Friedl
I have just altered it to give it a name. I should work on every link, and has so far for me
<a\b(?<link>[^>]+)>(?<title>.*?)</a>
When you grab the html from another homepage it converts the text to UTF8, which is fine.
However in Denmark and a lot of other countries we use special Characters like Æ Ø Å, which is replaced by meaningles characters, thereby messing up the spelling.
I know that you should replace special characters by the correct value in youre own homepage, then every browser renders it correctly, but nobody does that.
So one should either use a Regex replace or a Replace function when it writes the HTML to the file. Well so far so good, but I have enough difficulty just adjusting the code to my means.
So does anyone know how to do that????
I hope someone does, since I have just been able to get this working, investing long hours in installing and understanding the class, Regex and the whole .net framework. But then I discovered that most og the homepages had Æ Ø Å in them, which is perfectly natuaral.
As you mentioned that ? before the “"\s*target="newwindow"” section is used to to catch the first occurence but why i m having following problem
Pattern = <a\shref="(?<link>.*)?(">)
String = <a href="/news_1.html">News 1</a> <a href="myPage.html">My Page</a>
Why does it return "/news_1.html">News 1</a> <a href="myPage.html" ?
and Not 2 groups of link.
Please help..
Thanks
This thread is for discussions of Creating a Generic Site-To-Rss Tool.