.NET and the WebCrawler class

This article was originally published in VSJ, which is now part of Developer Fusion.
In the November issue, I showed how to develop a general purpose WebCrawler class. This month we will look at a practical application for putting that class to work: a Web Indexer that will build a database of words cross-referenced by the web pages containing those words. This project will require the WebCrawler class (Charlotte.dll) which appeared in the November 2004 issue, as well as the HTML Container class (WebWagon.dll) from the October issue. Both of these projects, including VB.NET and C# versions of the source code, are available at www.skycoder.com/downloads.

A Quick Review

Let’s take a moment to look at the main classes used in the project. The WebCrawler class loads a document from the Internet and then calls itself back again for each link or HRef on that page. Once it has been started, the process continues until each possible path has been exhausted. A MaximumLevel property determines how deep the recursion is allowed to go – but all paths will eventually be followed regardless. An optional list of Include paths may be specified. If there is no Include list, all URLs will be within range. An optional Exclude list is also possible, which operates in a converse manner. Each list supports either a full or partial path name. There are several events thrown back by the class that return feedback information and data. The HTML Container class provides a method to load an HTML document from the Internet, as well as some simple properties and methods for parsing the HTML. The Web Crawler class uses the HTML Container class and this project will use them both.

The Presentation Layer

Running this type of application is frustrating without knowing what it is doing and where. The WebCrawler class was designed with this in mind, and provides ample feedback to the presentation layer. We will use a TreeView control to diagram the process. Each node on the TreeView control will represent a web page, with each sub-node representing a page pointed to by that page and so on. Thus we will have a view of the pages to be visited and the recursion level where they will be found. Colour coding will indicate the state of each page – green for complete, red for error and blue for pages known but not yet visited. As each page is visited, we will display the text in that page as well – just to keep an eye on what’s being indexed.

Start a new Windows project, and add a Label, a TreeView control, two TextBoxes and a Button. At the bottom, place a StatusBar and a Progress Bar. Set the properties as shown in the box, “Property settings”.

Property Settings

Form1
Name frmMain
Size 960, 640
Text WebLex – Web Indexer
Label1
Name lblOpeningURL
AutoSize True
Location 8, 11
Size 76, 13
Text Opening URL:
TextBox1
Name txtOpeningURL
Location 88, 8
Size 400, 20
Text http://www.bbc.co.uk
TreeView1
Name tvProgress
Location 8, 40
Size 480, 496
TextBox2
Name txtText
Location 496, 40
Multiline True
Size 448, 496
Text  
Button1
Name btnStartIndexing
Location 16, 544
Size 96, 32
Text &Start Indexing
StatusBar1
Name stsFooter
Dock Bottom
ShowPanels True
Text  
ProgressBar1
Name pbLoadProgress
Location 843, 590
88 88, 11

Click on the StatusBar and press F4 to bring up the properties window. Click the Panels ellipsis to bring up the StatusBarPanel Collection Editor. Press the Add button. StatusBarPanel1 is added. Set the AutoSize property to True and clear the Text property. Click the Add button again to add a second panel. Clear the Text property for this panel as well and press the OK button. From the menu, choose Project|Properties. Change the Startup object to frmMain. Add code to serialize the Opening URL as well as code for the Form_Resize event (included with the downloadable project but not listed). Your form should look something like the one shown in Figure 1.

Figure 1
Figure 1 [Click to enlarge]

Cranking up the Engine

With the Presentation Layer in place, let’s write some code to check out the WebCrawler class. The CrawlURL method accepts an opening URL and begins recursively crawling at that point. The code to begin this process is extremely simple and almost identical in both languages:
Private Sub btnStartIndexing_Click( _
	ByVal sender As System.Object, _
	ByVal e As System.EventArgs)
	Handles btnStartIndexing.Click
	m_wcIndexer.CrawlURL(txtOpeningURL.Text)
End Sub
We could have set the maximum level property, but the default is generally adequate. There are two additional properties that will set the Include and Exclude lists, each of which accepts a string array. We can forget about those for now. Once the recursion has started, it is the event handlers where all the action takes place.

If you have not already done so, add a reference to both the HTML Container class (WebWagon.dll) and the Web Crawler class (Charlotte.dll). With .NET, it is not necessary to register the components. Simply choose Project|Add Reference from the menu. Click on the Projects tab and navigate to the folder with the DLLs – For example: c:\components\ if the DLLs were copied to that folder. Select both DLLs and click the OK button.

Declare a modular level WebCrawler object (just below the Windows Form Designer generated code) enabled for event handling. This is easily done in VB .NET using the WithEvents keyword, but C# does not support the WithEvents keyword, so we need to attach the events during run-time:

private Charlotte.WebCrawler
	m_wcSpider = new Charlotte.WebCrawler();

public frmMain()
{
	InitializeComponent();
	m_wcSpider.LoadProgress += new
		Charlotte.WebCrawler.LoadProgressHandler
		(m_wcSpider_Load_Progress);
	m_wcSpider.LoadStatus += new
		Charlotte.WebCrawler.LoadStatusHandler
		(m_wcSpider_LoadStatus);
	m_wcSpider.NewPage += new
		Charlotte.WebCrawler.NewPageHandler
		(m_wcSpider_NewPage);
	m_wcSpider.PageComplete += new
		Charlotte.WebCrawler.PageCompleteHandler
		(m_wcSpider_PageComplete);
	m_wcSpider.Queuing+= new
	Charlotte.WebCrawler.QueuingHandler
		(m_wcSpider_Queueing);
	}
The first events we will look at are LoadStatus, LoadProgress and NewPage. The first two are related – each provides feedback information as a page is loaded from the Web. We will update the StatusBar message whenever LoadStatus fires and the ProgressBar when LoadProgress fires. These two events are trivial and not listed here. The NewPage event fires when a page has been successfully loaded and passes back the page as an HTML Container object. This event also provides a Boolean parameter named NoFollow. This parameter will always be False when the event fires. Setting it to True will tell the WebCrawler class not to follow any links on this page. It is the NewPage event where we will call a stored procedure to index (or re-index) the words on the page. For feedback purposes, we will also populate the TextBox control with the document’s text. This routine will make use of a DataServices class, m_dsWebLex which wraps the interaction with the Database:
private void m_wcSpider_NewPage(
	string URL, Charlotte.HTMLPage
	Page, int Level, ref bool NoFollow)
{
	try
	{
		txtText.Text = Page.Text;
		if (m_dsWebLex.WasIndexed(URL,
	DateTime.Parse(Page.LastModified)))
			NoFollow = true;
		else
			if (!(Page.NoIndex))
				m_dsWebLex.IndexContent
	(Page.Title, Page.URL, Page.Text);
	}
	catch {}
}
The Try Catch blocks are especially important here. Otherwise, closing the form during a run will often generate an error as the dying process futilely attempts to update a closed form.

If everything is in place, the project will now have the ability to crawl the Net, beginning at the specified page and continuing until every possible path has been exhausted. As each page is loaded, the status and progress will be shown. When complete, the TextBox will be populated with text from the page. We have not restricted the range of allowable URLs yet, so the program will have a tendency to run forever and a day when started from any site which has links to another site. This of course describes most of the Internet.

A Tree of URLs

Even a restricted range of URLs is bound to involve hundreds or thousands of pages. Let’s finish up the presentation layer by looking at the Queuing event, where we will update the TreeView control. This event is fired after the NewPage event, passing the URL of the current page, and an array of the URLs on that page that are to be crawled. To update the Treeview control, we first locate the Node that represents the current page, and then add a sub-node for each URL in the array. If there is no node for the current page, one is added to the end of the tree.

The Queuing event will fire even if the maximum recursion level has been exceeded. All will be right with the world if we don’t take this into account; however the TreeView control will end up with nodes that seem to be orphans. The parent node will show a colour of green, to indicate completion, but the child node will show a colour of blue to indicate pending. The event’s Level parameter indicates the current level, and this can be compared with the MaxLevel property. If the maximum has been exceeded, we will skip the check for the parent node and just add these URLs to the end of the tree as if there was no parent. They won’t be processed until all other known URLs have been processed anyway, so at the end is exactly where they belong rather than as sub-nodes of the overflowed parent.

To make things a little more manageable, we will write two helper functions. FindNodeByValue will search the TreeView control for a node whose Text matches the given value and return this node if found or null if not. SetParentNode will either return the results of FindNodeByValue, or create a new node and return that. Using the two helper functions, the TreeView control can now be updated in the Queuing event:

private void m_wcSpider_Queueing(
	string URL, int Level,
	string[] HRefs)
{
	try
	{
		TreeNode nodParent;
		if (Level >
			m_wcSpider.MaxLevel)
			nodParent =
			tvProgress.Nodes.Add(URL);
		else
			nodParent = (TreeNode)
				SetParentNode(tvProgress,
				URL);
			int i = 0;
			for (i = 0; i <
				HRefs.Length; i++)
			{
				TreeNode nodChild =
					nodParent.Nodes.Add(
					HRefs[i]);
				nodChild.ForeColor =
			System.Drawing.Color.Blue;
			} //(Level >
				m_wcSpider.MaxLevel)
		}
	catch {}
}

private TreeNode SetParentNode(
		TreeView tvSet, string strURL)
{
	TreeNode nodParent=
		FindNodeByValue(tvSet, strURL);
	if (nodParent == null)
	{
		nodParent =
			tvSet.Nodes.Add(strURL);
		nodParent.ForeColor =
			System.Drawing.Color.Blue;
	}
	return(nodParent);
}

private TreeNode FindNodeByValue
	(System.Windows.Forms.TreeView
	tvParent, string strValue)
{
	foreach (TreeNode nodChild in tvParent.Nodes)
		if (nodChild.Text == strValue)
			return(nodChild);
		else
		{
			TreeNode nodResult = FindNodeByValue(
				nodChild, strValue);
			if (nodResult != null)
				return(nodResult);
		}
		return(null);
}
private TreeNode FindNodeByValue(
	TreeNode nodParent, string strValue)
{
	foreach (TreeNode nodChild in nodParent.Nodes)
	{
		if (nodChild.Text == strValue)
		{
			return (nodChild);
		}
		else
		{
			TreeNode nodResult = FindNodeByValue(
				nodChild, strValue);
			if (nodResult != null)
				return(nodResult);
		}
	}
	return(null);
}
That’s it… almost. The last remaining event, PageComplete, fires after all possible paths on a page have been travelled, or when the processing has ended due to an error. We complete the TreeView code in this event by setting the appropriate node to Green or Red depending on the ReturnCode parameter, which will indicate success or error:
private void m_wcSpider_PageComplete(
	string URL, int ReturnCode)
{
	try
	{
		TreeNode nodComplete = FindNodeByValue(
			this.tvProgress, URL);
		if (nodComplete != null)
		{
			if (ReturnCode == 0)
				nodComplete.ForeColor =
					System.Drawing.Color.Green;
			else
				nodComplete.ForeColor =
					System.Drawing.Color.Red;
			}
			txtOpeningURL.Text = URL;
	}
	catch {}
}

The DataServices Class

As it stands, the program is pretty useless but nonetheless fun to watch. The TreeView control will soon populate with dozens of nodes at all levels. You can see each page as it is processed, turning that node green or populating it with child nodes. Try setting a breakpoint on one of the events and change the MaxLevel property to see how this affects the depth of the tree. As you increase the level, the TreeView control will grow deeper levels of sub-nodes. The infrastructure is in place, let’s add some functionality.

We could write code to interact with the database here in the presentation layer but a better solution is to create a DataServices class that will abstract this functionality. This has advantages both for scalability and flexibility. The project here will assume you are using SQL Server (note that a Desktop version of SQL Server comes with Visual Studio .NET), but you may wish to use MySQL or some other database. If so, only the DataServices class would need to be modified. Keeping these details hidden from the other layers insures that any changes to the backend will not require modification of any of the other classes.

This class will have two public methods – both Boolean functions. WasIndexed will accept a URL and a ContentDate. If the URL has never been indexed, or if the ContentDate is more recent than the IndexDate, WasIndexed will return False otherwise WasIndexed will return True. This function is very important to the Indexing program’s efficiency. If the page has already been indexed, we won’t re-index it, but more importantly, we won’t follow any links on its page either – thus stopping the recursion at this point.

The second function, IndexContent will accept a URL, Title and Text. The page will be indexed if new, or re-indexed if known. This function returns True if successful or False if not.

Creating the Database

The DataServices class will be a very simple class that just executes the appropriate stored procedure and returns the results.

Let’s take a look at where the real work takes place. We need to design the database and stored procedures that the DataServices class will use. I will assume that you are using the Desktop Database Engine that comes with Visual Studio .NET. If you are using the enterprise version, use the appropriate Server and Security settings as per your configuration.

Once you are satisfied that SQL Server is up and running properly, create the database by right-clicking on your server and selecting New Database from the popup menu. Name the database dbWebLex. Right-click on the Tables node and add four tables listed in the box “dbWebLex tables”.

dbWebLex tables

tblKeywords
Column Name Data Type Length Allow Nulls
ID int 4  
Keyword varchar 30  
KeywordCount Int 4  
tblKeywordURL
Column Name Data Type Length Allow Nulls
ID int 4  
KeywordID int 4  
URLID Int 4  
tblURLs
Column Name Data Type Length Allow Nulls
ID int 4  
Title varchar 150 Yes
URL varchar 250  
IndexDate datetime 8  
tblStopWords
Column Name Data Type Length Allow Nulls
ID int 4  
Keyword varchar 30 Yes

Indicate each of the ID fields as the Primary Key by selecting the field, and then pressing the small key icon at the top left of the Table toolbar, or by selecting SetPrimaryKey from the Diagram menu. Make the field auto-increment by setting Identity to Yes on the Columns tab (see Figure 2). It is very important that you do this; otherwise you will get an NULL exception error later when inserting records.

Figure 2
Figure 2

Set the indexes by selecting View|Indexes Keys from the menu. From the Property Page dialog, click the New button and then select the appropriate Column name from the dropdown. You will want to index both Keyword fields, KeywordID, URLID and URL.

The Power of Stored Procedures

An application such as this is a no-brainer as far as using stored procedures is concerned. We are going to extract and index every word for each document processed. We could write a loop in VB .NET or C# that would loop for each word in the document, updating the appropriate tables along the way.

However, if you benchmark this compared with writing a stored procedure that accepts the entire document and does the same thing, you will be amazed at the difference in performance – even when using the Desktop version of SQL Server. This would be especially pronounced when going across a network.

We will begin by writing three simple stored procedures: Add1Word, Add1URL and AddWordURL. These procedures are responsible for adding an item to the tblKeywords, tblURLs or tblKeywordURLs tables as the names indicate. Each of these has a slightly misleading name, for a record will only be added if the entity in question is not already present.

Additionally, the first two are also tasked with providing the ID of the corresponding record for this item. For example, given the word ‘horse,’ Add1Word will return the ID for ‘horse’ if found, otherwise add ‘horse’ to the database and return the ID of the newly inserted record. Note that it is AddWordURL where KeywordCount is incremented:

CREATE PROCEDURE dbo.Add1Word
(
	@Word VarChar(30),
	@WordID Integer = -1 OUTPUT
)
AS
	SET NOCOUNT ON
	Set @WordID = (Select ID from tblKeywords
		where Keyword = @Word)
	IF @WordID is NULL Begin
		Insert Into tblKeywords (
			Keyword, KeywordCount) Values(@Word,0)
		Set @WordID = @@IDENTITY
	End
RETURN

CREATE PROCEDURE dbo.Add1URL
	(
		@Title Varchar(150),
		@URL Varchar(255),
		@URLID Integer = -1 OUTPUT,
		@IndexDate DateTime = NULL
	)
AS
	SET NOCOUNT ON
	Set @URLID = (Select ID from tblURLs
			where URL = @URL)
	IF @URLID is null Begin
		If @IndexDate is Null
			Insert Into tblURLs (Title,
				URL, IndexDate)
			Values(@Title, @URL, GetDate())
		Else
			Insert Into tblURLs (Title, URL, IndexDate)
			Values(@Title, @URL, @IndexDate)
			Set @URLID = @@IDENTITY
		End
RETURN

CREATE PROCEDURE dbo.AddWordURL
(
	@WordID Integer,
	@URLID Integer
)
AS
	SET NOCOUNT ON
	Declare @ID Integer
	Set @ID = (Select ID from tblKeywordURL
		Where KeywordID = @WordID AND
			URLID = @URLID)
	If @ID is NULL Begin
		Insert Into tblKeywordURL (KeywordID, URLID)
		Values(@WordID,@URLID)
		Update tblKeywords Set KeywordCount =
			KeywordCount + 1
		Where ID = @WordID
	End
RETURN
What goes up must come down, and what is added must be deleted. The procedure DeleteURL accepts a URL and removes all related records from tblURL and tblKeywordURL.

Additionally, the KeywordCount in tblKeywords is decremented as appropriate. This routine begins by obtaining the ID of the URL to be deleted. A Cursor is used to iterate through tblKeywordURL for each matching entry. Each time through, the KeywordCount for the corresponding record in tblKeywords is decremented. Finally, the records from tblKeywordURL and tblURL are removed:

CREATE PROCEDURE dbo.DeleteURL
(
		@URL Varchar(255)
)
	With recompile
	AS
		SET NOCOUNT ON
		Declare @URLID Integer
		Declare @KeywordID Integer
		Set @URLID = (Select ID From tblURLs
			Where URL = @URL)

		Declare curKeywordURL Cursor For
			Select KeywordID
			From tblKeywordURL Where URLID = @URLID
		OPEN curKeywordURL
		FETCH NEXT FROM curKeywordURL INTO
			@KeywordID
		While @@FETCH_STATUS = 0 Begin
			Update tblKeywords Set KeywordCount
				= KeywordCount – 1
				Where ID = @KeywordID
			FETCH NEXT FROM curKeywordURL INTO
				@KeywordID
		End

	CLOSE curKeywordURL
		DEALLOCATE curKeywordURL

	IF @URLID is Not NULL Begin
		Delete From tblKeywordURL
			Where URLID = @URLID
		Delete From tblURLs
			where ID = @URLID
	End

RETURN

Stop Words

It is a good idea to ignore certain words such as “a” or “the” which will occur many times in each document and provide no useful information. As its name would imply, tblStopWords will be populated with the IDs of items in tblKeywords that are to be ignored during indexing. You could populate this table manually, by first adding the list of common words to tblKeywords and then adding the appropriate IDs to tblStopwords – but we have a computer – let’s use it. A much better way is to just run the application for a few hours with no stop words defined, allowing them to clutter up the database as they wish. Then use a query to populate tblStopwords for any word that has a KeywordCount of more than a certain number. You will then clear all of the other tables and begin again. Of course you only need to do this one time.

The procedure IsaStopWord (not listed) accepts a keyword, returning True if it is a stop word and False otherwise.

Indexing the Document

SQL Server will get upset about a string parameter greater than 8000, so we will write a stored procedure named AddWords that accepts a Title, URL and up to 8000 bytes, indexing each of the words contained within. A forth parameter indicates whether or not to replace the existing records for this URL. This will always be set to true by the caller unless the document size is greater than 8000. The HLL program will be responsible for breaking the document up into chunks if so and the subsequent calls will set this flag to False so as to prevent the remaining contents from replacing what has just been indexed during the previous call. The procedure expects each word to be delimited by one or more blanks and so the HLL program will also be responsible for converting any special characters that should be treated as blanks.

The main loop continues as long as a word is found. Any word larger than 30 characters is considered as garbage and ignored. When something is found that is neither null, garbage nor a stop word, the three add functions are called:

CREATE PROCEDURE dbo.AddWords
(
		@Title varchar(150),
		@URL varchar(255),
		@Words varchar(8000),
		@ReplaceURL Bit
)

AS
	SET NOCOUNT ON
		Declare @Word varChar(30)
		Declare @s Integer
		Declare @t Integer
		Declare @l Integer
		Declare @Result Integer
		Declare @WordID Integer
		Declare @URLID Integer

		if @ReplaceURL <> 0
			execute DeleteURL @URL

			Set @s = 1
			Set @t = 1
			While (@t > 0) Begin
				Set @t = CharIndex(‘ ‘, @Words, @s)
				If @t > 0 Begin
					Set @l = @t-@s
				If @l <= 30
					Set @Word = Substring(@Words,@s,(@t-@s))
				Else
					Set @Word = ‘’
					Set @s = @t + 1
			End /* @t > 0 */
		Else Begin
			Set @Word = Substring(@Words, @s, 30)
	End /* Else */

		Set @Word = RTrim(@Word)
		If @Word <> ‘’ Begin
			Execute IsaStopWord @Word, @Result OUTPUT
		If @Result = 0 Begin
			Execute Add1Word @Word,
				@WordID OUTPUT
			Execute Add1URL @Title,
				@URL, @URLID OUTPUT
			Execute AddWordURL
				@WordID, @URLID
		End
	End
End

RETURN
The last stored procedure, WasIndexed, accepts a URL and a date – the ContentDate of the document. This function returns True only if the URL is on file, and the IndexDate is greater than the ContentDate:
CREATE PROCEDURE dbo.WasIndexed
	(
		@URL varchar(250),
		@ContentDate DateTime,
		@Result Bit = 0 OUTPUT
	)
AS
	Declare @IndexDate DateTime

	Set @IndexDate =
		(select IndexDate From tblURLs
			Where URL = @URL)
	If @IndexDate is NULL
		Set @Result = 0
	Else
		If DateDiff(minute, @IndexDate,
				@ContentDate) < 0
			Set @Result = -1
		Else
			Set @Result = 0
RETURN @Result

Completing the Project

With the tables created and the procedures written, we can finish up with the DataServices class. The complete listing for the project is available in both VB.NET and C# versions from www.skycoder.com/downloads.


Jon Vote is an independent consultant based on the west coast of the USA. He is a Microsoft Certified Solution Developer (MCSD), with a degree in Computer Science from Southern Oregon University. He can be reached at [email protected].

You might also like...

Comments

Contribute

Why not write for us? Or you could submit an event or a user group in your area. Alternatively just tell us what you think!

Our tools

We've got automatic conversion tools to convert C# to VB.NET, VB.NET to C#. Also you can compress javascript and compress css and generate sql connection strings.

“There are only 3 numbers of interest to a computer scientist: 1, 0 and infinity”