Building a web crawler

This article was originally published in VSJ, which is now part of Developer Fusion.
In October’s issue I showed how to develop an HTML Container class. This month, we will use that class to develop a general purpose Web Crawler class. The HTML Container project, including a VB.NET version, can be downloaded from the VSJ web site.

Before getting started you will need to add the HTML Container class (WebWagon.dll) to your project. From the menu, choose Projects|Add Reference. Click the Projects tab and then the Browse button. Navigate to the location of WebWagon.dll and click OK.

A Web Crawler – sometimes referred to as a spider or robot – is a process that visits a number of web pages programmatically, usually to extract some sort of information. For example, the popular search engine Google has a robot called googlebot that sooner or later visits virtually every page on the Internet for the purpose of indexing the words on that page. We are going to develop a general-purpose class that can be used as a basis for writing any type of robot. This class will be simple yet powerful. The heart of the class is a method called CrawlURL, which accepts a beginning URL. The contents of this URL will be loaded into the HTML Container class. For each link found on this page, it recurs back again, thus repeating the process for each of those pages and so on.

The basic process is pretty simple, but we must add a few more features in order to avoid spiralling into an infinite loop. First of all, we want to keep track of pages that we’ve already seen. Many sites are such that Page A points to Page B which points back again to Page A – and the routine will soon be chasing its own tail if we don’t prevent it from doing so. A related problem has to do with the allowable recursion level. If left unbound, many sites will cause the crawler to dig itself into a hopelessly deep hole, using large amounts of stack space and memory. It is also possible to encounter a submission form that will point back to itself, using a parameterized URL that fools the check for already encountered links. For the sake of usability, we want to restrict the range so we can target a specific site or group of sites, while ignoring everything else. Finally, as we will see later, Robot etiquette requires that we maintain an Exclude list so we might as well expose this function to the object caller as well.

The first problem could be solved in a number of ways. We could have an array or collection representing the known URLs and a routine that would check to see if a given URL was already in the list. Actually we can save a bit of work by using the .Net Queue class. This class has two methods that will be useful for this purpose – the Enqueue method, which adds an item to a queue, and the Contains method, that returns True if a given item is in the queue or False if not – exactly the functions needed for the task at hand.

A property will be added to set or indicate the maximum recursion level. We will write a private version of CrawlURL, which accepts a URL as does the public method, but also expects a recursion level parameter. The public interface will simply invoke the private method, passing a zero and the process will continue from there. When the maximum level has been reached, we will stop the process. When this happens, we don’t want to just throw away the links. We may drop pages that are not accessible from another path if we do. We will use another queue that is populated with links that would have been visited had we not exceeded the maximum level. These links will in turn be visited whenever the initial recursion has completed and the process will start again. A lower recursion level will save memory, but not allow any links to be missed completely.

The Include list will be optional. If none is specified, any URL will be allowed unless found on the Exclude list. If one or more are present, the URL in question will be required to match the Include. A partial path will be allowed, so the path:

http://www.bbc.co.uk/sports
…will include or exclude (depending on the list) anything that matches to that point.

A well-mannered robot

As with just about everything in life, there are certain rules that should be followed when writing a robot. There are two important bits of information that every robot should check before visiting a specific link. The first of these is the robots.txt file. This file may or may not be present – but if it is, it will be found in the root directory of the server. For example, the BBC’s robot.txt file can be found at:
http://www.bbc.co.uk/robots.txt
A robots.txt file consists of two parts – the robot name and the list of excludes meant for that specific robot. An asterisk (*) is used to indicate any robot not explicitly named – us for example. Here is a sample entry from the BBC robots.txt file:
User-agent: *
Disallow: /cgi-bin
This tells us that we should not index any files found in the /cgi-bin path, which for this page would mean everything that begins with:
http://www.bbc.co.uk/cgi-bin/
Two important things should be noted about the robots.txt file. First of all, it has nothing to do with security. You can, if you wish, ignore the robots.txt file and traverse to your heart’s content – but it is not considered polite to do so. Secondly, it is often the case that the excluded paths contain duplicated or otherwise uninteresting information that you probably didn’t want to visit anyway. Often the robots.txt file is maintained by the webmaster as a courtesy to the robot writer rather than a hindrance.

While the robots.txt file applies to the entire site, a given page can also have meta-tags that contain robot information. There are two of these that we should be concerned with here – NoIndex and NoFollow, which may appear together, separately or not at all. Here is an example of a meta-tag that contains both values:

<meta name=”robots” content=”noindex
	nofollow”>
The NoIndex attribute requests that text not be indexed from that page. The NoFollow attribute requests that none of the links on that page be crawled.

We are just about ready to write the CrawlURL method, but first let’s take a look at the HTML Container class. There are two methods that we will find particularly useful: LoadSource and GetHRefs. The first of these will grab a document given a URL. The second will extract every HRef element from the anchor tags on that page. Perfect – except that it would have been nice if the HTML Document class had included NoIndex and NoFollow properties to save us the trouble of checking them ourselves.

Extending the HTML container class

Well, of course there is an easy solution – inheritance. We will create a new HTML Container class that will inherit the original, while adding two additional properties: NoIndex, True if a NoIndex meta-tag is present, and NoFollow, True if a NoFollow meta-tag is present. Recalling the HTML Document class again, we find the LoadStatus event, which will raise with a Description of either “Complete” or “Error” when the page has completed loading or failed to load. This sounds like a good place to check the document’s meta-tags. We will write a private routine SetRobotsFlags, which will set modular level Boolean variables corresponding to the NoIndex and NoFollow properties depending on the presence or absence of their corresponding meta-tags. When a page has been loaded successfully, this routine will be called to set the flags. In the result of an error, we will set both of these flags to True.

Defining the classes and events

To create a new class based on an existing class we declare the class, and then use the Inherited keyword for VB.NET or suffixing the class name with a colon, followed by the base class for C#. The modified HTML Container will be a stand-alone class – but contained in the same namespace as the WebCrawler class:
public class
	HTMLPage:WebWagon.HTMLPage
{
	private bool m_bNoIndex;
	private bool m_bNoFollow;
	public HTMLPage()
	{
		this.LoadStatus += new
			HTMLPage.LoadStatusHandler
			(HTMLPage_LoadStatus);
	}
	public bool NoIndex
	{
		get
		{
			return(m_bNoIndex);
		}
	}
	public bool NoFollow
	{
		get
		{
			return(m_bNoFollow);
		}
	}
	private void HTMLPage_LoadStatus(
		string URL,
		string Description)
	{
		if (String.Compare(
			Description, “Complete”,
			true) == 0)
				SetRobotFlags();
		else
			if (String.Compare(
				Description, “Error”,
				true) == 0)
			{
				m_bNoFollow = true;
				m_bNoFollow = true;
			}
	}
	private void SetRobotFlags()
	{
		string[] strMetaTags
			= GetTagsByName(“Meta”);
		if (strMetaTags == null)
		{
			m_bNoIndex = false;
			m_bNoFollow = false;
		}
		else
		{
			int i = 0;
			const string ROBOTS =
				@”name\s*=\s*\””+robots”;
			const string NO_INDEX =
				“noindex”;
			const string NO_FOLLOW =
				“nofollow”;
			for (i = 0;
				i < strMetaTags.Length;
				i++)
			{
				if (Regex.IsMatch(
					strMetaTags[i],
					ROBOTS))
				{
					if (Regex.IsMatch(
						strMetaTags[i],
						NO_INDEX,
				RegexOptions.IgnoreCase))
						m_bNoIndex = true;
					if (Regex.IsMatch(
						strMetaTags[i],
						NO_FOLLOW,
				RegexOptions.IgnoreCase))
							m_bNoFollow =
								true;
				}
			}
		}
	}
}

Passing data back from the Crawler

The events raised by the HTML Container class are mainly for the purpose of providing feedback to the user – so we will go ahead and raise the same events in our WebCrawler class also. The events from the HTML Container class, LoadStatus and LoadProgress, indicate the status of a page as it is loaded. Feedback is nice, but in the case of the WebCrawler class – it won’t be much use at all unless we return some information to the parent object. Once started, the WebCrawler class will keep itself busy until all possible paths have been exhausted, which could be hours, weeks or even longer. So, we will raise a NewPage event every time a new page is encountered and pass back the HTMLContainerClass object itself. A Queuing event will fire when we have built a list of HRefs on that page, passing back an array of HRefs about to be crawled. Finally, a PageComplete event will fire when the page has been processed and all its possible paths followed:
public delegate void
	LoadStatusHandler(string URL,
	string Description);
public event LoadStatusHandler
	LoadStatus;
public delegate void
	LoadProgressHandler(
	string URL, long Max,
	long intValue);
public event LoadProgressHandler
	LoadProgress;
public delegate void NewPageHandler(
	string URL, HTMLPage Page,
	int Level, ref bool NoFollow);
public event NewPageHandler NewPage;
public delegate void QueuingHandler(
	string URL, int Level,
	string[] HRefs);
public event QueuingHandler Queuing;

public delegate void
	PageCompleteHandler(
	string URL, int ReturnCode);
public event PageCompleteHandler
	PageComplete;

Developing the main engine

We are finally ready for the fun part. We begin by providing the public interface for CrawlURL, which accepts the initial URL supplied by the caller. In the opening section, we discussed the need to keep track of the recursion level, and to save any links on a page that exceeds the limit. These both will be significant here, as we shall soon see. Since we don’t want the user to worry about passing level information, we will create a private interface that accepts a URL and also a Level. The private routine will increment this each time and call itself back – thus keeping track of the current level. The initial call to the CrawlURL function will only complete after every possible path has been followed to the maximum level – which may take some time. Theoretically though, it will eventually complete at which time any links that were skipped because the level was too great will be in the queue that was populated when this condition was detected previously. We need to begin the entire process again for each of these items. To iterate through the items in the queue, we will use the Peek method to get the value of the item, and the Dequeue method to remove the item from the queue. We are putting off most of the work for the private member, so the public version of CrawlURL is fairly simple and straightforward:
public void CrawlURL(string URL)
{
	CrawlURL(URL, 0);
	while (m_qToDo.Count > 0)
	{
		URL = (string) m_qToDo.Peek();
		m_qToDo.Dequeue();
		CrawlURL(URL, 0);
	}
}
The private member is a bit more involved, but thanks to all our prep work, it’s really a piece of cake. Since there is no reason to continue without a page, we will begin by invoking the LoadSource method of the HTMLContainer class. Remember that although we redefined this class, the method will be executing in the base class – so we don’t need to worry about how it works – it just works. If no success, we will raise a Page Complete event indicating an error and exit. If the page has loaded correctly, we will raise a Page Complete event, passing back the URL and the HTMLDocument object that was just loaded. This will allow the caller to do whatever processing they wish to do with that page. We will also pass back the Level, in case the caller wants to know how deeply the process has dug itself. Finally, a Boolean variable passed by reference – NoFollow can be set to True by the event handler, which will have the same effect as if a NoFollow meta-tag was found in the document.

Assuming the NoFollow parameter from the NewPage event has not been set to True, we look at the NoFollow property determined by the document’s meta-tags. We will only continue if this is FALSE, which it normally is. The GetHRefs method will return an array of the links on this page. This array is passed to a routine called ShuffleURLS, which will weed the list. Any URLs that are already known, mal-formed, match the Exclude list or don’t match the Include list are rejected. The Include and Exclude lists are set using the methods SetIncludes and SetExcludes, not listed, which simply accept an array of strings, and initialize the private string arrays m_strIncludes and m_strExcludes.

A poor man’s inter-process queuing algorithm

You may be wondering why the weed routine is named ShuffleURLs rather than something like WeedURLs. Well it is so named because once the list has been weeded; it is shuffled into random order! It is probably not apparent why it is necessary to shuffle the list into random order and the answer is that it isn’t. However, if you run multiple versions of a program like this at the same time, you will find that the sessions soon begin playing ‘leapfrog’ with each other. Even if you start them using different URLs, they will soon sync themselves in a seemingly magic fashion. The first program will process a given page only to have the second one visit the same page immediately afterward and so on. Of course the ‘right’ way would be to create some sort of inter-process queue to ensure that each thread had its own slice of the total, but randomly shuffling the list only takes a few additional lines of code and is tremendously effective at reducing this problem. It’s a classic ten-cent solution with a $100 payoff.

The main recursion loop

At this point, the HRefs array is populated with the links on this page. We are ready to repeat the process again for each item in the array. The array will be null if there were no links on this page. Otherwise, we will raise the Queuing event back to the caller, passing the HRefs array. Next we iterate through each item in the array. We will recur back again if the maximum level has not been reached, otherwise we will simply add each item to the ToDo queue for processing later. Here is the complete listing for the private version of CrawlURL:
private void CrawlURL(string URL,
	int Level)
{
	if (! m_pgeCurrent.LoadSource(URL))
	{
		if (PageComplete != null)
			PageComplete(URL, -1);
		return;
	}
	bool bNoFollow = false;
	if (NewPage != null)
		NewPage(URL, m_pgeCurrent,
			Level, ref bNoFollow);
		Level++;
		if (! bNoFollow)
			if (! m_pgeCurrent.NoFollow)
			{
			string[] strHRefs =
				m_pgeCurrent.GetHRefs();
			if (!(strHRefs == null))
			{
			strHRefs = ShuffleURLs
				(m_pgeCurrent.Host,
				strHRefs);
			if (!(strHRefs == null))
			{
				if (Queing != null)
					Queing(URL, Level,
					strHRefs);
				int i;
			for (i = 0;
				i < strHRefs.Length; i++)
				{
					if (Level <=
						m_intMaxLevel)
						CrawlURL(
						strHRefs[i],
						Level);
					else
						m_qToDo.Enqueue(
							strHRefs[i]);
				}
			}
		}
	}
	Level-;

	if (PageComplete != null)
		PageComplete(URL, 0);
}

The DataServices class

The basic engine and much of the infrastructure is now ready, but there is still some unfinished business to complete. We still need to add the code to filter URLs based on the values in the Include/Exclude list as well as the entries in the robots.txt file if present. We can combine all of these functions into a third class, which will be responsible for persisting this sort of information – the DataServices class. Let’s start with the code needed to decipher the robots.txt file. When a new domain name is encountered, we will request the robots.txt file for that server. If it is present, we will add any excluded paths to our Exclude list after normalizing them by prefixing the domain name. Whether the robots.txt file is found or not, we will add this domain name to a queue so when it is encountered next time, we will know it’s already been processed. The routine to do this will be a private routine named LoadRobotExcludes in the DataServices class, which will accept a domain name and load the Excludes table if applicable. The AddDomainName and KnownDomainName functions (not listed here) are trivial routines that use the familiar Queue object to add an item to the queue or check to see if it is there.
private void LoadRobotExcludes(
		string strDomainName)
{
	if (!(KnownDomainName(
		strDomainName)))
	{
		WebWagon.HTMLPage pgeRobots =
			new WebWagon.HTMLPage();
		if (pgeRobots.LoadSource(
			“http://” + strDomainName
			+ “/robots.txt”))
		{
			ring strSource =
				pgeRobots.Source;
			if (strSource != “”)
			{
				string[] strUserAgents =
					Regex.Split(
					strSource,
					“User-agent:”,
				RegexOptions.IgnoreCase |
				RegexOptions.Singleline);
				int i = 0;
				for(i = 0;
				i < strUserAgents.Length;
				i++)
				{
					const string
						OUR_BOT = “*”;
					if(Left(
						strUserAgents[
						i].Trim(),1) ==
						OUR_BOT)
					{
						const string
							TRIM_RIGHT =
							@”\s*\*\s*”;
						string strExcludes
						= Regex.Replace(
						strUserAgents[i],
						TRIM_RIGHT, “”)
						@”\r\n”;
						const string
						COMMENTS =
						“#[^\n]*\n”;
						strExcludes
						= Regex.Replace(
						strExcludes,
						COMMENTS,
						“”);

						string[]
						strExcludePaths =
						Regex.Split(
						strExcludes,
						“Disallow:”,
						System.Text.
					RegularExpressions.
				RegexOptions.IgnoreCase);
						int j = 0;
						for(j = 0;
			j < strExcludePaths.Length;
						j++)
						{
							const string
					TRIM_WHITE_SPACE =
								@”\s”;
					strExcludePaths[j] =
					Regex.Replace(
					strExcludePaths[j],
					TRIM_WHITE_SPACE,
					“”);
					if(strExcludePaths[j]
					!= “”)
						{
						m_strExcludes.Add(
						“http://”
						+ strDomainName
				+ strExcludePaths[j]);
							}
						}
					}
				}
			}
		AddDomainName(strDomainName);
		}
	}
}
The DataServices class will also be responsible for telling us if a given URL is present in either the Include List or Exclude List. The functions IncludedURL and ExcludedURL will check the Include or Exclude lists for a given entry and return True or False whether found or not. IncludedURL first checks to make sure there is an Include list. If not, every URL will result in a True. ExcludedURL first calls LoadRobotExcludes. Both routines make use of a private function PathMatches, which accepts an entry and an array of paths:
public bool IncludedURL(string strURL)
{
	if (m_strIncludes.Count == 0)
		return(true);
	else
		return(PathMatches(strURL,
			m_strIncludes));
}

public bool ExcludedURL(
		string strDomainName,
		string strURL)
{
	LoadRobotExcludes(strDomainName);
	return(PathMatches(strURL,
		m_strExcludes));
}

private bool KnownDomainName(
		string strDomainName)
{
	return(m_qDomainNames.Contains(
		strDomainName));
}

public void AddDomainName(
	string strDomainName)
{
	m_qDomainNames.Enqueue(
		strDomainName);
}

private bool PathMatches(
	string strItem, ArrayList raList)
{
	string[] strList =
		(string[]) raList.ToArray(
		typeof(string));
	if (strList == null)
		return(false);
	else
	{
		int i = 0;
		for (i = 0; i < strList.Length;
			i++)
			if (System.String.Compare(
				strList[i], Left(strItem,
				strList[i].Length),
				true) == 0)
					return(true);
	}
	return(false);
}

Weeding the list

With the Data Services class, we are ready to look at the weeding process. The private function WeWannaCrawl will accept a domain name and URL, check to see if it is suitable, and return True or False depending on the result. The URL must be well formed, not already encountered, match the Include list and not match the Exclude list. This routine will make use of another trivial function not listed here – WellFormedURL, which simply checks to see if the first 7 characters of the URL are equal to “http://” and returns True if so. IncludedPath and ExlcudedPath, also not listed, simply wrap the IncludedURL and ExcludedURL methods already mentioned:
private bool WeWannaCrawl(
	string strDomainName,
	string strURL)
{
	if (WellFormedURL(strURL))
		if (! m_qKnown.Contains(
			strURL))
			if (IncludedPath(strURL))
				if (! ExcludedPath(
					strDomainName,
					strURL))
				{
					m_qKnown.Enqueue(
						strURL);
					return(true);
				}
	return(false);
}
Only one major routine remains now – ShuffleURLs, which accepts the domain name and the HRefs array, returning a weeded and randomly shuffled list. This routine consists of two main For loops. The first of these weeds the list, only including items that have passed the WeWannaCrawl function’s careful eye. The second loop uses the random number generator to mix everything up into random order:
private string[] ShuffleURLs(
	string strDomainName,
	string[] strHRefs)
{
	ArrayList raReturn =
		new ArrayList();
	string[] strReturn;
	int i;
	for (i = 0; i < strHRefs.Length;
		i++)
		if (WeWannaCrawl(strDomainName,
			strHRefs[i]))
			raReturn.Add(strHRefs[i]);
			strReturn =
				(string[])
				raReturn.ToArray(
				typeof(string));
	if (!(strReturn == null))
	{
		Random rnd = new Random();
		bool[] bAnchorsVisited =
			new bool[strReturn.Length];
		int[] intAnchorIndexes =
			new int[strReturn.Length];
		for (i = 0;
			i < strReturn.Length; i++)
		{
			int intAnchorIndex =
				rnd.Next(0,
				strReturn.Length);
			while (bAnchorsVisited[
				intAnchorIndex])
				intAnchorIndex =
					rnd.Next(0,
					strReturn.Length);
			bAnchorsVisited[
				intAnchorIndex] = true;
			string strTemp =
				strReturn[i];
			strReturn[i] =
			strReturn[intAnchorIndex];
			strReturn[intAnchorIndex] =
				strTemp;
		}
	}
	return(strReturn);
}

Crawl on

The simple classes presented here provide a valuable tool for anyone who wishes to write a Web Crawler of any type.

Both VB.NET and C# versions of the source code, as well as a GUI demo project, are here.


Jon Vote is an independent consultant based on the west coast of the USA. He is a Microsoft Certified Solution Developer (MCSD), with a degree in Computer Science from Southern Oregon University. He can be reached at [email protected].

You might also like...

Comments

Contribute

Why not write for us? Or you could submit an event or a user group in your area. Alternatively just tell us what you think!

Our tools

We've got automatic conversion tools to convert C# to VB.NET, VB.NET to C#. Also you can compress javascript and compress css and generate sql connection strings.

“The trouble with programmers is that you can never tell what a programmer is doing until it's too late.” - Seymour Cray