Search engines with ASP.NET

This article was originally published in VSJ, which is now part of Developer Fusion.
In the November 2004 issue of VSJ, I showed how to write a Web Indexer, that crawls the web and builds a cross-index of words by URL. This month, we will develop an ASP.NET-based Search Engine, which will retrieve the results of the Web Indexer. Both VB.NET and C# versions of the Web Indexer project (as well as this project) can be downloaded. The Web Indexer generates the data that will be used by this project.

Restoring and configuring the database

If you are using the desktop version of SQL Server (MSDE) you will have to restore the database using the command line, as there is no GUI interface. Start the SQL Server command processor from a DOS prompt as follows, substituting YourServer for the name of your SQL Server:
osql –E –SYourServer
If you aren’t sure what to enter as your SQL Server name, start up Visual Studio .NET and open the Server Explorer. Under the Servers node, you should see your machine listed. Open this node and you will see an entry for SQL Servers at the bottom. Your SQL Server instance should appear as the first sub-node. The name listed here is the one you want.

The –E switch tells SQL Server to use integrated security – something we will discuss a bit later on. Assuming SQL Server is properly installed and running you should now be presented with the SQL Server prompt. Let’s say you copied the database file to c:\temp\dsWebLex.dat. To restore the database you would enter the following:

use master
go
restore database dbWebLex
from disk = ‘c:\temp\dbWebLex.dat’
with recovery
go
Verify that the database has been properly restored by using the Server Explorer.

Searching the database

We will need to write a Stored Procedure to retrieve the entries from the database, so let’s take a quick review of the tables. tblKeywords has an entry for every word encountered on any page, along with a word count. tblURLs has the title and URL of every page visited. tblKeywordURL relates the ID of a specific word in the database to the ID of a specific URL in the database. Every word in every indexed web page has a corresponding entry in this file, with one exception. tblStopWords contains a list of any words that are to be ignored during indexing.

Locate the dbWebLex database using the .NET Server Explorer. Create a new stored procedure by Right-clicking on the Stored Procedures node and select New Stored Procedure from the popup menu. Change the name to SearchWebLex. This routine will accept 11 parameters. The first of these indicates the type of search – all of the words or any of the words. Up to 10 optional search words may be specified with the remaining parameters:

CREATE PROCEDURE dbo.SearchWebLex
	(
		/* Up to 10 words */
		@SearchType int = 0,
		/* 0=Any words, 1=All words */
		@Word0 varchar(30) = Null,
		@Word1 varchar(30) = Null,
		@Word2 varchar(30) = Null,
		@Word3 varchar(30) = Null,
		@Word4 varchar(30) = Null,
		@Word5 varchar(30) = Null,
		@Word6 varchar(30) = Null,
		@Word7 varchar(30) = Null,
		@Word8 varchar(30) = Null,
		@Word9 varchar(30) = Null
	)
AS
	.
	.
	.
RETURN
The trick to an efficient keyword search like this is the word count. Consider a search for pages containing both the words “web” and “antidisestablishmentarianism.” You would expect the first word to appear many times and the second one few if any. Let’s say that the first word appears 40,000 times in your database, and the second only 3 times. We want to find a page that contains both of these words, so we can either check the 40,000 entries for “web,” hoping to find a match or check the 3 instances of “antidisestablishmentarianism.” The first solution requires 40,000 database fetches while the second requires only 3. So, we will arrange multiple keywords in order from least occurrence to most. The procedure begins by converting each string parameter to its corresponding ID in the database. A cursor is declared that will select records from tblKeywords ordered by KeywordCount:
SET NOCOUNT ON
	Declare @NumParms Integer
	Declare @ID Integer
	Declare @ID0 Integer
	Declare @ID1 Integer
	Declare @ID2 Integer
	Declare @ID3 Integer
	Declare @ID4 Integer
	Declare @ID5 Integer
	Declare @ID6 Integer
	Declare @ID7 Integer
	Declare @ID8 Integer
	Declare @ID9 Integer
	Declare curKeywords Cursor For
		Select ID from Keywords Where Keyword
			In(@Word0,@Word1,@Word2,@Word3,@Word4,
			@Word5,@Word6,@Word7,@Word8,@Word9)
			Order By KeywordCount
We now loop for each record, which will be equal to the number of keywords entered, and populate the appropriate ID variable – @ID0 the first time through the loop, @ID1 the next, and so on:
	Open curKeywords
	Fetch Next From curKeywords Into @ID
	Set @NumParms = 0
	While @@Fetch_Status = 0 Begin
		If @ID Is Not NULL Begin
		If @NumParms = 0
			Set @ID0 = @ID
		Else
		If @NumParms = 1
			Set @ID1 = @ID
		Else
		If @NumParms = 2
			Set @ID2 = @ID
		Else
		If @NumParms = 3
			Set @ID3 = @ID
		Else
		If @NumParms = 4
			Set @ID4 = @ID
		Else
		If @NumParms = 5
			Set @ID5 = @ID
		Else
		If @NumParms = 6
			Set @ID6 = @ID
		Else
		If @NumParms = 7
			Set @ID7 = @ID
		Else
		If @NumParms = 8
			Set @ID8 = @ID
		Else
		If @NumParms = 9
			Set @ID9 = @ID
			Set @NumParms = @NumParms + 1
			Fetch Next From curKeywords Into @ID
		End
	End /* While */
	Close curKeywords
	Deallocate curKeywords
So, for the example search of “web antidisestablishmentarianism” we will have @ID0 set to the ID for “antidisestablishmentarianism” and @ID1 set to the ID for “web.” The remaining parameters will be null. If ten words had been entered, all variables @ID0 to @ID9 would be non null.

The select statement depends on the type of search, so we check this and branch accordingly:

if @SearchType = 0
	Select Distinct Title, URL From tblURLs a
		Left Join tblKeywordURL b On a.ID = b.URLID
		Where KeywordID In(@ID0,@ID1,@ID2,@ID3,@ID4,@ID5,
			@ID6,@ID7,@ID8,@ID9)
		Group By Title, URL
Else
	Select Distinct Title, URL From tblKeywordURL
		Left Join tblURLs
			On tblKeywordURL.URLID = tblURLs.ID
		Where KeywordID In(@ID0,@ID1,@ID2,@ID3,
			@ID4,@ID5,@ID6,@ID7,@ID8,@ID9)
		And KeywordID Is Not NULL
		Group By Title, URL
		Having Count(*) = @NumParms

Database security

You should now be able to run the procedure within the Visual Studio .NET Server Explorer environment, but you are probably still not ready to access it with ASP.NET. Given the numerous postings on the various news groups and the problems I have had personally, a few words about security with ASP.NET seem to be in order. Again, this is mainly an issue if you are using MSDE. The problem is not very well documented in the first place and most of what is documented about the problem pertains to SQL Server enterprise and not to MSDE.

There are two basic kinds of security used with SQL Server: Windows Integrated security and SQL Server security. The first type allows access to SQL Server based on your Windows 2000 credentials. The second type requires entering an SQL Server UserID and Password. The problem is that the default configuration of ASP.NET will not allow Integrated Security, and the default configuration of MSDE will not allow SQL Server security. How I long for the days when you were safe taking the defaults!

The preferred solution is to enable SQL Server to use both Integrated and SQL Server security. If you have the enterprise version of SQL Server, you can do this using the Enterprise Manager. However MSDE has no Enterprise manager and so it is necessary to either re-install MSDE using custom mode, or change the LoginMode registry setting.

One must use extreme caution when changing any registry settings, as a corrupted registry can cause serious operating system problems. You should always back up the registry prior to any modification and be advised that you always change the registry at your own risk. You will find the desired entry in one of two places depending on whether MSDE was installed as a named or default instance. If you are not sure, check both places because it will be in one or the other. For a default instance, the path is:

HKEY_LOCAL_MACHINE
	\Software\Microsoft\
	MSSqlserver\MSSqlServer\LoginMode
If you don’t find such an entry, the second place to look will be based on your server’s instance name. Supposing this name was YourInstance, the registry entry would look like this:
HKEY_LOCAL_MACHINE
	\Software\Microsoft\Microsoft SQL
		Server\YourInstance\
		MSSQLServer\LoginMode
The LoginMode setting expects 1 for Integrated Security and 2 to allow both Integrated and SQL Server security. Change the value to 2.

For more information on this problem and the solution, see Microsoft Knowledge Base Article 325022. ,p>With SQL Server security enabled, you still need to add a user and give it proper authority. To do this, enter the SQL Server command prompt as described above using Integrated Security:

osql –E –SYourServer
We will add and authorize a user named “weblex” with a password of “search” authorized to the dbWebLex database. The process begins by adding an SQL Server login using sp_addlogin:
sp_addlogin “weblex”, “search”
go
Now, you need to grant access to the database, using sp_grantdbaccess:
use dbWebLex
go
sp_grantdbaccess “weblex”
go
That grants access to the database, but nothing else. We will also need to grant access to the stored procedure, SearchWebLex. The weblex user will be allowed to execute this procedure but nothing else. Access is granted using the grant command:
use dbWebLex
go
grant execute on SearchWebLex
	to weblex
go
You can double-check the security settings from the DOS prompt. Try getting into SQL Server with the following command line. Of course you will substitute your server name, as appropriate:
osql –SYourServer –Uweblex -Psearch
Remember not to use the –E option which is mutually exclusive to using a UserID and Password. If SQL Server security has been properly enabled, and the user properly authorized you should get an SQL Server prompt and no errors. Try executing the procedure:
use dbWebLex
go
SearchWebLex
go
You won’t get any records since no parameters have been passed, but you should not get any errors either. If everything checks out from the command line, you are ready to connect from ASP.NET.

Creating the WebForm

Start a new Web Project, setting the layout and background as desired. You will want to include a TextBox, Label, RadioButtonList, Button and DataGrid, which are given IDs of txtSearch, lblNoResults, rbAnyAll, btnSearch and dgResults respectively. Add two Items to the RadioButtonList by clicking on the Items ellipsis and then the Add button. The text for these should be set to “Any Words” and “All Words”. Set the Selected property to True for the first item. Set the Text property of the Button to “Search” and the Text Property of the Label to “Sorry, no matching pages where found.” The DataGrid and Label should both have their Visible property set to False.

Add an SQLDataAdapter, which will bring up the Data Adapter Configuration Wizard. Click next to continue. Click the ‘New Connection’ button. If you are using MSDE, your server will not appear in the drop down list, but you can enter it manually. Click the option to use a specific user and password. Enter “weblex” for the user and “search” for the password. Check the ‘Allow saving password box.’ Click the Test Connection button to make sure all is well. Click Ok to close the dialog. You will get a message warning that the UserID and Password will be passed as unencrypted text. We have only authorized this user to execute SearchWebLex so this should not be an issue. Click Next again to proceed. From the Choose Query Type dialog, change the option from ‘Use SQL Statements’ to ‘Use existing stored procedures.’ We will only need to bind the Select Command, which is the first dropdown. Choose SearchWebLex and click the Finish button. Click on the SQLDataAdapter and change the name to daWebLex. Click the ‘Generate Dataset’ link to generate the dataset and accept the defaults. Change the generated dataset name to dsWebLex.

Now we can set the DataGrid properties. You may wish to click the Auto Format link and choose one of the pre-selected styles from the list. Figure_1 shows the WebForm from the downloadable project.

Figure 1
Figure 1

Set the DataSource to dsWebLex and set the ShowFooter property to True. We will need to tweak the Columns collection a bit. The stored procedure will return two columns – the Title and URL. Rather than display these two items side by side, we wish instead to present a hyper-link showing the Title but pointing to the page represented by the URL. Click the ellipsis for the Columns property to bring up the dialog. Clear the CheckBox at the top left to suppress creating columns automatically at run time. Select the ‘(All Fields)’ node and click the right arrow button to add them to the Selected columns list. Un-check the ‘Visible’ box for each field. From the Available columns list, click the entry for HyperLink Column and press the right-arrow button to add it to the Selected columns list. Set the Header text to “Title.” Locate the Text field entry and select Title from the drop-down. The entry for the URL field will be found to the right, and here select URL from the drop-down.

Now click the Paging tab at the left to reveal the Paging dialog. Click the ‘Allow paging’ box.

Click Ok to accept the changes. You should now have a single column that will display as a hyper-link showing the Title and pointing to the URL.

Populating the grid will be a matter of setting the parameters for the SQLDataAdapter’s Select Command procedure and executing the Fill and Databind methods. Recall that SearchWebLex accepts 11 parameters in all. The first parameter, @SearchType indicates the type of search. A value of zero will initiate a search for any words entered. A value of one will match all of the words. The next parameters are set to the word or words to be searched – up to 10 items.

The first part is easy. We will just check the Selected property of the RadioButtonList Items collection and set the value appropriately. To set the search parameters, we will first need to massage the input a bit. We will want to parse out each word entered, while ignoring any special characters or punctuation. Ignoring the special characters requires a bit of thought. For this project we will take a very simple approach. Certain characters will be treated as null, while others will be converted to blanks. The contraction “don’t” will be changed to “dont” but “apples,pears” will yield two words: “apples” and “pears.”

Of course you should treat strings here in the same manner as the Web Indexer does and if you find yourself extending this functionality in one or the other, you would be better served to create a separate class that is used by both applications so that the filtering process is consistent. You may, for example, normally want to treat a period as a blank unless preceded by “vb” and followed by “net” in which case it would be treated differently. Providing this function in a separate class would insure that the string was treated the same way across all related applications. The string is passed to this routine (not listed) first before being sent to the parsing routine. Function ParseWords will accept this string, returning a string array containing the parsed words:

private string[] ParseWords(
	string strText)
{
	if ((strText == null) |
		(strText == “”))
		return(null);
	else
	{
		string strMatchWords = “\\w+”;
		System.Text.RegularExpressions.
			MatchCollection mc =
			System.Text.
			RegularExpressions.Regex.
			Matches(strText,
			strMatchWords);
		string[] strWords =
			new string[mc.Count];
		int i = 0;
		for (i = 0;
			i < strWords.Length; i++)
			strWords[i] = mc[i].Value;
		return(strWords);
	}
}
The parsing function is called by SetSearchParms, which is called whenever the user presses the Search button. This routine simply saves the parsed keywords to the ViewState collection:
private void SetSearchParms()
{
	int i = 0;
	string[] strWords = ParseWords
		(StripSpecialCharacters(
		txtSearch.Text));
	if (!(strWords == null))
	{
		int intUBound =
			strWords.Length;
		if (intUBound >
			MAX_UBOUND_WORDS)
			intUBound =
				MAX_UBOUND_WORDS;
		for (i = 0;
			i <= MAX_UBOUND_WORDS; i++)
			ViewState[“@Word” + i]
				= null;
			for (i = 0; i < intUBound;
				i++)
			{
				if (strWords[i] != “”)
					ViewState[“@Word”
					+ i] = strWords[i];
		}
	}
	dgResults.CurrentPageIndex = 0;
}
The DataGrid is updated in the routine PopulateGrid, which expects to find the search parameters, if any, in the ViewState collection:
private void PopulateGrid()
{
	if (rbAnyAll.Items[0].Selected)
		daWebLex.SelectCommand.
		Parameters[“@SearchType”].
		Value = 0;
	else
		daWebLex.SelectCommand.
			Parameters[“@SearchType”].
			Value = 1;
		bool bSearching = false;
		int i = 0;
		for (i = 0;
			i <= MAX_UBOUND_WORDS; i++)
			daWebLex.SelectCommand.
			Parameters[“@Word”
			+ i].Value = null;
	if (ViewState[“@Word” + i] != null)
		if ((string)ViewState[“@Word”
			+ i] != “”)
		{
			bSearching = true;
			daWebLex.SelectCommand.
			Parameters[“@Word”
			+ i].Value = (string)
			ViewState[“@Word” + i];
		}
		dsWebLex.Clear();
		daWebLex.Fill(dsWebLex);
		DataBind();
	if (dgResults.Items.Count == 0)
	{
		dgResults.Visible = false;
		if (bSearching)
			lblNoResults.Visible = true;
	}
	else
	{
		dgResults.Visible = true;
		lblNoResults.Visible = false;
		UpdateFooter();
	}
}
Since this is a web-based application, the state information will not be preserved after a post. So, we will always call the PopulateGrid routine during a Page_Load event, which will initialize the SQLDataAdapter and DataGrid to the current state – empty the first time or populated if a successful search has been previously achieved.

Of course no search can be initiated without the code for the Search button’s click event. A search is done by first setting the search parameters with a call to SetSearchParms and then updating the grid with a call to PopulateGrid:

private void btnSearch_Click(object
	sender, System.EventArgs e)
{
	SetSearchParms();
	PopulateGrid();
}
We should now be able to populate the grid – but we still need to add code to the PageIndexChanged event of the Datagrid to enable paging. The page is changed by setting the CurrentPageIndex property of the DataGrid and then invoking the DataBind() function:
private void
	dgResults_PageIndexChanged
	(object source,
		System.Web.UI.WebControls.
		DataGridPageChangedEventArgs e)
{
	dgResults.CurrentPageIndex =
		e.NewPageIndex;
	DataBind();
}

Conclusion

This project, along with the Web Indexer project, provides a basic yet useful and fully functional search engine package. While the large search engines such as Google are very good at what they do, their disadvantage is that they encompass the entire World Wide Web. With your own custom search engine, you can index your own sites or provide for a much narrower focus tailored to your own needs and interests.

The complete listing for this project is available in both VB.NET and C# versions from the codebin.


Jon Vote is an independent consultant based on the west coast of the USA. He is a Microsoft Certified Solution Developer (MCSD), with a degree in Computer Science from Southern Oregon University.

You might also like...

Comments

Contribute

Why not write for us? Or you could submit an event or a user group in your area. Alternatively just tell us what you think!

Our tools

We've got automatic conversion tools to convert C# to VB.NET, VB.NET to C#. Also you can compress javascript and compress css and generate sql connection strings.

“C++: an octopus made by nailing extra legs onto a dog.” - Steve Taylor