SAX and VB 6

This article was originally published in VSJ, which is now part of Developer Fusion.
There are a great many XML-based APIs, but the two that are concerned with parsing and accessing the data in an XML document are DOM (Document Object Model) and SAX (Simple API for XML). Which one you use depends on the nature of the document and what you are trying to achieve. DOM works by reading in the entire document and parsing it so that you can have access to any part of it. SAX, on the other hand, lets you process the document as it is read in, using an event-handling model. In VSJ July 2001 we looked at DOM in reasonable detail, and now it's time to look at the alternative, i.e. SAX .

To work with SAX under VB6 you need to install a copy of MSXML (Microsoft XML Core Services), which can be downloaded from the MSDN web site – see box Adding XML to VB 6. MSXML supports a number of different XML technologies, but SAX and DOM are the two that concern us.

DOM is a fine way to process XML data, but it isn't always a suitable way to work. DOM reads in the entire XML file and represents it as an in-memory parse tree. If you really do want to work with all of the information in an XML file, then this might well the be most efficient way to work, but what if you don't? For example, if you want to find a particular name and address in a big XML file, do you really want to read it all as a parse tree? Even if you could afford the memory to store it all, you'd have to wait until all of it was processed before you could start your search for the information. Of course this problem is even worse if the file is being read in via a slow serial stream such as a modem connection to the Internet. A much better idea is to parse the data stream as it comes in and stop the reading of the file as soon as you have found what you're looking for. This is what SAX lets you do. But be warned, because it's an event-driven API, it's more difficult to get started with because you have to implement all of the event handlers, even if you don't want to use them all! Underneath this initial layer of complexity, its all very simple – perhaps even simpler than DOM.

There are lots of examples of using SAX in the Microsoft documentation, but they're all so complicated that getting the basic idea of what is going on is difficult. To try and put things right, I'll show you the simplest possible example of SAX in action, and then let you move on to more interesting things.

There are two fundamental SAX objects that you have to get to know – the SAX reader and the SAX content handler (see Figure 1).

Figure 1
Figure 1: The SAX connection

The reader is fairly straightforward because it just reads the XML file you specify and fires off events depending on what it finds. The content handler is more complicated to implement, but not to understand, in that it just accepts the events that the reader fires off. The practical complication is that you have to create a class that implements the content handler interface and write all the event handlers that the interface specifies – even if you don't want to use them.

The easiest way to understand this is to do it. So start from a new .EXE project, and add a class module called ContentHandler (or anything you like as long as you're consistent). Load a reference to Microsoft XML so that you can make use of the COM objects it provides.

Next, include the statement:

Implements IVBSAXContentHandler
…in the general declarations section. This says that your new class will implement all of the methods defined by the IVBSAXContentHandler interface and thus it will behave as if it's a SAX content handler. In case you've never realised it, this is exactly what the Implements statement and the whole idea of "interfaces" is all about!

Now you have to do exactly what you promised – you have to implement the IVBSAXContentHandler interface and this means writing a lot of "stub", i.e. empty, procedures:

Private Sub IVBSAXContentHandler_characters( _
	strChars As String)
End Sub

Private Property Set _
	IVBSAXContentHandler_documentLocator( _
	ByVal RHS As MSXML2.IVBSAXLocator)
End Property

Private Sub IVBSAXContentHandler_endDocument()
End Sub

Private Sub IVBSAXContentHandler_endElement( _
	strNamespaceURI As String, _
	strLocalName As String, strQName As String)
End Sub

Private Sub IVBSAXContentHandler_endPrefixMapping( _
	strPrefix As String)
End Sub

Private Sub IVBSAXContentHandler_ignorableWhitespace( _
	strChars As String)
End Sub

Private Sub IVBSAXContentHandler_processingInstruction( _
	strTarget As String, strData As String)
End Sub

Private Sub IVBSAXContentHandler_skippedEntity( _
	strName As String)
End Sub

Private Sub IVBSAXContentHandler_startDocument()
End Sub

Private Sub IVBSAXContentHandler_startElement( _
	strNamespaceURI As String, _
	strLocalName As String, strQName As String, _
	ByVal oAttributes As MSXML2.IVBSAXAttributes)
End Sub

Private Sub IVBSAXContentHandler_startPrefixMapping( _
	strPrefix As String, 	strURI As String)
End Sub
If you enter all of these subroutines (and the one property), you'll have a working content handler – but one that doesn't do anything. In the spirit of getting something going, let's move on to using the content handler, useless though it is!

Add a button on the form and add the line:

Private Sub Command1_Click()
	Dim rdr As New SAXXMLReader40
…to the click handler. This creates an instance of the SAX reader object. You can associate a single content handler with the reader via its ContentHandler property, but first we have to create a content handler:
	Dim cnth As New ContentHandler
	Set rdr.ContentHandler = cnth
Now the reader and content handler are created and connected we can begin to parse an XML file. This can be done by setting the parseULR property to a file location or by using the parse property to pass XML data in a string or even in a DOM object. For this simple example we'll just pass it the name of a suitable XML file – in this case the one created using DOM in the earlier article.
	rdr.parseURL "\test.xml"
End Sub
Now when we run the program it all works but, of course, nothing happens. Why should it, as every event handler in the content handler simply returns as soon as it's called? I suppose the only mark of success is the fact that no error messages are generated and it doesn't crash.

Adding events

To indicate that something is indeed happening we need to add some code to the event handlers. The startDocument method is called when the reader opens the document and receives the first data. To see this in action change the method to read:
Private Sub _
IVBSAXContentHandler_startDocument()
	Debug.Print "*******************"
	Debug.Print "Start of document"
End Sub
The startElement method is called whenever an opening tag of any type is encountered. To see this in action change the method to read:
Private Sub _
IVBSAXContentHandler_startElement( _
	strNamespaceURI As String, _
	strLocalName As String, _
	strQName As String, _
	ByVal oAttributes As _
	MSXML2.IVBSAXAttributes)

	Debug.Print "Start", strLocalName
End Sub
This prints the unqualified name of the opening tag that has been encountered. Similarly changing endElement to read:
Private Sub _
IVBSAXContentHandler_endElement( _
	strNamespaceURI As String, _
	strLocalName As String, _
	strQName As String)

	Debug.Print "end", strLocalName
End Sub
…will print the closing tag name as encountered. If you try out the new version of the program you should discover that the opening and closing tag structure of the XML document is printed out. Notice that this isn't at all the way that DOM works. Now the file is being parsed as it is read in and the appropriate events are being triggered before the file is completely processed.

Now that you understand how SAX works, you can probably work out what the rest of the event handlers do. For example, end document is called when the entire XML document has been read, and characters is called when there is a string of characters to pass, i.e. the text between tags. Some of the methods are a little less obvious. For example, processingInstruction is called when the reader detects a <?xml–stylesheet type of tag. Even so, by reading the documentation you should be able to follow what everything does.

Useful SAX

One problem that SAX users have is in seeing how to make use of the events that the reader is generating to extract parts of the document. The solution to the problem is to use a flag. For example, suppose you want to extract just the text that occurs between a pair of <first_name> and <first_name/> tags. You can do this by adding a Boolean flag variable to the content handler:
Private target As Boolean
In VB you can assume that this is initially set to false. Detecting the start of the tag pair is easy:
Private Sub _
IVBSAXContentHandler_startElement( _
	strNamespaceURI As String, _
	strLocalName As String, _
	strQName As String, _
	ByVal oAttributes As _
	MSXML2.IVBSAXAttributes)
	If strLocalName = "first_name" _
		Then target = True
End Sub
Once detected and the flag set to true the text characters between the tag can be printed:
Private Sub _
IVBSAXContentHandler_characters( _
	strChars As String)
	If target Then
		Debug.Print strChars
	End If
End Sub
All that is necessary now is to turn off the output by resetting the flag when the closing tag is reached:
Private Sub _
IVBSAXContentHandler_endElement( _
	strNamespaceURI As String, _
	strLocalName As String, _
	strQName As String)
	If strLocalName = "first_name" _
		Then target = False
End Sub
Now if you run the program you will see the only the first names printed out – no matter how many there are.

SAX Writer

So far the output created by the content handler has just been written to the immediate window as a demonstration, but in practice it is usually better to store the output in some form or another. To help with this task there is another object – the SAX writer. In fact there are a number of different SAX writer objects that let you create different types of output. If you read the documentation you might well find the writer objects difficult to understand, because while the finer details are discussed there is no overview that gives you the basic ideas. The one single important idea is that a SAX writer is a content handler in the sense that it implements the content handler interface. What this means is that it has all of the routines needed to handle events generated by the reader, and these routines write correctly formatted XML to a file, to a string or to a DOM object. By default the output is written to a string and so the simplest use of the SAX writer is:
Private Sub Command1_Click()
	Dim rdr As New SAXXMLReader40
	Dim wrt As New MXXMLWriter40

	Set rdr.ContentHandler = wrt
	rdr.parseURL "\test.xml"
	Print wrt.output
End Sub
This reads the file in and passes it all to the writer which builds the XML file up in its output property. If you would like to see the output as formatted XML then change the indent property using:
wrt.indent = True
Of course copying an XML file without modification isn't the writer's usefulness. The key factor is that the writer has all of the event methods that the content handler has and these generate XML output when called. This makes the SAX writer one of the easiest ways of creating an XML file. The only problem is that the SAX writer object doesn't seem to have the necessary methods when you actually try and use it. The solution to this puzzle is to cast the objects to a new and more appropriate type – this is how VB6 handles objects which implement multiple interfaces:
Dim rdr As New SAXXMLReader40
Dim wrt As New MXXMLWriter40
Dim cntwrt As IVBSAXContentHandler
Set cntwrt = wrt
Now we have a SAX writer object that we can refer to either as wrt or cntrwrt. When we use cntwrt all of the methods and properties appropriate to a content handler are available to us.

To manually create an XML file is fairly easy, tedium being the only real problem!

wrt.indent = True
cntwrt.startDocument
Dim atrib As New SAXAttributes40
cntwrt.startElement "", "", _
	"Addressbook", atrib
cntwrt.startElement "", "", _
	"Address", atrib
cntwrt.startElement "", "", _
	"first_name", atrib
cntwrt.characters "Mike"
cntwrt.endElement "", "", "first_name"
cntwrt.startElement "", "", _
	"street", atrib
cntwrt.characters "1 Fortran Road"
cntwrt.endElement "", "", "street"
cntwrt.startElement "", "", _
	"town", atrib
cntwrt.characters "Erehwon"
cntwrt.endElement "", "", "town"
atrib.addAttribute "", "", "type", _
	"", "home"
cntwrt.startElement "", "", _
	"phone", atrib
cntwrt.characters "1234 56789"
cntwrt.endElement "", "", "phone"
cntwrt.endElement "", "", "Address"
cntwrt.endElement "", "", _
	"AddressBook"
cntwrt.endDocument
Print wrt.output
End Sub
Notice the use of the attribute object to create the "type" attribute within the <phone> tag. The resulting file is actually rather nicer than the one created using DOM, in that it has some formatting and full header (see Figure 2).

Figure 2
Figure 2: The XML file generated using SAX

However the SAX writer is used to create an XML file, it can send its output to a string, an Istream object or a DOM object. What this means is that SAX can be used to create a DOM object in memory. Why would you want to do this? Well the SAX part of the processing manages to work with XML without reading it all into memory, but when you find a small part of the file that you want to work with, DOM is often easier and more powerful. Think of it as SAX being sequential access and DOM direct access. So use SAX to find the bit you want to work with, convert it to DOM and work with it in this form. For example to create a DOM object corresponding to the file created by the SAX writer in the previous example, we first need to create a DOM document object:

Dim dom As New DOMDocument40
…and then set it to the SAX writer's output property to the DOM document:
wrt.output = dom
After the endDocument method is called the DOM document can be used as if it had be read in:
cntwrt.endDocument

Dim root As IXMLDOMElement
Set root = dom.documentElement
Dim node As IXMLDOMElement
For Each node In root.childNodes
	Print node.nodeName
Next
Print dom.xml
End Sub

More SAX

This isn't the whole story. There is the small matter of reading in schema and checking that the XML file is correct. There is also the issue of error handling and various other very useful tools such as the Filter interface and the HTML writer.

However, by now you should be able to make use of the documentation to figure out the details.

Adding XML to VB 6

If you want to try out this month's examples you will need to go to the Microsoft web site and download MSXML 4.0 if you haven't already got it. I'd like to quote you a URL for this, but these days the Microsoft site changes its URLs far too often. So go to msdn.microsoft.com and navigate to the download section and find MSXDL or "Microsoft XML Core Services 4.0". From there, download it using the installer. You may well have to upgrade the Windows installer to make it work, in which case just follow the instructions.

You might also like...

Comments

Contribute

Why not write for us? Or you could submit an event or a user group in your area. Alternatively just tell us what you think!

Our tools

We've got automatic conversion tools to convert C# to VB.NET, VB.NET to C#. Also you can compress javascript and compress css and generate sql connection strings.

“If debugging is the process of removing software bugs, then programming must be the process of putting them in.” - Edsger Dijkstra