To work with SAX under VB6 you need to install a copy of MSXML (Microsoft XML Core Services), which can be downloaded from the MSDN web site – see box Adding XML to VB 6. MSXML supports a number of different XML technologies, but SAX and DOM are the two that concern us.
DOM is a fine way to process XML data, but it isn't always a suitable way to work. DOM reads in the entire XML file and represents it as an in-memory parse tree. If you really do want to work with all of the information in an XML file, then this might well the be most efficient way to work, but what if you don't? For example, if you want to find a particular name and address in a big XML file, do you really want to read it all as a parse tree? Even if you could afford the memory to store it all, you'd have to wait until all of it was processed before you could start your search for the information. Of course this problem is even worse if the file is being read in via a slow serial stream such as a modem connection to the Internet. A much better idea is to parse the data stream as it comes in and stop the reading of the file as soon as you have found what you're looking for. This is what SAX lets you do. But be warned, because it's an event-driven API, it's more difficult to get started with because you have to implement all of the event handlers, even if you don't want to use them all! Underneath this initial layer of complexity, its all very simple – perhaps even simpler than DOM.
There are lots of examples of using SAX in the Microsoft documentation, but they're all so complicated that getting the basic idea of what is going on is difficult. To try and put things right, I'll show you the simplest possible example of SAX in action, and then let you move on to more interesting things.
There are two fundamental SAX objects that you have to get to know – the SAX reader and the SAX content handler (see Figure 1).
Figure 1: The SAX connection
The reader is fairly straightforward because it just reads the XML file you specify and fires off events depending on what it finds. The content handler is more complicated to implement, but not to understand, in that it just accepts the events that the reader fires off. The practical complication is that you have to create a class that implements the content handler interface and write all the event handlers that the interface specifies – even if you don't want to use them.
The easiest way to understand this is to do it. So start from a new .EXE project, and add a class module called ContentHandler (or anything you like as long as you're consistent). Load a reference to Microsoft XML so that you can make use of the COM objects it provides.
Next, include the statement:
Implements IVBSAXContentHandler…in the general declarations section. This says that your new class will implement all of the methods defined by the IVBSAXContentHandler interface and thus it will behave as if it's a SAX content handler. In case you've never realised it, this is exactly what the Implements statement and the whole idea of "interfaces" is all about!
Now you have to do exactly what you promised – you have to implement the IVBSAXContentHandler interface and this means writing a lot of "stub", i.e. empty, procedures:
Private Sub IVBSAXContentHandler_characters( _ strChars As String) End Sub Private Property Set _ IVBSAXContentHandler_documentLocator( _ ByVal RHS As MSXML2.IVBSAXLocator) End Property Private Sub IVBSAXContentHandler_endDocument() End Sub Private Sub IVBSAXContentHandler_endElement( _ strNamespaceURI As String, _ strLocalName As String, strQName As String) End Sub Private Sub IVBSAXContentHandler_endPrefixMapping( _ strPrefix As String) End Sub Private Sub IVBSAXContentHandler_ignorableWhitespace( _ strChars As String) End Sub Private Sub IVBSAXContentHandler_processingInstruction( _ strTarget As String, strData As String) End Sub Private Sub IVBSAXContentHandler_skippedEntity( _ strName As String) End Sub Private Sub IVBSAXContentHandler_startDocument() End Sub Private Sub IVBSAXContentHandler_startElement( _ strNamespaceURI As String, _ strLocalName As String, strQName As String, _ ByVal oAttributes As MSXML2.IVBSAXAttributes) End Sub Private Sub IVBSAXContentHandler_startPrefixMapping( _ strPrefix As String, strURI As String) End SubIf you enter all of these subroutines (and the one property), you'll have a working content handler – but one that doesn't do anything. In the spirit of getting something going, let's move on to using the content handler, useless though it is!
Add a button on the form and add the line:
Private Sub Command1_Click() Dim rdr As New SAXXMLReader40…to the click handler. This creates an instance of the SAX reader object. You can associate a single content handler with the reader via its ContentHandler property, but first we have to create a content handler:
Dim cnth As New ContentHandler Set rdr.ContentHandler = cnthNow the reader and content handler are created and connected we can begin to parse an XML file. This can be done by setting the parseULR property to a file location or by using the parse property to pass XML data in a string or even in a DOM object. For this simple example we'll just pass it the name of a suitable XML file – in this case the one created using DOM in the earlier article.
rdr.parseURL "\test.xml" End SubNow when we run the program it all works but, of course, nothing happens. Why should it, as every event handler in the content handler simply returns as soon as it's called? I suppose the only mark of success is the fact that no error messages are generated and it doesn't crash.
Adding events
To indicate that something is indeed happening we need to add some code to the event handlers. The startDocument method is called when the reader opens the document and receives the first data. To see this in action change the method to read:Private Sub _ IVBSAXContentHandler_startDocument() Debug.Print "*******************" Debug.Print "Start of document" End SubThe startElement method is called whenever an opening tag of any type is encountered. To see this in action change the method to read:
Private Sub _ IVBSAXContentHandler_startElement( _ strNamespaceURI As String, _ strLocalName As String, _ strQName As String, _ ByVal oAttributes As _ MSXML2.IVBSAXAttributes) Debug.Print "Start", strLocalName End SubThis prints the unqualified name of the opening tag that has been encountered. Similarly changing endElement to read:
Private Sub _ IVBSAXContentHandler_endElement( _ strNamespaceURI As String, _ strLocalName As String, _ strQName As String) Debug.Print "end", strLocalName End Sub…will print the closing tag name as encountered. If you try out the new version of the program you should discover that the opening and closing tag structure of the XML document is printed out. Notice that this isn't at all the way that DOM works. Now the file is being parsed as it is read in and the appropriate events are being triggered before the file is completely processed.
Now that you understand how SAX works, you can probably work out what the rest of the event handlers do. For example, end document is called when the entire XML document has been read, and characters is called when there is a string of characters to pass, i.e. the text between tags. Some of the methods are a little less obvious. For example, processingInstruction is called when the reader detects a <?xml–stylesheet type of tag. Even so, by reading the documentation you should be able to follow what everything does.
Useful SAX
One problem that SAX users have is in seeing how to make use of the events that the reader is generating to extract parts of the document. The solution to the problem is to use a flag. For example, suppose you want to extract just the text that occurs between a pair of <first_name> and <first_name/> tags. You can do this by adding a Boolean flag variable to the content handler:Private target As BooleanIn VB you can assume that this is initially set to false. Detecting the start of the tag pair is easy:
Private Sub _ IVBSAXContentHandler_startElement( _ strNamespaceURI As String, _ strLocalName As String, _ strQName As String, _ ByVal oAttributes As _ MSXML2.IVBSAXAttributes) If strLocalName = "first_name" _ Then target = True End SubOnce detected and the flag set to true the text characters between the tag can be printed:
Private Sub _ IVBSAXContentHandler_characters( _ strChars As String) If target Then Debug.Print strChars End If End SubAll that is necessary now is to turn off the output by resetting the flag when the closing tag is reached:
Private Sub _ IVBSAXContentHandler_endElement( _ strNamespaceURI As String, _ strLocalName As String, _ strQName As String) If strLocalName = "first_name" _ Then target = False End SubNow if you run the program you will see the only the first names printed out – no matter how many there are.
SAX Writer
So far the output created by the content handler has just been written to the immediate window as a demonstration, but in practice it is usually better to store the output in some form or another. To help with this task there is another object – the SAX writer. In fact there are a number of different SAX writer objects that let you create different types of output. If you read the documentation you might well find the writer objects difficult to understand, because while the finer details are discussed there is no overview that gives you the basic ideas. The one single important idea is that a SAX writer is a content handler in the sense that it implements the content handler interface. What this means is that it has all of the routines needed to handle events generated by the reader, and these routines write correctly formatted XML to a file, to a string or to a DOM object. By default the output is written to a string and so the simplest use of the SAX writer is:Private Sub Command1_Click() Dim rdr As New SAXXMLReader40 Dim wrt As New MXXMLWriter40 Set rdr.ContentHandler = wrt rdr.parseURL "\test.xml" Print wrt.output End SubThis reads the file in and passes it all to the writer which builds the XML file up in its output property. If you would like to see the output as formatted XML then change the indent property using:
wrt.indent = TrueOf course copying an XML file without modification isn't the writer's usefulness. The key factor is that the writer has all of the event methods that the content handler has and these generate XML output when called. This makes the SAX writer one of the easiest ways of creating an XML file. The only problem is that the SAX writer object doesn't seem to have the necessary methods when you actually try and use it. The solution to this puzzle is to cast the objects to a new and more appropriate type – this is how VB6 handles objects which implement multiple interfaces:
Dim rdr As New SAXXMLReader40 Dim wrt As New MXXMLWriter40 Dim cntwrt As IVBSAXContentHandler Set cntwrt = wrtNow we have a SAX writer object that we can refer to either as wrt or cntrwrt. When we use cntwrt all of the methods and properties appropriate to a content handler are available to us.
To manually create an XML file is fairly easy, tedium being the only real problem!
wrt.indent = True cntwrt.startDocument Dim atrib As New SAXAttributes40 cntwrt.startElement "", "", _ "Addressbook", atrib cntwrt.startElement "", "", _ "Address", atrib cntwrt.startElement "", "", _ "first_name", atrib cntwrt.characters "Mike" cntwrt.endElement "", "", "first_name" cntwrt.startElement "", "", _ "street", atrib cntwrt.characters "1 Fortran Road" cntwrt.endElement "", "", "street" cntwrt.startElement "", "", _ "town", atrib cntwrt.characters "Erehwon" cntwrt.endElement "", "", "town" atrib.addAttribute "", "", "type", _ "", "home" cntwrt.startElement "", "", _ "phone", atrib cntwrt.characters "1234 56789" cntwrt.endElement "", "", "phone" cntwrt.endElement "", "", "Address" cntwrt.endElement "", "", _ "AddressBook" cntwrt.endDocument Print wrt.output End SubNotice the use of the attribute object to create the "type" attribute within the <phone> tag. The resulting file is actually rather nicer than the one created using DOM, in that it has some formatting and full header (see Figure 2).
Figure 2: The XML file generated using SAX
However the SAX writer is used to create an XML file, it can send its output to a string, an Istream object or a DOM object. What this means is that SAX can be used to create a DOM object in memory. Why would you want to do this? Well the SAX part of the processing manages to work with XML without reading it all into memory, but when you find a small part of the file that you want to work with, DOM is often easier and more powerful. Think of it as SAX being sequential access and DOM direct access. So use SAX to find the bit you want to work with, convert it to DOM and work with it in this form. For example to create a DOM object corresponding to the file created by the SAX writer in the previous example, we first need to create a DOM document object:
Dim dom As New DOMDocument40…and then set it to the SAX writer's output property to the DOM document:
wrt.output = domAfter the endDocument method is called the DOM document can be used as if it had be read in:
cntwrt.endDocument Dim root As IXMLDOMElement Set root = dom.documentElement Dim node As IXMLDOMElement For Each node In root.childNodes Print node.nodeName Next Print dom.xml End Sub
More SAX
This isn't the whole story. There is the small matter of reading in schema and checking that the XML file is correct. There is also the issue of error handling and various other very useful tools such as the Filter interface and the HTML writer.However, by now you should be able to make use of the documentation to figure out the details.
Adding XML to VB 6If you want to try out this month's examples you will need to go to the Microsoft web site and download MSXML 4.0 if you haven't already got it. I'd like to quote you a URL for this, but these days the Microsoft site changes its URLs far too often. So go to msdn.microsoft.com and navigate to the download section and find MSXDL or "Microsoft XML Core Services 4.0". From there, download it using the installer. You may well have to upgrade the Windows installer to make it work, in which case just follow the instructions. |
Comments