Debugging with the DIA SDK

This article was originally published in VSJ, which is now part of Developer Fusion.
The Microsoft Debug Interface Access Software Development Kit (DIA SDK) provides access to debug information stored in program database (.PDB) files generated by Microsoft post-compiler tools. Because the format of the PDB file undergoes constant revision, Microsoft has decided that exposing the format is impractical. Using the DIA SDK, you can develop applications that search for and browse debug information stored in a PDB file.

To write a debugger, it is best to start with the source code of someone else’s effort. John Robbins supplies exactly that in his excellent book Debugging Applications (Microsoft Press, 2000). His code deals with all the architectural aspects of creating a debugger, but bemoans the lack of a full symbol engine that can associate machine code and assembler back to source code. He concludes his debugger chapter, “the dbghelp symbol engine is sufficient for some excellent debugging helper utilities, but it isn’t enough for a real debugger. You can always tackle reverse engineering the PDB file format yourself, and we can all hope that Microsoft will someday release the PDB access routines”.

Well, that time has come – Microsoft has released PDB access routines. They are called the Debug Interface Access (DIA) SDK. The DIA SDK first shipped with Visual Studio.NET. They are the official, Microsoft-endorsed method of accessing symbolic information from Microsoft code, both managed and unmanaged. They can read the new .NET PDB format, which is identified by an RSDS signature, and the older PDB formats generated by Visual Studio 6.0.

“Excellent – article over!” I hear you cry. Until, that is, I quote Matt Pietrek from Under the Hood (MSDN Magazine, March, 2002):

“The DIA APIs are COM-based and not for the faint of heart.”

For Matt to say something is not for the faint of heart is enough to deter most people. Hence, when searching the Internet for references to the DIA SDK, other than the MSDN documentation they are not mentioned.

In this article, I develop a small utility that will expose the workings of the DIA SDK. It is a command line utility that takes a PDB file name and a class name as parameters, and reverse engineers a C++ class declaration. This serves a number of purposes. Firstly, we get an insight into how to use the DIA SDK. Secondly, we get an idea of how rich a data source the PDB files really are. Thirdly, this reveals how much potentially sensitive information you are exposing if you ever ship full PDB files with your products.

Note: for the sake of clarity, code snippets in the text of this article don’t show error checking, but the accompanying sample code does implement full error checking of all COM calls.

Getting started

To build the sample code, open the supplied project file in Visual Studio .NET. Ensure that your Visual Studio include path includes the DIA SDK (Microsoft Visual Studio .NET\Visual Studio SDKs\DIA SDK\include) and your Visual Studio library path includes the DIA SDK (Microsoft Visual Studio .NET\Visual Studio SDKs\DIA SDK\lib).

To run the application you will need to ensure that the msdia20.dll (found in Microsoft Visual Studio .NET\Visual Studio SDKs\DIA SDK\bin) is registered as a COM component, by invoking regsvr32.exe on the dll.

While the MSDN documentation has a comprehensive coverage of the syntax for using the DIA SDK, it lacks overview material and contextual information, making the SDK seem impenetrable and complex. Similarly, although there is a sample application (installed into the path “Microsoft Visual Studio .NET\Visual Studio SDKs\DIA SDK\Sample”) its source code, which contains the only documentation of the alternative ways to load the DIA library, is complicated by virtue of excessive use of recursion.

Loading the SDK

The DIA SDK is implemented in a single dll called msdia20.dll. Most of the SDK interface definitions can be found in the dia2.h header file (found in the DIA SDK include directory). The SDK interfaces are built upon COM, but there are three different options for loading the dll into your application.

The dll can be registered as a COM component using regsvr32.exe and loaded in the normal COM manner.

CComPtr<IDiaDataSource>
	pIDiaDataSource;
CoCreateInstance(_uuidof( DiaSource ),
	0, CLSCTX_INPROC_SERVER,
	_uuidof( IDiaDataSource ),
	(void**) &pIDiaDataSource);
The SDK also appears to allow applications to use the dll in two other ways without COM registration. If the application is happy to call CoInitialize and build upon the OLE libraries, then the application can explicitly load the dll by calling NoRegCoCreate.
CComPtr<IDiaDataSource>
	pIDiaDataSource;
NoRegCoCreate("msdia20.dll",
	_uuidof( DiaSourceAlt ),
	_uuidof( IDiaDataSource ),
	(void **) &pIDiaDataSource );
However, if the application does not even want to build upon the OLE libraries, the dll can be loaded by calling NoOleCoCreate. This option means that all strings need to be freed using LocalFree instead of SysFreeString. This implies that use of ATL classes such as CComBSTR would be incorrect in this situation.
CComPtr<IDiaDataSource>
	pIDiaDataSource;
NoOleCoCreate(_uuidof( DiaSourceAlt ),
	_uuidof( IDiaDataSource ),
	(void **) &pIDiaDataSource );
Both the alternate functions for loading the dll are prototyped in the diacreate.h header file (found in the DIA SDK include directory), which ships with the SDK, but the functions are undocumented by MSDN. They both rely upon the dll being visible in the PATH environment variable, so that LoadLibraray can find them. The only use of these functions is found in the SDK sample. This article will not mention these issues again and will assume the CoCreateInstance loading option in all discussions that follow.

The joy of COM

Two strange worlds have collided in the DIA SDK, those of debugger writers and COM. I have been an avid reader of John Robbins and Matt Pietrek for years now, and I get the impression that neither of them are great COM aficionados! The use of COM in the DIA SDK is limited to interface definitions and reference counting, so the use of the Active Template Library to manage object lifetimes and resources seems to make a lot of sense. By simply including atlbase.h in the precompiled header, all DIA interfaces can be wrapped in the CComPtr smart pointer template, avoiding all need to remember to call Release on a pointer.

The only other technique required is to always check the HRESULT of every call made to a COM interface. The simplest way to achieve this is by use of a simple inline function:

inline void checkResult(HRESULT hr)
{
	if(FAILED(hr))
		throw hr;
}
All calls to any COM function are wrapped in a call to checkResult. Thus, exceptions are thrown whenever an invariant has been broken, and learning about a new interface becomes easier.

COM Interfaces are built upon the Unicode standard. Therefore, the DIA SDK passes all its strings as wide characters strings and our sample application declares a wmain entry point, so that all command line parameters are passed in as Unicode. The sample for this article makes use of the CComBSTR to manage COM strings. However, I find that it is best to convert these to C++ STL wide strings (std::wstring) as soon as possible.

Opening a program database

Our command line utility takes two parameters, the name of a PDB file and a name of the class that we want to reverse engineer. The class name must be presented exactly as it was compiled as part of generating the PDB. Having successfully loaded the DIA dll, the next stage in any use of the SDK is to load a PDB file. Once the PDB is open, a session is required to be opened, to provide a query context for debug symbols. If all this goes well, we are ready to get the global scope. This returns our first use of the most important interface in the DIA SDK; IDiaSymbol.
std::wstring pdbName = wargv[1];
pIDiaDataSource->loadDataFromPdb(
	pdbName.c_str());
CComPtr<IDiaSession> pIDiaSession;
pIDiaDataSource->openSession(
	&pIDiaSession );
CComPtr<IDiaSymbol> pGlobal;
pIDiaSession->get_globalScope(
	&pGlobal );

The IDiaSymbol Interface

In Object Oriented literature, IDiaSymbol is an example of a “fat interface”. The documentation of IDiaSymbol is intimidating, and it is not at all clear when its member functions are eligible to be called. On using it, it becomes clear that any one instance of the interface will return a failed HRESULT for most of its member functions, and this can easily frustrate the novice user.

If you imagine an inheritance hierarchy where all the derived class interfaces have had any new member functions rolled into their base class, then you are thinking of a fat interface. The problem is that any one instance of a derived class will not be able to implement the member functions specific to other derived classes. Most of the implementations for the functions will be defaulted to NOPs.

The design of the IDiaSymbol interface has put all the functions to do with any type of symbol in one place. Therefore, you must know what type of symbol you are dealing with to know which member functions of IDiaSymbol you can call on this instance. Once you have realised this, the documentation supplied in the MSDN Library starts to make a lot more sense. If you happen to call a member function that is invalid for that particular type of symbol, you tend to get an HRESULT of S_FALSE. Using the checkResult function described earlier, this immediately leads to an exception being thrown. This approach means that as soon as a mistake is made during the development and learning process, it is found and fixed straight away.

A key member function for working all this out is IDiaSymbol::get_symTag. This function always allows you to enquire of an IDiaSymbol instance what type of symbol it purports to be. The function returns a value from the SymTagEnum enumeration (declared in the cvconst.h header file, found in the DIA SDK include directory).

Types of symbols

The SymTagEnum enumeration has over thirty members, which have been split into two groups by the MSDN documentation: lexical symbols and class symbols. For our purposes, the symbol type of most interest is the user-defined type (UDT). Every class, structure, or union in C++ source code is represented as a UDT in the DIA SDK. Our command line utility is looking for C++ classes, so needs to query the SDK for a symbol type of SymTagUDT.

The query can be achieved using IDiaSymbol::findChildren. We can use the IDiaSymbol instance that has global scope that we acquired earlier for this. The findChildren function can be queried for many different data relationships, as it takes a member from the SymTagEnum enumeration as a parameter. As we develop our utility, we will find that it is essential to navigating the rich data source in a PDB file.

std::wstring className = wargv[2];
CComPtr<IDiaEnumSymbols>
	pIDiaEnumSymbols;
pGlobal->findChildren(SymTagUDT,
	className.c_str(), nsCaseSensitive,
	&pIDiaEnumSymbols);
The use of findChildren introduces another COM element to the story, the COM enumeration. When COM interfaces need to return a range of values in a result, a common idiom is to return an iterator style interface. In COM these all tend to have a similar set of functions, and would be represented in C++ as a template (for example think of the iterators exposed by the C++ STL). In addition, these iterators also manage the lifetime of the underlying data range, so that when we release the iterator, they also release the data values that were returned to us. As usual, as long as we use the ATL CComPtr template we do not have to concern ourselves with the workings of this.

Analysing the UDTs

Now we are in a position to walk the resultant list, extracting and processing each return value in turn. The question arises as to what to do with the IDiaSymbol instances when we extract them. The sample application has created a class (called UDT, see the UDT.h header file), which is constructed with a pointer to an IDiaSymbol instance. The constructor initially calls IDiaSymbol::get_symTag and verifies that the IDiaSymbol instance refers to a symbol of type SymTagUDT.

The second thing the constructor does is to get an ID for the IDiaSymbol instance. The DIA SDK supports numeric identifiers that can be used to represent an IDiaSymbol instance. The functions in the SDK often have two versions, one of which has ID appended to it. The ID version takes a numeric ordinal instead of the equivalent IDiaSymbol instance. The UDT class in the utility makes use of this feature, and so only stores the ID as a member variable. The UDT class also has a private function sym that can return an IDiaSymbol instance when required. This uses a global pointer to the IDiaSession that was acquired at the beginning of the program. It was found by experimentation during the development of the utility that this technique increases the speed of the system when responding to larger queries.

Finally, the constructor keeps pointers to all UDT instances in a global list that is available for debugging purposes.

The UDT class can report the name of the symbol. It also offers four main functions to get information about the type. It can supply base classes, data members, member functions and nested UDT types. This list is far from complete in terms of what data the PDB format could supply, but the intention here is to demonstrate the power of the DIA SDK, rather than aim for completeness.

Getting base classes

The UDT::getBases function can supply all the base classes for a type. It returns a complex structure defined as follows:
typedef std::vector< std::pair<
	UDT, std::pair< int, int > > >
	Bases;
The Bases type can represent the entire base class hierarchy for a type. The second and third parameters represent depth and protection respectively, (See printBaseList in DIA3.cpp for example usage).

UDT::getBases is built upon a call to IDiaSymbol::findChildren. This time the SymTagEnum has been set to SymTagBaseClass. As before this returns a pointer to IDiaEnumSymbols. The IDiaSymbol instances available through the COM enumeration represent the base class relationship. This allows the function to find out the protection relationship between the derived and the base class (i.e. whether it is using public, protected or private inheritance). The function could also determine whether virtual inheritance is in use at this point.

Having found the IDiaSymbol instance that represents the base class relationship, the function then calls IDiaSymbol::get_type to get a second IDiaSymbol instance that represents the base class type (UDT).

The COM enumeration returned by findChildren returned a list of IDiaSymbol instances that represented the relationship between derived and base classes. There is then a further indirection to get access to the underlying type of the base class. This is a common idiom in the DIA SDK. We are being exposed to the true complexity of the compiler writer’s life. However, this richness of data makes the DIA SDK potentially very powerful.

The UDT::getBases goes on to build a UDT class instance for each element found, and then uses recursion to find any base classes to these UDTs. Hence, the function takes an integer parameter that represents the depth of inheritance (as depth of recursion). This allows the function to build up an entire inheritance hierarchy for a class, including the depth and protection, for every base type.

Getting data members

UDT::getDataMembers has a simpler implementation than the getBases function. This is because the function is returning the IDiaSymbol instances that represent the relationship between the class and the member type, rather than the member type itself. There are two reasons for this design decision. Firstly, this makes the function more flexible and hence more powerful. Secondly, member types are not always UDTs.

The getBases function can assume that all base classes are UDTs because the C++ language decrees this to be the case. However, member types may well be built in C++ types such as ints or floats. In the DIA SDK, these are represented as usual by IDiaSymbols with the type of SymTagBaseType. The other thing that member types can be is references or pointers. These are again represented as IDiaSymbols with the type of SymTagPointerType.

The code in UDT::getDataMembers creates a DataMember instance (declared in the DataMember.h header file) for each data member found. Like the UDT constructor, the DataMember constructor checks that the symbol tag of the IDiaSymbol instance matches SymTagData. It then goes on to store the ID of the symbol and add itself to a global set of all DataMember instances useful for debugging.

The DataMember::type member function is worth some exploration. Having found the IDiaSymbol that represents the type of the data member, it calls expandType (found in useful.cpp). The expandType function returns a string that represents the C++ statements that could have generated the expression. expandType deals with CV qualification, pointer & reference versus value types, built in types, and the names of complex types such as structs or classes. This again shows the double indirection exhibited by the return values from the findChildren call. In this case, we queried the UDT for all its data members. We were returned a list of IDiaSymbols that represented the relationship between the UDT and the types it is built from. The relationship symbol allows us to find such things as the name of the variable and its protection.

Getting functions

UDT::getFunctions takes the same approach as getDataMembers, in that it returns the IDiaSymbol instances that represent the relationship between the UDT and the member function. It is also built upon the results of calling findChildren, this time with the SymTagFunction tag.

Member functions have many aspects worth exploring. They have return types, calling conventions, parameter names and parameter types. They may be const, static, virtual, pure (abstract), operators, constructors or destructors. Interestingly they may also be compiler generated. All this information can be inferred from the PDB files generated by the Microsoft post-compiler tools, and is made available fully for the first time by the DIA SDK.

The Function class can expose all this information as well. Each IDiaSymbol instance returned from the findChildren call to the UDT, returns a symbol that represents the member function relationship, exposing the function name and protection. Calling get_type on this returns a IDiaSymbol instance that represents the function type. Calling get_type on this returns a IDiaSymbol instance that represents the return type of the function. This time we have three levels of indirection to contend with!

To get the function parameters, we can call findChildren on the function type with the symbol tag SymTagFunctionArgType. From this call we can extract the parameter name, and use the expandType function to extract a string that can represent the parameter type.

Surprisingly, when I first expanded the parameters to some STL classes, with all the syntactic obfuscation that Visual C++ offers with template parameters, all seemed to work as one would expect. All the template arguments are fully expanded, and all names are complete and correct.

Nested Types

Having got this far, I wanted to explore other relationships that might be stored in the PDB files. For example, was there a way to find all the derived classes that inherit from a UDT? Accordingly, I called findChildren with the SymTagUDT enum tag. This did return some UDTs, but these turned out to be the member types of the class!

On pausing for reflection, the PDB file cannot know all the derived types that inherit from a particular class. Debug information can only report what is available to the compiler from the site of the code during the parse. For example, the compiler has to have all the base classes available to know how to form the binary layout of a class. However, the compiler cannot predict what classes may derive from a particular class, only the linker could calculate this for a particular binary.

Running the utility

The listing below shows some very odd classes that I have supplied to test the utility. The classes have all sorts of member variables, member types, member functions and base classes. If you build the sample app test (supplied in the zip file), and supply the PDB file generated to the DIA utility, output (shown in Listing Two) with remarkably similar content is produced to stdout. I find it best to redirect this to a text file for analysis.
struct base {
	char char_by_value;
	virtual ~base(){}
private:
	int a;
};

class derived : public base {
	base base_by_value;
	derived* pointer_to_derived;
	int int_by_value;
	unsigned long u_long_by_value;
	static base* static_derived_pointer;
	class member{int not_a_lot;};

public:
	derived() {}
	derived(const derived& other) {}
	virtual ~derived() {}
	derived& operator = (const derived& other) {return *this;}
	static void static_func(){}
	virtual const_func() const {}
	virtual __cdecl t1(){}
	virtual __fastcall t2(){}
	virtual __stdcall t3(){}
};

base* derived::static_derived_pointer;

void main()
{
	derived a;
}
Note that in this listing, main had to instantiate the derived class; otherwise the PDB information did not reliably include the classes definitions. This happened even in debug builds. I have yet to determine what would cause this omission. Maybe the compiler tools can infer that if a class is completely unused by a binary, there is no need to include the class information in the PDB. This also leads us to infer that the type information is laid out in the PDB file during the second phase of the compiler (code generation, the first phase is parsing). This is because when the compiler is parsing it cannot know what is to come, and must submit all type information into its abstract syntax tree. It is only during the code generation phase that it can omit types that have no references.

The listing below shows the results of running the utility on the test PDB file generated from the code in Listing One.

	struct base
	{
		char char_by_value;
	private:
		int a;
	public:
		virtual /*dtor*/ base::~base();
		/*ctor*/ base::base();
		// void __local_vftable_ctor_closure();
		// virtual void* __vecDelDtor(unsigned int);
	};
class derived : public base
{
	class derived::member
	{
		int not_a_lot;
	};
	base base_by_value;
	derived* pointer_to_derived;
	int int_by_value;
	unsigned long u_long_by_value;
	static base* derived::static_derived_pointer;
public:
	/*ctor*/ derived::derived();
	virtual /*dtor*/ derived::~derived();
	static void __cdecl static_func();
	virtual int derived::const_func() const;
	virtual int __cdecl derived::t1();
	virtual int __fastcall derived::t2();
	virtual int __stdcall derived::t3();
	// void __local_vftable_ctor_closure();
	// virtual void* __vecDelDtor(unsigned int);
};
There are a few things worth discussing. Some of the member functions are qualified by their class name. This is probably due the mangling algorithm that the compiler is using. The utility adds comments to member functions when it has inferred that they are special (such as constructors or destructors).

The utility reports all member functions in a class, even the ones that weren’t in the original C++. Any symbol names that start with a double underscore are reserved in C++. The utility prints them, but comments them out. Thus, the utility can show us what functions are being added by the compiler when we compile.

The generated functions explain some of the switches the Microsoft compiler offers. For example __declspec(novtable) will prevent the compiler from generating __local_vftable_ctor_closure. This means that any constructor for the class does not need to put the vtable pointer behind the class’s this pointer. For a full explanation of this, see Paul DiLascia’s C++ Q&A Column (MSDN Magazine, March 2000).

It is also interesting that calling conventions can be applied to virtual functions, which are documented by MSDN as always having a special calling convention, where the ECX register is set up as the class’s this pointer during the call.

Astute readers will have already noticed that the derived class in Listing Two has some member functions missing. In order to get the copy constructor to appear in the PDB generated information, I had to use the copy constructor somewhere in the code. In the test app, add the following line to the end of the main function:

derived b = a;
//cause the copy constructor
//to be called
This causes the compiler to require the copy constructor for the class, and therefore emit the debug information in the PDB.

It is very interesting to edit this example code, by adding namespaces, virtual base classes, parameters to the function calls etc. and watch how the PDB files report all this detail. It has exposed a lot about the inner workings of the Microsoft compiler, and increased my admiration for all the clever folks at Redmond!

Unresolved issues

As this is a small sample utility, many areas could be expanded or improved. For example, I wanted to determine if a member function was a constructor. The only way that I could work out how to do this was to scan the function name string, stripping everything prior to any double colon and comparing it to the name of the UDT. As an idea this isn’t too bad, though we are erring perilously close to parsing by approaching the problem this way.

I discovered my mistake when I tried to expand some templated UDTs. These were STL classes, and so were in the std namespace. The strings reported by all the SDK were fully qualified at all points, so every template typename was preceded with its namespace. Further static member functions seem to be qualified by the class name, so there were streams of double colons and angle brackets. The code as it stands can mark constructors and destructors for non-templated classes. The bodged string stripping code can be found in removeLeading in usefull.cpp. It could be made to work if one matched angle brackets before stripping the preceding text, but I have left this as an exercise for the reader.

I have also yet to discover how to extract the assignment operator in the test app. This may well be a special call. If anyone does find out how to get this, I would love to know!

Another thing I have yet to determine is how to find the protection of a member type. The tool does not expand member enumerations or constants. There is no extraction of namespaces, and I doubt I could extract an exception specification for a member function! So there is a huge amount of scope for expanding the project.

Conclusion

The DIA SDK has many features for interpreting memory locations in a running system; stack walking and parameter expansion are the type of features required to construct a commercial grade symbolic debugger. The DIA SDK is supported by Microsoft, deals with managed and unmanaged code and has the power to deliver the detail necessary to make some extremely powerful debugging applications.

This article has attempted to show that the static analysis of PDB files can extract some incredibly useful information. The utility provided is capable of reverse engineering class definitions for a number of complex C++ constructs. This information could be used to build some very interesting tools.


James Westland Cain is a Senior R&D Engineer at Quantel, where he designs and builds high-end editing systems for the TV and film markets. He was recently awarded a Ph.D. from Reading University for visualising large-scale C++ development using reverse engineering techniques. Find him at www.blunder1.demon.co.uk.

You might also like...

Comments

Contribute

Why not write for us? Or you could submit an event or a user group in your area. Alternatively just tell us what you think!

Our tools

We've got automatic conversion tools to convert C# to VB.NET, VB.NET to C#. Also you can compress javascript and compress css and generate sql connection strings.

“Computer science education cannot make anybody an expert programmer any more than studying brushes and pigment can make somebody an expert painter” - Eric Raymond