Everyone is well aware of just how much we enjoy the Kinect, both for its gameplay and hacking potential, here at Developer Fusion. Over the weekend, Microsoft Research gave everyone a fascinating insight into one of the software components that they built to leverage the Kinect hardware, specifically the human body tracking component.
The algorithm that they describe is implemented in software that runs on the Xbox device itself; it is built into the Kinect drivers on there, which is then provided to games and developers (you can also see its results if you have a development Xbox and go into the Kinect console). It provides an extremely reliable 3D joint orientation and location map which fully describes a human's pose in Cartesian space, for at least one person in the Kinect's view. This allows developers to build gestures and actions support on top of this to allow gamers to use Kinect to interact with the environment.
The algorithm, as described in the paper to be published at an upcoming conference, uses what's called a decision forest (a collection of decision trees, of course!) trained using thousands of sample datasets over even more thousands of hours of cluster compute hours. This provides advantages such as not requiring a calibration pose prior to use; not causing problems if pose varies greatly over a small amount of time; and efficient working with the type of visual data that Kinect provides. Developers who have studied the paper describe it as a fairly standard and understood but well implemented approach to solving the problem - the added bonus being that it works in the real world in all kinds of varied environments.
Not only is the algorithm a very smooth implementation - it is also reliable, and efficient. Due to the fact that most of the Cartesian space (x, y, z) manipulations can easily be handed off to the GPU for quicker computation, 200 frames per second can be achieved on fairly standard hardware. This is not so necessary in production games, as the Kinect hardware only runs at 30 frames per second itself, but it does allow a lot of other processing to occur in between.