The Microsoft Research Cambridge team receiving their gold medal and MacRobert Award winners’ certificate from Sir Garth Morrison KT CBE in June 2011. From left to right: Professor Andrew Blake FREng FRS, Dr Jamie Shotton, Mat Cook, Dr Andrew Fitzgibbon and Toby Sharp
Late last year, Microsoft launched Kinect for Xbox 360, a new type of video game controller in which a sensor tracks the motion of the player’s body and maps body movements into actions in the game world. The team of engineers from Microsoft Research Cambridge who pioneered the machine-learning based component of the computer vision software for the Kinect went on to win the 2011 MacRobert Award. Ingenia asked two of these researchers, Dr Andrew Fitzgibbon and Dr Jamie Shotton, to explain how they helped to develop controller-free computing.
In November 2010, Microsoft unveiled Kinect for Xbox 360, a recently developed motion sensor that was said to bring games and entertainment to life in ‘extraordinary new ways’. Just two months later, eight million devices had been sold, making Kinect the fastest selling consumer electronics device in history.
The motion sensing input device, developed for the Xbox 360 video game console, is based around a webcam-like peripheral, but with a key difference: the camera outputs 3D rather than 2D images. It enables a user to control and interact with the Xbox 360 without the need of a game controller; instead, the game player simply moves or speaks. Films and music can be controlled with the wave of a hand or the sound of the voice while video gamers can control their in-game avatar by simply kicking or jumping, for example.
A key component of the system is the computer vision software which converts the raw images from the camera into a few dozen numbers representing the 3D location of the body joints. A core part of that software was developed at Microsoft Research Cambridge (MSRC) by Dr Jamie Shotton, Dr Andrew Fitzgibbon, Mat Cook, Toby Sharp, and Professor Andrew Blake FREng.
Looking for solutions
Computer vision has been a research topic since around the mid-1960s. Legend has it that MIT’s Professor Seymour Papert set “the vision processor” as a summer project for an undergraduate, but discovered by autumn that the problem was harder than it seemed, difficult enough to give as subjects for doctorate analysis.
Several hundred PhDs later, aspects of the problem still remain unsolved, but significant progress has been made and the fruits of researchers’ labours are now increasingly impacting on our lives. Today, vision algorithms are used daily by tens of millions of consumers.
Early commercial applications included industrial inspection and number-plate recognition. Computer vision was snapped up by the entertainment industry for use in film special effects and computer games. In the early ‘90s, British company Oxford Metrics developed a vision-based human motion capture system. This could recover 3D body positions from multiple cameras placed in a studio – provided the user wore a dark bodysuit with retro-reflective spherical markers. Following this, the race was on to develop motion capture systems that use only a single camera and have no need of special markers. One of the leading research groups was led by Blake at Oxford university, and later at Microsoft, which in 2000 and 2001 published some of the seminal papers in motion capture from a 2D image sequence.
Progress in image analysis from a single 2D camera surged in the late 1990s, when researchers began to devise algorithms that could recover 3D information from video sequences. This allowed the film industry to integrate real-world footage with computer-generated special effects.
For example, Boujou an automatic camera tracker developed by Fitzgibbon and colleagues from the Oxford Metrics Group and Oxford University’s Visual Geometry Group, was developed to insert computer graphics into live-action footage in 3D. Used in myriad films and television shows, including Charlotte’s Web, Buffy the Vampire Slayer and the Harry Potter series, the program won an Emmy award in 2002.
But while 3D measurement was one focus of computer vision research, the new millennium saw many researchers tackle a seemingly harder problem: general object recognition. Could algorithms be devised to label every single object within a digital image (see figure 1)? To solve this problem, software models of each object category would have to be automatically derived from data, rather than by programmers.
The MSRC research group, then working with Dr Jamie Shotton, focused on high-speed machine learning algorithms such as ‘decision trees’ and ‘randomised forests’. These performed ‘object segmentation’, which involved computing a label for every pixel within an image, indicating the class of object to which that pixel belonged.
Software aside, Kinect for Xbox360 was only possible with the development of real-time 3D cameras (see figures2 and3). Prior to Microsoft Research’s involvement in Kinect, several start-up companies had been building these cameras, some of which had attracted the attention of Alex Kipman, a visionary designer working in Xbox Incubation, at Microsoft.
Kipman and his team had taken a 3D camera, and built a human body tracking system that, while functioning impressively well, had one fatal flaw. It could capture slow movements but would cease to work if the subject moved too fast.
This was because Kipman and his team’s algorithm was based on the assumption that, given the body position (the ‘pose’) in the previous video frame just a thirtieth of a second ago, one could find the position in the current frame by trying several “nearby” poses and evaluating which best matched the current frame. However, with fast motion, the new position was not necessarily close to the previous frame’s body position, and this could lead to a compounding error and long-term failure.
Computer vision researchers had been aware of this problem for some time, and avoided compounding errors by constructing a detection algorithm. This would analyse a single 3D image, take hundreds of thousands of raw depth measurements and convert these into a few dozen numbers that would represent the body pose.
The detection algorithm had drawn inspiration from a face recognition algorithm proposed by Viola and Jones in 2001, which now underpins the face detection software widely used in digital cameras. Developed using machine learning, the success of this algorithm to recognise a face in a general scene from a single image indicated that a single-image human pose recognition algorithm might be possible (see figure 3).
In 2008, the Xbox Incubation group approached the group at MSRC for help in developing the necessary software. A single-image human pose recognition algorithm still had not been written but, as researchers, the group understood that machine learning was key and huge amounts of training data would be necessary.
With this in mind, the Kinect team gathered many examples of human motion capture poses from actors wearing the motion capture suits, and wrote software to convert those poses to depth images. The same pose data were applied to a wide range of body shapes and sizes, so that the training set was more representative of the population.
At the same time, the Xbox team visited houses across the globe carrying a prototype camera and asked real families to act in front of the camera as if controlling a yet-to-be-invented computer game. The data were stored for later use as test data and also to ensure the algorithm generalised well rather than simply ‘remembering’ every training image and then failing to work on new examples.
But throughout this period of rapid data gathering, the problem of successfully capturing fast movements remained. The group had estimated that all combinations of poses, shapes, and sizes would yield approximately 1013different images, far too many for ‘brute force’ analysis.
To reduce the complexity, they revisited earlier blue-skies research on object recognition. During this work, Shotton and co-workers had developed an algorithm that recognised natural object categories such as “cow”, “water”, and “grass”. From this a new algorithm was constructed to recognise body parts such as “head”, “left hand” and “right shoulder”.
Despite the complexity of the problem, the algorithm is remarkably simple (see figure4). Every 33 milliseconds a new image is produced by the camera. At every pixel of this new image, the computer enacts a game of “20 questions”, with questions of the form “Is the point 30 cm to the northwest more than 12 cm further away than the point under this pixel?” Based on the answer to this question, another is posed, with a different offset (10 cm south, 17 cm west, as another example) and threshold distance. The pattern of yes/no answers to these questions provides a precise description of the shape of the object under that pixel, which can be converted to a list of probabilities of body parts. The final part of the puzzle is how the “questions” are chosen, which is where the large volume of training data the team had gathered came into play.
Essentially the strategy is again simple: at every step a question is chosen which provides the most information given the answers to the questions previously asked. The concept of “information” used is that proposed by Claude Shannon in his 1948 paper A Mathematical Theory of Communication, and is based on the entropy of the distributions of body part probabilities. For early tests with just a few hundred training images, it was quite feasible to perform this process in a day on a single machine, but the team knew that the key to success would be to use hundreds of thousands of training images, which could take months – far too long for the schedule they had to work with.
So help was enlisted from colleagues at Microsoft Research in Silicon Valley who had been developing an engine for efficient and reliable distributed computation. Together, a distributed learning algorithm was built that divided up the millions of training images into smaller batches and processed each batch in parallel on a networked cluster of computers. Using about 100 powerful machines, the team were able to bring the training time down to under a day.
All the pieces of the jigsaw had now been gathered, and working with the Xbox team, everything was pieced together. The part recognition algorithm could give fast and accurate proposals about the 3D locations of several body joints. These proposals could then be stitched together by the Xbox group’s tracking algorithm to ensure a smooth, seamless, multi-player experience (another story in itself, and a fantastic engineering effort on their part). Today, Kinect’s skeletal tracking, alongside technologies such as continuous voice recognition, give designers the platform on which to build the innovative games such as Kinect Sports and Dance Central.
Now that human motion capture is widely available in a video game system, where might we see ‘natural user interfaces’ next? While the keyboard will not be supplanted by touch-free interaction any time soon, video gaming is just the first of many possible applications.
The technology is already being used in the WorldWide Telescope, developed by Microsoft Research. The WWT software enables PCs to function as a virtual telescope, bringing together terabytes of imagery from ground- and space-based telescopes so that the universe can be explored over the Internet.
Other possible applications include medicine: surgeons could interact with 3D models of the body over a computer system, without touching anything, when planning surgery or even during operations. In addition, Kinect could also be useful to academic researchers who want to, say, explore 3D views of atomic structures as part of their scientific studies.
However Microsoft has also recently created the Kinect for Windows Software Development Kit, so that users can develop their own applications. The package gives access to many of the capabilities of the Kinect system, including human motion capture, to developers using PCs with Windows.
User interface technology, to date, is considered to have developed across three generations. First through keyboard and text-based interaction, second with the mouse and windows, menus and pointers and third, via multi-touch, including touch-screen displays. It is their hope that the natural user interface will herald a “fourth generation” of touch-free human-machine interaction.