Design like Iron Man

What is the next step in natural user interfaces? And what do comic book heroes have to say about it?

Iron Man 2008

Iron-Man is one cool hero and Robert Downey Jr.'s 2008 portrayal made it even cooler than the paper concept.

But what did a geek notice while watching the movie and sighed wanting the same thing? No, it's not a supersonic costume with limitless energy, but the 3d direct-manipulation interface that RDJ used to build said costume.

Though present only for a very limited time on screen (after all, there are baddies to kill, damsels to save), it was an "oh wow" moment for me. Two-dimensional direct manipulation is already here, thanks to the likes of Surface, iPhone and iPad and multitouch technology. But what about accurate, responsive 3d manipulation?



A discussion I've had some time ago made me think again about the interface envisioned in Iron Man two years ago. It is, I believe, the holy grail in computer interfaces. Direct, accurate manipulation of digital matter in three dimensions. The software used in the movie was representative of an Autocad package, used in modeling houses, cars, furniture, electronics. Let's analyze the specifics in the video and see how further along we are in building this.
No interface elements. There aren't any buttons, sliders or window elements. The interface simply disappears and you are working directly on the "document" or "material". When designing any software product, the biggest credit an UI designer can get is that the user doesn't notice the interface and the he or she just gets things done.
Kinetic/physical properties. Movement of digital items has speed and acceleration (rotation of model in first video), mass, impact and reactivity (throwing items in recycle bin unbalances it for a moment). All these are simulated for one purpose: making the user believe he is manipulating the digital entities in the exact same way one would act with physical objects.

Advanced (3d) gestures for non-physical manipulation. There are, of course, a large number of actions a computer can perform which don't map to a physical event such as the extrusion of an engine.

Accurate 3D holographic projections. The digital elements take a physical shape and are projected in 3d space.

Accurate response to direct manipulation. There are no styluses, controllers, or other input devices used.

So how far along are we? If you think this is all silly cinema stuff, prepare to be pleasantly surprised.

Learn to g-speak

G-speak is a "spatial operating environment" developed by Oblong Industries. If the above demo didn't impress you, check out the introduction of g-speak at TED 2010 and some in the field videos as this digital pottery session at the Rhode Island School of Design. The g-speak environment speaks to me as Jeff Han's multitouch video did back in 2006. It took 4 years for that technology to appear in a general-purpose computing device as is the iPad. The g-speak folks hope to bring their technology to mass commercialization within 5 years.  That sounds pretty ambitious, even coming from the consultants of the 2002 Minority Report movie. Right?

Bad name, awesome tech: enter Kinect

The newly released Kinect is an addon for Microsoft's XBox gaming console. Whereas g-speak is still some years away from commercialization, you can get a taste of it right now with the Kinect.

Combining a company purchase (3DV), tech licensing (PrimeSense) and some magic Microsoft software dust, Kinect was born. Here are a few promotional videos, if you can stomach them. And here's Arstechnica's balanced review. According to PrimeSense,

The PrimeSensor™ Reference Design is an end-to-end solution that enables a computer to perceive the world in three-dimensions and to translate these perceptions into a synchronized depth image, in the same way that humans do.


The human brain has some marvelous capabilities for viewing objects in 3D. It is helped by an enormous parallelism capacity. It only needs the two inputs from our eyes to tell distance. Not disposing of the brain's firepower, Kinect uses a neat trick: projecting infrared rays into the room with a IR light source that are picked up by the second camera (a CMOS sensor).  Here's how the Kinect "sees" our 3D environment.

kinect-shell.jpg Wall-E, is that you? © iFixit

Besides the IR projector and IR Receiver, Kinect also comes equipped with a VGA camera and no less than 6 microphones.  Microsoft took the PrimeSense design and added the VGA camera for face recognition and help with the 3d tracking algorithm. The microphones are used for speech recognition; you can now yell at your gaming console and it might actually do something. All this in a 150$ dollars package that you can buy today.

From some reports I read on the net it appears Microsoft spent a lot of money in R&D on Kinect. The advertising campaign alone is estimated at something like 200 million $. It is only natural to assume that they have bigger plans for Kinect than just having it remain a gaming accessory.  I believe Microsoft is betting on Kinect to represent the next leap in natural user interaction. Steve Ballmer was recently asked what was the riskiest bet Microsoft was taking and replied with "Windows 8". The optimist in me says Windows 8 will be able to use Kinect and have a revised interface to suit 3d manipulation. The cynic in me tells me that he was talking about a new color scheme.

So what do we get? Almost no interface elements, kinetic/physical properties, advanced 3d gestures from the original list. They added some really cool stuff via software, such as, in a multiplayer game, if a person comes into a room and he/she has an XBox Live account, they are signed in automatically into the game, simply via face recognition. Natural language commands bring another source of input for a tiny machine that knows much more about it's surroundings than previous efforts.

What do we miss? Holographic projections, accurate response and, something missing also from Iron Man's laboratory, tactile feedback. The early reviews for Kinect all mention this in one way or another... Kinect's technology, when it works, is an amazing way to interface with a computer. When it breaks down, it reminds us that there is still a lot of ground to cover. Microsoft's push for profitability (understandable, remember this is a mass-consumer product) removed an image processor from the device. This means that it needs an external processing power. The computing power reserved for Kinect is at the moment up to 15% of XBox's capability. The small sized cameras and their proximity requires a distance of 2-3 meters from the device in order to operate it successfully. Because of the small amount of processing power reserved for it, Kinect's developers have supplied the software with a library of 200 poses which can be mapped to your body faster and easier than a full body scan. You cannot operate it sitting down; it's my opinion that this is a side-effect of the 200 pre-inputted poses. You can also notice in the g-speak video above that their system reacts to their tiniest change, even when moving just their fingers. How do they do that? By using 6 or more HD cameras (and tons of processing) per second. The 340p IR receiver and 640p video camera just doesn't cut it for such fine detections. This is , again, an understandable means of reducing the cost.

On the other hand, Microsoft made a great move by placing Kinect 1 next to a gaming platform. Games are by their nature experimental, innovative processes. This gives everyone huge amounts of freedom to experiment. Made a gestural based interface and no one likes it? You can scrap it and the next game will try something different. This will give Microsoft valuable data for improving Kinect and filtering out bad interaction paradigms.

Kinect has a chance to evolve and become the next natural way to interface with computers. With increases in processing power, accuracy will increase. If you want to play like Iron Man, you can do so now with Kinect.

In the next installment, I'll talk about the accuracy, feedback, Playstation Move and the (sorry) state of holographic projections.