MIT just took the tech world a step closer to making domestic robot helpers a reality! Details

They are envisioning robots that can follow commands like, “Go to the kitchen and fetch me a cup of coffee”.

July 25, 2020 13:41 IST

Till now, the vision and navigation advancement of robots has been undertaken along two routes.

Robot helpers: MIT researchers to make robots perceive human-like physical environments! While the work of people has been made easy with the help of purpose-specific Roombas, engineers at the Massachusetts Institute of Technology (MIT) are looking to make robots that can provide overall assistance like helping in household chores based on high-level, Alexa-like commands. They are envisioning robots that can follow commands like, “Go to the kitchen and fetch me a cup of coffee”. For that, the team of researchers believe that the robot would have to perceive the physical environment like humans do.

In a statement, MIT Aeronautics and Astronautics assistant professor Luca Carlone said that in order to carry out any task or to take any decision, a mental image of one’s environment is necessary. He added that while the task was effortless for humans, for robots, it entailed a painful problem of transforming the pixel values of what the bots see through their cameras into an understanding of what the world is like.

In order to find a solution to this problem, Carlone, along with his students, has developed a representation of the perception of the physical environment for robots. This representation has been modeled after the way humans perceive their surroundings and navigate around them.

The model, called 3D Dynamic Scene Graphs, has been designed to allow the robots to quickly develop a 3D map of their surroundings. This map would include the people, walls, rooms, objects and their semantic labels, like a chair versus a table, as well as any other structures which the robots could likely be seeing in their environment. With the help of this model, the robot would be able to extract information from the map, in order to understand where the rooms and objects are, and also to perceive the movement of people in its path.

Carlone said that the compressed representation is important because it allows the robot to make quick decisions and also plan its path. It is not much different from what humans do, he added.

He further said that while these robots would become great domestic helpers, robots running on such a model would also be suitable for other high-level jobs like working with people on the factory floor and scouting a disaster site for any survivors.

Robot helpers: Mapping undertaken so far

Till now, the vision and navigation advancement of robots has been undertaken along two routes. The 3D mapping of robots allows them to reconstruct their surroundings in three dimensions as they explore the environment in real time, while the robots so far understood the semantic segmentation to distinguish between various objects and their differences, like a car as against a bicycle, in 2D images.

According to MIT, the model developed by Carlone and MIT graduate student and lead author of the study Antoni Rosinol is the first to generate the robot’s environment in real-time in 3D, while at the same time labelling objects, structures and people, both dynamic and stationary, within that 3D map itself.

How does the spatial recognition model work?

A major component of the new model is an open-source library called Kimera, which had been earlier developed by the team to make a 3D geometric model of the environment, while simultaneously encoding the likelihood of an object being a chair versus a table, for example. Carlone said that the team wanted Kimera to be a mix of mapping as well as semantic understanding in 3D.

So how does Kimera work, you ask? Kimera works completely in real-time, relying on the streams of images captured by the robot’s camera and the inertial movements caught by the onboard sensors. Using this data, Kimera estimates the trajectory of the robot or the camera, while at the same time reconstructing the scene in a 3D mesh.

The semantic 3D mesh is generated by Kimera using an existing neural network which has already been trained with the help of millions of real-world images. With the help of this network, Kimera predicts the label of each pixel, and then these labels are projected in 3D with the help of ray-casting, which is commonly used in computer graphics.

As a result, the robot has a 3D map of its surroundings, where each face has been colour-coded as part of objects, structures, or people.

Can robots work on Kimera-generated 3D mesh alone?

According to MIT, if robots were to work on such a mesh alone to navigate in its environment, it would be time consuming as well as computationally expensive. In order to solve this problem, Carlone and his team built off of Kimera, and developed an algorithm to convert Kimera’s dense mesh into dynamic 3D scene graphs, which are popular computer graphics typically used in video game engines for 3D environments.

The algorithms would break down Kimera’s mesh into distinct 3D semantic layers, so that the robot is able to visually perceive through a particular layer or lens. The layers are progressive in hierarchy, going from objects and people, to open spaces and structures like walls and ceilings, to rooms, corridors, and halls, and ultimately, entire buildings.

According to Carlone, with this layered representation, the robots would not have to make sense of the millions of points and faces in the original mesh. In these layers, the team has also managed to develop algorithms that would make it easier for robots to perceive human shapes and their movements in the real-time environment.

The new model was tested in a photo-realistic simulator that simulated a robot in a dynamic office environment which was filled with moving people.

Carlone said that his team was essentially allowing robots to have similar mental models as humans, which would impact several applications like self-driving cars, collaborative manufacturing, domestic robotics as well as search and rescue.

This article was first uploaded on July twenty-five, twenty twenty, at forty-one minutes past one in the afternoon.