From autonomous cars to hospitals: how robots can see what’s around them
⚡ Quick Summary
Neo Robot Disclosure/1X Technologies For a long time, seeing seemed like an exclusively biological ability.
Neo Robot
Disclosure/1X Technologies
For a long time, seeing seemed like an exclusively biological ability. Humans and animals observe the environment, recognize faces, avoid obstacles and make decisions in fractions of a second almost without realizing it.
Today, however, machines are also learning to do something similar. Thanks to advances in artificial intelligence and computer vision, robots are now able to interpret visual information in an increasingly sophisticated way.
Computer vision is the area of technology that allows computers and robots to interpret images and videos.
Instead of just recording what's in front, as a common camera does, these systems analyze visual content to identify people, objects, movements, distances and even behaviors.
Now on g1
Although it is still far from human perception, this technology has been transforming robots into machines capable of perceiving the environment, reacting to changes and making decisions in real time.
It is already present in autonomous cars, agricultural drones, security systems, environmental monitoring, hospitals and industrial production lines.
Vision starts with sensors
The process begins with cameras and sensors installed on the robot. These devices capture images of the environment in real time, working in a similar way to human eyes. Depending on the application, different types of sensors can be used.
Among the most common are traditional RGB cameras, which record colors like a conventional camera.
Infrared sensors, capable of detecting heat or operating in dark environments, are also widely used, in addition to thermal cameras, used to visualize temperature differences.
Brazilian startup creates 'brain' with AI to make robots smarter
But seeing is not enough. The robot also needs to understand depth and spatial position, using depth sensors.
The simplest models, which estimate the distance between surrounding objects, are already widespread. They appear, for example, in domestic robot vacuum cleaners, which avoid furniture and stairs on their own.
The most advanced models use LiDAR systems, a technology based on laser beams that creates three-dimensional maps of the environment with greater precision.
Another technique is stereo vision, which combines two cameras simultaneously to calculate depth in a way similar to human vision.
AI interpretation
After capturing the images, artificial intelligence comes into action. Algorithms process each camera frame looking for visual patterns.
Deep artificial neural networks, inspired by the human brain, are trained with millions of images.
Thus, they can recognize that certain combinations of shapes, colors and textures correspond to people, animals, cars, furniture, signs, tools, trees or roads.
With this, the system, in addition to identifying the elements of a scene, also classifies what they represent. In many artificial intelligence videos, colored boxes appear around people and objects. These markings are automatically generated by the algorithms.
Chinese robot runs 100 meters in 10 seconds and approaches Usain Bolt's record.
Disclosure/Unitree
It is worth distinguishing this type of AI from so-called LLMs (Large Language Models), such as ChatGPT, which are focused on processing and generating human language.
Both use deep neural networks, but with completely different data and objectives: while LLMs analyze text, computer vision specializes in interpreting pixels and shapes for navigation in physical space.
Many systems go beyond recognition and perform 3D reconstruction and mapping of the environment. Some robots are able to create complete maps of the places they pass through, in real time.
This process is known as SLAM, an acronym for Simultaneous Localization and Mapping, one of the most important technologies in modern robotics. Applications, advances and limitations
Despite impressive advances, robots still see the world very differently from humans.
We have an extraordinary capacity for contextual interpretation, something that artificial intelligence is still learning. A simple partially hidden object or an unexpected change in lighting can confuse automatic systems.
There is also a huge computational challenge: to see in real time, a robot needs to process thousands or even millions of calculations per second, requiring sophisticated sensors, optimized algorithms and powerful hardware.
An important advance was that of Graphics Processing Units (GPUs), microprocessors specialized in images, originally created for video games.
Robot demonstration with model Isaac Gr00t N1, from Nvidia
Disclosure/Nvidia
Another bottleneck is that labeling a large amount of data is often a costly and time-consuming process. Researchers are constantly looking for new approaches.
A recent publication by our team at PUC-Rio, in the Journal Of Imaging Informatics In Medicine, proposes a methodology inspired by constructivist teaching to identify uncertain cases and efficiently trigger human interventions during training.
In practice, the results are already remarkable. In autonomous vehicles, for example, computer vision works in extremely complex situations. Recognizes traffic signs, lanes, pedestrians and obstacles ahead.
In addition, they also need to detect weather conditions and the movement of other vehicles. All this in a few milliseconds, while the car is moving.
In industry, robots equipped with computer vision already carry out quality inspections capable of identifying defects imperceptible to the human eye.
In hospitals, intelligent systems analyze medical exams for early signs of illness. In agriculture, drones monitor crops and detect failures, pests and irrigation problems.
The trend is for machines with artificial vision to be increasingly present in everyday life.
The ability to see transformed robots from simple automated machines into systems capable of perceiving and interacting with the world around them. And this visual revolution is just beginning.
Alberto Barbosa Raposo receives funding from FAPERJ and CNPq.
Alexandre Soares does not consult, work with, own shares in or receive funding from any company or organization that could benefit from the publication of this article and has not disclosed any relevant links beyond his academic position.
← Back