The machine-learning programs that underpin their ability to “see” still have blind spots—but not for much longer
Kaia Glickman, Knowable Magazine
November 3, 2025 11:03 a.m.
Some computer vision programs have been thrown off by tricks such as manipulating the pixels in an image. MODIFIED FROM ISTOCK.COM / EYEEM MOBILE GMBH
Anyone with a computer has been asked to “select every image containing a traffic light” or “type the letters shown below” to prove that they are human. While these log-in hurdles—called reCAPTCHA tests—may prompt some head-scratching (does the corner of that red light count?), they reflect that vision is considered a clear metric for differe…
The machine-learning programs that underpin their ability to “see” still have blind spots—but not for much longer
Kaia Glickman, Knowable Magazine
November 3, 2025 11:03 a.m.
Some computer vision programs have been thrown off by tricks such as manipulating the pixels in an image. MODIFIED FROM ISTOCK.COM / EYEEM MOBILE GMBH
Anyone with a computer has been asked to “select every image containing a traffic light” or “type the letters shown below” to prove that they are human. While these log-in hurdles—called reCAPTCHA tests—may prompt some head-scratching (does the corner of that red light count?), they reflect that vision is considered a clear metric for differentiating computers from humans. But computers are catching up.
The quest to create computers that can “see” has made huge progress in recent years. Fifteen years ago, computers could correctly identify what an image contains about 60 percent of the time. Now, it’s common to see success rates near 90 percent. But many computer systems still fail some of the simplest vision tests—thus reCAPTCHA’s continued usefulness.
Newer approaches aim to more closely resemble the human visual system by training computers to see images as they are—made up of actual objects—rather than as just a collection of pixels. These efforts are already yielding success; for example, they’re used developing robots that can “see” and grab objects.
Better neural networks
Computer vision models employ what are called visual neural networks. These networks use interconnected units called artificial neurons that, akin to neurons in the brain, forge connections with each other as the system learns. Typically, these networks are trained on a set of images with descriptions, and eventually they can correctly guess what is in a new image they haven’t encountered before.
A major leap forward in this technology came in 2012 when, using a powerful version of what’s called a convolutional neural network, a model called AlexNet was able to correctly label images it hadn’t encountered before after teaching itself to recognize images on a training set. It won, by a large margin, the ImageNet Large Scale Visual Recognition Challenge, a contest that’s considered a benchmark for evaluating computer vision tasks. (AlexNet was developed by two students of computer scientist Geoffrey Hinton, the “Godfather of A.I.” who shared the Nobel Prize in physics in 2024.)
Despite this vastly improved performance, visual neural networks still make puzzling mistakes. In a classic example from 2017, a student-run A.I. research group at MIT tricked a neural network into labeling a picture of a cat as guacamole. By adding an imperceptible amount of pixel “noise” to the cat image, the model was completely thrown off.
“I was shocked that this was so easy to do—to make the models think the wrong thing,” says computer scientist Andrew Ilyas, a member of that student team who will start a new position at Carnegie Mellon University in Pittsburgh in January.
 In a classic example of tripping up an image-recognition program, a team in 2017 introduced some imperceptible noise into an image of a cat. Google’s InceptionV3 image classifier then mislabeled the image as guacamole. A. ILYAS ET AL / PROCEEDINGS OF THE 35TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING 2018
Moving every pixel in an image just slightly to the left or right can also confuse visual networks. Researchers did this with images of otters, airplanes and binoculars, and the model could no longer identify the image despite it appearing obvious to a person, computer scientists Yair Weiss and Aharon Azulay from Hebrew University of Jerusalem reported in 2019.
This susceptibility to minute changes stems from the compartmentalized way that visual neural networks learn. Instead of identifying a cat based on a true understanding of what a cat looks like, these approaches see a set of features that the network associates with “cat.” These features, however, are not inherent to the notion of “cat,” which Ilyas and his colleagues exploited in their often-cited guacamole example.
“Computers learn lazy shortcuts that are easily tampered with,” Ilyas says.
Today, convolutional neural networks are gradually being replaced by what are called vision transformers (ViTs). Typically trained on millions or even billions of images, ViTs divide images into groups of pixels called patches and cluster regions based on properties such as color and shape. These groupings are identified as physical features, such as a body part or a piece of furniture.
Vision transformers often perform better than previous approaches because they synthesize information from different areas of an image more efficiently, says machine learning researcher Alexey Dosovitskiy, who worked on ViTs at Google.
 Blind spots in computer vision programs can be revealed via subtly altered images. The bottom row features four such “adversarial images,” that are still recognizable to human eyes but tripped up the computer. A. ILYAS ET AL / PROCEEDINGS OF THE 35TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING 2018
Mimicking how the brain sees
Some researchers are now combining elements of various visual neural networks to enable the computers to think more like humans.
Object-centric neural networks aim to do just that. They evaluate images as compositions of objects rather than just grouping similar properties, such as “yellow.” These models’ image-recognition success comes from their ability to recognize an object as separate from its background.
In one recent example, researchers compared object-centric neural networks to other visual neural networks via a series of tests that required the computers to match identical shapes. All the models were trained on regular polygons and performed similarly on these kinds of shapes, but the object-centric models were much better at applying what they learned to irregular, colored and striped shapes.
The top object-centric model correctly matched the abnormal shapes 86.4 percent of the time, while the other visual model was successful only 65.1 percent of the time, as reported this year by Jeffrey Bowers, a psychologist who focuses on machine learning at the University of Bristol in England, and his colleague Guillermo Puebla, a psychologist at the University of Tarapacá in Chile.
Object-centric models’ success has expanded beyond two-dimensional images. Newer systems can watch videos and reason about what they saw, correctly answering questions such as “How good are this person’s badminton skills?”
Object-centric algorithms also have been incorporated into robots. Some of these can accurately grab and rotate objects in three dimensions, completing tasks such as opening drawers and turning faucets. One company is even building flying robots that use these types of visual recognition strategies to harvest apples, peaches and plums. These robots’ precise object detection abilities allow them to determine when fruit looks ripe and deftly swoop in between trees to pick the fruit without damaging its delicate skin.
Scientists expect even more progress in visual neural networks, but there’s a long way to go before they can compete with the brain’s capabilities.
“There are ways in which the human visual system does strange stuff,” Bowers says, “but never is a cat mistaken as guacamole.”
Knowable Magazine is an independent journalistic endeavor from Annual Reviews.
You Might Also Like
October 31, 2025
October 31, 2025
October 31, 2025
October 31, 2025
October 31, 2025