Google Research and DeepMind team recently published aNew research, proposed an innovative system called "StreetViewAI" to try to solve the long-standing limitation of Street View maps on the "visual dependence" of the visually impaired, allowing them to explore Google Street View's huge database of more than 2200 billion images in more than 100 countries around the world through AI dialogue.
Traditional street view services, centered around immersive 360-degree images, can provide general users with intuitive environmental perception, but are not very user-friendly for the visually impaired who must rely on hearing or assistive tools.
StreetViewAI was designed to change this situation. By integrating a multimodal model based on Google Gemini Flash 2.0, the research team established three subsystems: "AI Describer," "AI Chat Agent," and "AI Tour Guide."
The AI Describer instantly converts objects, spatial relationships, and navigation clues in the image into concise voice descriptions. The AI Chat Agent allows users to freely ask questions such as "Is this sidewalk shaded?", "Is the cafe entrance wheelchair accessible?", and even "Are there any surprising attractions along this route?" The AI can provide answers based on previous perspectives and the context of the conversation.
As for the AI Tour Guide, it further provides guided tour information on history, culture and architectural background, making the exploration process more in-depth.
StreetViewAI function summary table:
| Subsystem name | The main function | Usage scenarios/examples |
|---|---|---|
| AI Describer | Real-time voice description of important objects, spatial relationships and navigation clues in the picture | Users can get information such as "There is a bus stop 10 meters ahead" and "There is a pedestrian crossing on the right" |
| AI Chat Agent | Provide natural dialogue interaction, answer users' scenario-specific questions, and preserve the conversation context | “Is this path shaded?”, “Is the cafe entrance wheelchair accessible?”, “Are there any surprises along this route?” |
| AI Tour Guide | Supplementary guide information, including historical background, cultural significance, architectural style, etc. | Describe the history or architectural features of a building while exploring the streets of Paris |
In actual testing, the research team invited 11 visually impaired individuals who frequently used white canes and screen-based reading tools to participate. They designed two tasks: destination search and free exploration. During the process, participants interacted with the AI Chat Agent 917 times, significantly higher than the 136 interactions with the AI Describer, demonstrating that conversational interaction better met their needs.
Statistics show that the AI correctly answered 86.3% of questions, with an incorrect answer rate of only 3.9%. The most frequently asked topics were spatial relationships (27%), object presence confirmation (26.5%), and immediate scene description (18.4%).
Participants generally used voice as their primary mode of interaction, accounting for over 90%. One tester noted that previous navigation systems often only led them to a destination within a few meters, but StreetView AI not only led them to the door but also described the door's appearance and accessibility, providing more precise guidance.
This research highlights Google's ambitions in multimodal AI applications and demonstrates that AI is more than just a tool for entertainment or productivity; it can also serve as a vital bridge to improving the quality of life for vulnerable groups. With continued improvements to its accuracy and support, StreetViewAI may not only transform the digital experience for the visually impaired, but also expand into broader application scenarios such as education, tourism, and smart city navigation.



