Google Takes the Next Step in Multimodal Search

What's Yelp's Position on its Antitrust Suit Against Google?

When looking at evolutions in search – pre-dating recent gen-AI convergence – voice and visual search have been on similar paths as alternative inputs. We all know what voice search is, but visual search – for those unfamiliar – is using your camera to identify and contextualize things, a la Google Lens.

The idea in both cases is to accommodate several modalities that can be situationally and contextually relevant. Think: voice search while driving, and visual search to identify style item you encounter in the real world, using your camera instead of text. The latter is fueled by the camera-native generation Z.

More recently, Google has begun to combine these two inputs for a potential peanut butter & chocolate moment. This was first seen in Multimodal search, unveiled at Google I/O in 2022. In short, it lets you perform visual searches, then refine the results using text or voice (think: “the same jacket in blue.”).

This concept took a step forward this week when Google made this multi-modal query flow more natural. Specifically, a beta feature lets users long-press the shutter button while speaking, which means Google processes an integrated – rather than sequential – mix of visuals and voice to compute the best result.

For a use case example, point Google Lens at a landscaping shrub in your neighborhood while long-pressing and saying “What kind of tree is this and what local nurseries carry it?” Similarly, point it at a new restaurant in your neighborhood and say “Does this place require a reservation?” And so on.

LeadzAI Leads a Shift from Ads to Offers

Follow the Money 

One question that flows from all the above – as we always ask in such situations – is why? And the answer, as is often the case, is all about following the money. With more – and varied – search inputs, Google is hoping to boost one of the key metrics at the heart of its revenue model: query volume.

There’s a counterpoint here: voice and visual searches don’t carry the same 10-blue links ad inventory of Google’s traditional SERPS. But like its forays into AI (to which both voice and visual search are tightly related), Google has the opportunity to engender quality over quantity in its monetization.

In other words, though there’s less ad inventory – one search result versus several – this could be an opportunity for sponsored results, when relevant, that carry higher premiums than a typical CPC. This could flow from the commercial intent of using Google Lens to identify a fashion item, as noted.

The same goes for local storefronts, as in using Google Lens to get those restaurant details in the example above. We know from the mobile search era that proximity correlates to higher intent in local mobile searches. Consider the additional boost in value when a subject isn’t just in proximity but in view.

This all aligns with Yext CDO Christian Ward’s premise for Google’s monetization path in AI. Though it cannibalizes the traditional search model, AI-driven dialogues with a user can infer deeper levels of intent and thus higher-value leads for businesses. This brings us from the construct of clicks to that of offers.

And that’s one way Google could get around the innovator’s dilemma it currently faces in AI. As often, this will be a moving target.

Share Article...

Follow Us...

Stay ahead of the curve and get the latest on Local straight to your inbox.

By submitting this form, you agree to receive communications from Localogy. You can unsubscribe at any time.

Related Resources

What's Yelp's Position on its Antitrust Suit Against Google?