Google takes on GPT-4o with impressive Gemini voice and video prototype demo

 Google gemini ai app.
Google gemini ai app.

Yesterday, OpenAI launched a new multimodal GPT-4o model, conveniently timed right before Google's I/O 2024 event today.

GPT-4o looked incredibly impressive in demonstrations, with a more human-like ChatGPT voice and vision capabilities — but so does the new Gemini chatbot prototype Google just showed off in a tweet.

In Google's tweet, someone's pointing a phone at the I/O stage and chatting with Gemini. When asked what's happening in the camera view, Gemini responds, "It looks like people are setting up for a large event, perhaps a conference or presentation. Is there something particular that caught your eye?"

When prompted to analyze what the big I/O letters on the stage mean, Gemini says "Those letters represent Google I/O." Should Gemini have been able to recognize the I/O event with only one question? Perhaps, but more on that later.

Gemini went on to say that it was "always excited to learn about new advancements in artificial intelligence and how they can help people in their daily lives" when prompted about something it'd be "really excited to hear" about at Google I/O.

Is Google's new Gemini prototype better than its predecessor? Absolutely. Gemini has a fluid, human-like speaking tone, with inflections where you'd expect for emphasis, and it was able to analyze the surroundings to recognize the I/O presentation setup.

But is it better than OpenAI's new GPT-4o chatbot? That's the burning question.

How does GPT-4o compare to Google's new Gemini prototype?

Because Google hasn't officially launched the new Gemini yet (the chat shown in the tweet was with a prototype model), it might not be clear yet which model is better — but there are a few key differences between the two demos.

Both OpenAI's GPT-4o chatbot and Google's Gemini chatbot boast human-like speaking qualities, and they're both able to analyze visual input to answer a related question.

That said, I think Gemini's human-like voice sounds more natural. When listening to GPT-4o speak in the OpenAI live demo, there were a few times that the chatbot sounded overly enthusiastic or had slight stutters in speech. Gemini, on the other hand, sounded fluid, but obviously still has some improvements to make, like perhaps saying "I'm" instead of "I am" to sound more conversational.

Another big difference is the possibility to interrupt the chatbot while listening to its response. OpenAI programmed the GPT-4o model to be interrupted, so you don't have to wait through a long response before asking another question.

In Google's tweet, the speaker waited until the prototype Gemini model finished talking before responding with another question. To be fair, a listening-for-voice-input icon didn't pop up on the screen, so I don't know if the speaker could have interrupted Gemini and simply chose not to. But that seems like a feature you'd want to show off, no?

I'm also curious to know if GPT-4o could have described the Google I/O event after only one question. Gemini first recognized that the phone's camera was pointed at some type of event setup, and only mentioned the I/O event when asked about the I/O letters on stage.

After watching the GPT-4o live demo, I'm inclined to say OpenAI's chatbot would have said it was looking at the setup for the Google I/O 2024 event with only one question. Of course, that's pure speculation, but GPT-4o seems more well-rounded than Google's new Gemini.

My opinion may change after seeing more demos of Gemini over the course of Google I/O, but as of right now, GPT-4o seems like the chatbot to beat.

Check out our I/O live coverage throughout the day if you don't have time to catch the event.