OpenAI Aims to Release “Multi-Modal” Model Ahead of Google’s Gemini

OpenAI is reportedly working to release a next-generation large language model (LLM) called Gobi, which is a multi-modal language model, before Google’s Gemini is launched. Gemini is Google’s upcoming multi-modal model and is expected to be unveiled this fall. However, OpenAI is looking to beat Google by incorporating similar multi-modal capabilities into GPT-4.
Multi-modal language models have gained significant attention recently, with ChatGPT showcasing exceptional abilities in various fields. These models use large language models as their “brains” to perform various multi-modal tasks. MLLMs have exhibited capabilities that traditional methods lack, such as generating stories based on images, answering questions about visual knowledge, and performing mathematical reasoning without the need for optical character recognition (OCR).
OpenAI has already demonstrated these capabilities with GPT-4 in March but has only opened access to one company called “Be My Eyes,” which develops mobile applications for visually impaired or blind individuals. Six months later, OpenAI is preparing to launch a feature called GPT-Vision on a larger scale.
The reason for the delay in releasing this feature is primarily due to concerns about potential misuse by malicious actors. OpenAI’s engineers have been working on addressing legal concerns surrounding the new technology.
Similarly, Google is also facing this issue. When asked about measures taken to prevent misuse of Gemini, a Google spokesperson stated that the company made a series of commitments in July to ensure responsible development of all its products.
However, considering Google’s proprietary data related to text, images, videos, and audio (including data from platforms like search and YouTube), the development of multi-modal models may play to Google’s advantage. Early users of Gemini have reported fewer incorrect answers compared to existing models.
Sam Altman, CEO of OpenAI, hinted in recent interviews that GPT-5 has not yet emerged, but they plan to enhance GPT-4 with various improvements, with the new enhanced model potentially becoming one of them.
It is still too early to say if Gobi will eventually become GPT-5 as OpenAI does not appear to have started training it yet.
This competition can be compared to the AI version of iPhone versus Android. People are eagerly awaiting the arrival of Gemini, which will reveal the extent of the gap between Google and OpenAI.
Sources:华尔街见闻
Definitions:
1. Large language model (LLM): A large language model is a type of artificial intelligence model that can generate human-like text based on given prompts or queries.
2. Multi-modal model: A multi-modal model is an artificial intelligence model that can understand and generate text, images, videos, and other forms of information.
3. Optical character recognition (OCR): Optical character recognition is the technology that recognizes and extracts text from images or documents, converting it into machine-readable text.
Image source: Unsplash

Jerzy Lewandowski, a visionary in the realm of virtual reality and augmented reality technologies, has made significant contributions to the field with his pioneering research and innovative designs. His work primarily focuses on enhancing user experience and interaction within virtual environments, pushing the boundaries of immersive technology. Lewandowski’s groundbreaking projects have gained recognition for their ability to merge the digital and physical worlds, offering new possibilities in gaming, education, and professional training. His expertise and forward-thinking approach mark him as a key influencer in shaping the future of virtual and augmented reality applications.