The competition for the next generation of wearable AI devices is heating up. In this article, we will summarize the work being done by tech giants such as Meta, Google, Microsoft, OpenAI, and Apple in this field. The article is sourced from The Information, translated and compiled by Founder Park.
Table of Contents:
Google: It is impossible for Napoleon to return to his Waterloo easily.
OpenAI: Altman has always had ambitions in hardware development.
Microsoft: Small models pave the way, developing AI software for HoloLens.
Apple: Hardware is ready, models…
Meta: Young people are fast runners.
Amazon: A new device that supports multimodal AI is about to be released.
AI requires new hardware platforms, and wearable devices, especially smart glasses, are the hope of major tech giants. Meta, Google, Microsoft, OpenAI, and other leading companies in the AI field hope to integrate their visual and language-related AI technologies into smart glasses and other wearable devices with cameras.
While wearable devices have been in the industry for many years, the breakthrough progress of multimodal AI (including visual recognition of text, sound, images, tables, objects, and gestures) has rekindled the confidence of these giants in this field.
One recent example is OpenAI considering integrating the object recognition feature of GPT-4 Vision into Snapchat’s smart glasses.
Although it will take time to apply these technologies to wearable and mobile devices, this progress indicates the possibility of future AI assistants enabled by voice activation. They can bring revolutionary changes to our daily lives, whether it’s helping students write papers, answering math problems, providing information about the surrounding environment, from translating road signs to guiding car repairs. The functions of these assistants will surpass those of today’s smartphones.
As Pablo Mendes, CEO of Objective and former Apple engineer, said, AI models will become an indispensable part of our lives. They will not only be integrated into our computers and phones but will also appear in more devices. He believes that this is not far away from us.
While smartphones are currently the focus, Google is preparing to directly integrate small models into smartphones. However, other companies are exploring the application of these technologies in new types of devices. Recently, Meta demonstrated smart glasses developed in collaboration with Ray-Ban, featuring a demo version of a multimodal AI voice assistant.
The intelligent assistant can describe what the wearer sees and recommend which pants and shirts to match. It can also translate Spanish text from publications into English, and so on.
The Amazon Alexa AI team is also discussing a new AI device that also has visual recognition capabilities.
Silicon Valley has an obsession with wearable devices with cameras. Google and Microsoft have long been committed to developing AR headsets, but the results are uncertain. They are trying to display digital images on the OST screens of headsets to guide wearers in completing specific tasks or provide information about people and objects in their field of view. However, due to limitations in optical technology, this capability is still difficult to popularize. Apple’s planned Vision Pro headset will include some AR features, but it may not be equipped with multimodal AI initially.
The emergence of large models changes everything. Thanks to multimodal large models, future AI will be able to “observe” the wearer’s behavior through external cameras and analyze and comment on it. However, there are still many challenges to overcome in scaling down LLMs to efficiently execute and respond quickly on portable devices.
In addition, there are privacy and ethical issues regarding cameras on wearable devices that need to be accepted by society.
Here are some of the work being done by top giants and AI developers to achieve such products.
Last week, Google’s Gemini promotional video caused a sensation in the tech industry. They demonstrated the powerful multimodal capabilities of AI, such as AI recognizing a person imitating the iconic action in “The Matrix” and learning how to play a map-based game.
The most advanced model, Gemini Ultra, has not been officially released yet. Although theoretically, the model can perform video demonstrations, it is not yet at the level of real-life video in terms of response time and prerequisites.
According to insiders familiar with Google’s consumer hardware strategy, it may take several years to achieve this experience because the computational requirements for environmental perception consume a lot of energy, and Google has had setbacks in high-end wearable devices (Google Glass).
As a starting point, Google is redesigning the operating system of Pixel phones to integrate smaller Gemini models. According to The Information’s report on Thursday, these models will support Pixie, an AI assistant that performs complex multimodal tasks, such as providing guidance to the nearest store to purchase products captured by the user, aiming to surpass existing assistants like Siri.
For Google, an AI device that can learn and predict the information people need or want about the surrounding world is essential because Google’s core search technology can achieve this in a digital way. Google made an initial attempt with Google Glass, but the project failed ten years ago due to its clumsy design and limited practicality.
Subsequently, Google focused on camera-based processing technology and encouraged Android smartphone manufacturers to turn smartphone cameras into “a third eye” that can scan the environment and upload images to Google’s cloud system for analysis. Google’s intention was to provide users with more information about objects in the images. Finally, this concept was realized in the Google Lens application.
According to insiders, Google has recently slowed down the development of similar eyewear devices but is still developing software for these devices. These people said that Google plans to license the software to hardware manufacturers, similar to how it licenses the Android system to smartphone manufacturers.
It can be said that in March of this year, OpenAI, a startup supported by Microsoft, sparked the competition for wearable AI devices when they demonstrated ChatGPT’s ability to build a website based on handwritten sketches. Many employees of OpenAI, including Andrej Karpathy, compared language models to operating systems because they can write and execute code, access the internet, retrieve and reference files.
Since then, CEO Sam Altman has expressed interest in developing a new type of AI consumer device to leverage these capabilities. Earlier this year, former iPhone designer Jony Ive began discussing the possibility of developing such a device. Although OpenAI does not have a hardware team, it can collaborate with other companies, such as manufacturers like Snap or AI chip design companies.
Coincidentally, Altman also invested in Humane, a company that manufactures wearable “AI Pin” with cameras. The company also hopes to develop AI portable devices that can replace smartphones.
Recently, Microsoft researchers and product teams have made significant progress in the field of multimodal AI. This has given them more confidence in expanding their own voice assistant and developing small models for devices suitable for small-scale deployment. According to patent applications and insiders, this technology can be used to drive lightweight and affordable smart glasses or other hardware devices. Just a few days ago, Microsoft released their Phi-2 small model with a score better than Google’s Gemini Nano.
Some of this work may be based on Microsoft’s HoloLens, an expensive and bulky MR headset aimed at commercial customers such as factories or military units. Microsoft is currently developing AI software for HoloLens, allowing users to point the front camera of the headset at objects and chat with a chatbot powered by OpenAI. The chatbot can recognize these objects.
With the upcoming release of Vision Pro, Apple has all the necessary hardware to ride the wave of multimodal AI. However, compared to its competitors, Apple has fallen behind in the development of artificial intelligence. Apple only started seriously researching large language models this year, and before that, it only dabbled in it. There is currently no indication that Vision Pro will have complex object recognition or other multimodal capabilities in the near future. (At least for now, unlike the iPhone, Vision Pro will not provide developers with access to raw camera data due to privacy concerns.)
Nevertheless, Apple has spent years perfecting the computer vision capabilities of Vision Pro, allowing the device to quickly recognize the surrounding environment, including identifying furniture and determining whether the wearer is sitting in the living room, kitchen, or bedroom. Apple is also currently researching multimodal models for image and video recognition.
However, compared to other eyewear devices under development, Vision Pro is bulky and not suitable for outdoor wear. It was reported that Apple suspended the development of its own AR glasses earlier this year and focused on launching headsets. It is currently unclear when the project will be restarted. However, this device is also a direction in which Apple may incorporate multimodal AI technology.
Meta CTO Andrew Bosworth announced on Instagram this week that the company has started testing multimodal features in the second generation of its Ray-Ban smart glasses, and some users will have the opportunity to experience this feature first. These glasses are powered by Qualcomm’s new chip. Some executives at Meta believe that Ray-Ban smart glasses are pioneers of future AR glasses that combine digital images with views of the wearer’s real-world surroundings. The company plans to launch AR glasses in the next few years but has faced a series of challenges, including the stagnation of display technology and the alleged lack of success in promoting the first-generation smart glasses in the market.
However, as Tuesday’s announcement showed, the emergence of multimodal AI seems to have reignited the passion of Bosworth and his team. They believe that glasses can bring new surprises to consumers in the short term, regardless of whether the glasses are equipped with more advanced display technology.
According to insiders familiar with the project, during Amazon’s semi-annual product planning process this summer, engineers from the Alexa team are planning to launch a new device capable of running multimodal AI. The team is currently working on reducing the computational and memory requirements of AI processing images, videos, and voice on the device.
It is currently unclear whether this project will receive funding or what problems it intends to solve for customers. However, this project is different from Amazon’s Echo voice assistant device series, which has been on the market for more than a decade.
The Alexa team has been working on the development of new devices for many years, including smart audio glasses called Echo Frames. However, it is currently unclear whether this product will contribute to Amazon’s development of devices with visual recognition capabilities because it does not have a screen or camera.
Related Reports:
Google admits faking “Gemini” AI: video edited, non-real-time speech, using Prompt
Google’s new AI model “Gemini”: What makes it powerful? iKala founder: ChatGPT can’t compete with Google’s ecosystem version
Crushing GPT-4! Google releases the killer “Gemini native multimodal model”: AI comprehension surpasses humans for the first time, offline availability, equipped with Pixel 8 Pro
Amazon releases “Amazon Q” new AI chatbot: I am the most knowledgeable assistant for enterprises
Tags:
AI
Amazon
Apple
ChatGPT
Gemini
Google
HoloLens
Meta
Microsoft
OpenAI
Vision Pro
Wearable Devices