By Sachin S Panicker
The world has been abuzz with conversations about Artificial Intelligence over the past several months. This current wave of excitement around AI was fueled by the release of ChatGPT, OpenAI’s revolutionary new language model that generates authentic human-sounding text in response to prompts and questions. Subsequently, Google launched its AI-powered chat service, Bard AI, which gathered information from the Internet and generated human-like responses to inputs.
And now, more recently, a new tool has entered the fray that has the potential to chart a bold new course for AI development in the near future – ImageBind.
Meta’s new open-source AI model, ImageBind, seeks to mimic human perception with its capability to take one form of data to create other data types. In simple terms, much like how any of us could hear the sound of a car engine and picture a car, ImageBind could take the sound of a car engine as an input and generate an image of a car!
ImageBind’s incredible capabilities do seem to indicate that we are getting significantly closer to the emergence of Artificial General Intelligence (AGI) – AI that replicates human cognitive capabilities to perform any task that humans are capable of.
Diving Deep into the Implications of ImageBind
So, for the uninitiated, what does this exciting new development mean for the world of AI?
To begin with, ImageBind is a truly multi-modal model, unlike GPT-4 – the model currently powering ChatGPT, which is only dual! ImageBind currently supports 6 modalities i.e. Text, Image/Video, Audio, Depth (through 3D sensors), Thermal (by detecting Infrared Radiation) and Motion/Position (through Inertial Measurement Units); with the scope for bringing in more in the future. It achieves this through the novel method of encoding all these modalities in a single, shared representation space.
ImageBind employs diverse sets of image-coupled data to create a unified representational space. It doesn’t require all modalities to be concurrently present in datasets. Rather, it harnesses the unifying attribute of images, establishing that aligning each modality’s embedding with image embeddings results in a spontaneous alignment across all modalities. It employs paired data on a web-scale, like (image, text), and merges it with naturally occurring paired data, like (video, audio), (image, depth), and so on, to create a singular integrated embedding space. This facilitates the indirect alignment of text embeddings with other modalities like audio, depth, etc., leading to zero-shot recognition capabilities on those modalities without the need for explicit semantic or textual pairing. Furthermore, it can commence with large-scale vision-language models like CLIP, making use of their substantial image and text representations. Consequently, it can be utilized across a broad range of modalities and tasks with minimal training.
The development also has implications for the leadership of the AI sector. Much of the recent discourse around AI has been captured by ChatGPT and OpenAI. However, due to several factors, this momentum slowed – most users are still being waitlisted for the GPT-4 API, visual support is still not enabled on ChatGPT, plugins are still in Beta-testing, and a new version launch that will likely be no earlier than the end of 2023. Moreover, with ChatGPT’s trademark application still pending with the US Patent Office, and several companies releasing imitators named [XYZ]GPT, the road ahead will likely not be smooth sailing for OpenAI. This does open up space for another player to seize the momentum. And with the release of ImageBind, Meta has shown itself to be a serious contender.
Experimenting with ImageBind
I recently decided to explore creating an entire video from scratch using AI. I used Stable Diffusion to first convert text to image, and created several image frames, with some back-and-forth while fine-tuning the prompt. I then put together the frames using the AI video generation tool Kaiber AI, used the generative music app Mubert to add audio, and then fed all this output into the AI creative suite Runway to generate the video.
The entire exercise, involving multiple tools, was intensely time-consuming and draining, involving a significant amount of human effort and input.
This led me to experiment with ImageBind instead. I set up and ran the ImageBind code on an Apple M1 with MacOS Ventura. The process was painstakingly slow, with model weights of almost 5 GB.
The documentation of the ImageBind code appears to have been done hastily, though I do not expect this to deter users from adopting this revolutionary new AI technology.
The Future is Bright for AI
Despite this initial setback with ImageBind, the immense capabilities of the new multi-modal model inspire me with confidence about the current direction of AI development. While Meta will no doubt further develop ImageBind, and OpenAI will continue to refine ChatGPT, the possibility of Google (through Bard or another tool), or some other player to introduce a disruptive innovation in the space remains open. A brave new world lies ahead of us – one where the lines between AI and human capabilities will continue to blur, if not be erased!
The author is chief AI scientist, Fulcrum Digital
