‘NExT-GPT’ – Video, Audio, Image, and Text – ‘Any-to-Any’ Multimodal Large Language Model

NExT-GPT has been developed by researchers at the National University of Singapore and Tsinghua University.

NExT-GPT can accept inputs in different formats and deliver responses according to the desired output.

This means that you can put in a text prompt and NExT-GPT and can process that prompt into a video, or you can give it an image and have that converted to an audio output, or Input Text and have the output available in Text, Image, and Audio. It allows ‘any to any’ combinations of input to output.

Some examples;

Example-1: Text → Text+Image+Audio

Human Input – I’ve recently started practicing mindfulness and meditation as part of my daily routine, but sometimes it’s challenging. Can you help me with some techniques?

Next-GPT Output – Of course! Mindfulness and meditation can be a wonderful way to cultivate a sense of calm and focus. One technique you might find helpful is called “mindful breathing.” It involves taking slow, deep breaths and focusing your attention on the sensation of the breath entering and leaving your body. Another technique is “body scan meditation,” where you consciously scan your entire body from head to toe, paying attention to any sensations or discomfort. You can also try guided meditations that provide a structured practice. Would you like me to find a video that demonstrates mindful breathing or body scan meditation?

Human Input – I wish to listen to a music for the meditation first.

Next-GPT Output – I guess you may want some musics that sound soft.

Human Input -Base on the audio content, visualize it in an image.

Next-GPT Output –

Human Input – Can you infer what time is in the picture?

Next-GPT Output -Based on the image, it is likely that the time is either early morning or late afternoon. The sunlight is either low or setting, creating a serene atmosphere.

Abstract

While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides, they mostly fall prey to the limitation of only input-side multimodal understanding, without the ability to produce content in multiple modalities. As we humans always perceive the world and communicate with people through various modalities, developing any-to-any MM-LLMs capable of accepting and delivering content in any modality becomes essential to human-level AI. To fill the gap, we present an end-to-end general-purpose any-to-any MM-LLM system, NExT-GPT. We connect an LLM with multimodal adaptors and different diffusion decoders, enabling NExT-GPT to perceive inputs and generate outputs in arbitrary combinations of text, images, videos, and audio. By leveraging the existing well-trained highly-performing encoders and decoders, NExT-GPT is tuned with only a small amount of parameter (1%) of certain projection layers, which not only benefits low-cost training and also facilitates convenient expansion to more potential modalities. Moreover, we introduce a modality-switching instruction tuning (MosIT) and manually curate a high-quality dataset for MosIT, based on which NExT-GPT is empowered with complex cross-modal semantic understanding and content generation. Overall, our research showcases the promising possibility of building an AI agent capable of modeling universal modalities, paving the way for more human-like AI research in the community.

Technical Description

• Architecture

Figure 1: By connecting LLM with multimodal adaptors and diffusion decoders, NExT-GPT achieves universal multimodal understanding and any-to-any modality input and output.

Multimodal Encoding Stage. Leveraging existing well-established models to encode inputs of various modalities. Here we take advantage of the ImageBind, which is a unified high-performance encoder across six modalities. Then, via the linear projection layer, different input representations are mapped into language-like representations that are comprehensible to the LLM.
LLM Understanding and Reasoning Stage. An LLM is used as the core agent of NExT-GPT. Technically, we employ the Vicuna. LLM takes as input the representations from different modalities and carries out semantic understanding and reasoning over the inputs. It outputs 1) the textual responses directly, and 2) signal tokens of each modality that serve as instructions to dictate the decoding layers whether to generate multimodal contents, and what content to produce if yes.
Multimodal Generation Stage. Receiving the multimodal signals with specific instructions from LLM (if any), the Transformer-based output projection layers map the signal token representations into the ones that are understandable to following multimodal decoders. Technically, we employ the current off-the-shelf latent conditioned diffusion models of different modal generations, i.e., Stable Diffusion (SD) for image synthesis, Zeroscope for video synthesis, and AudioLDM for audio synthesis.

• System Inference

In Figure 2 we further illustrate the inference procedure of NExT-GPT. Given certain user inputs of any combination of modalities, the corresponding modal encoders and projectors transform them into feature representations and passed to LLM (except the text inputs, which will be directly fed into LLM). Then, LLM decides what content to generate, i.e., textual tokens, and modality signal tokens. If LLM identifies a certain modality content (except language) to be produced, a special type of token will be output indicating the activation of that modality; otherwise, no special token output means deactivation of that modality. Technically, we design the ‘<IMG_i>’ (i=0,…,4) as image signal tokens; ‘<AUD_i>’ (i=0,…,8) as audio signal tokens; and ‘<VID_i>’ (i=0,…,24) as video signal tokens. After LLM, the text responses are output to the user; while the representations of the signal tokens of certain activated modalities are passed to the corresponding diffusion decoders for content generation.

Figure 2: NExT-GPT inference process. Grey colors denote the deactivation of the modules.

• Lightweight Multimodal Alignment Learning

We design the system with mainly three tiers in loose coupling, and we only need to update the two projection layers at encoding side and decoding side.

Encoding-side LLM-centric Multimodal Alignment. We align different inputting multimodal features with the text feature space, the representations that are understandable to the core LLM.
Decoding-side Instruction-following Alignment. We minimize the distance between the LLM’s modal signal token representations (after each Transformer-based project layer) and the conditional text representations of the diffusion models. Since only the textual condition encoders are used (with the diffusion backbone frozen), the learning is merely based on the purely captioning texts, i.e., without any visual or audio inputs.

Figure 3: Illustration of the lightweight multimodal alignment learning of encoding and decoding.

• Modality-switching Instruction Tuning (MosIT)

Further instruction tuning (IT) is necessary to enhance the capabilities and controllability of LLM. To facilitate the development of any-to-any MM-LLM, we propose a novel Modality-switching Instruction Tuning (MosIT). As illustrated in Figure 4, when an IT dialogue sample is fed into the system, the LLM reconstructs and generates the textual content of input (and represents the multimodal content with the multimodal signal tokens). The optimization is imposed based on gold annotations and LLM’s outputs. In addition to the LLM tuning, we also fine-tune the decoding end of NExT-GPT. We align the modal signal token representation encoded by the output projection with the gold multimodal caption representation encoded by the diffusion condition encoder. Thereby, the comprehensive tuning process brings closer to the goal of faithful and effective interaction with users.

Figure 4: Illustration of modality-switching instruction tuning.

MosIT Data

All the existing IT datasets fail to meet the requirements for our any-to-any MM-LLM scenario. We thus construct the MosIT dataset of high quality. The data encompasses a wide range of multimodal inputs and outputs, offering the necessary complexity and variability to facilitate the training of MM-LLMs that can handle diverse user interactions and deliver desired responses accurately.

Courtesy : https://next-gpt.github.io/