Microsoft Research introduces Visual ChatGPT: Interact with ChatGPT using Visual Foundation Models
Architecture of Visual ChatGPT

Overview
Researchers at Microsoft have just released a new system called Visual ChatGPT that enables conversational language models to handle complex visual tasks. The article, titled "Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models" and authored by Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan of Microsoft Research Asia, proposes a system that combines ChatGPT, a conversational language model, with Visual Foundation Models (VFMs), which have shown great potential in computer vision.
The system employs a Prompt Manager that allows ChatGPT to interact with a variety of VFMs, incorporating 22 different models, and supporting iterative feedback. With the help of the Prompt Manager, Visual ChatGPT can leverage these VFMs to understand and generate complex visual tasks, such as generating a red flower conditioned on a yellow flower image and the predicted depth, then transforming it into a cartoon.
The article discusses related work, including efforts to combine language and vision, pre-trained models for VL tasks, and guidance for pre-trained LLMs for VL tasks. Visual ChatGPT builds upon this work and extends the potential of Chain-of-Thought (CoT) to massive tasks involving multiple modalities, including text-to-image generation, image-to-image translation, and image-to-text generation.