Alibaba Cloud Launches Models that Understand Images

Photo credit: Shutterstock

Published on Aug. 25, 2023

Alibaba Cloud said on Friday that it’s releasing two open-source large vision language models that understand images and text.

Qwen-VL, a pre-trained large vision language model and its conversationally finetuned version Qwen-VL-Chat, are available for download on Alibaba Cloud’s AI model community ModelScope and the collaborative AI platform Hugging Face.

The two models can understand both image and text input in English and Chinese. They can perform visual tasks, such as answering open-ended questions based on multiple images and generating image captions. Qwen-VL-Chat can perform more sophisticated tasks, such as doing mathematical calculations and creating a story based on multiple images.

The two models are trained based on the 7-billion-parameter version of its large language model Qwen-7B that it open-sourced earlier this month. Alibaba Cloud said that compared with other open-source large vision language models, Qwen-VL can comprehend images in higher resolution, leading to better image recognition and understanding performance.

The release underscores the efforts of the cloud computing company in developing advanced multi-modal capabilities for its large language models, capable of processing data types including images and audios along with text. The incorporation of other sensory input into large language models opens up possibilities for new applications for researchers and commercial organizations.

Impact Beyond the Lab

The two models promise to transform how users interact with visual content. For example, researchers and commercial organizations can explore practical uses, such as leveraging the models to generate photo captions for news outlets or assisting non-Chinese speakers that can’t read street signs in Chinese.

Qwen Vl Chat — The Qwen-VL-Chat model can support multiple rounds of Q&A. Photo credit: Alibaba Cloud

With the capabilities of visual question answering, they also hold the potential to make shopping more accessible to blind and partially sighted users, an endeavor that Alibaba Group has undertaken.

Alibaba’s online marketplace Taobao added Optical Character Recognition technology to its pages to help the visually impaired read text, such as product specifications and descriptions on images. The newly launched large vision language models can simplify the process by making it possible for visually impaired people to get the answer that they need from the image based on multi-round conversation.

Alibaba Cloud said its pre-trained 7-billion-parameter large language model Qwen-7B, and its conversationally finetuned version, Qwen-7B-Chat have garnered over 400,000 downloads since their launch in a month. It has previously made the two models available to help developers, researchers and commercial organizations to build their generative AI models more cost-effectively.

Dive into emerging tech themes

Alibaba Cloud Launches Open-Source Models that Understand Images

Alibaba Cloud Launches Open-Source Models that Understand Images

Impact Beyond the Lab

Popular Stories

Alibaba Cloud and OBS Unveil AI-powered OBS Cloud 3.0 at Paris 2024

Alibaba Celebrates Gender Equality at Paris 2024 with AI-powered Restoration of Historic Photos

Alibaba International’s Aidge AI Toolkit Hits Half a Million Merchant Adoptions

Alibaba AI Tool Creates Picture Books for Children with Autism

Alibaba Cloud Launches Open-Source Models that Understand Images

Stay updated on the digital economy by signing up for our free weekly newsletter

Impact Beyond the Lab

Never Miss a Story

Popular Stories

Sign Up For Our Newsletter