MENU

VILA can reason among multiple images, support in context learning or understand videos

VILA can reason among multiple images, support in context learning or understand videos

New Products |
By Wisse Hettinga



Visual language models have evolved significantly recently. However, the existing technology typically only supports one single image

From the Nvidia developer website: https://developer.nvidia.com

We developed VILA, a visual language model with a holistic pretraining, instruction tuning, and deployment pipeline that helps our NVIDIA clients succeed in their multi-modal products. VILA achieves SOTA performance both on image QA benchmarks and video QA benchmarks, having strong multi-image reasoning capabilities and in-context learning capabilities. It is also optimized for speed. 

It uses 1 ⁄ 4 of the tokens compared to other VLMs and is quantized with 4-bit AWQ without losing accuracy. VILA has multiple sizes ranging from 40B, which can support the highest performance, to 3.5B, which can be deployed on edge devices such as NVIDIA Jetson Orin.

We designed an efficient training pipeline that trained VILA-13B on 128 NVIDIA A100 GPUs in only two days. In addition to this research prototype, we demonstrated that VILA is scalable with more data and GPU hours.

For inference efficiency, VILA is TRT-LLM compatible. We quantized VILA using 4-bit AWQ, which runs at 10ms/token for VILA-14B on a single NVIDIA RTX 4090 GPU.

If you enjoyed this article, you will like the following ones: don't miss them by subscribing to :    eeNews on Google News

Share:

Linked Articles
10s