Qwen2-VL Finetune - Vision-Language Model Training

Features

Everything you need for VLM training

Comprehensive toolkit for fine-tuning vision-language models with state-of-the-art techniques

LoRA

LoRA & QLoRA

Memory-efficient fine-tuning with Low-Rank Adaptation. Support for 4-bit and 8-bit quantization.

DPO

Direct Preference Optimization

Train models with human preferences using DPO for better alignment with desired outputs.

GRPO

Group Relative Policy Optimization

Advanced RLHF technique for improved reasoning and instruction following capabilities.

Video

Multi-Image & Video

Train on multi-image sequences and video data with configurable frame rates and resolution.

DeepSpeed

DeepSpeed Integration

Distributed training with ZeRO-2/3 optimization for scaling across multiple GPUs.

Liger

Liger Kernel

Optimized Triton kernels for faster training and reduced memory consumption.

Demo

Interactive Gradio WebUI

Chat with your fine-tuned models through an intuitive web interface

Qwen2-VL

Qwen2-VL-7B Instruct

What's in this image?

The image shows a beautiful sunset over a mountain landscape with vibrant orange and purple hues in the sky. In the foreground, there's a serene lake reflecting the colors of the sky...

Can you describe the colors in more detail?

Certainly! The color palette includes:
• Deep indigo and violet in the upper sky
• Warm orange and gold near the horizon
• Soft pink and lavender clouds
• Dark silhouettes of the mountains

Enter message or upload file...

Installation

Get started in minutes

Choose your preferred setup method

Docker

Docker (Recommended)

Pre-built environment

                            # Pull the pre-built image
docker pull john119/vlm

# Run with GPU support
docker run --gpus all -it \
  -v /host/path:/docker/path \
  --name vlm --ipc=host \
  john119/vlm /bin/bash
                        

Includes all dependencies with CUDA 12.8 support

pip

pip install

                            # Clone the repository
git clone https://github.com/2U1/Qwen2-VL-Finetune.git
cd Qwen2-VL-Finetune

# Install dependencies
pip install -r requirements.txt \
  -f https://download.pytorch.org/whl/cu128
pip install qwen-vl-utils
pip install flash-attn --no-build-isolation
                        

conda

Conda environment

                            # Create conda environment
conda env create -f environment.yaml
conda activate train

# Install additional packages
pip install qwen-vl-utils
pip install flash-attn --no-build-isolation
                        

Quick Start

Launch the Gradio Demo

Run the interactive WebUI locally with just a few commands

1

Install Gradio

Install the Gradio package for the web interface:

pip install gradio

2

Launch the WebUI

Start the Gradio demo with your model path:

                            python -m src.serve.app \
  --model-path /path/to/merged/weight
                        

3

Open in Browser

Navigate to http://localhost:7860 to access the chat interface. Upload images or videos and start conversations!

Resources

Explore & Contribute

GitHub Repository

Source code, issues, and contributions

Docker Hub

Pre-built Docker images

Qwen2-VL Model

Base model on HuggingFace

Fine-tune Qwen2-VL Vision-Language Models