A comprehensive training framework for Qwen2-VL and Qwen2.5-VL models. Supports LoRA, QLoRA, DPO, GRPO, multi-image, video, and mixed-modality training.
Comprehensive toolkit for fine-tuning vision-language models with state-of-the-art techniques
Memory-efficient fine-tuning with Low-Rank Adaptation. Support for 4-bit and 8-bit quantization.
Train models with human preferences using DPO for better alignment with desired outputs.
Advanced RLHF technique for improved reasoning and instruction following capabilities.
Train on multi-image sequences and video data with configurable frame rates and resolution.
Distributed training with ZeRO-2/3 optimization for scaling across multiple GPUs.
Optimized Triton kernels for faster training and reduced memory consumption.
Chat with your fine-tuned models through an intuitive web interface
Choose your preferred setup method
# Pull the pre-built image
docker pull john119/vlm
# Run with GPU support
docker run --gpus all -it \
-v /host/path:/docker/path \
--name vlm --ipc=host \
john119/vlm /bin/bash
Includes all dependencies with CUDA 12.8 support
# Clone the repository
git clone https://github.com/2U1/Qwen2-VL-Finetune.git
cd Qwen2-VL-Finetune
# Install dependencies
pip install -r requirements.txt \
-f https://download.pytorch.org/whl/cu128
pip install qwen-vl-utils
pip install flash-attn --no-build-isolation
# Create conda environment
conda env create -f environment.yaml
conda activate train
# Install additional packages
pip install qwen-vl-utils
pip install flash-attn --no-build-isolation
Run the interactive WebUI locally with just a few commands
Install the Gradio package for the web interface:
pip install gradio
Start the Gradio demo with your model path:
python -m src.serve.app \
--model-path /path/to/merged/weight
Navigate to http://localhost:7860 to access the chat interface. Upload images or videos and start conversations!