TECHNICAL REPORT

Vision LLM Photo Extraction Quality Benchmark

Comprehensive analysis of vision-capable LLMs for structured photo metadata extraction across multiple resolutions

🔭 Purpose of Experiment

This benchmark experiment was designed to comprehensively evaluate the performance and quality of multiple vision-capable Large Language Models (LLMs) in the context of automated photo metadata extraction. The primary objectives were twofold: First, to understand the quality and accuracy differences between various open-source vision LLMs when tasked with analyzing photographs and extracting structured information. Second, to quantify the impact of image resolution on both model performance (speed, token usage) and extraction quality.

The motivation behind this research stems from the need to select optimal vision models for production applications that process large photo libraries. Many real-world applications face critical trade-offs between processing speed, computational cost, and accuracy. Understanding how image resolution affects these factors is essential for making informed architectural decisions.

Specifically, we sought to answer several key questions: How do different models compare in their ability to accurately identify objects, describe scenes, classify moods, and extract contextual information? Does feeding higher-resolution images to vision models result in proportionally better output quality, or do we see diminishing returns? What are the performance penalties (latency and token consumption) associated with processing larger images? Which models offer the best balance of speed and accuracy for batch processing scenarios?

This benchmark represents a practical evaluation using a diverse set of 8 real-world photographs spanning multiple categories including architecture, wildlife, food, outdoor recreation, and lifestyle imagery. The complexity of the extraction task—requiring 11 distinct fields including tags, descriptions, object lists, mood classification, and technical quality assessment—mirrors the demands of modern photo management and search applications.

Approach and Methodology

Our experimental approach was designed to provide fair, reproducible, and comprehensive comparisons across models and image sizes. We selected 6 diverse vision LLMs representing different architectures, parameter counts, and design philosophies: granite3.2-vision:2b (IBM's lightweight model), llava:7b and llava:13b (popular open-source variants), llava-phi3:3.8b (Microsoft Phi-based), gemma3:4b (Google), and ministral-3:3b (Mistral).

The test corpus consisted of 8 carefully selected photographs representing real-world diversity: European historic architecture with construction activity, outdoor recreation (canoeing with a dog), elaborate food presentation, wildlife (collared peccaries in Costa Rica), mountain hiking scenes, trail signage, autumn pet photography, and grocery shopping. These images were chosen to challenge different aspects of vision understanding including object recognition, scene comprehension, technical quality assessment, and contextual understanding.

Each photograph was processed at three resolution levels: 256px, 512px, and 1024px width (maintaining aspect ratio). Images were resized using Lanczos resampling to ensure high-quality downscaling. For each model and image size combination, we measured: inference time (seconds per image), token usage (prompt tokens + generated tokens), and response quality against human-established ground truth.

The extraction prompt was intentionally complex, requiring models to generate structured JSON with 11 fields: descriptive tags (list), short executive summary (string), detailed narrative description (string), visible objects (list), people identified (list), location/place (string), visual mood (categorical: serene, vibrant, cozy, etc.), technical quality (categorical: crisp, clear, detailed, etc.), actions occurring (list), and high-level category (categorical: landscape, architecture, wildlife, etc.).

Ground truth was established through careful human analysis of each photograph following the identical prompt structure. Each model's response was then scored on a 0-100 scale evaluating: tag coverage and relevance (15 points), object detection accuracy (20 points), description quality and completeness (15 points), mood classification correctness (10 points), technical quality assessment (10 points), category classification (10 points), action detection (10 points), and people/places accuracy (10 points). Penalties were applied for hallucinations, missing critical elements, and categorical misclassifications.

🖥️ Hardware and Software Configuration

System Hardware

Processor: AMD Ryzen 9 5950X (16 cores, 32 threads, up to 5.1 GHz)

Graphics: NVIDIA GeForce RTX 4060 Ti 16GB (Driver 550.163.01)

Memory: 128GB DDR4 RAM (91GB available during testing)

Storage: 8GB swap partition (minimal usage during benchmark)

Software Stack

Operating System: Ubuntu 24.04.3 LTS (Noble Numbat)

Kernel: Linux 6.14.0-37-generic

LLM Server: Ollama 0.14.2

Python: 3.12.3 (venv)

Key Libraries: requests, Pillow, pillow-heif, pathlib

Hardware Footprint Challenge The 16GB VRAM capacity allowed even the larger 13B parameter models to load completely into GPU memory without requiring system RAM offloading, ensuring consistent inference performance. Ollama handled GPU acceleration automatically using its bundled CUDA runtime, eliminating the need for separate CUDA toolkit installation.

🎯 Quality Rankings

Scoring Methodology: Each model was scored 0-100 based on accuracy, completeness, and absence of hallucinations across 11 extraction fields. Scores reflect how well models identified objects, actions, moods, and contextual information in the test photos.
Rank Model Avg Quality Score 256px 512px 1024px Grade
1 granite3.2-vision:2b 43.4 37.8 45.1 47.4 B
2 llava:13b 39.2 37.5 38.9 41.2 B
3 llava:7b 38.9 39.0 39.5 38.2 C
4 gemma3:4b 28.0 28.9 25.6 29.5 D
5 llava-phi3:3.8b 12.8 7.4 12.9 18.0 D
6 ministral-3:3b 0.0 0.0 0.0 0.0 F

Key Quality Findings

granite3.2-vision:2b emerged as the quality leader with 43.4/100, despite being the smallest model at only 2 billion parameters. It demonstrated consistent performance with good object detection and accurate mood classification, improving with higher resolutions.

LLaVA family (7b, 13b) achieved similar scores around 38-39/100, indicating minimal quality improvement from additional parameters. Both showed competent object recognition but struggled with mood and category classification.

Critical failures: ministral-3:3b completely failed with 0/100 due to consistent JSON syntax errors. llava-phi3:3.8b scored only 12.8/100 with an 87.5% JSON parsing failure rate, making it unsuitable for production despite occasional valid outputs.

📊 Performance Analysis

Quality vs. Speed Trade-off (512px)

Processing Time vs. Image Size

Token Usage by Model and Resolution

📈 Detailed Analysis

Impact of Photo Input Size

The relationship between image resolution and model performance proved more nuanced than a simple 'larger is better' pattern. For granite3.2-vision:2b, quality scaled positively with resolution: 37.8/100 at 256px, 45.1/100 at 512px, and 47.4/100 at 1024px—a 25% improvement suggesting effective utilization of additional visual information.

The llava family showed inconsistent scaling, with llava:13b sometimes performing worse at higher resolutions. Most notably, gemma3:4b demonstrated unusual non-monotonic behavior with lowest performance at 512px (25.6/100) compared to 256px (28.9/100) and 1024px (29.5/100).

Practical recommendation: 512px width represents the optimal balance for most models, providing 80-90% of the quality achievable at 1024px while requiring significantly less processing time and memory.

Response Times and Performance

granite3.2-vision:2b demonstrated exceptional speed: ~1.0 second at 256px, ~1.2 seconds at 512px, and ~2.0 seconds at 1024px. This near-linear scaling indicates efficient architecture.

llava:7b and llava:13b performed similarly despite parameter count differences (2.5-3.5 seconds at 256px, 3.5-5.0 seconds at 1024px), suggesting both are limited by memory bandwidth rather than compute capability.

Batch processing implications: Processing 1000 photos at 512px would require approximately 20 minutes with granite versus 50-80 minutes with LLaVA variants—a 2.5-4x throughput advantage that compounds significantly at scale.

Token Usage and Efficiency

Prompt tokens scaled predictably with resolution: ~900-1000 tokens at 256px, ~1400-1600 at 512px, and ~2200-2500 at 1024px. This superlinear scaling reflects how vision transformers encode image patches.

Response generation varied dramatically: ministral-3:3b generated the longest responses (1400-2500 tokens) despite 100% parsing failures. granite3.2-vision produced concise 600-900 token responses while maintaining competitive quality, demonstrating superior efficiency.

The inverse correlation between verbosity and quality suggests that focused, efficient models extract and communicate relevant information without superfluous elaboration.

📸 Sample Photos and Ground Truth

These 8 photos represent the diversity of real-world consumer and professional photography, challenging multiple dimensions: text reading, emotional context, technical composition, environmental conditions, species identification, and complex multi-element scenes.

Augsburg City Hall
Architecture

Augsburg City Hall

Tests recognition of historic European architecture, baroque style identification, and modern elements (construction crane) juxtaposed with classical structures.

architecture baroque historic clock tower domes
Canoeing with Dog
Lifestyle

Canoeing with Dog

Serene outdoor scene featuring person and fluffy dog in canoe on lily pad-covered pond. Tests animal recognition and mood assessment.

canoe dog lily pads outdoor recreation
Food Spread
Food

Catered Food Spread

Overhead food spread with strawberries, cheese platters, crackers, grapes, and decorative flowers. Tests specific food item identification and event context inference.

food party cheese board strawberries
Person with Dog in Autumn
Candid

Autumn Pet Bonding

Intimate autumn scene showing person embracing fluffy white dog in fallen leaves. Tests emotional content recognition and seasonal identification.

autumn dog bonding cuddle
Shopping Cart
Still Life

Grocery Shopping Cart

Close-up of shopping cart with Land O'Lakes butter, Cabot cheese, pickles jar. Tests brand recognition and text reading capabilities.

shopping groceries dairy butter
Mountain Vista
Travel

Mountain Summit Vista

Two hikers resting on granite summit overlooking layered mountain ranges with autumn foliage and visible backpacking gear.

hiking backpacking mountains vista
Trail Sign
Landscape

White Mountains Trail Sign

Weathered trail sign on granite bedrock with misty mountains and low clouds. Tests text reading of trail names and atmospheric condition description.

trail sign hiking atmospheric mountains
Wildlife in Costa Rica
Wildlife

Costa Rica Wildlife

Group of collared peccaries foraging on tropical grass. Tests wildlife species identification and tropical environment detection.

wildlife peccary tropical foraging

🎯 Conclusion and Recommendations

🚀 Primary Recommendation: granite3.2-vision:2b @ 512px

This configuration offers the optimal balance of quality, speed, reliability, and resource efficiency:

  • Best quality score: 45.1/100 at 512px among all tested models
  • Fastest inference: ~1.2 seconds per image at 512px
  • Smallest footprint: 2B parameters enabling deployment on modest GPUs
  • 100% reliability: No JSON parsing failures
  • Consistent performance: Across diverse image types

The 512px resolution provides 95% of the quality achievable at 1024px (45.1 vs 47.4) for roughly 40% less processing time, making it the optimal choice for production deployment.

Key Findings

1. Smaller models can outperform larger ones: The 2B parameter granite decisively outperformed models up to 6.5x larger, demonstrating that architectural efficiency matters more than raw parameter count.

2. Reliability crisis in structured output: The complex 11-field JSON requirement exposed significant weaknesses. Two models failed catastrophically (ministral-3:3b at 100%, llava-phi3:3.8b at 87.5%), highlighting the need for better structured output training.

3. Resolution scaling is non-linear: Only granite showed consistent quality improvements with resolution. Other models demonstrated unpredictable patterns, with some performing worse at intermediate sizes.

4. Speed and quality aren't inherently opposed: Granite's simultaneous leadership in both metrics breaks the conventional trade-off through architectural innovation rather than parameter scaling.

5. The task difficulty exposed limitations: Even the best model achieved only 43.4% accuracy, indicating current vision LLMs struggle with comprehensive multi-dimensional understanding requiring object recognition, mood assessment, and contextual reasoning.

Models to Avoid

AVOID ministral-3:3b - 100% JSON parsing failure rate makes it completely unusable

AVOID llava-phi3:3.8b - 87.5% failure rate is unacceptable for production

AVOID gemma3:4b - Poor quality (28/100) and unpredictable resolution scaling

CAUTION llava:13b - No quality advantage over llava:7b, wastes memory and compute

Cost-Benefit Analysis

For processing 10,000 photos:

  • granite3.2-vision @ 512px: ~3.3 hours, minimal GPU cost, 45% quality
  • llava:7b @ 512px: ~10 hours, moderate GPU cost, 39% quality
  • llava:13b @ 1024px: ~15 hours, high GPU cost, 41% quality

The granite approach offers 3x faster processing with 15% better quality, delivering clear ROI even before considering reduced hardware requirements.