Comprehensive analysis of vision-capable LLMs for structured photo metadata extraction across multiple resolutions
This benchmark experiment was designed to comprehensively evaluate the performance and quality of multiple vision-capable Large Language Models (LLMs) in the context of automated photo metadata extraction. The primary objectives were twofold: First, to understand the quality and accuracy differences between various open-source vision LLMs when tasked with analyzing photographs and extracting structured information. Second, to quantify the impact of image resolution on both model performance (speed, token usage) and extraction quality.
The motivation behind this research stems from the need to select optimal vision models for production applications that process large photo libraries. Many real-world applications face critical trade-offs between processing speed, computational cost, and accuracy. Understanding how image resolution affects these factors is essential for making informed architectural decisions.
Specifically, we sought to answer several key questions: How do different models compare in their ability to accurately identify objects, describe scenes, classify moods, and extract contextual information? Does feeding higher-resolution images to vision models result in proportionally better output quality, or do we see diminishing returns? What are the performance penalties (latency and token consumption) associated with processing larger images? Which models offer the best balance of speed and accuracy for batch processing scenarios?
This benchmark represents a practical evaluation using a diverse set of 8 real-world photographs spanning multiple categories including architecture, wildlife, food, outdoor recreation, and lifestyle imagery. The complexity of the extraction task—requiring 11 distinct fields including tags, descriptions, object lists, mood classification, and technical quality assessment—mirrors the demands of modern photo management and search applications.
Our experimental approach was designed to provide fair, reproducible, and comprehensive comparisons across models and image sizes. We selected 6 diverse vision LLMs representing different architectures, parameter counts, and design philosophies: granite3.2-vision:2b (IBM's lightweight model), llava:7b and llava:13b (popular open-source variants), llava-phi3:3.8b (Microsoft Phi-based), gemma3:4b (Google), and ministral-3:3b (Mistral).
The test corpus consisted of 8 carefully selected photographs representing real-world diversity: European historic architecture with construction activity, outdoor recreation (canoeing with a dog), elaborate food presentation, wildlife (collared peccaries in Costa Rica), mountain hiking scenes, trail signage, autumn pet photography, and grocery shopping. These images were chosen to challenge different aspects of vision understanding including object recognition, scene comprehension, technical quality assessment, and contextual understanding.
Each photograph was processed at three resolution levels: 256px, 512px, and 1024px width (maintaining aspect ratio). Images were resized using Lanczos resampling to ensure high-quality downscaling. For each model and image size combination, we measured: inference time (seconds per image), token usage (prompt tokens + generated tokens), and response quality against human-established ground truth.
The extraction prompt was intentionally complex, requiring models to generate structured JSON with 11 fields: descriptive tags (list), short executive summary (string), detailed narrative description (string), visible objects (list), people identified (list), location/place (string), visual mood (categorical: serene, vibrant, cozy, etc.), technical quality (categorical: crisp, clear, detailed, etc.), actions occurring (list), and high-level category (categorical: landscape, architecture, wildlife, etc.).
Ground truth was established through careful human analysis of each photograph following the identical prompt structure. Each model's response was then scored on a 0-100 scale evaluating: tag coverage and relevance (15 points), object detection accuracy (20 points), description quality and completeness (15 points), mood classification correctness (10 points), technical quality assessment (10 points), category classification (10 points), action detection (10 points), and people/places accuracy (10 points). Penalties were applied for hallucinations, missing critical elements, and categorical misclassifications.
Processor: AMD Ryzen 9 5950X (16 cores, 32 threads, up to 5.1 GHz)
Graphics: NVIDIA GeForce RTX 4060 Ti 16GB (Driver 550.163.01)
Memory: 128GB DDR4 RAM (91GB available during testing)
Storage: 8GB swap partition (minimal usage during benchmark)
Operating System: Ubuntu 24.04.3 LTS (Noble Numbat)
Kernel: Linux 6.14.0-37-generic
LLM Server: Ollama 0.14.2
Python: 3.12.3 (venv)
Key Libraries: requests, Pillow, pillow-heif, pathlib
| Rank | Model | Avg Quality Score | 256px | 512px | 1024px | Grade |
|---|---|---|---|---|---|---|
| 1 | granite3.2-vision:2b | 43.4 | 37.8 | 45.1 | 47.4 | B |
| 2 | llava:13b | 39.2 | 37.5 | 38.9 | 41.2 | B |
| 3 | llava:7b | 38.9 | 39.0 | 39.5 | 38.2 | C |
| 4 | gemma3:4b | 28.0 | 28.9 | 25.6 | 29.5 | D |
| 5 | llava-phi3:3.8b | 12.8 | 7.4 | 12.9 | 18.0 | D |
| 6 | ministral-3:3b | 0.0 | 0.0 | 0.0 | 0.0 | F |
granite3.2-vision:2b emerged as the quality leader with 43.4/100, despite being the smallest model at only 2 billion parameters. It demonstrated consistent performance with good object detection and accurate mood classification, improving with higher resolutions.
LLaVA family (7b, 13b) achieved similar scores around 38-39/100, indicating minimal quality improvement from additional parameters. Both showed competent object recognition but struggled with mood and category classification.
Critical failures: ministral-3:3b completely failed with 0/100 due to consistent JSON syntax errors. llava-phi3:3.8b scored only 12.8/100 with an 87.5% JSON parsing failure rate, making it unsuitable for production despite occasional valid outputs.
The relationship between image resolution and model performance proved more nuanced than a simple 'larger is better' pattern. For granite3.2-vision:2b, quality scaled positively with resolution: 37.8/100 at 256px, 45.1/100 at 512px, and 47.4/100 at 1024px—a 25% improvement suggesting effective utilization of additional visual information.
The llava family showed inconsistent scaling, with llava:13b sometimes performing worse at higher resolutions. Most notably, gemma3:4b demonstrated unusual non-monotonic behavior with lowest performance at 512px (25.6/100) compared to 256px (28.9/100) and 1024px (29.5/100).
Practical recommendation: 512px width represents the optimal balance for most models, providing 80-90% of the quality achievable at 1024px while requiring significantly less processing time and memory.
granite3.2-vision:2b demonstrated exceptional speed: ~1.0 second at 256px, ~1.2 seconds at 512px, and ~2.0 seconds at 1024px. This near-linear scaling indicates efficient architecture.
llava:7b and llava:13b performed similarly despite parameter count differences (2.5-3.5 seconds at 256px, 3.5-5.0 seconds at 1024px), suggesting both are limited by memory bandwidth rather than compute capability.
Batch processing implications: Processing 1000 photos at 512px would require approximately 20 minutes with granite versus 50-80 minutes with LLaVA variants—a 2.5-4x throughput advantage that compounds significantly at scale.
Prompt tokens scaled predictably with resolution: ~900-1000 tokens at 256px, ~1400-1600 at 512px, and ~2200-2500 at 1024px. This superlinear scaling reflects how vision transformers encode image patches.
Response generation varied dramatically: ministral-3:3b generated the longest responses (1400-2500 tokens) despite 100% parsing failures. granite3.2-vision produced concise 600-900 token responses while maintaining competitive quality, demonstrating superior efficiency.
The inverse correlation between verbosity and quality suggests that focused, efficient models extract and communicate relevant information without superfluous elaboration.
These 8 photos represent the diversity of real-world consumer and professional photography, challenging multiple dimensions: text reading, emotional context, technical composition, environmental conditions, species identification, and complex multi-element scenes.
Tests recognition of historic European architecture, baroque style identification, and modern elements (construction crane) juxtaposed with classical structures.
Serene outdoor scene featuring person and fluffy dog in canoe on lily pad-covered pond. Tests animal recognition and mood assessment.
Overhead food spread with strawberries, cheese platters, crackers, grapes, and decorative flowers. Tests specific food item identification and event context inference.
Intimate autumn scene showing person embracing fluffy white dog in fallen leaves. Tests emotional content recognition and seasonal identification.
Close-up of shopping cart with Land O'Lakes butter, Cabot cheese, pickles jar. Tests brand recognition and text reading capabilities.
Two hikers resting on granite summit overlooking layered mountain ranges with autumn foliage and visible backpacking gear.
Weathered trail sign on granite bedrock with misty mountains and low clouds. Tests text reading of trail names and atmospheric condition description.
Group of collared peccaries foraging on tropical grass. Tests wildlife species identification and tropical environment detection.
This configuration offers the optimal balance of quality, speed, reliability, and resource efficiency:
The 512px resolution provides 95% of the quality achievable at 1024px (45.1 vs 47.4) for roughly 40% less processing time, making it the optimal choice for production deployment.
1. Smaller models can outperform larger ones: The 2B parameter granite decisively outperformed models up to 6.5x larger, demonstrating that architectural efficiency matters more than raw parameter count.
2. Reliability crisis in structured output: The complex 11-field JSON requirement exposed significant weaknesses. Two models failed catastrophically (ministral-3:3b at 100%, llava-phi3:3.8b at 87.5%), highlighting the need for better structured output training.
3. Resolution scaling is non-linear: Only granite showed consistent quality improvements with resolution. Other models demonstrated unpredictable patterns, with some performing worse at intermediate sizes.
4. Speed and quality aren't inherently opposed: Granite's simultaneous leadership in both metrics breaks the conventional trade-off through architectural innovation rather than parameter scaling.
5. The task difficulty exposed limitations: Even the best model achieved only 43.4% accuracy, indicating current vision LLMs struggle with comprehensive multi-dimensional understanding requiring object recognition, mood assessment, and contextual reasoning.
AVOID ministral-3:3b - 100% JSON parsing failure rate makes it completely unusable
AVOID llava-phi3:3.8b - 87.5% failure rate is unacceptable for production
AVOID gemma3:4b - Poor quality (28/100) and unpredictable resolution scaling
CAUTION llava:13b - No quality advantage over llava:7b, wastes memory and compute
For processing 10,000 photos:
The granite approach offers 3x faster processing with 15% better quality, delivering clear ROI even before considering reduced hardware requirements.