TimeSpot

Benchmarking Geo-Temporal Understanding in VLMs
1,455 VQA pairs
Geographic Reasoning
Temporal Reasoning
Rubric-based Open-ended Evaluation

TimeSpot: Benchmarking Geo-Temporal Understanding in Vision–Language Models in Real-World Settings

Azmine Toushik Wasi*, Shahriyar Zaman Ridoy*, Koushik Ahamed Tonmoy, Kinga Tshering, S. M. Muhtasimul Hasan, Wahid Faisal, Tasnim Mohiuddin, Md Rizwan Parvez

Computational Intelligence and Operations Laboratory (CIOL) • Shahjalal University of Science and Technology (SUST) • North South University (NSU) • Qatar Computing Research Institute (QCRI)

*Equal Contribution

Correspondence: shahriyar.zaman01@gmail.com, mparvez@hbku.edu.qa

📄 OpenReview 📄 arXiv

Geo-temporal understanding—the ability to infer location, time, and contextual properties from visual input—is crucial for applications such as disaster management, traffic planning, navigation, world modeling, and geography education, yet current vision–language models (VLMs) struggle with temporal reasoning and physically grounded spatial cues. To address this, we introduce TimeSpot, a benchmark of 1,455 ground-level images from 80 countries that evaluates structured prediction of temporal (season, month, time of day, daylight phase) and geographic attributes (continent, country, climate zone, environment type, latitude–longitude) under real-world uncertainty, revealing that existing VLMs perform poorly and highlighting the need for new approaches to robust, physically grounded geo-temporal reasoning.

TimeSpot overview figure (boxed)

Benchmark Structure and Task Categories

Axis Field Categories (top shown)
Temporal Season Summer (400), Fall (399), Spring (335), Winter (321)
Daylight phase Afternoon (584), Night (287), Sunset (210), Morning (203), Midday (124), Sunrise (47)
Month 12 months represented; top: August (163), September (146), July (145), March (131)
Hemispheric tag Northern Hemisphere Summer (703), Northern Hemisphere Winter (615), Southern Hemisphere Winter (81), Southern Hemisphere Summer (56)
Time coverage Day (1182), Night (273)
Hour range Full 0–23; densest 08–18
Geography Continents Asia (529), Europe (430), North America (326), South America (170)
Countries 82 unique; top: USA (196), Russia (97), Japan (67), Italy (65), China (58)
Climate Temperate (C) (582), Continental (D) (396), Tropical (A) (274), Arid (B) (180), Polar (E) (23)
Environment type Urban (648), Rural (202), Mountain (193), Coastal (181), Suburban (118), Desert (113)
Lat/Lon span lat -54.80 to 71.96, lon -173.24 to 170.31
Cues Primary temporal cues Sun/Shadows (573), Vegetation (325), Other (289), Snow/Ice (122), Human Clothing (95), Agricultural Activity (51)
Primary geolocation cues Architecture (355), Natural Biome (354), Topography (Mountains/Coast) (295), Road Signage/Language (236), Vehicles (156), Other (58)

Examples

Every temporal reasoning category includes MCQ and open-ended formats to measure discriminative accuracy and generative reasoning capacity. Representative examples below.

Combined example: MCQ and Open-ended

Results

Evaluation results for TimeSpot benchmark across multiple vision-language models, including both proprietary and open-source variants.

Multiple Choice Evaluation Accuracy (%) on TimeSpot-MCQ

Model Cnt. Cou. Clim. Env. Lat.° Long.° Dist.(km) Season Month Time (Ac.) Time (MAE) DLP
Proprietary Models
GPT-4o-mini82.6849.1450.9357.8712.4024.702827.0747.0822.3430.323:5431.55
GPT-5-mini83.6268.2772.4760.014.7215.641389.7958.4334.2721.554:1044.60
Gemini-2.0-Flash89.0776.9168.5260.963.3211.23994.3049.7622.8927.354:2230.24
Gemini-2.5-Flash90.5177.2571.3464.323.0510.38917.6150.9223.9125.153:5641.92
Claude 3.5 Haiku77.2555.5361.8655.746.8527.512269.8644.1219.0423.094:1430.93
Mistral Medium 3.175.8852.8566.6761.726.3722.622045.6136.8415.2630.733:3636.01
Open-Source Models (≤11B)
InternVL3.5-1B43.0214.1532.5053.5444.684378.927700.4230.653.787.7711:4535.80
InternVL3.5-2B60.0029.4151.8257.8013.1143.713959.2936.295.7027.804:3024.05
Qwen-VL2.5-3B-Instruct22.4013.4718.8344.5316.18130.988231.1827.499.9622.064:348.52
InternVL3.5-4B60.7930.1257.7756.7415.3444.154236.7737.5512.0329.334:1041.61
Qwen-VL2.5-7B-Instruct85.7073.9670.8675.2132.9421.464719.9561.4644.9625.683:4764.09
Llama-3.2-11B-Vision-Instruct74.2255.7357.1257.615.8526.572072.3543.5016.6825.744:1843.57
Open-Source Models (>11B)
Gemma-3-27B-it79.5954.0260.4153.126.8323.582063.9344.8117.1126.344:2830.86
Qwen-VL2.5-32B-Instruct78.5657.1162.9560.826.2724.022010.1244.8117.8631.103:4444.54
Internvl3-78b77.4653.2671.6161.377.4223.632180.2945.9116.4329.644:0734.91
Qwen-VL2.5-72B-Instruct77.9458.2865.1558.145.1119.331711.4244.4718.2828.714:0036.84
Llama-3.2-90B-Vision-Instruct78.0853.5463.8559.047.0526.792284.8545.1519.7223.334:2933.88
GLM-4.5V-106B-MoE85.3269.6862.0962.514.2314.091280.8757.5536.0430.514:0942.45
Reasoning Models
o4-mini82.3971.8273.0666.644.8515.391359.9665.8148.2023.914:0451.79
Gemini-2-Flash-Thinking88.6676.2266.7359.933.4411.701024.1449.2822.6827.494:2229.76
Gemini-2.5-Flash-Thinking90.3177.5970.8664.473.049.85892.5451.1324.2622.194:0336.56
Kimi-VL-A3B-Thinking-250658.9040.6954.8459.3116.0039.834034.1539.7212.6532.234:1825.70
GLM-4.1V-9B-Thinking84.4468.3470.1968.544.3423.011788.7758.0238.8833.743:5847.76

Abbreviations: Cnt. → Continent, Cou. → Country, Clim. → Climate Zone, Env. → Environment Type, Lat.° → Latitude in degrees, Long.° → Longitude in degrees, Dist.(km) (MD) → Mean distance from actual location in kilometers, DLP → Day-light phase. Time (Ac.) denotes accuracy if the model predicted the time correctly within a 1-hour window, and Time (MAE) shows the mean error in HH:MM format. “Ac.” refers to accuracy.

Key Takeways

  • Overall Findings:
    • Temporal Inference is a Major Bottleneck: Time-of-day accuracy is extremely low across all models (22–34%), peaking at 33.74% (GLM-4.1V-9B-Thinking), with MAE ≈ 4 hours.
    • Geodesic Disconnect: High coarse-grained localization often coexists with large metric and temporal errors. Top models (Gemini-2.5-Flash-Thinking) reach ~77.59% country accuracy but median geodesic error = 892.54 km.
    • Proprietary Models Lead: Closed-source models outperform open-source in spatial reasoning, metric localization, and calibration.
    • Open-Source Variance: GLM-4.5V-106B-MoE attains competitive country accuracy (69.68%) but weaker metric grounding; Qwen-VL2.5-7B fails in coordinate estimation (MD 4719 km).
    • Reasoning-Augmented Models Excel: Gemini-2.5-Flash-Thinking, o4-mini outperform base counterparts across geolocation and temporal tasks, demonstrating multi-step inference benefits.
  • Geo-Temporal Consistency & Confidence:
    • Even strong models exhibit physical consistency violations (phase-time mismatches, season-month inconsistencies, spatial tail errors).
    • Calibration failures: all models show overconfidence on fine-grained temporal tasks (high ECE), signaling low reliability under ambiguous cues.
  • Error Analysis:
    • Seasonal Collapse: Summer is reliably predicted; Autumn collapses to 0% accuracy, showing models rely on static color cues rather than true phenology.
    • Daylight Phase Blindspots: Midday/Afternoon easier; Night and Sunrise/Sunset accuracy remains <35%, often confusing dawn/dusk.
    • Time Anchoring: Predictions collapse to round-hour anchors (09:00, 12:00, 18:00), indicating failure to compute continuous solar geometry.
    • Spatial Drift & Biases: Near-miss country errors (e.g., Bangladesh vs India); default to Temperate/Urban, poor performance in Polar, Arid, and Continental climates.
  • Qualitative Error Analysis:
    • Low-Light Breakdowns: Night/twilight scenes cause drastic drift; solar cues absence leads to round-hour guesses.
    • Urban Occlusion: Street canyons compress apparent sun elevation → late time predictions.
    • Missing Shorelines: Coastal/elevation cues systematically ignored; coordinates shift inland.
    • Shortcut Exploitation: Models rely on human-centric cues (architecture/signage), failing to use environmental cues (biome, topography) for precise geolocation or temporal estimation.

Benchmark Design & Evaluation Methods

  • Dataset Construction & Protocol:
    • 1,455 ground-level images from 80 countries; landmarks and heavy text removed to force reliance on physical cues (illumination, shadows, sky, vegetation, materials).
    • Structured 9-field prediction: 4 temporal (season, month, local time, daylight phase) + 5 geographic (continent, country, climate, environment type, lat/lon).
    • Annotations verified with metadata (timestamps, GPS, solar ephemerides) and cross-checked manually for physical validity.
  • Evaluation Metrics:
    • Geographic: Top-1 categorical accuracy for region/climate; MAE for lat/lon; mean/median great-circle distance (MD) in km.
    • Temporal: Top-1 accuracy, ±1-hour window accuracy, HH:MM MAE for local time.
    • LLM-as-a-Judge: Semantic alignment for synonyms (USA vs United States, Fall vs Autumn) handled by Gemini-2.5-Flash judge model.
  • Performance Improvement Approaches:
    • Supervised Fine-Tuning (SFT): Fine-tuning Qwen-VL2.5-3B-Instruct on 40% of TIMESPOT improved categorical geo-semantic accuracy (Country: 14.2% → 19.2%), but time predictions remain unstable.
    • Recommendations: Future models should include physical inductive biases (latitude/solar conditioning), constraint-aware reasoning (phase/time/longitude matching), and augmentations for extreme lighting and polar environments.

Citation

Please cite the paper as below:

@misc{
wasi2026timespot,
title={TimeSpot: Benchmarking Geo-Temporal Understanding in Vision{\textendash}Language Models in Real-World Settings},
author={Azmine Toushik Wasi and Shahriyar Zaman Ridoy and Koushik Ahamed Tonmoy and Kinga Tshering and S. M. Muhtasimul Hasan and Wahid Faisal and Tasnim Mohiuddin and Md Rizwan Parvez},
year={2026},
url={https://openreview.net/forum?id=ZLTbUvfej2}
}
      
CIOL Logo NSU Logo SUST Logo QCRI Logo