TimeSpot | Benchmarking Geo-Temporal Understanding in VLMs

1,455 VQA pairs

Geographic Reasoning

Temporal Reasoning

Rubric-based Open-ended Evaluation

TimeSpot: Benchmarking Geo-Temporal Understanding in Vision–Language Models in Real-World Settings

Azmine Toushik Wasi*, Shahriyar Zaman Ridoy*, Koushik Ahamed Tonmoy, Kinga Tshering, S. M. Muhtasimul Hasan, Wahid Faisal, Tasnim Mohiuddin, Md Rizwan Parvez

Computational Intelligence and Operations Laboratory (CIOL) • Shahjalal University of Science and Technology (SUST) • North South University (NSU) • Qatar Computing Research Institute (QCRI)

*Equal Contribution

Correspondence: shahriyar.zaman01@gmail.com, mparvez@hbku.edu.qa

📄 OpenReview 📄 arXiv

Benchmark Structure and Task Categories

Axis	Field	Categories (top shown)
Temporal	Season	Summer (400), Fall (399), Spring (335), Winter (321)
Daylight phase	Afternoon (584), Night (287), Sunset (210), Morning (203), Midday (124), Sunrise (47)
Month	12 months represented; top: August (163), September (146), July (145), March (131)
Hemispheric tag	Northern Hemisphere Summer (703), Northern Hemisphere Winter (615), Southern Hemisphere Winter (81), Southern Hemisphere Summer (56)
Time coverage	Day (1182), Night (273)
Hour range	Full 0–23; densest 08–18
Geography	Continents	Asia (529), Europe (430), North America (326), South America (170)
Countries	82 unique; top: USA (196), Russia (97), Japan (67), Italy (65), China (58)
Climate	Temperate (C) (582), Continental (D) (396), Tropical (A) (274), Arid (B) (180), Polar (E) (23)
Environment type	Urban (648), Rural (202), Mountain (193), Coastal (181), Suburban (118), Desert (113)
Lat/Lon span	lat -54.80 to 71.96, lon -173.24 to 170.31
Cues	Primary temporal cues	Sun/Shadows (573), Vegetation (325), Other (289), Snow/Ice (122), Human Clothing (95), Agricultural Activity (51)
Primary geolocation cues	Architecture (355), Natural Biome (354), Topography (Mountains/Coast) (295), Road Signage/Language (236), Vehicles (156), Other (58)

Axis

Field

Categories (top shown)

Temporal

Season

Summer (400), Fall (399), Spring (335), Winter (321)

Daylight phase

Afternoon (584), Night (287), Sunset (210), Morning (203), Midday (124), Sunrise (47)

Month

12 months represented; top: August (163), September (146), July (145), March (131)

Hemispheric tag

Northern Hemisphere Summer (703), Northern Hemisphere Winter (615), Southern Hemisphere Winter (81), Southern Hemisphere Summer (56)

Time coverage

Day (1182), Night (273)

Hour range

Full 0–23; densest 08–18

Geography

Continents

Asia (529), Europe (430), North America (326), South America (170)

Countries

82 unique; top: USA (196), Russia (97), Japan (67), Italy (65), China (58)

Climate

Temperate (C) (582), Continental (D) (396), Tropical (A) (274), Arid (B) (180), Polar (E) (23)

Environment type

Urban (648), Rural (202), Mountain (193), Coastal (181), Suburban (118), Desert (113)

Lat/Lon span

lat -54.80 to 71.96, lon -173.24 to 170.31

Cues

Primary temporal cues

Sun/Shadows (573), Vegetation (325), Other (289), Snow/Ice (122), Human Clothing (95), Agricultural Activity (51)

Primary geolocation cues

Architecture (355), Natural Biome (354), Topography (Mountains/Coast) (295), Road Signage/Language (236), Vehicles (156), Other (58)

Results

Evaluation results for TimeSpot benchmark across multiple vision-language models, including both proprietary and open-source variants.

Multiple Choice Evaluation Accuracy (%) on TimeSpot-MCQ

Model	Cnt.	Cou.	Clim.	Env.	Lat.°	Long.°	Dist.(km)	Season	Month	Time (Ac.)	Time (MAE)	DLP
Proprietary Models
GPT-4o-mini	82.68	49.14	50.93	57.87	12.40	24.70	2827.07	47.08	22.34	30.32	3:54	31.55
GPT-5-mini	83.62	68.27	72.47	60.01	4.72	15.64	1389.79	58.43	34.27	21.55	4:10	44.60
Gemini-2.0-Flash	89.07	76.91	68.52	60.96	3.32	11.23	994.30	49.76	22.89	27.35	4:22	30.24
Gemini-2.5-Flash	90.51	77.25	71.34	64.32	3.05	10.38	917.61	50.92	23.91	25.15	3:56	41.92
Claude 3.5 Haiku	77.25	55.53	61.86	55.74	6.85	27.51	2269.86	44.12	19.04	23.09	4:14	30.93
Mistral Medium 3.1	75.88	52.85	66.67	61.72	6.37	22.62	2045.61	36.84	15.26	30.73	3:36	36.01
Open-Source Models (≤11B)
InternVL3.5-1B	43.02	14.15	32.50	53.54	44.68	4378.92	7700.42	30.65	3.78	7.77	11:45	35.80
InternVL3.5-2B	60.00	29.41	51.82	57.80	13.11	43.71	3959.29	36.29	5.70	27.80	4:30	24.05
Qwen-VL2.5-3B-Instruct	22.40	13.47	18.83	44.53	16.18	130.98	8231.18	27.49	9.96	22.06	4:34	8.52
InternVL3.5-4B	60.79	30.12	57.77	56.74	15.34	44.15	4236.77	37.55	12.03	29.33	4:10	41.61
Qwen-VL2.5-7B-Instruct	85.70	73.96	70.86	75.21	32.94	21.46	4719.95	61.46	44.96	25.68	3:47	64.09
Llama-3.2-11B-Vision-Instruct	74.22	55.73	57.12	57.61	5.85	26.57	2072.35	43.50	16.68	25.74	4:18	43.57
Open-Source Models (>11B)
Gemma-3-27B-it	79.59	54.02	60.41	53.12	6.83	23.58	2063.93	44.81	17.11	26.34	4:28	30.86
Qwen-VL2.5-32B-Instruct	78.56	57.11	62.95	60.82	6.27	24.02	2010.12	44.81	17.86	31.10	3:44	44.54
Internvl3-78b	77.46	53.26	71.61	61.37	7.42	23.63	2180.29	45.91	16.43	29.64	4:07	34.91
Qwen-VL2.5-72B-Instruct	77.94	58.28	65.15	58.14	5.11	19.33	1711.42	44.47	18.28	28.71	4:00	36.84
Llama-3.2-90B-Vision-Instruct	78.08	53.54	63.85	59.04	7.05	26.79	2284.85	45.15	19.72	23.33	4:29	33.88
GLM-4.5V-106B-MoE	85.32	69.68	62.09	62.51	4.23	14.09	1280.87	57.55	36.04	30.51	4:09	42.45
Reasoning Models
o4-mini	82.39	71.82	73.06	66.64	4.85	15.39	1359.96	65.81	48.20	23.91	4:04	51.79
Gemini-2-Flash-Thinking	88.66	76.22	66.73	59.93	3.44	11.70	1024.14	49.28	22.68	27.49	4:22	29.76
Gemini-2.5-Flash-Thinking	90.31	77.59	70.86	64.47	3.04	9.85	892.54	51.13	24.26	22.19	4:03	36.56
Kimi-VL-A3B-Thinking-2506	58.90	40.69	54.84	59.31	16.00	39.83	4034.15	39.72	12.65	32.23	4:18	25.70
GLM-4.1V-9B-Thinking	84.44	68.34	70.19	68.54	4.34	23.01	1788.77	58.02	38.88	33.74	3:58	47.76

Abbreviations: Cnt. → Continent, Cou. → Country, Clim. → Climate Zone, Env. → Environment Type, Lat.° → Latitude in degrees, Long.° → Longitude in degrees, Dist.(km) (MD) → Mean distance from actual location in kilometers, DLP → Day-light phase. Time (Ac.) denotes accuracy if the model predicted the time correctly within a 1-hour window, and Time (MAE) shows the mean error in HH:MM format. “Ac.” refers to accuracy.

Key Takeways

Overall Findings:
- Temporal Inference is a Major Bottleneck: Time-of-day accuracy is extremely low across all models (22–34%), peaking at 33.74% (GLM-4.1V-9B-Thinking), with MAE ≈ 4 hours.
- Geodesic Disconnect: High coarse-grained localization often coexists with large metric and temporal errors. Top models (Gemini-2.5-Flash-Thinking) reach ~77.59% country accuracy but median geodesic error = 892.54 km.
- Proprietary Models Lead: Closed-source models outperform open-source in spatial reasoning, metric localization, and calibration.
- Open-Source Variance: GLM-4.5V-106B-MoE attains competitive country accuracy (69.68%) but weaker metric grounding; Qwen-VL2.5-7B fails in coordinate estimation (MD 4719 km).
- Reasoning-Augmented Models Excel: Gemini-2.5-Flash-Thinking, o4-mini outperform base counterparts across geolocation and temporal tasks, demonstrating multi-step inference benefits.
Geo-Temporal Consistency & Confidence:
- Even strong models exhibit physical consistency violations (phase-time mismatches, season-month inconsistencies, spatial tail errors).
- Calibration failures: all models show overconfidence on fine-grained temporal tasks (high ECE), signaling low reliability under ambiguous cues.
Error Analysis:
- Seasonal Collapse: Summer is reliably predicted; Autumn collapses to 0% accuracy, showing models rely on static color cues rather than true phenology.
- Daylight Phase Blindspots: Midday/Afternoon easier; Night and Sunrise/Sunset accuracy remains <35%, often confusing dawn/dusk.
- Time Anchoring: Predictions collapse to round-hour anchors (09:00, 12:00, 18:00), indicating failure to compute continuous solar geometry.
- Spatial Drift & Biases: Near-miss country errors (e.g., Bangladesh vs India); default to Temperate/Urban, poor performance in Polar, Arid, and Continental climates.
Qualitative Error Analysis:
- Low-Light Breakdowns: Night/twilight scenes cause drastic drift; solar cues absence leads to round-hour guesses.
- Urban Occlusion: Street canyons compress apparent sun elevation → late time predictions.
- Missing Shorelines: Coastal/elevation cues systematically ignored; coordinates shift inland.
- Shortcut Exploitation: Models rely on human-centric cues (architecture/signage), failing to use environmental cues (biome, topography) for precise geolocation or temporal estimation.

Benchmark Design & Evaluation Methods

Dataset Construction & Protocol:
- 1,455 ground-level images from 80 countries; landmarks and heavy text removed to force reliance on physical cues (illumination, shadows, sky, vegetation, materials).
- Structured 9-field prediction: 4 temporal (season, month, local time, daylight phase) + 5 geographic (continent, country, climate, environment type, lat/lon).
- Annotations verified with metadata (timestamps, GPS, solar ephemerides) and cross-checked manually for physical validity.
Evaluation Metrics:
- Geographic: Top-1 categorical accuracy for region/climate; MAE for lat/lon; mean/median great-circle distance (MD) in km.
- Temporal: Top-1 accuracy, ±1-hour window accuracy, HH:MM MAE for local time.
- LLM-as-a-Judge: Semantic alignment for synonyms (USA vs United States, Fall vs Autumn) handled by Gemini-2.5-Flash judge model.
Performance Improvement Approaches:
- Supervised Fine-Tuning (SFT): Fine-tuning Qwen-VL2.5-3B-Instruct on 40% of TIMESPOT improved categorical geo-semantic accuracy (Country: 14.2% → 19.2%), but time predictions remain unstable.
- Recommendations: Future models should include physical inductive biases (latitude/solar conditioning), constraint-aware reasoning (phase/time/longitude matching), and augmentations for extreme lighting and polar environments.

Citation

Please cite the paper as below:

@misc{
wasi2026timespot,
title={TimeSpot: Benchmarking Geo-Temporal Understanding in Vision{\textendash}Language Models in Real-World Settings},
author={Azmine Toushik Wasi and Shahriyar Zaman Ridoy and Koushik Ahamed Tonmoy and Kinga Tshering and S. M. Muhtasimul Hasan and Wahid Faisal and Tasnim Mohiuddin and Md Rizwan Parvez},
year={2026},
url={https://openreview.net/forum?id=ZLTbUvfej2}
}