LLMs Reveal Fatal Flaw: They Can't Read Clocks! PhD Stunned by Accuracy Below 50%

Synced Report

Edited by: KingHZ

[Synced] AI can write papers, draw pictures, and score high marks, but it struggles terribly with simple tasks like "reading a clock" or "what day of the week is it today"? The latest research reveals surprising cognitive deficits behind this, reminding us that while AI is powerful, precise reasoning still requires human intervention.

Some tasks are effortless for humans, but AI frequently makes mistakes.

For example, counting the number of 'r's in the word "strawberry" once stumped many top LLMs.

The latest research reveals that reading clocks or calendars is also very difficult for AI.

Figure 1: In test instances, 6 large models could not correctly read analog clocks, and only 2 could understand calendars.

Researchers from the University of Edinburgh and other institutions revealed this thought-provoking AI phenomenon.

They simulated clocks and calendars to systematically examine the ability of multimodal language models (MLLMs) to interpret time and dates.

The results were disappointing:

The accuracy of AI systems in reading clocks was only 38.7%, and the accuracy in determining calendar dates was only 26.3%.

At the ICLR 2025 Workshop on Reasoning and Planning for LLMs, they presented these unexpected deficits of LLMs.

Paper link: https://arxiv.org/abs/2502.05092

To investigate the ability of MLLMs to handle temporal tasks, they constructed precisely customized test sets, including two subsets: ClockQA and CalendarQA.

ClockQA covers six types of simulated clock images (including variants with Roman numerals, missing second hands, and different dial colors) and corresponding time questions;

CalendarQA includes calendar images for ten years, with questions ranging from simple to complex:

What day of the week is New Year's Day?

What day of the week is March 15th?

What date is the 153rd day of the year?

Figure 2: Overview of the DateTimeReasoning task and its two main subsets: ClockQA and CalendarQA

Although the dataset size is relatively small, its design effectively probes the core dimensions of temporal reasoning, visual parsing, and date/time inference.

Initial findings indicate that although some models show potential in clock reading or calendar questions, fundamental problems still exist.

Among them, Gemini-2.0 had lower hour/minute errors in clock reading; o1 model had the highest accuracy in calendar questions.

Detailed Results

Table 1 summarizes the performance of each model on the two tasks.

In the ClockQA task, Gemini-2.0 achieved the highest Exact Match (EM) score (22.58%) and the smallest hour/minute error, showing an advantage in understanding clocks compared to other models.

However, the overall EM score is still low, indicating that multimodal large language models (MLLMs) still have significant difficulties in the clock reading task.

In contrast, GPT-o1 performed outstandingly in the CalendarQA task, with an accuracy rate of 80%, demonstrating its strong ability in date calculation and logical reasoning. Other models lagged significantly behind, indicating that date calculation and structured layout parsing remain challenges for AI.

Overall, except for the high performance of GPT-o1 in CalendarQA, the overall performance of the remaining models in both ClockQA and CalendarQA tasks was unsatisfactory.

Table 1: Performance of each model on the clock task (left) and calendar task (right). ↑ indicates higher values are better; ↓ indicates lower values are better.

Clock reading tasks are still prone to errors.

In the ClockQA subset, the models performed significantly worse than on calendar-related questions (see Table 1).

Figures 4a and 3a show that even with standard clock faces, model performance is poor, with some models even tending to give a "default" time.

Using Roman numerals or stylized hands further increased the error rate.

Removing the second hand did not simplify the model's reasoning process, indicating a fundamental problem in the models' ability to identify hands and understand angles.

Calendar reasoning analysis was slightly better.

In contrast, some models performed better on calendar tasks and certain question types.

GPT-o1 performed particularly well in the CalendarQA subset, with an overall accuracy rate of up to 80% (see Table 1 and Figure 3b).

Figure 3: Error analysis of ClockQA and CalendarQA

The points in Figure 3(a) represent the relationship between the time predicted by the model (vertical axis) and the actual time (horizontal axis). The black dashed line (y=x) represents the ideal situation where the model prediction is completely correct.

Figure 3(b) shows the accuracy performance of each model by year. Blank bars indicate that the model's accuracy for the corresponding year is 0%.

Closed-source models like GPT-o1 and Claude-3.5 performed better than open-source models in handling questions about common holidays.

This may be because the training data includes memory patterns of these holidays (see Figure 4b).

However, for some lesser-known or questions requiring complex calculations (e.g., "the 153rd day"), the accuracy of the models dropped significantly, indicating that offset-based reasoning ability is difficult to transfer.

The performance on these types of questions was particularly noticeable for small or open-source models (such as MiniCPM, Qwen2-VL-7B, and Llama3.2-Vision), which was almost random.

Figure 4: ClockQA and CalendarQA analysis based on question type and category

The study also revealed another problem: when AI has limited exposure to data during training, especially when facing rare phenomena like leap years or complex calendar calculations, its performance significantly declines.

Although large language models (LLMs) have been exposed to a large amount of explanations about the concept of "leap year" during training, this does not mean they can perform the reasoning required for related tasks involving visual judgment.

This research highlights two areas needing improvement:

One is the need to include more targeted examples in the training data;

The second is to rethink how AI handles tasks that combine logical reasoning and spatial perception, especially those it is not usually exposed to.

Blind faith in AI is worse than no AI.

The accuracy of AI systems in correctly reading clocks was only 38.7%, and the accuracy in determining calendar dates was only 26.3%.

Early systems were trained through labeled samples, but reading a clock requires another ability - spatial reasoning.

This might be the reason for AI's poor performance this time, explained Rohit Saxena, a researcher at the University of Edinburgh and the paper's author:

Models must recognize overlapping hands, measure angles, and adapt to various dial designs, such as Roman numerals or artistic markings.

It's relatively easy for AI to recognize "this is a clock," but it's much harder to actually read the time.

Date judgment is also a headache.

When asked date reasoning questions, AI's error rate is also high. For example, questions like "What day of the week is the 153rd day of this year?"

This deficit is also surprising, as arithmetic should be one of the basic capabilities of a computer.

But as Saxena explained, AI processes arithmetic differently from traditional computers:

Arithmetic is simple for traditional computers, but it's not the case for large language models. AI doesn't run mathematical algorithms; instead, it predicts answers based on patterns learned from training data.

So it can sometimes answer arithmetic questions correctly, but the reasoning process is neither consistent nor rule-based, and our research precisely reveals this gap.

This research is part of a growing area of research in recent years, focusing on the difference between how AI "understands" and how humans understand.

AI models arrive at answers by identifying familiar patterns; they perform excellently when there are enough examples in the training data, but they fail when generalization or abstract reasoning is required.

Most importantly, the research reminds us again that over-relying on AI's output can lead to risks.

Saxena stated: "AI is indeed powerful, but when tasks involve both perception and precise reasoning, we still need rigorous testing, backup logic, and in many cases, human intervention."

Another author, Aryo Pradipta Gema, a PhD student at the University of Edinburgh, said:

Today's AI research often emphasizes complex reasoning tasks, but ironically, many systems still struggle with simpler everyday tasks.

Our research findings indicate that it is now time to address these fundamental capability deficits. Otherwise, AI may always struggle to truly be implemented in time-sensitive real-world applications.

References:

https://www.livescience.com/technology/artificial-intelligence/ai-models-cant-tell-time-or-read-a-calendar-study-re veals

https://arxiv.org/abs/2502.05092

https://www.ed.ac.uk/news/most-ai-struggles-to-read-clocks-and-calendars

Main Tag:Artificial Intelligence

Sub Tags:LLMVisual PerceptionTemporal ReasoningAI Limitations


Previous:ZeroSearch: <Alibaba Technology> Large Language Models Learn Through Self-Rewarding Without a Browser

Next:Open-Source Implementation of Google's Self-Discovering Algorithm AlphaEvolve: OpenAplha_Evolve

Share Short URL