PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

SciReC: Diagnostic Evaluation of Multimodal Multi-Turn Relational Reasoning with Adaptive Interaction

Nilay Yilmaz¹ Naga Sai Abhiram Kusumba^*2, Stella Wenxing Liu^*1, Yezhou Yang¹

¹Arizona State University, ²One Capital
^*Indicates Equal Contribution

Paper 🤗 Huggingface Code arXiv

Example conversation flow and evaluation for an analogical relational question in the Astronomy. The panels show the question, model response, and ground truth with scores, respectively. The next question type is adjusted based on the scores. The last row shows the average score across all question types and the main failure rates. While relational reasoning is the primary cause (35.9%), the other three factors in the bar chart are associated with Figure 2, and the remaining portion (2.4%) corresponds to Figure 1.

Abstract

Relational reasoning requires the process of perceptual understanding, comparing, and integrating the underlying relationships between concepts. This ability consists of multiple categories, such as analogical, structural, and cause-effect, each capturing a different aspect of higher-order understanding. To examine the performance of multimodal large language models (MLLM) on these relational inference tasks, we developed SciReC, a model-adaptive multimodal academic dialog benchmark. As the relational reasoning process involves multiple representations and various factors (visual understanding, exhibiting knowledge, and memory recall), we propose DMRA, a deficit-based diagnostic framework that quantifies the contribution of these components to identify the primary cause of unsuccessful cases. Claude 4.6 achieved the best performance on the overall relational score with 73%, followed by GPT 5.4 with 68%. Performance trends indicate that open-source models achieve their lowest scores on spatial relations, while proprietary models struggle more with hierarchical and sequential relations. Across domains, model performance is lowest on Astronomy and highest on Psychology. The results of DMRA reveal that relational reasoning is the primary source of error across all models, followed by memory limitations.

Example Relational Questions

DMRA (Deficit-Based Multimodal Relational Analysis)

DMRA diagnoses why MLLMs fail on relational reasoning tasks by separating upstream deficits (memory, knowledge, and visual understanding) from cross-image relational reasoning errors.

For each figure, deficits are measured as D_x⁽ⁱ⁾ = max(0, T − X_i), where T is a performance threshold and X_i denotes memory, knowledge, or visual scores. Memory-validation questions dynamically adjust task importance through softmax-based weighting, enabling DMRA to distinguish true memory failures from perception and knowledge limitations.

The weighted upstream contribution is computed as R_x⁽ⁱ⁾ = α_iw_x⁽ⁱ⁾D_x⁽ⁱ⁾, while the remaining unexplained error is attributed to relational reasoning: C_rel = F − min(F, U). This decomposition provides a fine-grained explanation of whether failures originate from perception, knowledge, memory, or the inability to connect concepts across multiple figures.

Results

Claude 4.6 leads SciReC with 73.78% accuracy, followed by GPT-5.4 (68%) and Qwen-3.5 (56.25%). While open-source models are closing the performance gap, spatial reasoning remains their major challenge. Proprietary models excel in structural reasoning but continue to struggle with hierarchical and sequential relational understanding.

SciReC evaluates relational reasoning across eight academic domains and reveals substantial domain-specific variation in MLLM performance. Astronomy emerges as the most challenging domain, while Psychology is the easiest for most models. Claude 4.6 achieves the strongest overall cross-domain performance, leading in six of eight domains, whereas GPT-5.4 performs best in Economics and Behavioral Neuroscience. Open-source models show increasing competitiveness in some domains, particularly Qwen3-5, but continue to lag behind proprietary systems in scientific domains such as Astronomy, Physics, and Chemistry.

The DMRA analysis shows that relational reasoning is the dominant cause of failure across most models, especially top performers such as GPT-5.4, Claude 4.6, and Qwen 3.5, where it accounts for about 66–72% of first-cause errors. These models generally understand visual inputs and possess the required knowledge but struggle to connect concepts and infer relationships. In contrast, weaker models exhibit higher rates of memory, knowledge, and visual-understanding errors, indicating that their failures stem from both relational reasoning and more fundamental limitations.

More Works from Our Lab

Paper Title 1

Paper Title 2

Paper Title 3

SciReC: Diagnostic Evaluation of Multimodal Multi-Turn Relational Reasoning with Adaptive Interaction

Abstract

Example Relational Questions

DMRA (Deficit-Based Multimodal Relational Analysis)

Results