CAT-BENCH: An Assessment of Comprehension of Language Models of Temporal Dependence in Procedural Text

https://arxiv.org/abs/2406.15823
WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

Understanding how LLMs understand natural language projects, such as instructions and syntax, is critical for their reliable use in decision-making systems. An important aspect of projects is their timing, which reflects the causal relationships between actions. Planning, integral to the decision-making process, has been studied extensively in domains such as robotics and embodied environments. Effective use, revision, or customization of projects requires the ability to reason about the steps involved and their causal connections. Although evaluation is common in domains such as blocksworld and simulated environments, real-world natural language projects face unique challenges because they can be physically implemented to test their accuracy and reliability.

Researchers from Stony Brook University, the US Naval Academy, and the University of Texas at Austin have developed CAT-BENCH, a benchmark for using advanced language models to predict the sequence of steps in cooking recipes. Can assess the ability to do. Their study shows that current state-of-the-art language models need help with this task, even with techniques such as few-shot learning and explanation-based prompting, yielding low F1 scores. Although these models can create coherent projects, research emphasizes significant challenges in understanding causal and temporal relationships within instructional texts. Analyzes show that prompting models to explain their predictions after generating them improves performance compared to traditional chain-of-thought prompting, highlighting inconsistencies in model reasoning.

Early research emphasized understanding plans and goals. Creating plans involves temporal reasoning and tracking entity states. NaturalPlan focuses on a few real-world tasks that involve natural language interaction. PlanBench demonstrated the challenges of developing efficient plans under a strict syntax—objective-oriented script construction task models for setting up steps for specific goals. ChattyChef uses chat settings to optimize the step-by-step configuration. CoPlan revises measures to meet constraints. Studies such as entity states, action linking, and predicting the next event explore the understanding of the plan. Different datasets remove dependencies in instruction and decision making. However, more datasets are needed to focus on the prediction and explanation of temporal order constraints in instructional projects.

CAT-BENCH evaluates the ability of models to recognize temporal dependencies between steps in cooking recipes. Based on the causal relationships within the recipe's directed acyclic graph (DAG), it raises questions about whether one step should precede or follow another. For example, determining whether a baked cake must be removed to cool before placing the dough on a baking tray relies on understanding the preconditions and step effects. The CAT-BENCH consists of 2,840 questions in 57 sets, evenly divided between questions examining “before” and “after” temporal relationships. Models for predicting these dependencies are evaluated for their precision, recall, and F1 score, as well as their ability to provide valid explanations for their judgments.

Different models were evaluated on CAT-BENCH depending on the steps for their performance. In the zero-shot setting, the GPT-4-turbo and GPT-3.5-turbo showed the highest F1 scores, with the GPT-4o performing unexpectedly worse. Adding explanations to responses generally improves model performance, significantly increasing the F1 score of GPT-4o in particular. However, the models were biased toward predicting dependence, affecting their overall precision and balance of recall. Human evaluation of the model-generated descriptions indicated different quality, with larger models generally performing better than smaller ones. Models need to be consistent in making step-by-step predictions, especially when explanations are included. Further analysis revealed common errors such as misunderstanding multihop dependence and failing to identify causal relationships between measures.

CAT-BENCH introduces a new benchmark for evaluating the causal and temporal reasoning capabilities of language models in understanding procedural text such as cooking recipes. Despite advances in state-of-the-art models (LLMs), none accurately determine whether one step in a project should precede or succeed another, particularly in recognizing non-dependencies. The models also show inconsistencies in their predictions. Providing explanation after prompting LLMs to provide a response significantly improved their performance compared to reasoning after responding. However, there is considerable room for improvement in understanding the model's step-dependency from human evaluation of these specifications. These findings indicate current limitations in LLMs for plan-based reasoning applications.


check Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us. Twitter.

Join us. Telegram channel And LinkedIn GrTop.

If you like our work, you will like our work. Newsletter..

Don't forget to join us. 45k+ ML SubReddit

Sana Hasan, a consulting intern at Marktech Post and a dual degree student at IIT Madras, is passionate about using technology and AI to tackle real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to AI and real-life solutions.

[Announcing Gretel Navigator] Create, edit, and augment tabular data with the first Compound AI system trusted by EY, Databricks, Google, and Microsoft.

WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

Leave a Comment