Eye On AI

View Original

Week Ending 2.4.2024

RESEARCH WATCH: 2.4.2024

SPONSORED BY

Digimarc digital watermarks invisibly guard your digital assets to protect against misuse, prove copyright ownership, and verify authenticity. In an era of artificial intelligence, don’t leave your images and other digital content exposed. Demand superior content protection and maintain trust in your brand with Digimarc.

Checkout Digimarc - https://www.digimarc.com/

Generative AI for Education (GAIED): Advances, Opportunities, and Challenges

The GAIED paper surveys recent advances in using generative AI to improve education. The workshop brought together experts to explore challenges and opportunities in this emerging field which could transform how we teach and learn.

Authors:  Paul Denny, Sumit Gulwani, Neil T. Heffernan, Tanja Käser, Steven Moore, Anna N. Rafferty, Adish Singla

Link:  https://arxiv.org/abs/2402.01580v1

Date: 2024-02-02

Summary:

This survey article has grown out of the GAIED (pronounced "guide") workshop organized by the authors at the NeurIPS 2023 conference. We organized the GAIED workshop as part of a community-building effort to bring together researchers, educators, and practitioners to explore the potential of generative AI for enhancing education. This article aims to provide an overview of the workshop activities and highlight several future research directions in the area of GAIED.

--------------------------------------------------------------------------------------------------------

Homogenization Effects of Large Language Models on Human Creative Ideation

The paper on homogenization effects of LLMs analyzes how large language models impact creative ideation when used as creativity assistants. Findings suggest LLMs make ideas from different users more similar but less original, with implications for design of creative AI tools.

Authors:  Barrett R. Anderson, Jash Hemant Shah, Max Kreminski

Link:  https://arxiv.org/abs/2402.01536v1

Date: 2024-02-02

Summary:

Large language models (LLMs) are now being used in a wide variety of contexts, including as creativity support tools (CSTs) intended to help their users come up with new ideas. But do LLMs actually support user creativity? We hypothesized that the use of an LLM as a CST might make the LLM's users feel more creative, and even broaden the range of ideas suggested by each individual user, but also homogenize the ideas suggested by different users. We conducted a 36-participant comparative user study and found, in accordance with the homogenization hypothesis, that different users tended to produce less semantically distinct ideas with ChatGPT than with an alternative CST. Additionally, ChatGPT users generated a greater number of more detailed ideas, but felt less responsible for the ideas they generated. We discuss potential implications of these findings for users, designers, and developers of LLM-based CSTs.

--------------------------------------------------------------------------------------------------------

K-Level Reasoning with Large Language Models

The k-level reasoning paper formally evaluates how well LLMs can make strategic decisions in complex, competitive environments like business and finance. Proposed methods significantly improve LLMs' prediction accuracy of opponents' moves, enabling more effective planning and decisions.

Authors:  Yadong Zhang, Shaoguang Mao, Tao Ge, Xun Wang, Yan Xia, Man Lan, Furu Wei

Link:  https://arxiv.org/abs/2402.01521v1

Date: 2024-02-02

Summary:

While Large Language Models (LLMs) have demonstrated their proficiency in complex reasoning tasks, their performance in dynamic, interactive, and competitive scenarios - such as business strategy and stock market analysis - remains underexplored. To bridge this gap, we formally explore the dynamic reasoning capabilities of LLMs for decision-making in rapidly evolving environments. We introduce two game theory-based pilot challenges that mirror the complexities of real-world dynamic decision-making. These challenges are well-defined, enabling clear, controllable, and precise evaluation of LLMs' dynamic reasoning abilities. Through extensive experiments, we find that existing reasoning methods tend to falter in dynamic settings that require k-level thinking - a key concept not tackled by previous works. To address this, we propose a novel reasoning approach for LLMs, named "K-Level Reasoning". This approach adopts the perspective of rivals to recursively employ k-level thinking based on available historical information, which significantly improves the prediction accuracy of rivals' subsequent moves and informs more strategic decision-making. This research not only sets a robust quantitative benchmark for the assessment of dynamic reasoning but also markedly enhances the proficiency of LLMs in dynamic contexts.

--------------------------------------------------------------------------------------------------------

Supervised Algorithmic Fairness in Distribution Shifts: A Survey

The survey on supervised algorithmic fairness focuses on maintaining equitable AI systems even when data distributions change over time. It summarizes approaches, datasets, and metrics in this critical area for responsible, trustworthy AI.

Authors:  Yujie Lin, Dong Li, Chen Zhao, Xintao Wu, Qin Tian, Minglai Shao

Link:  https://arxiv.org/abs/2402.01327v1

Date: 2024-02-02

Summary:

Supervised fairness-aware machine learning under distribution shifts is an emerging field that addresses the challenge of maintaining equitable and unbiased predictions when faced with changes in data distributions from source to target domains. In real-world applications, machine learning models are often trained on a specific dataset but deployed in environments where the data distribution may shift over time due to various factors. This shift can lead to unfair predictions, disproportionately affecting certain groups characterized by sensitive attributes, such as race and gender. In this survey, we provide a summary of various types of distribution shifts and comprehensively investigate existing methods based on these shifts, highlighting six commonly used approaches in the literature. Additionally, this survey lists publicly available datasets and evaluation metrics for empirical studies. We further explore the interconnection with related research fields, discuss the significant challenges, and identify potential directions for future studies.

--------------------------------------------------------------------------------------------------------

Beyond the Request: Harnessing HTTP Response Headers for Cross-Browser Web Tracker Classification in an Imbalanced Setting

The web tracker classification paper uses HTTP response headers and imbalanced learning to improve web tracker detection across browsers. High accuracy is achieved, enabling enhanced user privacy and security online while further analysis of tracker types is needed.

Authors:  Wolf Rieder, Philip Raschke, Thomas Cory

Link:  https://arxiv.org/abs/2402.01240v1

Date: 2024-02-02

Summary:

The World Wide Web's connectivity is greatly attributed to the HTTP protocol, with HTTP messages offering informative header fields that appeal to disciplines like web security and privacy, especially concerning web tracking. Despite existing research employing HTTP/S request messages to identify web trackers, HTTP/S response headers are often overlooked. This study endeavors to design effective machine learning classifiers for web tracker detection using HTTP/S response headers. Data from the Chrome, Firefox, and Brave browsers, obtained through the traffic monitoring browser extension T.EX, serves as our data set. Eleven supervised models were trained on Chrome data and tested across all browsers. The results demonstrated high accuracy, F1-score, precision, recall, and minimal log-loss error for Chrome and Firefox, but subpar performance on Brave, potentially due to its distinct data distribution and feature set. The research suggests that these classifiers are viable for detecting web trackers in Chrome and Firefox. However, real-world application testing remains pending, and the distinction between tracker types and broader label sources could be explored in future studies.

--------------------------------------------------------------------------------------------------------

AI Code Generators for Security: Friend or Foe?

The paper exploring AI code generators analyzes emerging opportunities and risks of using them for software security research. A benchmark is introduced to quantify capabilities and guide responsible use of generative coding models.

Authors:  Roberto Natella, Pietro Liguori, Cristina Improta, Bojan Cukic, Domenico Cotroneo

Link:  https://arxiv.org/abs/2402.01219v1

Date: 2024-02-02

Summary:

Recent advances of artificial intelligence (AI) code generators are opening new opportunities in software security research, including misuse by malicious actors. We review use cases for AI code generators for security and introduce an evaluation benchmark.

--------------------------------------------------------------------------------------------------------

Efficient Causal Graph Discovery Using Large Language Models

The causal graph discovery paper leverages LLMs to efficiently reconstruct causal relationships from data, using fewer queries than existing methods. Performance improvements demonstrate this approach's effectiveness for elucidating causality across different domains.

Authors:  Thomas Jiralerspong, Xiaoyin Chen, Yash More, Vedant Shah, Yoshua Bengio

Link:  https://arxiv.org/abs/2402.01207v1

Date: 2024-02-02

Summary:

We propose a novel framework that leverages LLMs for full causal graph discovery. While previous LLM-based methods have used a pairwise query approach, this requires a quadratic number of queries which quickly becomes impractical for larger causal graphs. In contrast, the proposed framework uses a breadth-first search (BFS) approach which allows it to use only a linear number of queries. We also show that the proposed method can easily incorporate observational data when available, to improve performance. In addition to being more time and data-efficient, the proposed framework achieves state-of-the-art results on real-world causal graphs of varying sizes. The results demonstrate the effectiveness and efficiency of the proposed method in discovering causal relationships, showcasing its potential for broad applicability in causal graph discovery tasks across different domains.

--------------------------------------------------------------------------------------------------------

A Survey on Self-Supervised Learning for Non-Sequential Tabular Data

The survey on self-supervised learning for tabular data reviews recent advances in representation learning for this ubiquitous but challenging data type. It highlights strengths of different techniques and discusses integrating domain knowledge and transfer learning to lower barriers for tabular SSL adoption.

Authors:  Wei-Yao Wang, Wei-Wei Du, Derek Xu, Wei Wang, Wen-Chih Peng

Link:  https://arxiv.org/abs/2402.01204v1

Date: 2024-02-02

Summary:

Self-supervised learning (SSL) has been incorporated into many state-of-the-art models in various domains, where SSL defines pretext tasks based on unlabeled datasets to learn contextualized and robust representations. Recently, SSL has been a new trend in exploring the representation learning capability in the realm of tabular data, which is more challenging due to not having explicit relations for learning descriptive representations. This survey aims to systematically review and summarize the recent progress and challenges of SSL for non-sequential tabular data (SSL4NS-TD). We first present a formal definition of NS-TD and clarify its correlation to related studies. Then, these approaches are categorized into three groups -- predictive learning, contrastive learning, and hybrid learning, with their motivations and strengths of representative methods within each direction. On top of this, application issues of SSL4NS-TD are presented, including automatic data engineering, cross-table transferability, and domain knowledge integration. In addition, we elaborate on existing benchmarks and datasets for NS-TD applications to discuss the performance of existing tabular models. Finally, we discuss the challenges of SSL4NS-TD and provide potential directions for future research. We expect our work to be useful in terms of encouraging more research on lowering the barrier to entry SSL for the tabular domain and improving the foundations for implicit tabular data.

--------------------------------------------------------------------------------------------------------

2AFC Prompting of Large Multimodal Models for Image Quality Assessment

The image quality assessment paper employs comparison-based prompting of multimodal models to quantify their proficiency at ranking image quality. Analysis suggests room for improvement in fine-grained discrimination, informing future development of perception models for subjective tasks like aesthetics evaluation.

Authors:  Hanwei Zhu, Xiangjie Sui, Baoliang Chen, Xuelin Liu, Peilin Chen, Yuming Fang, Shiqi Wang

Link:  https://arxiv.org/abs/2402.01162v1

Date: 2024-02-02

Summary:

While abundant research has been conducted on improving high-level visual understanding and reasoning capabilities of large multimodal models~(LMMs), their visual quality assessment~(IQA) ability has been relatively under-explored. Here we take initial steps towards this goal by employing the two-alternative forced choice~(2AFC) prompting, as 2AFC is widely regarded as the most reliable way of collecting human opinions of visual quality. Subsequently, the global quality score of each image estimated by a particular LMM can be efficiently aggregated using the maximum a posterior estimation. Meanwhile, we introduce three evaluation criteria: consistency, accuracy, and correlation, to provide comprehensive quantifications and deeper insights into the IQA capability of five LMMs. Extensive experiments show that existing LMMs exhibit remarkable IQA ability on coarse-grained quality comparison, but there is room for improvement on fine-grained quality discrimination. The proposed dataset sheds light on the future development of IQA models based on LMMs. The codes will be made publicly available at https://github.com/h4nwei/2AFC-LMMs.

--------------------------------------------------------------------------------------------------------

Near-Optimal Reinforcement Learning with Self-Play under Adaptivity Constraints

The reinforcement learning paper introduces a near-optimal algorithm for multi-agent RL with adaptivity constraints motivated by real-world policy update costs. Mathematical bounds are derived, significantly advancing understanding of this new problem setting critical for practical MARL deployment.

Authors:  Dan Qiao, Yu-Xiang Wang

Link:  https://arxiv.org/abs/2402.01111v1

Date: 2024-02-02

Summary:

We study the problem of multi-agent reinforcement learning (MARL) with adaptivity constraints -- a new problem motivated by real-world applications where deployments of new policies are costly and the number of policy updates must be minimized. For two-player zero-sum Markov Games, we design a (policy) elimination based algorithm that achieves a regret of $\widetilde{O}(\sqrt{H^3 S^2 ABK})$, while the batch complexity is only $O(H+\log\log K)$. In the above, $S$ denotes the number of states, $A,B$ are the number of actions for the two players respectively, $H$ is the horizon and $K$ is the number of episodes. Furthermore, we prove a batch complexity lower bound $\Omega(\frac{H}{\log_{A}K}+\log\log K)$ for all algorithms with $\widetilde{O}(\sqrt{K})$ regret bound, which matches our upper bound up to logarithmic factors. As a byproduct, our techniques naturally extend to learning bandit games and reward-free MARL within near optimal batch complexity. To the best of our knowledge, these are the first line of results towards understanding MARL with low adaptivity.

--------------------------------------------------------------------------------------------------------

Simulation of Graph Algorithms with Looped Transformers

The paper on looped transformers theoretically analyzes how neural networks can simulate reasoning over graph data. By constructing architectures that provably replicate algorithms like DFS and SCC, they formally establish transformers' capacity to model relational reasoning, despite limits from precision. This advances understanding of neural-symbolic computing.

Authors:  Artur Back de Luca, Kimon Fountoulakis

Link:  https://arxiv.org/abs/2402.01107v1

Date: 2024-02-02

Summary:

The execution of graph algorithms using neural networks has recently attracted significant interest due to promising empirical progress. This motivates further understanding of how neural networks can replicate reasoning steps with relational data. In this work, we study the ability of transformer networks to simulate algorithms on graphs from a theoretical perspective. The architecture that we utilize is a looped transformer with extra attention heads that interact with the graph. We prove by construction that this architecture can simulate algorithms such as Dijkstra's shortest path algorithm, Breadth- and Depth-First Search, and Kosaraju's strongly connected components algorithm. The width of the network does not increase with the size of the input graph, which implies that the network can simulate the above algorithms for any graph. Despite this property, we show that there is a limit to simulation in our solution due to finite precision. Finally, we show a Turing Completeness result with constant width when the extra attention heads are utilized.

--------------------------------------------------------------------------------------------------------

Compositional Generative Modeling: A Single Model is Not All You Need

The paper argues against reliance on massive monolithic generative models, instead advocating composition of smaller models. Benefits include better generalization, programming new tasks, and discovering modular components in data. This provides a promising direction to balance strengths of neural and symbolic AI.

Authors:  Yilun Du, Leslie Kaelbling

Link:  https://arxiv.org/abs/2402.01103v1

Date: 2024-02-02

Summary:

Large monolithic generative models trained on massive amounts of data have become an increasingly dominant approach in AI research. In this paper, we argue that we should instead construct large generative systems by composing smaller generative models together. We show how such a compositional generative approach enables us to learn distributions in a more data-efficient manner, enabling generalization to parts of the data distribution unseen at training time. We further show how this enables us to program and construct new generative models for tasks completely unseen at training. Finally, we show that in many cases, we can discover separate compositional components from data.

--------------------------------------------------------------------------------------------------------

Let's Negotiate! A Survey of Negotiation Dialogue Systems

The survey on negotiation dialogue systems reviews the growing area of conversational agents that can resolve conflicts and reach agreements. Covering methods, benchmarks, and assessments, it identifies open challenges for multi-party, multimodal negotiation across cultures. The overview informs future work towards more capable and trustworthy conversational AI.

Authors:  Haolan Zhan, Yufei Wang, Tao Feng, Yuncheng Hua, Suraj Sharma, Zhuang Li, Lizhen Qu, Zhaleh Semnani Azad, Ingrid Zukerman, Gholamreza Haffari

Link:  https://arxiv.org/abs/2402.01097v1

Date: 2024-02-02

Summary:

Negotiation is a crucial ability in human communication. Recently, there has been a resurgent research interest in negotiation dialogue systems, whose goal is to create intelligent agents that can assist people in resolving conflicts or reaching agreements. Although there have been many explorations into negotiation dialogue systems, a systematic review of this task has not been performed to date. We aim to fill this gap by investigating recent studies in the field of negotiation dialogue systems, and covering benchmarks, evaluations and methodologies within the literature. We also discuss potential future directions, including multi-modal, multi-party and cross-cultural negotiation scenarios. Our goal is to provide the community with a systematic overview of negotiation dialogue systems and to inspire future research.

--------------------------------------------------------------------------------------------------------

Trustworthy Distributed AI Systems: Robustness, Privacy, and Governance

The paper examines vulnerabilities of distributed AI systems, which enable immense data processing yet pose risks around security, privacy, and fairness. A taxonomy of defenses is presented spanning adversarial robustness, privacy protection, and governance. Understanding gaps motivating further research towards trustworthy, responsible decentralized AI are highlighted.

Authors:  Wenqi Wei, Ling Liu

Link:  https://arxiv.org/abs/2402.01096v1

Date: 2024-02-02

Summary:

Emerging Distributed AI systems are revolutionizing big data computing and data processing capabilities with growing economic and societal impact. However, recent studies have identified new attack surfaces and risks caused by security, privacy, and fairness issues in AI systems. In this paper, we review representative techniques, algorithms, and theoretical foundations for trustworthy distributed AI through robustness guarantee, privacy protection, and fairness awareness in distributed learning. We first provide a brief overview of alternative architectures for distributed learning, discuss inherent vulnerabilities for security, privacy, and fairness of AI algorithms in distributed learning, and analyze why these problems are present in distributed learning regardless of specific architectures. Then we provide a unique taxonomy of countermeasures for trustworthy distributed AI, covering (1) robustness to evasion attacks and irregular queries at inference, and robustness to poisoning attacks, Byzantine attacks, and irregular data distribution during training; (2) privacy protection during distributed learning and model inference at deployment; and (3) AI fairness and governance with respect to both data and models. We conclude with a discussion on open challenges and future research directions toward trustworthy distributed AI, such as the need for trustworthy AI policy guidelines, the AI responsibility-utility co-design, and incentives and compliance.

--------------------------------------------------------------------------------------------------------

How many views does your deep neural network use for prediction?

The paper introduces Minimal Sufficient Views to estimate the key features used by deep neural networks for particular predictions. Relating model generalization ability to the number of these explanatory views, results across CNNs and transformers suggest the concept helps explain prediction success for non-ensemble models too. This advances theoretical understanding of deep learning generalization.

Authors:  Keisuke Kawano, Takuro Kutsuna, Keisuke Sano

Link:  https://arxiv.org/abs/2402.01095v1

Date: 2024-02-02

Summary:

The generalization ability of Deep Neural Networks (DNNs) is still not fully understood, despite numerous theoretical and empirical analyses. Recently, Allen-Zhu & Li (2023) introduced the concept of multi-views to explain the generalization ability of DNNs, but their main target is ensemble or distilled models, and no method for estimating multi-views used in a prediction of a specific input is discussed. In this paper, we propose Minimal Sufficient Views (MSVs), which is similar to multi-views but can be efficiently computed for real images. MSVs is a set of minimal and distinct features in an input, each of which preserves a model's prediction for the input. We empirically show that there is a clear relationship between the number of MSVs and prediction accuracy across models, including convolutional and transformer models, suggesting that a multi-view like perspective is also important for understanding the generalization ability of (non-ensemble or non-distilled) DNNs.

--------------------------------------------------------------------------------------------------------

Recent Advances in Predictive Modeling with Electronic Health Records

The survey examines recent progress in predictive healthcare modeling using deep learning on electronic health records. Outlining tasks, models, benchmarks and open problems in this crucial application domain with complex heterogeneous data, it points to future directions like hybrid techniques, interpretable predictions, and enhanced knowledge transfer.

Authors:  Jiaqi Wang, Junyu Luo, Muchao Ye, Xiaochen Wang, Yuan Zhong, Aofei Chang, Guanjie Huang, Ziyi Yin, Cao Xiao, Jimeng Sun, Fenglong Ma

Link:  https://arxiv.org/abs/2402.01077v1

Date: 2024-02-02

Summary:

The development of electronic health records (EHR) systems has enabled the collection of a vast amount of digitized patient data. However, utilizing EHR data for predictive modeling presents several challenges due to its unique characteristics. With the advancements in machine learning techniques, deep learning has demonstrated its superiority in various applications, including healthcare. This survey systematically reviews recent advances in deep learning-based predictive models using EHR data. Specifically, we begin by introducing the background of EHR data and providing a mathematical definition of the predictive modeling task. We then categorize and summarize predictive deep models from multiple perspectives. Furthermore, we present benchmarks and toolkits relevant to predictive modeling in healthcare. Finally, we conclude this survey by discussing open challenges and suggesting promising directions for future research.

--------------------------------------------------------------------------------------------------------

Evaluating Large Language Models for Generalization and Robustness via Data Compression

The skill learning paper addresses limitations of mutual information objectives for robotic manipulation. By gracefully integrating multiple reward signals within actor-critic reinforcement learning, proposed methods greatly improve discovery of useful behaviors. Hierarchical composition further enables efficient planning and control, significantly advancing self-supervised policy learning.

Authors:  Yucheng Li, Yunhao Guo, Frank Guerin, Chenghua Lin

Link:  https://arxiv.org/abs/2402.00861v1

Date: 2024-02-01

Summary:

Existing methods for evaluating large language models face challenges such as data contamination, sensitivity to prompts, and the high cost of benchmark creation. To address this, we propose a lossless data compression based evaluation approach that tests how models' predictive abilities generalize after their training cutoff. Specifically, we collect comprehensive test data spanning 83 months from 2017 to 2023 and split the data into training and testing periods according to models' training data cutoff. We measure: 1) the compression performance on the testing period as a measure of generalization on unseen data; and 2) the performance gap between the training and testing period as a measure of robustness. Our experiments test 14 representative large language models with various sizes on sources including Wikipedia, news articles, code, arXiv papers, and multi-modal data. We find that the compression rate of many models reduces significantly after their cutoff date, but models such as Mistral and Llama-2 demonstrate a good balance between performance and robustness. Results also suggest that models struggle to generalize on news and code data, but work especially well on arXiv papers. We also find the context size and tokenization implementation have a big impact of on the overall compression performance.

--------------------------------------------------------------------------------------------------------

SLIM: Skill Learning with Multiple Critics

The skill learning paper addresses a key limitation of mutual information objectives for robotic manipulation tasks. By integrating multiple reward functions within a novel multi-critic reinforcement learning approach, they greatly enhance discovery of useful behaviors. Hierarchical composition further enables efficient planning and control. This significantly advances self-supervised policy learning for real-world robotics.

Authors:  David Emukpere, Bingbing Wu, Julien Perez

Link:  https://arxiv.org/abs/2402.00823v1

Date: 2024-02-01

Summary:

Self-supervised skill learning aims to acquire useful behaviors that leverage the underlying dynamics of the environment. Latent variable models, based on mutual information maximization, have been particularly successful in this task but still struggle in the context of robotic manipulation. As it requires impacting a possibly large set of degrees of freedom composing the environment, mutual information maximization fails alone in producing useful manipulation behaviors. To address this limitation, we introduce SLIM, a multi-critic learning approach for skill discovery with a particular focus on robotic manipulation. Our main insight is that utilizing multiple critics in an actor-critic framework to gracefully combine multiple reward functions leads to a significant improvement in latent-variable skill discovery for robotic manipulation while overcoming possible interference occurring among rewards which hinders convergence to useful skills. Furthermore, in the context of tabletop manipulation, we demonstrate the applicability of our novel skill discovery approach to acquire safe and efficient motor primitives in a hierarchical reinforcement learning fashion and leverage them through planning, surpassing the state-of-the-art approaches for skill discovery by a large margin.

--------------------------------------------------------------------------------------------------------

Formal-LLM: Integrating Formal Language and Natural Language for Controllable LLM-based Agents

The paper on formal language-guided agents aims to improve plan validity and user trust in LLM-based planning. By constraining the natural language generation process via automata representing human requirements, they prevent invalid plans. Experiments demonstrate over 50% performance gains on planning tasks, validating benefits of the proposed hybrid approach. This facilitates safer deployment of LLMs in applications requiring high reliability.

Authors:  Zelong Li, Wenyue Hua, Hao Wang, He Zhu, Yongfeng Zhang

Link:  https://arxiv.org/abs/2402.00798v1

Date: 2024-02-01

Summary:

Recent advancements on Large Language Models (LLMs) enable AI Agents to automatically generate and execute multi-step plans to solve complex tasks. However, since LLM's content generation process is hardly controllable, current LLM-based agents frequently generate invalid or non-executable plans, which jeopardizes the performance of the generated plans and corrupts users' trust in LLM-based agents. In response, this paper proposes a novel ``Formal-LLM'' framework for LLM-based agents by integrating the expressiveness of natural language and the precision of formal language. Specifically, the framework allows human users to express their requirements or constraints for the planning process as an automaton. A stack-based LLM plan generation process is then conducted under the supervision of the automaton to ensure that the generated plan satisfies the constraints, making the planning process controllable. We conduct experiments on both benchmark tasks and practical real-life tasks, and our framework achieves over 50% overall performance increase, which validates the feasibility and effectiveness of employing Formal-LLM to guide the plan generation of agents, preventing the agents from generating invalid and unsuccessful plans. Further, more controllable LLM-based agents can facilitate the broader utilization of LLM in application scenarios where high validity of planning is essential. The work is open-sourced at https://github.com/agiresearch/Formal-LLM.

--------------------------------------------------------------------------------------------------------

Distinguishing the Indistinguishable: Human Expertise in Algorithmic Prediction

The paper provides a new framework for selectively incorporating human expertise to improve algorithmic predictions. Though models often outperform humans on average, human judgment significantly boosted accuracy for around 30% of instances. Principled identification of such cases enables more effective collaboration. This natural approach to leveraging complementary strengths has wide applicability.

Authors:  Rohan Alur, Manish Raghavan, Devavrat Shah

Link:  https://arxiv.org/abs/2402.00793v1

Date: 2024-02-01

Summary:

We introduce a novel framework for incorporating human expertise into algorithmic predictions. Our approach focuses on the use of human judgment to distinguish inputs which `look the same' to any feasible predictive algorithm. We argue that this framing clarifies the problem of human/AI collaboration in prediction tasks, as experts often have access to information -- particularly subjective information -- which is not encoded in the algorithm's training data. We use this insight to develop a set of principled algorithms for selectively incorporating human feedback only when it improves the performance of any feasible predictor. We find empirically that although algorithms often outperform their human counterparts on average, human judgment can significantly improve algorithmic predictions on specific instances (which can be identified ex-ante). In an X-ray classification task, we find that this subset constitutes nearly 30% of the patient population. Our approach provides a natural way of uncovering this heterogeneity and thus enabling effective human-AI collaboration.

--------------------------------------------------------------------------------------------------------


EYE ON A.I. GETS READERS UP TO DATE ON THE LATEST FUNDING NEWS AND RELATED ISSUES. SUBSCRIBE FOR THE WEEKLY NEWSLETTER.