Week Ending 4.21.2024

April 22, 2024 Craig Smith

RESEARCH WATCH: 4.21.2024

Data Authenticity, Consent, & Provenance for AI are all broken: what will it take to fix them?

Data Authenticity, Consent, & Provenance for AI are all broken: Massive and poorly documented training data has fueled the recent progress in foundation models, but poses major challenges around data transparency, consent, bias, and ethical concerns. This paper conducts a large-scale analysis of the training data landscape, identifying the missing infrastructure needed to facilitate responsible AI development. By examining shortcomings in tracing data authenticity, consent, and documentation, the authors outline how developers, policymakers, and data creators can adopt universal provenance standards, an important step towards building trustworthy foundation models.

Authors: Shayne Longpre, Robert Mahari, Naana Obeng-Marnu, William Brannon, Tobin South, Katy Gero, Sandy Pentland, Jad Kabbara

Link: https://arxiv.org/abs/2404.12691v1

Date: 2024-04-19

Summary:

New capabilities in foundation models are owed in large part to massive, widely-sourced, and under-documented training data collections. Existing practices in data collection have led to challenges in documenting data transparency, tracing authenticity, verifying consent, privacy, representation, bias, copyright infringement, and the overall development of ethical and trustworthy foundation models. In response, regulation is emphasizing the need for training data transparency to understand foundation models' limitations. Based on a large-scale analysis of the foundation model training data landscape and existing solutions, we identify the missing infrastructure to facilitate responsible foundation model development practices. We examine the current shortcomings of common tools for tracing data authenticity, consent, and documentation, and outline how policymakers, developers, and data creators can facilitate responsible foundation model development by adopting universal data provenance standards.

--------------------------------------------------------------------------------------------------------

X-Light: Cross-City Traffic Signal Control Using Transformer on Transformer as Meta Multi-Agent Reinforcement Learner

X-Light: Reinforcement learning has significantly improved traffic signal control through better multi-agent cooperation. However, achieving algorithms that transfer effectively across diverse cities remains a challenge. This paper introduces X-Light, a transformer-based meta multi-agent approach that learns general decision trajectories from full MDP trajectories across cities. X-Light demonstrates remarkable transferability, outperforming baseline methods by up to 16.3% when directly applied to unseen scenarios, making it a promising solution for real-world, cross-city traffic optimization.

Authors: Haoyuan Jiang, Ziyue Li, Hua Wei, Xuantang Xiong, Jingqing Ruan, Jiaming Lu, Hangyu Mao, Rui Zhao

Link: https://arxiv.org/abs/2404.12090v1

Date: 2024-04-18

Summary:

The effectiveness of traffic light control has been significantly improved by current reinforcement learning-based approaches via better cooperation among multiple traffic lights. However, a persisting issue remains: how to obtain a multi-agent traffic signal control algorithm with remarkable transferability across diverse cities? In this paper, we propose a Transformer on Transformer (TonT) model for cross-city meta multi-agent traffic signal control, named as X-Light: We input the full Markov Decision Process trajectories, and the Lower Transformer aggregates the states, actions, rewards among the target intersection and its neighbors within a city, and the Upper Transformer learns the general decision trajectories across different cities. This dual-level approach bolsters the model's robust generalization and transferability. Notably, when directly transferring to unseen scenarios, ours surpasses all baseline methods with +7.91% on average, and even +16.3% in some cases, yielding the best results.

--------------------------------------------------------------------------------------------------------

What does CLIP know about peeling a banana?

Understanding the implicit knowledge within large vision-language models like CLIP is crucial for enabling intelligent affordance-based reasoning in robotics. This paper proposes AffordanceCLIP, which leverages CLIP's learned associations between objects, parts, and actions to perform zero-shot affordance segmentation without additional supervision. AffordanceCLIP achieves competitive performance while being task-agnostic, scalable to any action prompt, and efficient, opening new perspectives for functionality-driven reasoning in AI systems.

Authors: Claudia Cuttano, Gabriele Rosi, Gabriele Trivigno, Giuseppe Averta

Link: https://arxiv.org/abs/2404.12015v1

Date: 2024-04-18

Summary:

Humans show an innate capability to identify tools to support specific actions. The association between objects parts and the actions they facilitate is usually named affordance. Being able to segment objects parts depending on the tasks they afford is crucial to enable intelligent robots to use objects of daily living. Traditional supervised learning methods for affordance segmentation require costly pixel-level annotations, while weakly supervised approaches, though less demanding, still rely on object-interaction examples and support a closed set of actions. These limitations hinder scalability, may introduce biases, and usually restrict models to a limited set of predefined actions. This paper proposes AffordanceCLIP, to overcome these limitations by leveraging the implicit affordance knowledge embedded within large pre-trained Vision-Language models like CLIP. We experimentally demonstrate that CLIP, although not explicitly trained for affordances detection, retains valuable information for the task. Our AffordanceCLIP achieves competitive zero-shot performance compared to methods with specialized training, while offering several advantages: i) it works with any action prompt, not just a predefined set; ii) it requires training only a small number of additional parameters compared to existing solutions and iii) eliminates the need for direct supervision on action-object pairs, opening new perspectives for functionality-based reasoning of models.

--------------------------------------------------------------------------------------------------------

A Data-Driven Representation for Sign Language Production

Current sign language production methods rely on limited linguistic resources, hindering progress. This paper introduces an innovative data-driven approach that transforms continuous sign pose generation into a discrete sequence generation problem using vector quantization. By learning a codebook of motion tokens that can be sequenced, the proposed method circumvents annotation needs while leveraging available data. Extensive evaluations demonstrate significant improvements over prior work, advancing the field of automatic sign language production.

Authors: Harry Walsh, Abolfazl Ravanshad, Mariam Rahmani, Richard Bowden

Link: https://arxiv.org/abs/2404.11499v1

Date: 2024-04-17

Summary:

Phonetic representations are used when recording spoken languages, but no equivalent exists for recording signed languages. As a result, linguists have proposed several annotation systems that operate on the gloss or sub-unit level; however, these resources are notably irregular and scarce. Sign Language Production (SLP) aims to automatically translate spoken language sentences into continuous sequences of sign language. However, current state-of-the-art approaches rely on scarce linguistic resources to work. This has limited progress in the field. This paper introduces an innovative solution by transforming the continuous pose generation problem into a discrete sequence generation problem. Thus, overcoming the need for costly annotation. Although, if available, we leverage the additional information to enhance our approach. By applying Vector Quantisation (VQ) to sign language data, we first learn a codebook of short motions that can be combined to create a natural sequence of sign. Where each token in the codebook can be thought of as the lexicon of our representation. Then using a transformer we perform a translation from spoken language text to a sequence of codebook tokens. Each token can be directly mapped to a sequence of poses allowing the translation to be performed by a single network. Furthermore, we present a sign stitching method to effectively join tokens together. We evaluate on the RWTH-PHOENIX-Weather-2014T (PHOENIX14T) and the more challenging Meine DGS Annotated (mDGS) datasets. An extensive evaluation shows our approach outperforms previous methods, increasing the BLEU-1 back translation score by up to 72%.

--------------------------------------------------------------------------------------------------------

AdaIR: Exploiting Underlying Similarities of Image Restoration Tasks with Adapters

AdaIR: Traditional image restoration methods require extensive task-specific networks, leading to high storage and computational costs. AdaIR proposes a novel framework that exploits shared components across restoration tasks while augmenting them with lightweight task-adapters. By pre-training a generic network and subsequently adapting it to individual tasks, AdaIR achieves outstanding multi-task performance with significantly reduced parameter counts and training time, offering an efficient solution for practical image restoration applications.

Authors: Hao-Wei Chen, Yu-Syuan Xu, Kelvin C. K. Chan, Hsien-Kai Kuo, Chun-Yi Lee, Ming-Hsuan Yang

Link: https://arxiv.org/abs/2404.11475v1

Date: 2024-04-17

Summary:

Existing image restoration approaches typically employ extensive networks specifically trained for designated degradations. Despite being effective, such methods inevitably entail considerable storage costs and computational overheads due to the reliance on task-specific networks. In this work, we go beyond this well-established framework and exploit the inherent commonalities among image restoration tasks. The primary objective is to identify components that are shareable across restoration tasks and augment the shared components with modules specifically trained for individual tasks. Towards this goal, we propose AdaIR, a novel framework that enables low storage cost and efficient training without sacrificing performance. Specifically, a generic restoration network is first constructed through self-supervised pre-training using synthetic degradations. Subsequent to the pre-training phase, adapters are trained to adapt the pre-trained network to specific degradations. AdaIR requires solely the training of lightweight, task-specific modules, ensuring a more efficient storage and training regimen. We have conducted extensive experiments to validate the effectiveness of AdaIR and analyze the influence of the pre-training strategy on discovering shareable components. Extensive experimental results show that AdaIR achieves outstanding results on multi-task restoration while utilizing significantly fewer parameters (1.9 MB) and less training time (7 hours) for each restoration task. The source codes and trained models will be released.

--------------------------------------------------------------------------------------------------------

Advancing Social Intelligence in AI Agents: Technical Challenges and Open Questions

Building socially intelligent AI agents that can perceive, reason, and respond to human behavior and cognition is a multidisciplinary research goal. This position paper identifies key technical challenges and open questions across computing fields like NLP, vision, and robotics to advance social AI. Anchored in prior work, the paper provides a roadmap for future research on endowing AI agents with robust social intelligence capabilities.

Authors: Leena Mathur, Paul Pu Liang, Louis-Philippe Morency

Link: https://arxiv.org/abs/2404.11023v1

Date: 2024-04-17

Summary:

Building socially-intelligent AI agents (Social-AI) is a multidisciplinary, multimodal research goal that involves creating agents that can sense, perceive, reason about, learn from, and respond to affect, behavior, and cognition of other agents (human or artificial). Progress towards Social-AI has accelerated in the past decade across several computing communities, including natural language processing, machine learning, robotics, human-machine interaction, computer vision, and speech. Natural language processing, in particular, has been prominent in Social-AI research, as language plays a key role in constructing the social world. In this position paper, we identify a set of underlying technical challenges and open questions for researchers across computing communities to advance Social-AI. We anchor our discussion in the context of social intelligence concepts and prior progress in Social-AI research.

--------------------------------------------------------------------------------------------------------

Vision-and-Language Navigation via Causal Learning

Mitigating dataset biases is crucial for developing robust vision-and-language navigation (VLN) agents that generalize well. This paper proposes GOAT, a pioneering causal inference-based approach that comprehensively addresses potential spurious correlations in vision, language, and history through novel adjustment and contrastive learning modules. Extensive experiments across multiple VLN datasets demonstrate GOAT's superiority over prior methods in this important problem of generalizable multimodal understanding and navigation.

Authors: Liuyi Wang, Zongtao He, Ronghao Dang, Mengjiao Shen, Chengju Liu, Qijun Chen

Link: https://arxiv.org/abs/2404.10241v1

Date: 2024-04-16

Summary:

In the pursuit of robust and generalizable environment perception and language understanding, the ubiquitous challenge of dataset bias continues to plague vision-and-language navigation (VLN) agents, hindering their performance in unseen environments. This paper introduces the generalized cross-modal causal transformer (GOAT), a pioneering solution rooted in the paradigm of causal inference. By delving into both observable and unobservable confounders within vision, language, and history, we propose the back-door and front-door adjustment causal learning (BACL and FACL) modules to promote unbiased learning by comprehensively mitigating potential spurious correlations. Additionally, to capture global confounder features, we propose a cross-modal feature pooling (CFP) module supervised by contrastive learning, which is also shown to be effective in improving cross-modal representations during pre-training. Extensive experiments across multiple VLN datasets (R2R, REVERIE, RxR, and SOON) underscore the superiority of our proposed method over previous state-of-the-art approaches. Code is available at https://github.com/CrystalSixone/VLN-GOAT.

--------------------------------------------------------------------------------------------------------

Chinchilla Scaling: A replication attempt

Estimating optimal scaling laws for large language models is critical for efficient model development. This work attempts to replicate the third estimation procedure from a previous influential scaling study but finds inconsistencies and implausible results. The authors provide a rederivation that aligns better with the original work's first two approaches, offering a more reliable scaling law estimation method for guiding future LLM research and development.

Authors: Tamay Besiroglu, Ege Erdil, Matthew Barnett, Josh You

Link: https://arxiv.org/abs/2404.10102v1

Date: 2024-04-15

Summary:

Hoffmann et al. (2022) propose three methods for estimating a compute-optimal scaling law. We attempt to replicate their third estimation procedure, which involves fitting a parametric loss function to a reconstruction of data from their plots. We find that the reported estimates are inconsistent with their first two estimation methods, fail at fitting the extracted data, and report implausibly narrow confidence intervals--intervals this narrow would require over 600,000 experiments, while they likely only ran fewer than 500. In contrast, our rederivation of the scaling law using the third approach yields results that are compatible with the findings from the first two estimation procedures described by Hoffmann et al.

--------------------------------------------------------------------------------------------------------

Video2Game: Real-time, Interactive, Realistic and Browser-Compatible Environment from a Single Video

Creating interactive virtual environments like games often requires complex manual modeling efforts. Video2Game proposes an automatic approach to convert real-world scene videos into realistic, interactive game environments through neural radiance fields, mesh distillation, and physics modeling. Benchmarked on indoor and outdoor scenes, the system can render highly realistic visuals in real-time while enabling user interactions, streamlining the development of immersive digital replicas of the real world.

Authors: Hongchi Xia, Zhi-Hao Lin, Wei-Chiu Ma, Shenlong Wang

Link: https://arxiv.org/abs/2404.09833v1

Date: 2024-04-15

Summary:

Creating high-quality and interactive virtual environments, such as games and simulators, often involves complex and costly manual modeling processes. In this paper, we present Video2Game, a novel approach that automatically converts videos of real-world scenes into realistic and interactive game environments. At the heart of our system are three core components:(i) a neural radiance fields (NeRF) module that effectively captures the geometry and visual appearance of the scene; (ii) a mesh module that distills the knowledge from NeRF for faster rendering; and (iii) a physics module that models the interactions and physical dynamics among the objects. By following the carefully designed pipeline, one can construct an interactable and actionable digital replica of the real world. We benchmark our system on both indoor and large-scale outdoor scenes. We show that we can not only produce highly-realistic renderings in real-time, but also build interactive games on top.

--------------------------------------------------------------------------------------------------------

Harnessing GPT-4V(ision) for Insurance: A Preliminary Exploration

Large multimodal models (LMMs) like GPT-4V present new opportunities in the insurance domain, which involves diverse data modalities. This paper explores GPT-4V's capabilities across multimodal insurance tasks categorized by insurance types and stages. While showing remarkable performance, the study also reveals limitations in risk assessment, loss estimation, hallucination, and multilingual support, highlighting areas for future research in applying cutting-edge LMM technology to the complex insurance field.

Authors: Chenwei Lin, Hanjia Lyu, Jiebo Luo, Xian Xu

Link: https://arxiv.org/abs/2404.09690v1

Date: 2024-04-15

Summary:

The emergence of Large Multimodal Models (LMMs) marks a significant milestone in the development of artificial intelligence. Insurance, as a vast and complex discipline, involves a wide variety of data forms in its operational processes, including text, images, and videos, thereby giving rise to diverse multimodal tasks. Despite this, there has been limited systematic exploration of multimodal tasks specific to insurance, nor a thorough investigation into how LMMs can address these challenges. In this paper, we explore GPT-4V's capabilities in the insurance domain. We categorize multimodal tasks by focusing primarily on visual aspects based on types of insurance (e.g., auto, household/commercial property, health, and agricultural insurance) and insurance stages (e.g., risk assessment, risk monitoring, and claims processing). Our experiment reveals that GPT-4V exhibits remarkable abilities in insurance-related tasks, demonstrating not only a robust understanding of multimodal content in the insurance domain but also a comprehensive knowledge of insurance scenarios. However, there are notable shortcomings: GPT-4V struggles with detailed risk rating and loss assessment, suffers from hallucination in image understanding, and shows variable support for different languages. Through this work, we aim to bridge the insurance domain with cutting-edge LMM technology, facilitate interdisciplinary exchange and development, and provide a foundation for the continued advancement and evolution of future research endeavors.

--------------------------------------------------------------------------------------------------------

Exploring Augmentation and Cognitive Strategies for AI based Synthetic Personae

Large language models (LLMs) show promise for creating synthetic personae in human-computer interaction research. However, their black-box nature and hallucinations pose challenges. This position paper advocates using LLMs as data augmentation systems instead of zero-shot generators. It also proposes developing robust cognitive and memory frameworks to guide LLM responses. Preliminary explorations suggest data enrichment, episodic memory, and self-reflection techniques can improve the reliability of synthetic personae, opening new avenues for innovative HCI research.

Authors: Rafael Arias Gonzalez, Steve DiPaola

Link: https://arxiv.org/abs/2404.10890v1

Date: 2024-04-16

Summary:

Large language models (LLMs) hold potential for innovative HCI research, including the creation of synthetic personae. However, their black-box nature and propensity for hallucinations pose challenges. To address these limitations, this position paper advocates for using LLMs as data augmentation systems rather than zero-shot generators. We further propose the development of robust cognitive and memory frameworks to guide LLM responses. Initial explorations suggest that data enrichment, episodic memory, and self-reflection techniques can improve the reliability of synthetic personae and open up new avenues for HCI research.

--------------------------------------------------------------------------------------------------------

REQUAL-LM: Reliability and Equity through Aggregation in Large Language Models

The randomized nature and inherent biases of large language models (LLMs) raise concerns about reliability and equity when deploying them in critical applications. REQUAL-LM introduces a novel aggregation method to find reliable and equitable LLM outputs through repeated sampling and equity-aware aggregation. By formally defining reliability and bias, it effectively mitigates harmful biases while selecting highly reliable outputs representative of minority groups. Its blackbox design enables scalability alongside rapid LLM advancements without retraining, facilitating responsible real-world deployment.

Authors: Sana Ebrahimi, Nima Shahbazi, Abolfazl Asudeh

Link: https://arxiv.org/abs/2404.11782v1

Date: 2024-04-17

Summary:

The extensive scope of large language models (LLMs) across various domains underscores the critical importance of responsibility in their application, beyond natural language processing. In particular, the randomized nature of LLMs, coupled with inherent biases and historical stereotypes in data, raises critical concerns regarding reliability and equity. Addressing these challenges are necessary before using LLMs for applications with societal impact. Towards addressing this gap, we introduce REQUAL-LM, a novel method for finding reliable and equitable LLM outputs through aggregation. Specifically, we develop a Monte Carlo method based on repeated sampling to find a reliable output close to the mean of the underlying distribution of possible outputs. We formally define the terms such as reliability and bias, and design an equity-aware aggregation to minimize harmful bias while finding a highly reliable output. REQUAL-LM does not require specialized hardware, does not impose a significant computing load, and uses LLMs as a blackbox. This design choice enables seamless scalability alongside the rapid advancement of LLM technologies. Our system does not require retraining the LLMs, which makes it deployment ready and easy to adapt. Our comprehensive experiments using various tasks and datasets demonstrate that REQUAL- LM effectively mitigates bias and selects a more equitable response, specifically the outputs that properly represents minority groups.

--------------------------------------------------------------------------------------------------------

Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization

Text-to-audio generation is increasingly important for content creation in industries like music and film. While recent diffusion models train on prompt-audio pairs, they often miss concepts or temporal ordering from the input prompts. Tango 2 introduces direct preference optimization to explicitly improve these aspects by fine-tuning an existing model on synthetically created preference datasets with "winner" and "loser" outputs. This approach leads to improved audio quality over baseline models on both automatic and human evaluation metrics.

Authors: Navonil Majumder, Chia-Yu Hung, Deepanway Ghosal, Wei-Ning Hsu, Rada Mihalcea, Soujanya Poria

Link: https://arxiv.org/abs/2404.09956v2

Date: 2024-04-16

Summary:

Generative multimodal content is increasingly prevalent in much of the content creation arena, as it has the potential to allow artists and media personnel to create pre-production mockups by quickly bringing their ideas to life. The generation of audio from text prompts is an important aspect of such processes in the music and film industry. Many of the recent diffusion-based text-to-audio models focus on training increasingly sophisticated diffusion models on a large set of datasets of prompt-audio pairs. These models do not explicitly focus on the presence of concepts or events and their temporal ordering in the output audio with respect to the input prompt. Our hypothesis is focusing on how these aspects of audio generation could improve audio generation performance in the presence of limited data. As such, in this work, using an existing text-to-audio model Tango, we synthetically create a preference dataset where each prompt has a winner audio output and some loser audio outputs for the diffusion model to learn from. The loser outputs, in theory, have some concepts from the prompt missing or in an incorrect order. We fine-tune the publicly available Tango text-to-audio model using diffusion-DPO (direct preference optimization) loss on our preference dataset and show that it leads to improved audio output over Tango and AudioLDM2, in terms of both automatic- and manual-evaluation metrics.

--------------------------------------------------------------------------------------------------------

CORI: CJKV Benchmark with Romanization Integration -- A step towards Cross-lingual Transfer Beyond Textual Scripts

Assuming English as the source for cross-lingual transfer can hinder performance for many languages due to language dissimilarities. This work highlights the importance of selecting source languages with high contact to the target for effective transfer. It proposes CORI, a novel benchmark for closely related Chinese-Japanese-Korean-Vietnamese languages, integrating Romanized transcriptions via contrastive learning to enhance cross-lingual representations and zero-shot transfer across these high-contact languages.

Authors: Hoang H. Nguyen, Chenwei Zhang, Ye Liu, Natalie Parde, Eugene Rohrbaugh, Philip S. Yu

Link: https://arxiv.org/abs/2404.12618v1

Date: 2024-04-19

Summary:

Naively assuming English as a source language may hinder cross-lingual transfer for many languages by failing to consider the importance of language contact. Some languages are more well-connected than others, and target languages can benefit from transferring from closely related languages; for many languages, the set of closely related languages does not include English. In this work, we study the impact of source language for cross-lingual transfer, demonstrating the importance of selecting source languages that have high contact with the target language. We also construct a novel benchmark dataset for close contact Chinese-Japanese-Korean-Vietnamese (CJKV) languages to further encourage in-depth studies of language contact. To comprehensively capture contact between these languages, we propose to integrate Romanized transcription beyond textual scripts via Contrastive Learning objectives, leading to enhanced cross-lingual representations and effective zero-shot cross-lingual transfer.

--------------------------------------------------------------------------------------------------------

What's under the hood: Investigating Automatic Metrics on Meeting Summarization

Automatic evaluation metrics for meeting summarization often fail to capture important meeting-specific errors. This paper conducts a comprehensive study correlating frequently used metrics with human evaluations across various error types like missing information and speaker dynamics. The authors find metrics struggle with observable errors, showing weak correlations or masking errors, while only subsets accurately reflect certain errors' impact on summary quality. This analysis highlights the need for better metrics tailored to meeting summarization challenges.

Authors: Frederic Kirstein, Jan Philip Wahle, Terry Ruas, Bela Gipp

Link: https://arxiv.org/abs/2404.11124v1

Date: 2024-04-17

Summary:

Meeting summarization has become a critical task considering the increase in online interactions. While new techniques are introduced regularly, their evaluation uses metrics not designed to capture meeting-specific errors, undermining effective evaluation. This paper investigates what the frequently used automatic metrics capture and which errors they mask by correlating automatic metric scores with human evaluations across a broad error taxonomy. We commence with a comprehensive literature review on English meeting summarization to define key challenges like speaker dynamics and contextual turn-taking and error types such as missing information and linguistic inaccuracy, concepts previously loosely defined in the field. We examine the relationship between characteristic challenges and errors by using annotated transcripts and summaries from Transformer-based sequence-to-sequence and autoregressive models from the general summary QMSum dataset. Through experimental validation, we find that different model architectures respond variably to challenges in meeting transcripts, resulting in different pronounced links between challenges and errors. Current default-used metrics struggle to capture observable errors, showing weak to mid-correlations, while a third of the correlations show trends of error masking. Only a subset reacts accurately to specific errors, while most correlations show either unresponsiveness or failure to reflect the error's impact on summary quality.

--------------------------------------------------------------------------------------------------------

HalluciBot: Is There No Such Thing as a Bad Question?

Most work on mitigating hallucinations in large language models (LLMs) focuses on post-generation analysis. HalluciBot instead predicts the probability of hallucination before generation, using a query perturbator and multi-agent Monte Carlo simulation during training. Introducing the concept of "truthful hallucination", HalluciBot enables judging a query's propensity to hallucinate, paving the way to revise or cancel problematic queries upfront and measure user accountability for hallucinatory queries, reducing computational waste.

Authors: William Watson, Nicole Cho

Link: https://arxiv.org/abs/2404.12535v1

Date: 2024-04-18

Summary:

Hallucination continues to be one of the most critical challenges in the institutional adoption journey of Large Language Models (LLMs). In this context, an overwhelming number of studies have focused on analyzing the post-generation phase - refining outputs via feedback, analyzing logit output values, or deriving clues via the outputs' artifacts. We propose HalluciBot, a model that predicts the probability of hallucination $\textbf{before generation}$, for any query imposed to an LLM. In essence, HalluciBot does not invoke any generation during inference. To derive empirical evidence for HalluciBot, we employ a Multi-Agent Monte Carlo Simulation using a Query Perturbator to craft $n$ variations per query at train time. The construction of our Query Perturbator is motivated by our introduction of a new definition of hallucination - $\textit{truthful hallucination}$. Our training methodology generated 2,219,022 estimates for a training corpus of 369,837 queries, spanning 13 diverse datasets and 3 question-answering scenarios. HalluciBot predicts both binary and multi-class probabilities of hallucination, enabling a means to judge the query's quality with regards to its propensity to hallucinate. Therefore, HalluciBot paves the way to revise or cancel a query before generation and the ensuing computational waste. Moreover, it provides a lucid means to measure user accountability for hallucinatory queries.

--------------------------------------------------------------------------------------------------------

Fortify the Guardian, Not the Treasure: Resilient Adversarial Detectors

Robust Adversarial Detection via Adversarial Retraining (RADAR) enhances the robustness of adversarial detectors against adaptive attacks while maintaining classification accuracy. It leverages adversarial training by integrating adversarial examples optimized to fool both the classifier and detector into the training dataset. This enables the detector to learn and adapt to potential attack scenarios. Evaluations on CIFAR-10 and SVHN demonstrate RADAR's improved ability to identify adaptive attacks without sacrificing clean data performance.

Authors: Raz Lapid, Almog Dubin, Moshe Sipper

Link: https://arxiv.org/abs/2404.12120v1

Date: 2024-04-18

Summary:

This paper presents RADAR-Robust Adversarial Detection via Adversarial Retraining-an approach designed to enhance the robustness of adversarial detectors against adaptive attacks, while maintaining classifier performance. An adaptive attack is one where the attacker is aware of the defenses and adapts their strategy accordingly. Our proposed method leverages adversarial training to reinforce the ability to detect attacks, without compromising clean accuracy. During the training phase, we integrate into the dataset adversarial examples, which were optimized to fool both the classifier and the adversarial detector, enabling the adversarial detector to learn and adapt to potential attack scenarios. Experimental evaluations on the CIFAR-10 and SVHN datasets demonstrate that our proposed algorithm significantly improves a detector's ability to accurately identify adaptive adversarial attacks -- without sacrificing clean accuracy.

--------------------------------------------------------------------------------------------------------

When Life gives you LLMs, make LLM-ADE: Large Language Models with Adaptive Data Engineering

The LLM-ADE framework proposes a novel methodology for continually pre-training large language models (LLMs) while addressing catastrophic forgetting and double descent challenges. It employs dynamic architectural adjustments like selective block freezing and expansion tailored to specific datasets. This strategy enhances adaptability to new data while preserving prior knowledge. Evaluations on TinyLlama show LLM-ADE's significant performance gains over traditional continuous training across general knowledge benchmarks, offering a robust and efficient approach for real-world LLM applications.

Authors: Stephen Choi, William Gazeley

Link: https://arxiv.org/abs/2404.13028v1

Date: 2024-04-19

Summary:

This paper presents the LLM-ADE framework, a novel methodology for continued pre-training of large language models (LLMs) that addresses the challenges of catastrophic forgetting and double descent. LLM-ADE employs dynamic architectural adjustments, including selective block freezing and expansion, tailored to specific datasets. This strategy enhances model adaptability to new data while preserving previously acquired knowledge. We demonstrate LLM-ADE's effectiveness on the TinyLlama model across various general knowledge benchmarks, showing significant performance improvements without the drawbacks of traditional continuous training methods. This approach promises a more versatile and robust way to keep LLMs current and efficient in real-world applications.

--------------------------------------------------------------------------------------------------------

Modeling the Lane-Change Reactions to Merging Vehicles for Highway On-Ramp Simulations

Accurately modeling driver behavior in highway merge scenarios is crucial for developing autonomous vehicles. This work aims to improve simulations by including both yielding and lane-change reactions of lag vehicles to merging cars. Using a novel naturalistic U.S. highway dataset, the authors evaluate models for capturing reactive lane-change behavior and demonstrate their integration into high-fidelity simulations with adequate computational efficiency for large-scale autonomous vehicle development.

Authors: Dustin Holley, Jovin Dsa, Hossein Nourkhiz Mahjoub, Gibran Ali, Tyler Naes, Ehsan Moradi-Pari, Pawan Sai Kallepalli

Link: https://arxiv.org/abs/2404.09851v1

Date: 2024-04-15

Summary:

Enhancing simulation environments to replicate real-world driver behavior is essential for developing Autonomous Vehicle technology. While some previous works have studied the yielding reaction of lag vehicles in response to a merging car at highway on-ramps, the possible lane-change reaction of the lag car has not been widely studied. In this work we aim to improve the simulation of the highway merge scenario by including the lane-change reaction in addition to yielding behavior of main-lane lag vehicles, and we evaluate two different models for their ability to capture this reactive lane-change behavior. To tune the payoff functions of these models, a novel naturalistic dataset was collected on U.S. highways that provided several hours of merge-specific data to learn the lane change behavior of U.S. drivers. To make sure that we are collecting a representative set of different U.S. highway geometries in our data, we surveyed 50,000 U.S. highway on-ramps and then selected eight representative sites. The data were collected using roadside-mounted lidar sensors to capture various merge driver interactions. The models were demonstrated to be configurable for both keep-straight and lane-change behavior. The models were finally integrated into a high-fidelity simulation environment and confirmed to have adequate computation time efficiency for use in large-scale simulations to support autonomous vehicle development.

--------------------------------------------------------------------------------------------------------

Awareness of uncertainty in classification using a multivariate model and multi-views

Enabling AI systems to express uncertainty can make them more natural and reliable. This paper proposes an uncertainty-aware loss function for multivariate classification tasks to train models that estimate prediction uncertainties. It combines this with extensive data augmentation during training and testing to generate multiple predictions per sample. Various aggregation methods like mode values and bin counts are then explored to derive final predictions from the multi-view outputs and associated uncertainties. Evaluations on CIFAR-10 show promising results compared to other uncertainty estimation techniques.

Authors: Alexey Kornaev, Elena Kornaeva, Oleg Ivanov, Ilya Pershin, Danis Alukaev

Link: https://arxiv.org/abs/2404.10314v1

Date: 2024-04-16

Summary:

One of the ways to make artificial intelligence more natural is to give it some room for doubt. Two main questions should be resolved in that way. First, how to train a model to estimate uncertainties of its own predictions? And then, what to do with the uncertain predictions if they appear? First, we proposed an uncertainty-aware negative log-likelihood loss for the case of N-dimensional multivariate normal distribution with spherical variance matrix to the solution of N-classes classification tasks. The loss is similar to the heteroscedastic regression loss. The proposed model regularizes uncertain predictions, and trains to calculate both the predictions and their uncertainty estimations. The model fits well with the label smoothing technique. Second, we expanded the limits of data augmentation at the training and test stages, and made the trained model to give multiple predictions for a given number of augmented versions of each test sample. Given the multi-view predictions together with their uncertainties and confidences, we proposed several methods to calculate final predictions, including mode values and bin counts with soft and hard weights. For the latter method, we formalized the model tuning task in the form of multimodal optimization with non-differentiable criteria of maximum accuracy, and applied particle swarm optimization to solve the tuning task. The proposed methodology was tested using CIFAR-10 dataset with clean and noisy labels and demonstrated good results in comparison with other uncertainty estimation methods related to sample selection, co-teaching, and label smoothing.

--------------------------------------------------------------------------------------------------------

Eye On AI

Week Ending 4.21.2024

RESEARCH WATCH: 4.21.2024

EYE ON A.I. GETS READERS UP TO DATE ON THE LATEST FUNDING NEWS AND RELATED ISSUES. SUBSCRIBE FOR THE WEEKLY NEWSLETTER.