Week Ending 6.9.2024

 

RESEARCH WATCH: 6.9.2024

 

Seeing the Unseen: Visual Metaphor Captioning for Videos

Visual Metaphor Captioning for Videos proposes a novel task of describing metaphors present in videos, along with a dataset and metric. This could aid in understanding complex language phenomena and enhance multimodal AI systems' ability to interpret abstract concepts, benefiting areas like creative content generation and language education.

Authors:  Abisek Rajakumar Kalarani, Pushpak Bhattacharyya, Sumit Shekhar

Link:  https://arxiv.org/abs/2406.04886v1

Date: 2024-06-07

Summary:

Metaphors are a common communication tool used in our day-to-day life. The detection and generation of metaphors in textual form have been studied extensively but metaphors in other forms have been under-explored. Recent studies have shown that Vision-Language (VL) models cannot understand visual metaphors in memes and adverts. As of now, no probing studies have been done that involve complex language phenomena like metaphors with videos. Hence, we introduce a new VL task of describing the metaphors present in the videos in our work. To facilitate this novel task, we construct and release a manually created dataset with 705 videos and 2115 human-written captions, along with a new metric called Average Concept Distance (ACD), to automatically evaluate the creativity of the metaphors generated. We also propose a novel low-resource video metaphor captioning system: GIT-LLaVA, which obtains comparable performance to SoTA video language models on the proposed task. We perform a comprehensive analysis of existing video language models on this task and publish our dataset, models, and benchmark results to enable further research.

--------------------------------------------------------------------------------------------------------

WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild

WildBench introduces an evaluation framework for benchmarking large language models using challenging real-world queries from human-chatbot conversations. It could help identify strengths and weaknesses of language models in practical applications, guiding model development and selection for deployable AI assistants.

Authors:  Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, Yejin Choi

Link:  https://arxiv.org/abs/2406.04770v1

Date: 2024-06-07

Summary:

We introduce WildBench, an automated evaluation framework designed to benchmark large language models (LLMs) using challenging, real-world user queries. WildBench consists of 1,024 tasks carefully selected from over one million human-chatbot conversation logs. For automated evaluation with WildBench, we have developed two metrics, WB-Reward and WB-Score, which are computable using advanced LLMs such as GPT-4-turbo. WildBench evaluation uses task-specific checklists to evaluate model outputs systematically and provides structured explanations that justify the scores and comparisons, resulting in more reliable and interpretable automatic judgments. WB-Reward employs fine-grained pairwise comparisons between model responses, generating five potential outcomes: much better, slightly better, slightly worse, much worse, or a tie. Unlike previous evaluations that employed a single baseline model, we selected three baseline models at varying performance levels to ensure a comprehensive pairwise evaluation. Additionally, we propose a simple method to mitigate length bias, by converting outcomes of ``slightly better/worse'' to ``tie'' if the winner response exceeds the loser one by more than $K$ characters. WB-Score evaluates the quality of model outputs individually, making it a fast and cost-efficient evaluation metric. WildBench results demonstrate a strong correlation with the human-voted Elo ratings from Chatbot Arena on hard tasks. Specifically, WB-Reward achieves a Pearson correlation of 0.98 with top-ranking models. Additionally, WB-Score reaches 0.95, surpassing both ArenaHard's 0.91 and AlpacaEval2.0's 0.89 for length-controlled win rates, as well as the 0.87 for regular win rates.

--------------------------------------------------------------------------------------------------------

Denoising-Aware Contrastive Learning for Noisy Time Series

Denoising-Aware Contrastive Learning for Noisy Time Series proposes a method to mitigate noise in self-supervised learning for time series data. This could improve the performance of AI systems in domains with noisy data, such as industrial monitoring, finance, and healthcare.

Authors:  Shuang Zhou, Daochen Zha, Xiao Shen, Xiao Huang, Rui Zhang, Fu-Lai Chung

Link:  https://arxiv.org/abs/2406.04627v1

Date: 2024-06-07

Summary:

Time series self-supervised learning (SSL) aims to exploit unlabeled data for pre-training to mitigate the reliance on labels. Despite the great success in recent years, there is limited discussion on the potential noise in the time series, which can severely impair the performance of existing SSL methods. To mitigate the noise, the de facto strategy is to apply conventional denoising methods before model training. However, this pre-processing approach may not fully eliminate the effect of noise in SSL for two reasons: (i) the diverse types of noise in time series make it difficult to automatically determine suitable denoising methods; (ii) noise can be amplified after mapping raw data into latent space. In this paper, we propose denoising-aware contrastive learning (DECL), which uses contrastive learning objectives to mitigate the noise in the representation and automatically selects suitable denoising methods for every sample. Extensive experiments on various datasets verify the effectiveness of our method. The code is open-sourced.

--------------------------------------------------------------------------------------------------------

Multi-layer Learnable Attention Mask for Multimodal Tasks

Multi-layer Learnable Attention Mask for Multimodal Tasks introduces a learnable attention mechanism to improve performance in diverse multimodal settings like movie understanding. It could enhance AI models' ability to process and integrate information from different modalities, benefiting applications like multimedia analysis and generation.

Authors:  Wayner Barrios, SouYoung Jin

Link:  https://arxiv.org/abs/2406.02761v1

Date: 2024-06-04

Summary:

While the Self-Attention mechanism in the Transformer model has proven to be effective in many domains, we observe that it is less effective in more diverse settings (e.g. multimodality) due to the varying granularity of each token and the high computational demands of lengthy sequences. To address the challenges, we introduce the Learnable Attention Mask (LAM), strategically designed to globally regulate attention maps and prioritize critical tokens within the sequence. Leveraging the Self-Attention module in a BERT-like transformer network, our approach adeptly captures associations between tokens. The extension of the LAM to a multi-layer version accommodates the varied information aspects embedded at each layer of the Transformer network. Comprehensive experimental validation on various datasets, such as MADv2, QVHighlights, ImageNet 1K, and MSRVTT, demonstrates the efficacy of the LAM, exemplifying its ability to enhance model performance while mitigating redundant computations. This pioneering approach presents a significant advancement in enhancing the understanding of complex scenarios, such as in movie understanding.

--------------------------------------------------------------------------------------------------------

Windex: Realtime Neural Whittle Indexing for Scalable Service Guarantees in NextG Cellular Networks

Windex presents a real-time resource allocation approach for next-generation cellular networks, enabling scalable service guarantees. It could improve the efficiency and quality of service in 5G and beyond networks, crucial for supporting emerging applications like autonomous vehicles and remote healthcare.

Authors:  Archana Bura, Ushasi Ghosh, Dinesh Bharadia, Srinivas Shakkottai

Link:  https://arxiv.org/abs/2406.01888v1

Date: 2024-06-04

Summary:

We address the resource allocation challenges in NextG cellular radio access networks (RAN), where heterogeneous user applications demand guarantees on throughput and service regularity. We leverage the Whittle indexability property to decompose the resource allocation problem, enabling the independent computation of relative priorities for each user. By simply allocating resources in decreasing order of these indices, we transform the combinatorial complexity of resource allocation into a linear one. We propose Windex, a lightweight approach for training neural networks to compute Whittle indices, considering constraint violation, channel quality, and system load. Implemented on a real-time RAN intelligent controller (RIC), our approach enables resource allocation decision times of less than 20$\mu$s per user and efficiently allocates resources in each 1ms scheduling time slot. Evaluation across standardized 3GPP service classes demonstrates significant improvements in service guarantees compared to existing schedulers, validated through simulations and emulations with over-the-air channel traces on a 5G testbed.

--------------------------------------------------------------------------------------------------------

Effective Subset Selection Through The Lens of Neural Network Pruning

Effective Subset Selection Through The Lens of Neural Network Pruning explores using network pruning insights for data subset selection, which is important for efficient annotation in domains like medical imaging where annotation is expensive.

Authors:  Noga Bar, Raja Giryes

Link:  https://arxiv.org/abs/2406.01086v1

Date: 2024-06-03

Summary:

Having large amounts of annotated data significantly impacts the effectiveness of deep neural networks. However, the annotation task can be very expensive in some domains, such as medical data. Thus, it is important to select the data to be annotated wisely, which is known as the subset selection problem. We investigate the relationship between subset selection and neural network pruning, which is more widely studied, and establish a correspondence between them. Leveraging insights from network pruning, we propose utilizing the norm criterion of neural network features to improve subset selection methods. We empirically validate our proposed strategy on various networks and datasets, demonstrating enhanced accuracy. This shows the potential of employing pruning tools for subset selection.

--------------------------------------------------------------------------------------------------------

A Synergistic Approach In Network Intrusion Detection By Neurosymbolic AI

A Synergistic Approach In Network Intrusion Detection By Neurosymbolic AI proposes incorporating neurosymbolic AI into network intrusion detection systems, leveraging the strengths of neural networks and symbolic reasoning. This could lead to more robust, interpretable, and adaptive cybersecurity systems.

Authors:  Alice Bizzarri, Chung-En Yu, Brian Jalaian, Fabrizio Riguzzi, Nathaniel D. Bastian

Link:  https://arxiv.org/abs/2406.00938v1

Date: 2024-06-03

Summary:

The prevailing approaches in Network Intrusion Detection Systems (NIDS) are often hampered by issues such as high resource consumption, significant computational demands, and poor interpretability. Furthermore, these systems generally struggle to identify novel, rapidly changing cyber threats. This paper delves into the potential of incorporating Neurosymbolic Artificial Intelligence (NSAI) into NIDS, combining deep learning's data-driven strengths with symbolic AI's logical reasoning to tackle the dynamic challenges in cybersecurity, which also includes detailed NSAI techniques introduction for cyber professionals to explore the potential strengths of NSAI in NIDS. The inclusion of NSAI in NIDS marks potential advancements in both the detection and interpretation of intricate network threats, benefiting from the robust pattern recognition of neural networks and the interpretive prowess of symbolic reasoning. By analyzing network traffic data types and machine learning architectures, we illustrate NSAI's distinctive capability to offer more profound insights into network behavior, thereby improving both detection performance and the adaptability of the system. This merging of technologies not only enhances the functionality of traditional NIDS but also sets the stage for future developments in building more resilient, interpretable, and dynamic defense mechanisms against advanced cyber threats. The continued progress in this area is poised to transform NIDS into a system that is both responsive to known threats and anticipatory of emerging, unseen ones.

--------------------------------------------------------------------------------------------------------

Scaling Up Deep Clustering Methods Beyond ImageNet-1K

Scaling Up Deep Clustering Methods Beyond ImageNet-1K investigates the performance of deep clustering approaches on large-scale datasets, providing insights into their potential for real-world applications involving massive, imbalanced data.

Authors:  Nikolas Adaloglou, Felix Michels, Kaspar Senft, Diana Petrusheva, Markus Kollmann

Link:  https://arxiv.org/abs/2406.01203v1

Date: 2024-06-03

Summary:

Deep image clustering methods are typically evaluated on small-scale balanced classification datasets while feature-based $k$-means has been applied on proprietary billion-scale datasets. In this work, we explore the performance of feature-based deep clustering approaches on large-scale benchmarks whilst disentangling the impact of the following data-related factors: i) class imbalance, ii) class granularity, iii) easy-to-recognize classes, and iv) the ability to capture multiple classes. Consequently, we develop multiple new benchmarks based on ImageNet21K. Our experimental analysis reveals that feature-based $k$-means is often unfairly evaluated on balanced datasets. However, deep clustering methods outperform $k$-means across most large-scale benchmarks. Interestingly, $k$-means underperforms on easy-to-classify benchmarks by large margins. The performance gap, however, diminishes on the highest data regimes such as ImageNet21K. Finally, we find that non-primary cluster predictions capture meaningful classes (i.e. coarser classes).

--------------------------------------------------------------------------------------------------------

Adaptive Sampling of k-Space in Magnetic Resonance for Rapid Pathology Prediction

Adaptive Sampling of k-Space in Magnetic Resonance proposes an adaptive sampling method for accelerating MR scans while maintaining diagnostic accuracy. This could make MR imaging more accessible for population-level disease surveillance and remote healthcare.

Authors:  Chen-Yu Yen, Raghav Singhal, Umang Sharma, Rajesh Ranganath, Sumit Chopra, Lerrel Pinto

Link:  https://arxiv.org/abs/2406.04318v1

Date: 2024-06-06

Summary:

Magnetic Resonance (MR) imaging, despite its proven diagnostic utility, remains an inaccessible imaging modality for disease surveillance at the population level. A major factor rendering MR inaccessible is lengthy scan times. An MR scanner collects measurements associated with the underlying anatomy in the Fourier space, also known as the k-space. Creating a high-fidelity image requires collecting large quantities of such measurements, increasing the scan time. Traditionally to accelerate an MR scan, image reconstruction from under-sampled k-space data is the method of choice. However, recent works show the feasibility of bypassing image reconstruction and directly learning to detect disease directly from a sparser learned subset of the k-space measurements. In this work, we propose Adaptive Sampling for MR (ASMR), a sampling method that learns an adaptive policy to sequentially select k-space samples to optimize for target disease detection. On 6 out of 8 pathology classification tasks spanning the Knee, Brain, and Prostate MR scans, ASMR reaches within 2% of the performance of a fully sampled classifier while using only 8% of the k-space, as well as outperforming prior state-of-the-art work in k-space sampling such as EMRT, LOUPE, and DPS.

--------------------------------------------------------------------------------------------------------

iKAN: Global Incremental Learning with KAN for Human Activity Recognition Across Heterogeneous Datasets

iKAN presents an incremental learning framework for human activity recognition across heterogeneous datasets, addressing catastrophic forgetting and non-uniform inputs. It could enable continual learning of new activities from diverse sensor modalities in real-world applications like health monitoring.

Authors:  Mengxi Liu, Sizhen Bian, Bo Zhou, Paul Lukowicz

Link:  https://arxiv.org/abs/2406.01646v1

Date: 2024-06-03

Summary:

This work proposes an incremental learning (IL) framework for wearable sensor human activity recognition (HAR) that tackles two challenges simultaneously: catastrophic forgetting and non-uniform inputs. The scalable framework, iKAN, pioneers IL with Kolmogorov-Arnold Networks (KAN) to replace multi-layer perceptrons as the classifier that leverages the local plasticity and global stability of splines. To adapt KAN for HAR, iKAN uses task-specific feature branches and a feature redistribution layer. Unlike existing IL methods that primarily adjust the output dimension or the number of classifier nodes to adapt to new tasks, iKAN focuses on expanding the feature extraction branches to accommodate new inputs from different sensor modalities while maintaining consistent dimensions and the number of classifier outputs. Continual learning across six public HAR datasets demonstrated the iKAN framework's incremental learning performance, with a last performance of 84.9\% (weighted F1 score) and an average incremental performance of 81.34\%, which significantly outperforms the two existing incremental learning methods, such as EWC (51.42\%) and experience replay (59.92\%).

--------------------------------------------------------------------------------------------------------

Causal Inference in Randomized Trials with Partial Clustering and Imbalanced Dependence Structures

Causal Inference in Randomized Trials addresses challenges in analyzing data from trials where participants exhibit imbalanced dependence structures, proposing a robust estimation method. This could improve statistical power and validity of insights from clinical trials and other studies involving clustering or group interventions.

Authors:  Joshua R. Nugent, Elijah Kakande, Gabriel Chamie, Jane Kabami, Asiphas Owaraganise, Diane V. Havlir, Moses Kamya, Laura B. Balzer

Link:  https://arxiv.org/abs/2406.04505v1

Date: 2024-06-06

Summary:

In many randomized trials, participants are grouped into clusters, such as neighborhoods or schools, and these clusters are assumed to be the independent unit. This assumption, however, might not reflect the underlying dependence structure, with serious consequences to statistical power. First, consider a cluster randomized trial where participants are artificially grouped together for the purposes of randomization. For intervention participants the groups are the basis for intervention delivery, but for control participants the groups are dissolved. Second, consider an individually randomized group treatment trial where participants are randomized and then post-randomization, intervention participants are grouped together for intervention delivery, while the control participants continue with the standard of care. In both trial designs, outcomes among intervention participants will be dependent within each cluster, while outcomes for control participants will be effectively independent. We use causal models to non-parametrically describe the data generating process for each trial design and formalize the conditional independence in the observed data distribution. For estimation and inference, we propose a novel implementation of targeted minimum loss-based estimation (TMLE) accounting for partial clustering and the imbalanced dependence structure. TMLE is a model-robust approach, leverages covariate adjustment and machine learning to improve precision, and facilitates estimation of a large set of causal effects. In finite sample simulations, TMLE achieved comparable or markedly higher statistical power than common alternatives. Finally, application of TMLE to real data from the SEARCH-IPT trial resulted in 20-57\% efficiency gains, demonstrating the real-world consequences of our proposed approach.

--------------------------------------------------------------------------------------------------------

Position: Cracking the Code of Cascading Disparity Towards Marginalized Communities

Position: Cracking the Code of Cascading Disparity highlights how disparities faced by marginalized communities in areas like performance, privacy and safety with AI systems are interconnected and can compound over time. It calls for mitigating these disparities at the source to ensure equitable AI development across different populations.

Authors:  Golnoosh Farnadi, Mohammad Havaei, Negar Rostamzadeh

Link:  https://arxiv.org/abs/2406.01757v1

Date: 2024-06-03

Summary:

The rise of foundation models holds immense promise for advancing AI, but this progress may amplify existing risks and inequalities, leaving marginalized communities behind. In this position paper, we discuss that disparities towards marginalized communities - performance, representation, privacy, robustness, interpretability and safety - are not isolated concerns but rather interconnected elements of a cascading disparity phenomenon. We contrast foundation models with traditional models and highlight the potential for exacerbated disparity against marginalized communities. Moreover, we emphasize the unique threat of cascading impacts in foundation models, where interconnected disparities can trigger long-lasting negative consequences, specifically to the people on the margin. We define marginalized communities within the machine learning context and explore the multifaceted nature of disparities. We analyze the sources of these disparities, tracing them from data creation, training and deployment procedures to highlight the complex technical and socio-technical landscape. To mitigate the pressing crisis, we conclude with a set of calls to action to mitigate disparity at its source.

--------------------------------------------------------------------------------------------------------

MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models

MeLFusion introduces a novel approach to synthesize music from both text and image cues using diffusion models. This multimodal capability could benefit creative applications like film scoring, by enabling musicians to compose based on visual scenes along with written scripts or descriptions.

Authors:  Sanjoy Chowdhury, Sayan Nag, K J Joseph, Balaji Vasan Srinivasan, Dinesh Manocha

Link:  https://arxiv.org/abs/2406.04673v1

Date: 2024-06-07

Summary:

Music is a universal language that can communicate emotions and feelings. It forms an essential part of the whole spectrum of creative media, ranging from movies to social media posts. Machine learning models that can synthesize music are predominantly conditioned on textual descriptions of it. Inspired by how musicians compose music not just from a movie script, but also through visualizations, we propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music. MeLFusion is a text-to-music diffusion model with a novel "visual synapse", which effectively infuses the semantics from the visual modality into the generated music. To facilitate research in this area, we introduce a new dataset MeLBench, and propose a new evaluation metric IMSM. Our exhaustive experimental evaluation suggests that adding visual information to the music synthesis pipeline significantly improves the quality of generated music, measured both objectively and subjectively, with a relative gain of up to 67.98% on the FAD score. We hope that our work will gather attention to this pragmatic, yet relatively under-explored research area.

--------------------------------------------------------------------------------------------------------

REP: Resource-Efficient Prompting for On-device Continual Learning

REP proposes resource-efficient prompting methods for on-device continual learning with vision transformers, avoiding trade-offs between accuracy, computational cost and memory usage. This could enable deployable continual learning systems on resource-constrained devices for applications like mobile intelligence and IoT.

Authors:  Sungho Jeon, Xinyue Ma, Kwang In Kim, Myeongjae Jeon

Link:  https://arxiv.org/abs/2406.04772v1

Date: 2024-06-07

Summary:

On-device continual learning (CL) requires the co-optimization of model accuracy and resource efficiency to be practical. This is extremely challenging because it must preserve accuracy while learning new tasks with continuously drifting data and maintain both high energy and memory efficiency to be deployable on real-world devices. Typically, a CL method leverages one of two types of backbone networks: CNN or ViT. It is commonly believed that CNN-based CL excels in resource efficiency, whereas ViT-based CL is superior in model performance, making each option attractive only for a single aspect. In this paper, we revisit this comparison while embracing powerful pre-trained ViT models of various sizes, including ViT-Ti (5.8M parameters). Our detailed analysis reveals that many practical options exist today for making ViT-based methods more suitable for on-device CL, even when accuracy, energy, and memory are all considered. To further expand this impact, we introduce REP, which improves resource efficiency specifically targeting prompt-based rehearsal-free methods. Our key focus is on avoiding catastrophic trade-offs with accuracy while trimming computational and memory costs throughout the training process. We achieve this by exploiting swift prompt selection that enhances input data using a carefully provisioned model, and by developing two novel algorithms-adaptive token merging (AToM) and adaptive layer dropping (ALD)-that optimize the prompt updating stage. In particular, AToM and ALD perform selective skipping across the data and model-layer dimensions without compromising task-specific features in vision transformer models. Extensive experiments on three image classification datasets validate REP's superior resource efficiency over current state-of-the-art methods.

--------------------------------------------------------------------------------------------------------

Think out Loud: Emotion Deducing Explanation in Dialogues

Think out Loud proposes the Emotion Deducing Explanation task for dialogues, where models explain perceived emotions by reasoning about potential causes and speaker's mental state. This capability could aid emotional intelligence in conversational AI assistants across domains like counseling and customer service.

Authors:  Jiangnan Li, Zheng Lin, Lanrui Wang, Qingyi Si, Yanan Cao, Mo Yu, Peng Fu, Weiping Wang, Jie Zhou

Link:  https://arxiv.org/abs/2406.04758v1

Date: 2024-06-07

Summary:

Humans convey emotions through daily dialogues, making emotion understanding a crucial step of affective intelligence. To understand emotions in dialogues, machines are asked to recognize the emotion for an utterance (Emotion Recognition in Dialogues, ERD); based on the emotion, then find causal utterances for the emotion (Emotion Cause Extraction in Dialogues, ECED). The setting of the two tasks requires first ERD and then ECED, ignoring the mutual complement between emotion and cause. To fix this, some new tasks are proposed to extract them simultaneously. Although the current research on these tasks has excellent achievements, simply identifying emotion-related factors by classification modeling lacks realizing the specific thinking process of causes stimulating the emotion in an explainable way. This thinking process especially reflected in the reasoning ability of Large Language Models (LLMs) is under-explored. To this end, we propose a new task "Emotion Deducing Explanation in Dialogues" (EDEN). EDEN recognizes emotion and causes in an explicitly thinking way. That is, models need to generate an explanation text, which first summarizes the causes; analyzes the inner activities of the speakers triggered by the causes using common sense; then guesses the emotion accordingly. To support the study of EDEN, based on the existing resources in ECED, we construct two EDEN datasets by human effort. We further evaluate different models on EDEN and find that LLMs are more competent than conventional PLMs. Besides, EDEN can help LLMs achieve better recognition of emotions and causes, which explores a new research direction of explainable emotion understanding in dialogues.

--------------------------------------------------------------------------------------------------------

Do Large Language Models Perform the Way People Expect? Measuring the Human Generalization Function

Do Large Language Models Perform the Way People Expect? analyzes the mismatch between human expectations of where language models will perform well and their actual capabilities. Addressing this misalignment could improve trust and appropriate usage of language models across applications.

Authors:  Keyon Vafa, Ashesh Rambachan, Sendhil Mullainathan

Link:  https://arxiv.org/abs/2406.01382v1

Date: 2024-06-03

Summary:

What makes large language models (LLMs) impressive is also what makes them hard to evaluate: their diversity of uses. To evaluate these models, we must understand the purposes they will be used for. We consider a setting where these deployment decisions are made by people, and in particular, people's beliefs about where an LLM will perform well. We model such beliefs as the consequence of a human generalization function: having seen what an LLM gets right or wrong, people generalize to where else it might succeed. We collect a dataset of 19K examples of how humans make generalizations across 79 tasks from the MMLU and BIG-Bench benchmarks. We show that the human generalization function can be predicted using NLP methods: people have consistent structured ways to generalize. We then evaluate LLM alignment with the human generalization function. Our results show that -- especially for cases where the cost of mistakes is high -- more capable models (e.g. GPT-4) can do worse on the instances people choose to use them for, exactly because they are not aligned with the human generalization function.

--------------------------------------------------------------------------------------------------------

Leveraging automatic strategy discovery to teach people how to select better projects

Leveraging automatic strategy discovery teaches people optimized decision strategies for project selection through an intelligent tutor. This approach could enhance human decision-making skills and outcomes in domains like investment, resource allocation and strategic planning.

Authors:  Lovis Heindrich, Falk Lieder

Link:  https://arxiv.org/abs/2406.04082v1

Date: 2024-06-06

Summary:

The decisions of individuals and organizations are often suboptimal because normative decision strategies are too demanding in the real world. Recent work suggests that some errors can be prevented by leveraging artificial intelligence to discover and teach prescriptive decision strategies that take people's constraints into account. So far, this line of research has been limited to simplified decision problems. This article is the first to extend this approach to a real-world decision problem, namely project selection. We develop a computational method (MGPS) that automatically discovers project selection strategies that are optimized for real people and develop an intelligent tutor that teaches the discovered strategies. We evaluated MGPS on a computational benchmark and tested the intelligent tutor in a training experiment with two control conditions. MGPS outperformed a state-of-the-art method and was more computationally efficient. Moreover, the intelligent tutor significantly improved people's decision strategies. Our results indicate that our method can improve human decision-making in naturalistic settings similar to real-world project selection, a first step towards applying strategy discovery to the real world.

--------------------------------------------------------------------------------------------------------

Towards AI-Assisted Sustainable Adaptive Video Streaming Systems: Tutorial and Survey

Towards AI-Assisted Sustainable Adaptive Video Streaming surveys AI techniques to improve energy efficiency across the video streaming lifecycle, from encoding to delivery and quality assessment. Sustainable optimizations could reduce the environmental footprint of video streaming applications as usage grows.

Authors:  Reza Farahani, Zoha Azimi, Christian Timmerer, Radu Prodan

Link:  https://arxiv.org/abs/2406.02302v1

Date: 2024-06-04

Summary:

Improvements in networking technologies and the steadily increasing numbers of users, as well as the shift from traditional broadcasting to streaming content over the Internet, have made video applications (e.g., live and Video-on-Demand (VoD)) predominant sources of traffic. Recent advances in Artificial Intelligence (AI) and its widespread application in various academic and industrial fields have focused on designing and implementing a variety of video compression and content delivery techniques to improve user Quality of Experience (QoE). However, providing high QoE services results in more energy consumption and carbon footprint across the service delivery path, extending from the end user's device through the network and service infrastructure (e.g., cloud providers). Despite the importance of energy efficiency in video streaming, there is a lack of comprehensive surveys covering state-of-the-art AI techniques and their applications throughout the video streaming lifecycle. Existing surveys typically focus on specific parts, such as video encoding, delivery networks, playback, or quality assessment, without providing a holistic view of the entire lifecycle and its impact on energy consumption and QoE. Motivated by this research gap, this survey provides a comprehensive overview of the video streaming lifecycle, content delivery, energy and Video Quality Assessment (VQA) metrics and models, and AI techniques employed in video streaming. In addition, it conducts an in-depth state-of-the-art analysis focused on AI-driven approaches to enhance the energy efficiency of end-to-end aspects of video streaming systems (i.e., encoding, delivery network, playback, and VQA approaches). Finally, it discusses prospective research directions for developing AI-assisted energy-aware video streaming systems.

--------------------------------------------------------------------------------------------------------

3D-GRAND: Towards Better Grounding and Less Hallucination for 3D-LLMs

3D-GRAND introduces a large-scale dataset of 3D scenes and instructions to improve language grounding and reduce hallucinations in 3D language models. This could benefit embodied AI agents and robots operating in real-world environments based on multimodal instructions.

Authors:  Jianing Yang, Xuweiyi Chen, Nikhil Madaan, Madhavan Iyengar, Shengyi Qian, David F. Fouhey, Joyce Chai

Link:  https://arxiv.org/abs/2406.05132v1

Date: 2024-06-07

Summary:

The integration of language and 3D perception is crucial for developing embodied agents and robots that comprehend and interact with the physical world. While large language models (LLMs) have demonstrated impressive language understanding and generation capabilities, their adaptation to 3D environments (3D-LLMs) remains in its early stages. A primary challenge is the absence of large-scale datasets that provide dense grounding between language and 3D scenes. In this paper, we introduce 3D-GRAND, a pioneering large-scale dataset comprising 40,087 household scenes paired with 6.2 million densely-grounded scene-language instructions. Our results show that instruction tuning with 3D-GRAND significantly enhances grounding capabilities and reduces hallucinations in 3D-LLMs. As part of our contributions, we propose a comprehensive benchmark 3D-POPE to systematically evaluate hallucination in 3D-LLMs, enabling fair comparisons among future models. Our experiments highlight a scaling effect between dataset size and 3D-LLM performance, emphasizing the critical role of large-scale 3D-text datasets in advancing embodied AI research. Notably, our results demonstrate early signals for effective sim-to-real transfer, indicating that models trained on large synthetic data can perform well on real-world 3D scans. Through 3D-GRAND and 3D-POPE, we aim to equip the embodied AI community with essential resources and insights, setting the stage for more reliable and better-grounded 3D-LLMs. Project website: https://3d-grand.github.io

--------------------------------------------------------------------------------------------------------

AMOSL: Adaptive Modality-wise Structure Learning in Multi-view Graph Neural Networks For Enhanced Unified Representation

AMOSL proposes an adaptive structure learning approach for multi-view graph neural networks to better capture modality discrepancies and enhance representation learning. This could improve multi-modal reasoning capabilities for applications leveraging diverse data sources like social networks and sensor fusion.

Authors:  Peiyu Liang, Hongchang Gao, Xubin He

Link:  https://arxiv.org/abs/2406.02348v1

Date: 2024-06-04

Summary:

While Multi-view Graph Neural Networks (MVGNNs) excel at leveraging diverse modalities for learning object representation, existing methods assume identical local topology structures across modalities that overlook real-world discrepancies. This leads MVGNNs straggles in modality fusion and representations denoising. To address these issues, we propose adaptive modality-wise structure learning (AMoSL). AMoSL captures node correspondences between modalities via optimal transport, and jointly learning with graph embedding. To enable efficient end-to-end training, we employ an efficient solution for the resulting complex bilevel optimization problem. Furthermore, AMoSL adapts to downstream tasks through unsupervised learning on inter-modality distances. The effectiveness of AMoSL is demonstrated by its ability to train more accurate graph classifiers on six benchmark datasets.

--------------------------------------------------------------------------------------------------------


EYE ON A.I. GETS READERS UP TO DATE ON THE LATEST FUNDING NEWS AND RELATED ISSUES. SUBSCRIBE FOR THE WEEKLY NEWSLETTER.