Week Ending 7.13.2025

RESEARCH WATCH: 7.13.2025

Recognizing Dementia from Neuropsychological Tests with State Space Models

Early dementia detection represents a critical healthcare challenge, as timely intervention can significantly improve patient outcomes and quality of life. Traditional neuropsychological assessments rely heavily on manual scoring, creating bottlenecks in clinical workflows and potential inconsistencies. This research introduces Demenba, an innovative automatic dementia classification system that leverages state space models to analyze speech patterns from cognitive tests. By processing over 1,000 hours of assessments from the Framingham Heart Study, the system achieves superior performance while maintaining computational efficiency. The framework's integration with large language models enhances transparency and scalability, potentially revolutionizing clinical dementia screening processes and enabling earlier, more accurate diagnosis in healthcare settings.

Authors: Liming Wang, Saurabhchand Bhati, Cody Karjadi, Rhoda Au, James Glass

Link: https://arxiv.org/abs/2507.10311v1

Date: 2025-07-d

Summary:

Early detection of dementia is critical for timely medical intervention and improved patient outcomes. Neuropsychological tests are widely used for cognitive assessment but have traditionally relied on manual scoring. Automatic dementia classification (ADC) systems aim to infer cognitive decline directly from speech recordings of such tests. We propose Demenba, a novel ADC framework based on state space models, which scale linearly in memory and computation with sequence length. Trained on over 1,000 hours of cognitive assessments administered to Framingham Heart Study participants, some of whom were diagnosed with dementia through adjudicated review, our method outperforms prior approaches in fine-grained dementia classification by 21\%, while using fewer parameters. We further analyze its scaling behavior and demonstrate that our model gains additional improvement when fused with large language models, paving the way for more transparent and scalable dementia assessment tools. Code: https://anonymous.4open.science/r/Demenba-0861

--------------------------------------------------------------------------------------------------------

Evolution of Fear and Social Rewards in Prey-Predator Relationship

Understanding the evolutionary origins of fear and social behaviors provides crucial insights into fundamental survival mechanisms across species. This research employs distributed evolutionary simulation to investigate how fear responses and social rewards co-evolved under predator pressure. The study reveals surprising findings about the primacy of social reward systems over fear responses in prey survival strategies. Through reinforcement learning-based simulations, researchers demonstrate that social cohesion emerges as a more critical survival factor than direct predator avoidance. These findings have significant implications for understanding animal behavior, designing AI systems that model natural selection, developing conservation strategies, and potentially informing therapeutic approaches for anxiety and social disorders in humans.

Authors: Yuji Kanagawa, Kenji Doya

Link: https://arxiv.org/abs/2507.09992v1

Date: 2025-07-d

Summary:

Fear is a critical brain function for detecting danger and learning to avoid specific stimuli that can lead to danger. While fear is believed to have evolved under pressure from predators, experimentally reproducing the evolution is challenging. To investigate the relationship between environmental conditions, the evolution of fear, and the evolution of other rewards, such as food reward and social reward, we developed a distributed evolutionary simulation. In our simulation, prey and predator agents co-evolve their innate reward functions, including a possibly fear-like term for observing predators, and learn behaviors via reinforcement learning. Surprisingly, our simulation revealed that social reward for observing the same species is more important for prey to survive, and fear-like negative reward for observing predators evolves only after acquiring social reward. We also found that the predator with increased hunting ability (larger mouth) amplified fear emergence, but also that fear evolution is more stable with non-evolving predators that are bad at chasing prey. Additionally, unlike for predators, we found that positive rewards evolve in opposition to fear for stationary threats, as areas with abundant leftover food develop around them. These findings suggest that fear and social reward have had a complex interplay with each other through evolution, along with the nature of predators and threats.

--------------------------------------------------------------------------------------------------------

Enhancing Clinical Text Classification via Fine-Tuned DRAGON Longformer Models

Clinical text processing remains a significant bottleneck in healthcare informatics, where vast amounts of unstructured medical data require efficient classification and analysis. This research optimizes the DRAGON Longformer model specifically for clinical text classification tasks, addressing the unique challenges of medical terminology and extended document lengths. By extending sequence processing capabilities and implementing domain-specific preprocessing, the enhanced model achieves substantial performance improvements across key metrics. The 13.2% accuracy improvement and enhanced precision demonstrate the model's potential for real-world clinical applications. This advancement could streamline medical record processing, improve diagnostic coding accuracy, support clinical decision-making, and enable more efficient healthcare data management systems across diverse medical specialties.

Authors: Mingchuan Yang, Ziyuan Huang

Link: https://arxiv.org/abs/2507.09470v1

Date: 2025-07-d

Summary:

This study explores the optimization of the DRAGON Longformer base model for clinical text classification, specifically targeting the binary classification of medical case descriptions. A dataset of 500 clinical cases containing structured medical observations was used, with 400 cases for training and 100 for validation. Enhancements to the pre-trained joeranbosma/dragon-longformer-base-mixed-domain model included hyperparameter tuning, domain-specific preprocessing, and architectural adjustments. Key modifications involved increasing sequence length from 512 to 1024 tokens, adjusting learning rates from 1e-05 to 5e-06, extending training epochs from 5 to 8, and incorporating specialized medical terminology. The optimized model achieved notable performance gains: accuracy improved from 72.0% to 85.2%, precision from 68.0% to 84.1%, recall from 75.0% to 86.3%, and F1-score from 71.0% to 85.2%. Statistical analysis confirmed the significance of these improvements (p < .001). The model demonstrated enhanced capability in interpreting medical terminology, anatomical measurements, and clinical observations. These findings contribute to domain-specific language model research and offer practical implications for clinical natural language processing applications. The optimized model's strong performance across diverse medical conditions underscores its potential for broad use in healthcare settings.

--------------------------------------------------------------------------------------------------------

MoSAiC: Multi-Modal Multi-Label Supervision-Aware Contrastive Learning for Remote Sensing

Remote sensing applications face unique challenges in processing multi-modal satellite imagery, including optical and synthetic aperture radar data from the same geographical regions. Traditional contrastive learning approaches often struggle with the high inter-class similarity and complex boundaries characteristic of Earth observation data. MoSAiC addresses these limitations by introducing a unified framework that optimizes both intra- and inter-modality contrastive learning with multi-label supervision. The system demonstrates superior performance in low-label scenarios and high-class-overlap situations, making it particularly valuable for environmental monitoring, agricultural assessment, urban planning, and disaster response applications. By enabling more robust representation learning across spectrally similar classes, MoSAiC could significantly advance automated Earth observation systems and environmental surveillance capabilities.

Authors: Debashis Gupta, Aditi Golder, Rongkhun Zhu, Kangning Cui, Wei Tang, Fan Yang, Ovidiu Csillik, Sarra Alaqahtani, V. Paul Pauca

Link: https://arxiv.org/abs/2507.08683v1

Date: 2025-07-d

Summary:

Contrastive learning (CL) has emerged as a powerful paradigm for learning transferable representations without the reliance on large labeled datasets. Its ability to capture intrinsic similarities and differences among data samples has led to state-of-the-art results in computer vision tasks. These strengths make CL particularly well-suited for Earth System Observation (ESO), where diverse satellite modalities such as optical and SAR imagery offer naturally aligned views of the same geospatial regions. However, ESO presents unique challenges, including high inter-class similarity, scene clutter, and ambiguous boundaries, which complicate representation learning -- especially in low-label, multi-label settings. Existing CL frameworks often focus on intra-modality self-supervision or lack mechanisms for multi-label alignment and semantic precision across modalities. In this work, we introduce MoSAiC, a unified framework that jointly optimizes intra- and inter-modality contrastive learning with a multi-label supervised contrastive loss. Designed specifically for multi-modal satellite imagery, MoSAiC enables finer semantic disentanglement and more robust representation learning across spectrally similar and spatially complex classes. Experiments on two benchmark datasets, BigEarthNet V2.0 and Sent12MS, show that MoSAiC consistently outperforms both fully supervised and self-supervised baselines in terms of accuracy, cluster coherence, and generalization in low-label and high-class-overlap scenarios.

--------------------------------------------------------------------------------------------------------

Why this and not that? A Logic-based Framework for Contrastive Explanations

Explainable AI systems increasingly require the ability to provide contrastive explanations that address questions of the form "Why P but not Q?" This research develops a comprehensive logical framework for generating such explanations, moving beyond simple feature attribution to explicit causal comparison. The framework captures cardinality-minimal versions of existing contrastive explanation methods while providing computational complexity analysis for practical implementation. By computing causes for both positive and negative outcomes, the system enables more nuanced understanding of AI decision-making processes. This work has significant implications for healthcare diagnostics, legal AI systems, financial decision-making, and any domain where understanding the reasoning behind automated decisions is crucial for trust and accountability.

Authors: Tobias Geibinger, Reijo Jaakkola, Antti Kuusisto, Xinghan Liu, Miikka Vilander

Link: https://arxiv.org/abs/2507.08454v1

Date: 2025-07-d

Summary:

We define several canonical problems related to contrastive explanations, each answering a question of the form ''Why P but not Q?''. The problems compute causes for both P and Q, explicitly comparing their differences. We investigate the basic properties of our definitions in the setting of propositional logic. We show, inter alia, that our framework captures a cardinality-minimal version of existing contrastive explanations in the literature. Furthermore, we provide an extensive analysis of the computational complexities of the problems. We also implement the problems for CNF-formulas using answer set programming and present several examples demonstrating how they work in practice.

--------------------------------------------------------------------------------------------------------

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

Audio understanding across speech, sound, and music represents a significant frontier in multimodal AI development. Audio Flamingo 3 introduces comprehensive capabilities including novel joint representation learning, chain-of-thought reasoning, and extended audio processing up to 10 minutes. The model's unified encoder and flexible thinking mechanisms enable sophisticated audio analysis and generation tasks. Applications span voice assistants, music analysis, accessibility technologies, content creation, and educational tools. The fully open-source nature promotes research accessibility while the multi-turn conversation capabilities enable more natural human-AI interaction. The model's ability to handle diverse audio modalities makes it valuable for entertainment industry applications, hearing assistance technologies, and automated content moderation systems requiring nuanced audio understanding.

Authors: Arushi Goel, Sreyan Ghosh, Jaehyeon Kim, Sonal Kumar, Zhifeng Kong, Sang-gil Lee, Chao-Han Huck Yang, Ramani Duraiswami, Dinesh Manocha, Rafael Valle, Bryan Catanzaro

Link: https://arxiv.org/abs/2507.08128v1

Date: 2025-07-d

Summary:

We present Audio Flamingo 3 (AF3), a fully open state-of-the-art (SOTA) large audio-language model that advances reasoning and understanding across speech, sound, and music. AF3 introduces: (i) AF-Whisper, a unified audio encoder trained using a novel strategy for joint representation learning across all 3 modalities of speech, sound, and music; (ii) flexible, on-demand thinking, allowing the model to do chain-of-thought-type reasoning before answering; (iii) multi-turn, multi-audio chat; (iv) long audio understanding and reasoning (including speech) up to 10 minutes; and (v) voice-to-voice interaction. To enable these capabilities, we propose several large-scale training datasets curated using novel strategies, including AudioSkills-XL, LongAudio-XL, AF-Think, and AF-Chat, and train AF3 with a novel five-stage curriculum-based training strategy. Trained on only open-source audio data, AF3 achieves new SOTA results on over 20+ (long) audio understanding and reasoning benchmarks, surpassing both open-weight and closed-source models trained on much larger datasets.

--------------------------------------------------------------------------------------------------------

Scaling RL to Long Videos

Video understanding at extended temporal scales presents significant computational and reasoning challenges for current vision-language models. This research introduces a comprehensive framework addressing long video reasoning through reinforcement learning, incorporating a large-scale dataset of 52K video question-answer pairs with high-quality reasoning annotations. The Multi-modal Reinforcement Sequence Parallelism system enables efficient training on hour-long videos while maintaining strong performance across diverse reasoning tasks. Applications include video content analysis, educational content creation, sports analytics, entertainment industry tools, and surveillance systems. The framework's ability to process extended temporal sequences opens possibilities for automated video summarization, content recommendation systems, and sophisticated video editing tools that require understanding of long-term narrative structures and temporal relationships.

Authors: Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu, Hongxu Yin, Yao Lu, Song Han

Link: https://arxiv.org/abs/2507.07966v1

Date: 2025-07-d

Summary:

We introduce a full-stack framework that scales up reasoning in vision-language models (VLMs) to long videos, leveraging reinforcement learning. We address the unique challenges of long video reasoning by integrating three critical components: (1) a large-scale dataset, LongVideo-Reason, comprising 52K long video QA pairs with high-quality reasoning annotations across diverse domains such as sports, games, and vlogs; (2) a two-stage training pipeline that extends VLMs with chain-of-thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) a training infrastructure for long video RL, named Multi-modal Reinforcement Sequence Parallelism (MR-SP), which incorporates sequence parallelism and a vLLM-based engine tailored for long video, using cached video embeddings for efficient rollout and prefilling. In experiments, LongVILA-R1-7B achieves strong performance on long video QA benchmarks such as VideoMME. It also outperforms Video-R1-7B and even matches Gemini-1.5-Pro across temporal reasoning, goal and purpose reasoning, spatial reasoning, and plot reasoning on our LongVideo-Reason-eval benchmark. Notably, our MR-SP system achieves up to 2.1x speedup on long video RL training. LongVILA-R1 demonstrates consistent performance gains as the number of input video frames scales. LongVILA-R1 marks a firm step towards long video reasoning in VLMs. In addition, we release our training system for public availability that supports RL training on various modalities (video, text, and audio), various models (VILA and Qwen series), and even image and video generation models. On a single A100 node (8 GPUs), it supports RL training on hour-long videos (e.g., 3,600 frames / around 256k tokens).

--------------------------------------------------------------------------------------------------------

KIS-S: A GPU-Aware Kubernetes Inference Simulator with RL-Based Auto-Scaling

GPU resource management in containerized inference environments presents complex challenges due to dynamic workloads and expensive hardware utilization. Traditional auto-scaling mechanisms struggle with bursty traffic patterns and lack integration with GPU-specific metrics. KIS-S addresses these limitations through a unified simulation and reinforcement learning framework that learns optimal scaling policies without requiring real-world experimentation. The system demonstrates significant improvements in latency reduction and resource efficiency. Applications include cloud service providers, machine learning platform operations, real-time inference systems, and any GPU-intensive workload requiring dynamic scaling. The framework's ability to generalize across different traffic patterns makes it valuable for cost optimization in cloud computing environments and improved quality of service in AI-powered applications.

Authors: Guilin Zhang, Wulan Guo, Ziqi Tan, Qiang Guan, Hailong Jiang

Link: https://arxiv.org/abs/2507.07932v1

Date: 2025-07-d

Summary:

Autoscaling GPU inference workloads in Kubernetes remains challenging due to the reactive and threshold-based nature of default mechanisms such as the Horizontal Pod Autoscaler (HPA), which struggle under dynamic and bursty traffic patterns and lack integration with GPU-level metrics. We present KIS-S, a unified framework that combines KISim, a GPU-aware Kubernetes Inference Simulator, with KIScaler, a Proximal Policy Optimization (PPO)-based autoscaler. KIScaler learns latency-aware and resource-efficient scaling policies entirely in simulation, and is directly deployed without retraining. Experiments across four traffic patterns show that KIScaler improves average reward by 75.2%, reduces P95 latency up to 6.7x over CPU baselines, and generalizes without retraining. Our work bridges the gap between reactive autoscaling and intelligent orchestration for scalable GPU-accelerated environments.

--------------------------------------------------------------------------------------------------------

MIRA: A Novel Framework for Fusing Modalities in Medical RAG

Medical diagnosis assistance through multimodal large language models faces critical challenges in maintaining factual accuracy while integrating diverse information sources. MIRA introduces a sophisticated retrieval-augmented generation framework specifically designed for medical applications, addressing the delicate balance between comprehensive information retrieval and avoiding misleading content. The system's calibrated approach to context management and integration of visual and textual medical data represents a significant advancement in AI-assisted healthcare. Applications include clinical decision support systems, medical education tools, diagnostic assistance platforms, and automated medical report generation. The framework's focus on factual accuracy makes it particularly valuable for healthcare environments where precision is paramount, potentially improving diagnostic accuracy and reducing medical errors.

Authors: Jinhong Wang, Tajamul Ashraf, Zongyan Han, Jorma Laaksonen, Rao Mohammad Anwer

Link: https://arxiv.org/abs/2507.07902v1

Date: 2025-07-d

Summary:

Multimodal Large Language Models (MLLMs) have significantly advanced AI-assisted medical diagnosis, but they often generate factually inconsistent responses that deviate from established medical knowledge. Retrieval-Augmented Generation (RAG) enhances factual accuracy by integrating external sources, but it presents two key challenges. First, insufficient retrieval can miss critical information, whereas excessive retrieval can introduce irrelevant or misleading content, disrupting model output. Second, even when the model initially provides correct answers, over-reliance on retrieved data can lead to factual errors. To address these issues, we introduce the Multimodal Intelligent Retrieval and Augmentation (MIRA) framework, designed to optimize factual accuracy in MLLM. MIRA consists of two key components: (1) a calibrated Rethinking and Rearrangement module that dynamically adjusts the number of retrieved contexts to manage factual risk, and (2) A medical RAG framework integrating image embeddings and a medical knowledge base with a query-rewrite module for efficient multimodal reasoning. This enables the model to effectively integrate both its inherent knowledge and external references. Our evaluation of publicly available medical VQA and report generation benchmarks demonstrates that MIRA substantially enhances factual accuracy and overall performance, achieving new state-of-the-art results. Code is released at https://github.com/mbzuai-oryx/MIRA.

--------------------------------------------------------------------------------------------------------

ODIA: Oriented Distillation for Inline Acceleration of LLM-based Function Calling

Function calling capabilities in large language models enable powerful API interactions but suffer from significant latency issues that impact user experience. ODIA introduces a novel knowledge distillation approach that identifies simple queries from production traffic and creates smaller, faster models for routine tasks. The system achieves substantial latency reduction while maintaining accuracy through automated data collection and model updating. Applications include customer service automation, API integration platforms, mobile applications requiring fast response times, and any system where LLM-based function calling is bottlenecked by latency. The method's minimal human intervention requirement and continuous improvement capabilities make it particularly valuable for production environments requiring both speed and reliability in automated system interactions.

Authors: Hanlong Zhang, Jingsheng Yang, Hao Li, Yuhao He, Franck Gong

Link: https://arxiv.org/abs/2507.08877v1

Date: 2025-07-d

Summary:

Function Calling is a crucial technique that enables Large Language Models (LLMs) to interact with external systems through APIs. However, the high latency associated with LLM-based Function Calling significantly impacts user experience. This paper presents a novel approach called Oriented Distillation for Inline Acceleration (ODIA) that leverages online user interaction data to accelerate Function Calling. By automatically identifying "simple queries" from production traffic and distilling knowledge from larger models to smaller ones, our method reduces response latency by 45% (expected) and 78% (median) while maintaining accuracy. We demonstrate the effectiveness of our approach through real-world deployment in a music application, where the smaller model successfully handles 60% of traffic with negligible accuracy loss. Our method requires minimal human intervention and continuously improves through automated data collection and model updating, making it a practical solution for production environments.

--------------------------------------------------------------------------------------------------------

Application of LLMs to Multi-Robot Path Planning and Task Allocation

Multi-agent reinforcement learning faces significant exploration challenges that compound traditional single-agent problems. This research investigates large language models as expert planners for efficient exploration in multi-robot environments, addressing the complex coordination requirements inherent in multi-agent systems. The approach leverages LLMs' reasoning capabilities to guide exploration strategies in planning-based tasks. Applications include warehouse automation, search and rescue operations, autonomous vehicle coordination, smart manufacturing systems, and distributed sensor networks. The integration of language models with multi-agent systems opens possibilities for more intuitive robot coordination, natural language task specification, and improved human-robot interaction in complex environments requiring sophisticated coordination and planning capabilities.

Authors: Ashish Kumar

Link: https://arxiv.org/abs/2507.07302v1

Date: 2025-07-d

Summary:

Efficient exploration is a well known problem in deep reinforcement learning and this problem is exacerbated in multi-agent reinforcement learning due the intrinsic complexities of such algorithms. There are several approaches to efficiently explore an environment to learn to solve tasks by multi-agent operating in that environment, of which, the idea of expert exploration is investigated in this work. More specifically, this work investigates the application of large-language models as expert planners for efficient exploration in planning based tasks for multiple agents.

--------------------------------------------------------------------------------------------------------

TRIP: A Nonparametric Test to Diagnose Biased Feature Importance Scores

Feature importance interpretation in machine learning models faces significant challenges when dealing with dependent features, particularly in permutation-based importance methods. TRIP introduces a diagnostic test that detects unreliable importance scores resulting from model extrapolation when features are permuted. The test requires minimal assumptions and can be adapted for high-dimensional settings, addressing a critical gap in model interpretability. Applications include financial risk modeling, medical diagnosis systems, fraud detection, and any domain where understanding feature contributions is crucial for decision-making. The test's ability to identify when permutation feature importance is misleading makes it valuable for regulatory compliance, model validation, and ensuring trustworthy AI systems in high-stakes applications.

Authors: Aaron Foote, Danny Krizanc

Link: https://arxiv.org/abs/2507.07276v1

Date: 2025-07-d

Summary:

Along with accurate prediction, understanding the contribution of each feature to the making of the prediction, i.e., the importance of the feature, is a desirable and arguably necessary component of a machine learning model. For a complex model such as a random forest, such importances are not innate -- as they are, e.g., with linear regression. Efficient methods have been created to provide such capabilities, with one of the most popular among them being permutation feature importance due to its efficiency, model-agnostic nature, and perceived intuitiveness. However, permutation feature importance has been shown to be misleading in the presence of dependent features as a result of the creation of unrealistic observations when permuting the dependent features. In this work, we develop TRIP (Test for Reliable Interpretation via Permutation), a test requiring minimal assumptions that is able to detect unreliable permutation feature importance scores that are the result of model extrapolation. To build on this, we demonstrate how the test can be complemented in order to allow its use in high dimensional settings. Through testing on simulated data and applications, our results show that the test can be used to reliably detect when permutation feature importance scores are unreliable.

--------------------------------------------------------------------------------------------------------

GUIDE: Towards Scalable Advising for Research Ideas

The accelerating pace of AI research creates an urgent need for scalable systems that can provide high-quality feedback on research hypotheses and experimental designs. GUIDE explores key factors in developing robust advising systems, including model size, context length, and structured reasoning processes. The system demonstrates that smaller models with compressed literature databases can outperform larger general-purpose models in research evaluation tasks. Applications include academic research support, grant proposal evaluation, peer review assistance, and research planning tools. The framework's high-confidence prediction capabilities make it valuable for research institutions, funding agencies, and academic publishers seeking to streamline review processes while maintaining quality standards in scientific evaluation.

Authors: Yaowenqi Liu, BingXu Meng, Rui Pan, Jerry Huang, Tong Zhang

Link: https://arxiv.org/abs/2507.08870v1

Date: 2025-07-d

Summary:

The field of AI research is advancing at an unprecedented pace, enabling automated hypothesis generation and experimental design across diverse domains such as biology, mathematics, and artificial intelligence. Despite these advancements, there remains a significant gap in the availability of scalable advising systems capable of providing high-quality, well-reasoned feedback to refine proposed hypotheses and experimental designs. To address this challenge, we explore key factors that underlie the development of robust advising systems, including model size, context length, confidence estimation, and structured reasoning processes. Our findings reveal that a relatively small model, when equipped with a well-compressed literature database and a structured reasoning framework, can outperform powerful general-purpose language models such as Deepseek-R1 in terms of acceptance rates for self-ranked top-30% submissions to ICLR 2025. Moreover, when limited to high-confidence predictions, our system achieves an acceptance rate exceeding 90% on the ICLR 2025 test set, underscoring its potential to significantly enhance the quality and efficiency of hypothesis generation and experimental design. The code is released at https://github.com/HowardLiu0830/GUIDE-Research-Idea-Evaluation.

--------------------------------------------------------------------------------------------------------

The bitter lesson of misuse detection

AI safety systems designed to detect and prevent harmful content face significant challenges in real-world deployment scenarios. This research introduces BELLS, a comprehensive benchmark evaluating supervision systems across diverse attack vectors and harm categories. The findings reveal substantial limitations in specialized supervision systems, with simple attacks often achieving near-zero detection rates. The study demonstrates that general-purpose language models often outperform specialized supervisors in misuse detection tasks. Applications include content moderation systems, AI safety research, platform governance, and regulatory compliance tools. The framework's comprehensive evaluation approach provides crucial insights for developing more robust safety mechanisms and highlights the importance of general capabilities in addressing evolving threats to AI system security.

Authors: Hadrien Mariaccia, Charbel-Raphaël Segerie, Diego Dorn

Link: https://arxiv.org/abs/2507.06282v1

Date: 2025-07-d

Summary:

Prior work on jailbreak detection has established the importance of adversarial robustness for LLMs but has largely focused on the model ability to resist adversarial inputs and to output safe content, rather than the effectiveness of external supervision systems. The only public and independent benchmark of these guardrails to date evaluates a narrow set of supervisors on limited scenarios. Consequently, no comprehensive public benchmark yet verifies how well supervision systems from the market perform under realistic, diverse attacks. To address this, we introduce BELLS, a Benchmark for the Evaluation of LLM Supervision Systems. The framework is two dimensional: harm severity (benign, borderline, harmful) and adversarial sophistication (direct vs. jailbreak) and provides a rich dataset covering 3 jailbreak families and 11 harm categories. Our evaluations reveal drastic limitations of specialized supervision systems. While they recognize some known jailbreak patterns, their semantic understanding and generalization capabilities are very limited, sometimes with detection rates close to zero when asking a harmful question directly or with a new jailbreak technique such as base64 encoding. Simply asking generalist LLMs if the user question is "harmful or not" largely outperforms these supervisors from the market according to our BELLS score. But frontier LLMs still suffer from metacognitive incoherence, often responding to queries they correctly identify as harmful (up to 30 percent for Claude 3.7 and greater than 50 percent for Mistral Large). These results suggest that simple scaffolding could significantly improve misuse detection robustness, but more research is needed to assess the tradeoffs of such techniques. Our results support the "bitter lesson" of misuse detection: general capabilities of LLMs are necessary to detect a diverse array of misuses and jailbreaks.

--------------------------------------------------------------------------------------------------------

Assuring the Safety of Reinforcement Learning Components: AMLAS-RL

Safety assurance in reinforcement learning systems represents a critical challenge as these systems increasingly integrate into safety-critical cyber-physical systems. AMLAS-RL adapts the established AMLAS methodology to provide structured guidance for generating safety arguments throughout the RL lifecycle. The framework addresses the unique challenges posed by RL's learning-based nature compared to traditional supervised learning approaches. Applications include autonomous vehicles, medical devices, industrial control systems, and any safety-critical application incorporating RL components. The methodology's systematic approach to safety argumentation makes it valuable for regulatory compliance, certification processes, and risk management in domains where demonstrable safety is paramount, providing a structured path for RL deployment in high-stakes environments.

Authors: Calum Corrie Imrie, Ioannis Stefanakos, Sepeedeh Shahbeigi, Richard Hawkins, Simon Burton

Link: https://arxiv.org/abs/2507.08848v1

Date: 2025-07-d

Summary:

The rapid advancement of machine learning (ML) has led to its increasing integration into cyber-physical systems (CPS) across diverse domains. While CPS offer powerful capabilities, incorporating ML components introduces significant safety and assurance challenges. Among ML techniques, reinforcement learning (RL) is particularly suited for CPS due to its capacity to handle complex, dynamic environments where explicit models of interaction between system and environment are unavailable or difficult to construct. However, in safety-critical applications, this learning process must not only be effective but demonstrably safe. Safe-RL methods aim to address this by incorporating safety constraints during learning, yet they fall short in providing systematic assurance across the RL lifecycle. The AMLAS methodology offers structured guidance for assuring the safety of supervised learning components, but it does not directly apply to the unique challenges posed by RL. In this paper, we adapt AMLAS to provide a framework for generating assurance arguments for an RL-enabled system through an iterative process; AMLAS-RL. We demonstrate AMLAS-RL using a running example of a wheeled vehicle tasked with reaching a target goal without collision.

--------------------------------------------------------------------------------------------------------

Hita: Holistic Tokenizer for Autoregressive Image Generation

Autoregressive image generation models face limitations in capturing global relationships when generating images token by token. Hita introduces a novel tokenization approach that incorporates holistic queries alongside local patch tokens, enabling better global context understanding. The system's sequential structure prioritizes holistic information while maintaining compatibility with autoregressive generation processes. Applications include digital art creation, content generation for media industries, architectural visualization, and automated design tools. The tokenizer's ability to capture global image properties like textures and shapes makes it valuable for creative applications requiring coherent style and structure. The demonstrated effectiveness in style transfer and image in-painting opens possibilities for advanced image editing tools and creative AI applications.

Authors: Anlin Zheng, Haochen Wang, Yucheng Zhao, Weipeng Deng, Tiancai Wang, Xiangyu Zhang, Xiaojuan Qi

Link: https://arxiv.org/abs/2507.02358v4

Date: 2025-07-d

Summary:

Vanilla autoregressive image generation models generate visual tokens step-by-step, limiting their ability to capture holistic relationships among token sequences. Moreover, because most visual tokenizers map local image patches into latent tokens, global information is limited. To address this, we introduce \textit{Hita}, a novel image tokenizer for autoregressive (AR) image generation. It introduces a holistic-to-local tokenization scheme with learnable holistic queries and local patch tokens. Hita incorporates two key strategies to better align with the AR generation process: 1) {arranging} a sequential structure with holistic tokens at the beginning, followed by patch-level tokens, and using causal attention to maintain awareness of previous tokens; and 2) adopting a lightweight fusion module before feeding the de-quantized tokens into the decoder to control information flow and prioritize holistic tokens. Extensive experiments show that Hita accelerates the training speed of AR generators and outperforms those trained with vanilla tokenizers, achieving \textbf{2.59 FID} and \textbf{281.9 IS} on the ImageNet benchmark. Detailed analysis of the holistic representation highlights its ability to capture global image properties, such as textures, materials, and shapes. Additionally, Hita also demonstrates effectiveness in zero-shot style transfer and image in-painting. The code is available at \href{https://github.com/CVMI-Lab/Hita}{https://github.com/CVMI-Lab/Hita}.

--------------------------------------------------------------------------------------------------------

PWD: Prior-Guided and Wavelet-Enhanced Diffusion Model for Limited-Angle CT

Limited-angle computed tomography reconstruction presents significant challenges in medical imaging, particularly in reducing radiation exposure while maintaining diagnostic quality. PWD introduces a diffusion model that incorporates prior information and wavelet-based feature fusion to enable efficient sampling with fewer steps. The system addresses the computational overhead of standard diffusion models while preserving fine structural details crucial for medical diagnosis. Applications include dental imaging, interventional radiology, pediatric imaging, and any medical context requiring reduced radiation exposure. The model's ability to achieve high-quality reconstruction with significantly fewer sampling steps makes it valuable for clinical workflows requiring both speed and accuracy in medical image reconstruction.

Authors: Yi Liu, Yiyang Wen, Zekun Zhou, Junqi Ma, Linghang Wang, Yucheng Yao, Liu Shi, Qiegen Liu

Link: https://arxiv.org/abs/2507.05317v2

Date: 2025-07-d

Summary:

Generative diffusion models have received increasing attention in medical imaging, particularly in limited-angle computed tomography (LACT). Standard diffusion models achieve high-quality image reconstruction but require a large number of sampling steps during inference, resulting in substantial computational overhead. Although skip-sampling strategies have been proposed to improve efficiency, they often lead to loss of fine structural details. To address this issue, we propose a prior information embedding and wavelet feature fusion fast sampling diffusion model for LACT reconstruction. The PWD enables efficient sampling while preserving reconstruction fidelity in LACT, and effectively mitigates the degradation typically introduced by skip-sampling. Specifically, during the training phase, PWD maps the distribution of LACT images to that of fully sampled target images, enabling the model to learn structural correspondences between them. During inference, the LACT image serves as an explicit prior to guide the sampling trajectory, allowing for high-quality reconstruction with significantly fewer steps. In addition, PWD performs multi-scale feature fusion in the wavelet domain, effectively enhancing the reconstruction of fine details by leveraging both low-frequency and high-frequency information. Quantitative and qualitative evaluations on clinical dental arch CBCT and periapical datasets demonstrate that PWD outperforms existing methods under the same sampling condition. Using only 50 sampling steps, PWD achieves at least 1.7 dB improvement in PSNR and 10% gain in SSIM.

--------------------------------------------------------------------------------------------------------

IPFormer-VideoLLM: Enhancing Multi-modal Video Understanding for Multi-shot Scenes

Video understanding systems struggle with multi-shot scenarios involving varying camera angles and scene changes, often leading to identity confusion and key frame negligence. IPFormer-VideoLLM addresses these challenges through instance-level feature injection and a comprehensive multi-shot dataset called MultiClip-Bench. The system's attention-based connector enables effective aggregation of instance-specific information across scenes. Applications include video content analysis, surveillance systems, sports analytics, entertainment industry tools, and educational content creation. The model's enhanced multi-scene understanding capabilities make it valuable for automated video editing, content recommendation systems, and any application requiring sophisticated temporal and spatial reasoning across complex video sequences with multiple perspectives and scene transitions.

Authors: Yujia Liang, Jile Jiao, Xuetao Feng, Zixuan Ye, Yuan Wang, Zhicheng Wang

Link: https://arxiv.org/abs/2506.21116v2

Date: 2025-07-d

Summary:

Video Large Language Models (VideoLLMs) have demonstrated remarkable understanding capabilities, but are found struggling to tackle multi-shot scenarios,e.g., video clips with varying camera angles or scene changes. This challenge can render failures such as instance identity forgetting and key frame negligence. In this work, we first attribute the challenge to the lack of multi-shot annotations among existing datasets and therefore we introduce a new dataset termed MultiClip-Bench, featuring dense descriptions and instruction-based question-answering pairs tailored for multi-shot scenarios. We empirically find that the training set significantly boosts the multi-shot performance, while the testing benchmark provides a reliable measure of the model capability in multi-shot scenarios. By further analyzing and discovering that current models only encode instance features in a discrete or lossy manner, at the risk of missing identity information, we then contribute a new model IPFormer-VideoLLM. Its key idea is the injection of instance-level features as instance prompts through an efficient attention-based connector. This allows for the aggregation of instance-specific information across scenes. Experiments demonstrate that our proposed dataset and model not only enhance the multi-scene video understanding significantly, but also offer distinct advantages across various video benchmarks.

--------------------------------------------------------------------------------------------------------

The end of radical concept nativism

The philosophical debate over concept learning versus innate knowledge has profound implications for understanding human cognition and AI development. This research challenges Jerry Fodor's radical concept nativism through formal analysis using computer science and information theory. The work identifies critical divergences between nativist arguments and actual human cognitive processes, particularly regarding expressive power and conceptual structure. Applications include cognitive science research, AI system design, educational technology, and developmental psychology. The findings suggest that genuine concept learning is possible, informing approaches to machine learning, natural language processing, and cognitive modeling. This work has implications for understanding human creativity, learning mechanisms, and the development of AI systems that can acquire genuinely new concepts.

Authors: Joshua S. Rule, Steven T. Piantadosi

Link: https://arxiv.org/abs/2505.18277v2

Date: 2025-07-d

Summary:

Though humans seem to be remarkable learners, arguments in cognitive science and philosophy of mind have long maintained that learning something fundamentally new is impossible. Specifically, Jerry Fodor's arguments for radical concept nativism hold that most, if not all, concepts are innate and that what many call concept learning never actually leads to the acquisition of new concepts. These arguments have deeply affected cognitive science, and many believe that the counterarguments to radical concept nativism have been either unsuccessful or only apply to a narrow class of concepts. This paper first reviews the features and limitations of prior arguments. We then identify three critical points - related to issues of expressive power, conceptual structure, and concept possession - at which the arguments in favor of radical concept nativism diverge from describing actual human cognition. We use ideas from computer science and information theory to formalize the relevant ideas in ways that are arguably more scientifically productive. We conclude that, as a result, there is an important sense in which people do indeed learn new concepts.

--------------------------------------------------------------------------------------------------------

No, of Course I Can! Deeper Fine-Tuning Attacks That Bypass Token-Level Safety Mechanisms

Fine-tuning attacks on large language models represent a significant security concern as these models become more widely deployed in production environments. This research demonstrates that existing safety mechanisms are "shallow" and can be bypassed through deeper attack strategies that train models to initially refuse before complying with harmful requests. The "refuse-then-comply" approach successfully jailbreaks both open-source and production models, achieving high attack success rates. Applications include AI safety research, security testing, model robustness evaluation, and regulatory compliance assessment. The work's identification of vulnerabilities in major production systems highlights the need for more sophisticated safety mechanisms and has direct implications for AI deployment policies, security protocols, and the development of more robust alignment techniques.

Authors: Joshua Kazdan, Abhay Puri, Rylan Schaeffer, Lisa Yu, Chris Cundy, Jason Stanley, Sanmi Koyejo, Krishnamurthy Dvijotham

Link: https://arxiv.org/abs/2502.19537v5

Date: 2025-07-d

Summary:

Leading language model (LM) providers like OpenAI and Anthropic allow customers to fine-tune frontier LMs for specific use cases. To prevent abuse, these providers apply filters to block fine-tuning on overtly harmful data. In this setting, we make three contributions: First, while past work has shown that safety alignment is "shallow", we correspondingly demonstrate that existing fine-tuning attacks are shallow -- attacks target only the first several tokens of the model response, and consequently can be blocked by generating the first several response tokens with an aligned model. Second, we conceptually illustrate how to make attacks deeper by introducing a new fine-tuning attack that trains models to first refuse harmful requests before answering them; this "refuse-then-comply" strategy bypasses shallow defenses and produces harmful responses that evade output filters. Third, we demonstrate the potency of our new fine-tuning attack by jailbreaking both open-source models equipped with defenses and production models, achieving attack success rates of 57% and 72% against GPT-4o and Claude Haiku, respectively. Our attack received a $2000 bug bounty from OpenAI and was acknowledged as a vulnerability by Anthropic. Our work undermines the notion that models are safe because they initially refuse harmful requests and broadens awareness of the scope of attacks that face production fine-tuning APIs.

--------------------------------------------------------------------------------------------------------

EYE ON A.I. GETS READERS UP TO DATE ON THE LATEST FUNDING NEWS AND RELATED ISSUES. SUBSCRIBE FOR THE WEEKLY NEWSLETTER.

Artificial Intelligence, Research WatchCraig SmithJuly 15, 2025Comment