Week Ending 5.11.2025
RESEARCH WATCH: 5.11.2025
SLOT: Structuring the Output of Large Language Models
Large language models excel at generating text but often struggle to produce strictly formatted outputs needed for enterprise applications like data extraction, API integration, and automated workflows. Current approaches using constrained decoding or model-specific solutions limit flexibility and deployment options. SLOT addresses this critical gap by introducing a lightweight post-processing layer that transforms unstructured LLM outputs into well-defined schemas. This model-agnostic approach enables developers to leverage any LLM while ensuring outputs conform to required formats. Potential applications include automated form filling, structured report generation, database population, and reliable agent-based systems where precise output formatting is essential for downstream processing.
Authors: Darren Yow-Bang Wang, Zhengyuan Shen, Soumya Smruti Mishra, Zhichao Xu, Yifei Teng, Haibo Ding
Link: https://arxiv.org/abs/2505.04016v1
Date: 2025-05-06
Summary:
Structured outputs are essential for large language models (LLMs) in critical applications like agents and information extraction. Despite their capabilities, LLMs often generate outputs that deviate from predefined schemas, significantly hampering reliable application development. We present SLOT (Structured LLM Output Transformer), a model-agnostic approach that transforms unstructured LLM outputs into precise structured formats. While existing solutions predominantly rely on constrained decoding techniques or are tightly coupled with specific models, SLOT employs a fine-tuned lightweight language model as a post-processing layer, achieving flexibility across various LLMs and schema specifications. We introduce a systematic pipeline for data curation and synthesis alongside a formal evaluation methodology that quantifies both schema accuracy and content fidelity. Our results demonstrate that fine-tuned Mistral-7B model with constrained decoding achieves near perfect schema accuracy (99.5%) and content similarity (94.0%), outperforming Claude-3.5-Sonnet by substantial margins (+25 and +20 percentage points, respectively). Notably, even compact models like Llama-3.2-1B can match or exceed the structured output capabilities of much larger proprietary models when equipped with SLOT, enabling reliable structured generation in resource-constrained environments.
--------------------------------------------------------------------------------------------------------
LogisticsVLN: Vision-Language Navigation For Low-Altitude Terminal Delivery Based on Agentic UAVs
The surge in e-commerce has created unprecedented demand for efficient last-mile delivery solutions, yet current UAV navigation systems lack the precision required for terminal delivery to specific locations like apartment balconies or office windows. LogisticsVLN pioneers a multimodal approach combining vision and language understanding for fine-grained aerial navigation. By integrating lightweight language models with visual perception, the system enables UAVs to interpret complex delivery instructions and navigate urban environments autonomously. Applications include package delivery to high-rise buildings, emergency medical supply transport, and urban logistics in densely populated areas. The VLD dataset enables researchers to advance this critical intersection of AI and autonomous aviation.
Authors: Xinyuan Zhang, Yonglin Tian, Fei Lin, Yue Liu, Jing Ma, Kornélia Sára Szatmáry, Fei-Yue Wang
Link: https://arxiv.org/abs/2505.03460v1
Date: 2025-05-06
Summary:
The growing demand for intelligent logistics, particularly fine-grained terminal delivery, underscores the need for autonomous UAV (Unmanned Aerial Vehicle)-based delivery systems. However, most existing last-mile delivery studies rely on ground robots, while current UAV-based Vision-Language Navigation (VLN) tasks primarily focus on coarse-grained, long-range goals, making them unsuitable for precise terminal delivery. To bridge this gap, we propose LogisticsVLN, a scalable aerial delivery system built on multimodal large language models (MLLMs) for autonomous terminal delivery. LogisticsVLN integrates lightweight Large Language Models (LLMs) and Visual-Language Models (VLMs) in a modular pipeline for request understanding, floor localization, object detection, and action-decision making. To support research and evaluation in this new setting, we construct the Vision-Language Delivery (VLD) dataset within the CARLA simulator. Experimental results on the VLD dataset showcase the feasibility of the LogisticsVLN system. In addition, we conduct subtask-level evaluations of each module of our system, offering valuable insights for improving the robustness and real-world deployment of foundation model-based vision-language delivery systems.
--------------------------------------------------------------------------------------------------------
Knowledge Distillation for Speech Denoising by Latent Representation Alignment with Cosine Distance
Speech enhancement remains vital for consumer electronics like smartphones, hearing aids, and smart glasses, yet state-of-the-art models are too computationally intensive for edge deployment. This work addresses the challenge of compressing powerful denoising models while maintaining performance through innovative knowledge distillation. By leveraging denoising-autoencoder frameworks and cosine similarity in latent spaces, the method enables efficient student models that retain teacher model quality. Applications include real-time voice enhancement in noisy environments, improved audio quality for video calls, and accessibility tools for hearing-impaired users. The approach's flexibility allows deployment across diverse hardware constraints, from wearables to automotive systems requiring clear voice communication.
Authors: Diep Luong, Mikko Heikkinen, Konstantinos Drossos, Tuomas Virtanen
Link: https://arxiv.org/abs/2505.03442v1
Date: 2025-05-06
Summary:
Speech denoising is a generally adopted and impactful task, appearing in many common and everyday-life use cases. Although there are very powerful methods published, most of those are too complex for deployment in everyday and low-resources computational environments, like hand-held devices, intelligent glasses, hearing aids, etc. Knowledge distillation (KD) is a prominent way for alleviating this complexity mismatch and is based on the transferring/distilling of knowledge from a pre-trained complex model, the teacher, to another less complex one, the student. Existing KD methods for speech denoising are based on processes that potentially hamper the KD by bounding the learning of the student to the distribution, information ordering, and feature dimensionality learned by the teacher. In this paper, we present and assess a method that tries to treat this issue, by exploiting the well-known denoising-autoencoder framework, the linear inverted bottlenecks, and the properties of the cosine similarity. We use a public dataset and conduct repeated experiments with different mismatching scenarios between the teacher and the student, reporting the mean and standard deviation of the metrics of our method and another, state-of-the-art method that is used as a baseline. Our results show that with the proposed method, the student can perform better and can also retain greater mismatching conditions compared to the teacher.
--------------------------------------------------------------------------------------------------------
Rethinking Federated Graph Learning: A Data Condensation Perspective
Federated learning on graph data presents unique challenges due to heterogeneous graph structures across distributed clients, limiting collaborative training effectiveness. FedGM introduces a paradigm shift by using condensed graphs as optimization carriers rather than traditional parameter sharing. This approach synthesizes comprehensive knowledge from distributed graphs while minimizing communication overhead and preserving privacy through single condensed data transmissions. Applications span social network analysis across organizations, collaborative drug discovery using molecular graphs, financial fraud detection across institutions, and multi-site medical knowledge graphs. The method's ability to handle diverse graph distributions makes it particularly valuable for cross-organizational collaborations where data sharing is restricted by privacy regulations.
Authors: Hao Zhang, Xunkai Li, Yinlin Zhu, Lianglin Hu
Link: https://arxiv.org/abs/2505.02573v1
Date: 2025-05-05
Summary:
Federated graph learning is a widely recognized technique that promotes collaborative training of graph neural networks (GNNs) by multi-client graphs.However, existing approaches heavily rely on the communication of model parameters or gradients for federated optimization and fail to adequately address the data heterogeneity introduced by intricate and diverse graph distributions. Although some methods attempt to share additional messages among the server and clients to improve federated convergence during communication, they introduce significant privacy risks and increase communication overhead. To address these issues, we introduce the concept of a condensed graph as a novel optimization carrier to address FGL data heterogeneity and propose a new FGL paradigm called FedGM. Specifically, we utilize a generalized condensation graph consensus to aggregate comprehensive knowledge from distributed graphs, while minimizing communication costs and privacy risks through a single transmission of the condensed data. Extensive experiments on six public datasets consistently demonstrate the superiority of FedGM over state-of-the-art baselines, highlighting its potential for a novel FGL paradigm.
--------------------------------------------------------------------------------------------------------
SymbioticRAG: Enhancing Document Intelligence Through Human-LLM Symbiotic Collaboration
Traditional RAG systems struggle with relevance determination and user query formulation, limiting their effectiveness in complex document intelligence tasks. SymbioticRAG revolutionizes this paradigm by creating bidirectional learning between humans and AI, where users directly curate retrieved content while the system learns personalized retrieval patterns. The framework's comprehensive document processing pipeline handles diverse formats including tables, formulas, and figures, making it ideal for scientific literature review, geological exploration reports, and educational content curation. By capturing user interactions to build personalized models, it addresses the fundamental challenge of relevance being inherently human-centered. Applications include research acceleration, technical documentation exploration, and domain-specific knowledge management systems.
Authors: Qiang Sun, Tingting Bi, Sirui Li, Eun-Jung Holden, Paul Duuring, Kai Niu, Wei Liu
Link: https://arxiv.org/abs/2505.02418v1
Date: 2025-05-05
Summary:
We present \textbf{SymbioticRAG}, a novel framework that fundamentally reimagines Retrieval-Augmented Generation~(RAG) systems by establishing a bidirectional learning relationship between humans and machines. Our approach addresses two critical challenges in current RAG systems: the inherently human-centered nature of relevance determination and users' progression from "unconscious incompetence" in query formulation. SymbioticRAG introduces a two-tier solution where Level 1 enables direct human curation of retrieved content through interactive source document exploration, while Level 2 aims to build personalized retrieval models based on captured user interactions. We implement Level 1 through three key components: (1)~a comprehensive document processing pipeline with specialized models for layout detection, OCR, and extraction of tables, formulas, and figures; (2)~an extensible retriever module supporting multiple retrieval strategies; and (3)~an interactive interface that facilitates both user engagement and interaction data logging. We experiment Level 2 implementation via a retriever strategy incorporated LLM summarized user intention from user interaction logs. To maintain high-quality data preparation, we develop a human-on-the-loop validation interface that improves pipeline output while advancing research in specialized extraction tasks. Evaluation across three scenarios (literature review, geological exploration, and education) demonstrates significant improvements in retrieval relevance and user satisfaction compared to traditional RAG approaches. To facilitate broader research and further advancement of SymbioticRAG Level 2 implementation, we will make our system openly accessible to the research community.
--------------------------------------------------------------------------------------------------------
T2S: High-resolution Time Series Generation with Text-to-Series Diffusion Models
Time series data scarcity hampers development across domains from healthcare to finance, yet existing synthetic generation methods lack flexibility and generalization capabilities. T2S bridges natural language and time series through diffusion models, enabling generation of arbitrary-length sequences from textual descriptions. By categorizing captions into point, fragment, and instance levels, the framework provides comprehensive temporal understanding. The length-adaptive variational autoencoder ensures consistent generation across varying sequence lengths. Applications include synthetic patient data generation for medical research, financial market simulation, IoT sensor data augmentation, and climate modeling. The domain-agnostic approach and extensive evaluation across 13 datasets demonstrate its versatility for addressing data availability challenges.
Authors: Yunfeng Ge, Jiawei Li, Yiji Zhao, Haomin Wen, Zhao Li, Meikang Qiu, Hongyan Li, Ming Jin, Shirui Pan
Link: https://arxiv.org/abs/2505.02417v2
Date: 2025-05-08
Summary:
Text-to-Time Series generation holds significant potential to address challenges such as data sparsity, imbalance, and limited availability of multimodal time series datasets across domains. While diffusion models have achieved remarkable success in Text-to-X (e.g., vision and audio data) generation, their use in time series generation remains in its nascent stages. Existing approaches face two critical limitations: (1) the lack of systematic exploration of general-proposed time series captions, which are often domain-specific and struggle with generalization; and (2) the inability to generate time series of arbitrary lengths, limiting their applicability to real-world scenarios. In this work, we first categorize time series captions into three levels: point-level, fragment-level, and instance-level. Additionally, we introduce a new fragment-level dataset containing over 600,000 high-resolution time series-text pairs. Second, we propose Text-to-Series (T2S), a diffusion-based framework that bridges the gap between natural language and time series in a domain-agnostic manner. T2S employs a length-adaptive variational autoencoder to encode time series of varying lengths into consistent latent embeddings. On top of that, T2S effectively aligns textual representations with latent embeddings by utilizing Flow Matching and employing Diffusion Transformer as the denoiser. We train T2S in an interleaved paradigm across multiple lengths, allowing it to generate sequences of any desired length. Extensive evaluations demonstrate that T2S achieves state-of-the-art performance across 13 datasets spanning 12 domains.
--------------------------------------------------------------------------------------------------------
Optimizing LLMs for Resource-Constrained Environments: A Survey of Model Compression Techniques
The deployment of large language models on mobile and edge devices remains challenging due to computational constraints, limiting AI accessibility and real-time applications. This comprehensive survey examines three primary compression approaches: knowledge distillation, quantization, and pruning, providing practitioners with actionable insights for edge deployment. By analyzing successful applications and complementary techniques like mixture-of-experts, the survey guides optimization strategies for resource-limited environments. Potential applications include on-device translation, offline virtual assistants, automotive AI systems, and privacy-preserving mobile applications. The work is particularly relevant as organizations seek to democratize AI access while maintaining data privacy and reducing cloud dependence for latency-critical tasks.
Authors: Sanjay Surendranath Girija, Shashank Kapoor, Lakshit Arora, Dipen Pradhan, Aman Raj, Ankit Shetgaonkar
Link: https://arxiv.org/abs/2505.02309v2
Date: 2025-05-08
Summary:
Large Language Models (LLMs) have revolutionized many areas of artificial intelligence (AI), but their substantial resource requirements limit their deployment on mobile and edge devices. This survey paper provides a comprehensive overview of techniques for compressing LLMs to enable efficient inference in resource-constrained environments. We examine three primary approaches: Knowledge Distillation, Model Quantization, and Model Pruning. For each technique, we discuss the underlying principles, present different variants, and provide examples of successful applications. We also briefly discuss complementary techniques such as mixture-of-experts and early-exit strategies. Finally, we highlight promising future directions, aiming to provide a valuable resource for both researchers and practitioners seeking to optimize LLMs for edge deployment.
--------------------------------------------------------------------------------------------------------
EcoAgent: An Efficient Edge-Cloud Collaborative Multi-Agent Framework for Mobile Automation
Mobile automation faces a fundamental trade-off between powerful cloud-based reasoning and responsive edge deployment, limiting practical applications. EcoAgent introduces a collaborative multi-agent architecture that strategically distributes tasks between cloud planning and edge execution. The Pre-Understanding Module compresses visual information into text, drastically reducing communication overhead, while the Memory and Reflection modules enable adaptive replanning. This framework enables sophisticated mobile automation tasks like automated testing, accessibility assistance, and intelligent user interfaces while maintaining efficiency. By achieving cloud-level performance with significantly reduced token consumption, EcoAgent makes complex mobile AI applications economically viable and practically deployable across diverse smartphone capabilities.
Authors: Biao Yi, Xavier Hu, Yurun Chen, Shengyu Zhang, Hongxia Yang, Fan Wu, Fei Wu
Link: https://arxiv.org/abs/2505.05440v2
Date: 2025-05-09
Summary:
Cloud-based mobile agents powered by (multimodal) large language models ((M)LLMs) offer strong reasoning abilities but suffer from high latency and cost. While fine-tuned (M)SLMs enable edge deployment, they often lose general capabilities and struggle with complex tasks. To address this, we propose \textbf{EcoAgent}, an \textbf{E}dge-\textbf{C}loud c\textbf{O}llaborative multi-agent framework for mobile automation. EcoAgent features a closed-loop collaboration among a cloud-based Planning Agent and two edge-based agents: the Execution Agent for action execution and the Observation Agent for verifying outcomes. The Observation Agent uses a Pre-Understanding Module to compress screen images into concise text, reducing token usage and communication overhead. In case of failure, the Planning Agent retrieves screen history through a Memory Module and replans via a Reflection Module. Experiments on AndroidWorld show that EcoAgent achieves task success rates comparable to cloud-based mobile agents while significantly reducing MLLM token consumption, enabling efficient and practical mobile automation.
--------------------------------------------------------------------------------------------------------
ConCISE: Confidence-guided Compression in Step-by-step Efficient Reasoning
Large reasoning models generate unnecessarily verbose outputs due to redundant reflection steps, increasing computational costs and degrading user experience. ConCISE identifies two key patterns causing this inefficiency: Confidence Deficit and Termination Delay. By injecting confidence signals during inference and implementing early stopping mechanisms, the framework reduces output length by up to 50% while maintaining accuracy. This advancement is crucial for deploying reasoning models in production environments where response time and computational efficiency matter. Applications include conversational AI assistants, automated code review systems, mathematical problem solvers, and educational tutoring platforms. The method's model-agnostic nature ensures broad applicability across different large language models.
Authors: Ziqing Qiao, Yongheng Deng, Jiali Zeng, Dong Wang, Lai Wei, Fandong Meng, Jie Zhou, Ju Ren, Yaoxue Zhang
Link: https://arxiv.org/abs/2505.04881v1
Date: 2025-05-08
Summary:
Large Reasoning Models (LRMs) perform strongly in complex reasoning tasks via Chain-of-Thought (CoT) prompting, but often suffer from verbose outputs caused by redundant content, increasing computational overhead, and degrading user experience. Existing compression methods either operate post-hoc pruning, risking disruption to reasoning coherence, or rely on sampling-based selection, which fails to intervene effectively during generation. In this work, we introduce a confidence-guided perspective to explain the emergence of redundant reflection in LRMs, identifying two key patterns: Confidence Deficit, where the model reconsiders correct steps due to low internal confidence, and Termination Delay, where reasoning continues even after reaching a confident answer. Based on this analysis, we propose ConCISE (Confidence-guided Compression In Step-by-step Efficient Reasoning), a framework that simplifies reasoning chains by reinforcing the model's confidence during inference, thus preventing the generation of redundant reflection steps. It integrates Confidence Injection to stabilize intermediate steps and Early Stopping to terminate reasoning when confidence is sufficient. Extensive experiments demonstrate that fine-tuning LRMs on ConCISE-generated data yields significantly shorter outputs, reducing length by up to approximately 50% under SimPO, while maintaining high task accuracy. ConCISE consistently outperforms existing baselines across multiple reasoning benchmarks.
--------------------------------------------------------------------------------------------------------
Healthcare prediction tools often lack transparency, limiting their adoption by medical professionals and patients. This work demonstrates an interactive web-based system for diabetes risk assessment that prioritizes explainability through SHAP and LIME visualizations. Using the CDC BRFSS dataset, the LightGBM model with undersampling achieves optimal recall for early detection. The Dash-based interface enables users to understand prediction rationales and explore comorbidity correlations, fostering trust and engagement. Applications include preventive healthcare screening, patient education tools, clinical decision support systems, and public health monitoring. By making AI predictions interpretable and interactive, the system bridges the gap between advanced machine learning and practical healthcare delivery.
Authors: Udaya Allani
Link: https://arxiv.org/abs/2505.05683v1
Date: 2025-05-08
Summary:
This study presents a web-based interactive health risk prediction tool designed to assess diabetes risk using machine learning models. Built on the 2015 CDC BRFSS dataset, the study evaluates models including Logistic Regression, Random Forest, XGBoost, LightGBM, KNN, and Neural Networks under original, SMOTE, and undersampling strategies. LightGBM with undersampling achieved the best recall, making it ideal for risk detection. The tool integrates SHAP and LIME to explain predictions and highlights comorbidity correlations using Pearson analysis. A Dash-based UI enables user-friendly interaction with model predictions, personalized suggestions, and feature insights, supporting data-driven health awareness.
--------------------------------------------------------------------------------------------------------
ORBIT-2: Scaling Exascale Vision Foundation Models for Weather and Climate Downscaling
Climate modeling at regional scales remains limited by computational constraints and sparse observational data, hampering local decision-making for climate adaptation. ORBIT-2 revolutionizes climate downscaling through two key innovations: Reslim architecture for efficient processing and TILES algorithm for linear-complexity attention scaling. Achieving 1.8 ExaFLOPS and supporting 0.9 km global resolution, the system enables unprecedented detail in weather prediction and climate projection. Applications include extreme weather forecasting, agricultural planning, urban climate modeling, and disaster preparedness. The model's ability to process 4.2 billion token sequences while maintaining 92-98% scaling efficiency represents a breakthrough in applying AI to one of humanity's most pressing challenges.
Authors: Xiao Wang, Jong-Youl Choi, Takuya Kurihaya, Isaac Lyngaas, Hong-Jun Yoon, Ming Fan, Nasik Muhammad Nafi, Aristeidis Tsaris, Ashwin M. Aji, Maliha Hossain, Mohamed Wahib, Dali Wang, Peter Thornton, Prasanna Balaprakash, Moetasim Ashfaq, Dan Lu
Link: https://arxiv.org/abs/2505.04802v1
Date: 2025-05-07
Summary:
Sparse observations and coarse-resolution climate models limit effective regional decision-making, underscoring the need for robust downscaling. However, existing AI methods struggle with generalization across variables and geographies and are constrained by the quadratic complexity of Vision Transformer (ViT) self-attention. We introduce ORBIT-2, a scalable foundation model for global, hyper-resolution climate downscaling. ORBIT-2 incorporates two key innovations: (1) Residual Slim ViT (Reslim), a lightweight architecture with residual learning and Bayesian regularization for efficient, robust prediction; and (2) TILES, a tile-wise sequence scaling algorithm that reduces self-attention complexity from quadratic to linear, enabling long-sequence processing and massive parallelism. ORBIT-2 scales to 10 billion parameters across 32,768 GPUs, achieving up to 1.8 ExaFLOPS sustained throughput and 92-98% strong scaling efficiency. It supports downscaling to 0.9 km global resolution and processes sequences up to 4.2 billion tokens. On 7 km resolution benchmarks, ORBIT-2 achieves high accuracy with R^2 scores in the range of 0.98 to 0.99 against observation data.
--------------------------------------------------------------------------------------------------------
Soft Best-of-n Sampling for Model Alignment
Aligning language models with human preferences typically requires expensive fine-tuning, making it impractical for many applications. Soft Best-of-n sampling generalizes the popular BoN approach by introducing a temperature parameter that smoothly interpolates between original and reward-maximizing distributions. This theoretical advancement provides practitioners with fine-grained control over the trade-off between alignment quality and distributional distortion. Applications include content moderation systems, customer service chatbots, educational AI tutors, and creative writing assistants. The method's simplicity and theoretical guarantees make it particularly valuable for organizations seeking to improve model outputs without the computational overhead of full retraining or the coarse control of traditional BoN sampling.
Authors: Claudio Mayrink Verdun, Alex Oesterling, Himabindu Lakkaraju, Flavio P. Calmon
Link: https://arxiv.org/abs/2505.03156v1
Date: 2025-05-06
Summary:
Best-of-$n$ (BoN) sampling is a practical approach for aligning language model outputs with human preferences without expensive fine-tuning. BoN sampling is performed by generating $n$ responses to a prompt and then selecting the sample that maximizes a reward function. BoN yields high reward values in practice at a distortion cost, as measured by the KL-divergence between the sampled and original distribution. This distortion is coarsely controlled by varying the number of samples: larger $n$ yields a higher reward at a higher distortion cost. We introduce Soft Best-of-$n$ sampling, a generalization of BoN that allows for smooth interpolation between the original distribution and reward-maximizing distribution through a temperature parameter $\lambda$. We establish theoretical guarantees showing that Soft Best-of-$n$ sampling converges sharply to the optimal tilted distribution at a rate of $O(1/n)$ in KL and the expected (relative) reward. For sequences of discrete outputs, we analyze an additive reward model that reveals the fundamental limitations of blockwise sampling.
--------------------------------------------------------------------------------------------------------
The transformative impact of generative AI on business creation and innovation demands systematic understanding for entrepreneurs and policymakers. This comprehensive review analyzes 83 papers using advanced NLP techniques, revealing five major thematic clusters spanning digital transformation to data-driven innovation. The study identifies critical gaps in macro-level research and regulatory frameworks while highlighting ethical concerns emerging from GenAI adoption. Applications include AI-powered business model innovation, educational programs for entrepreneurs, policy development for AI governance, and strategic planning for startups. By synthesizing scattered research, this work provides essential guidance for navigating the rapidly evolving intersection of AI and entrepreneurship in the modern business landscape.
Authors: Anna Kusetogullari, Huseyin Kusetogullari, Martin Andersson, Tony Gorschek
Link: https://arxiv.org/abs/2505.05523v1
Date: 2025-05-08
Summary:
Generative Artificial Intelligence (GenAI) and Large Language Models (LLMs) are recognized to have significant effects on industry and business dynamics, not least because of their impact on the preconditions for entrepreneurship. There is still a lack of knowledge of GenAI as a theme in entrepreneurship research. This paper presents a systematic literature review aimed at identifying and analyzing the evolving landscape of research on the effects of GenAI on entrepreneurship. We analyze 83 peer-reviewed articles obtained from leading academic databases: Web of Science and Scopus. Using natural language processing and unsupervised machine learning techniques with TF-IDF vectorization, Principal Component Analysis (PCA), and hierarchical clustering, five major thematic clusters are identified: (1) Digital Transformation and Behavioral Models, (2) GenAI-Enhanced Education and Learning Systems, (3) Sustainable Innovation and Strategic AI Impact, (4) Business Models and Market Trends, and (5) Data-Driven Technological Trends in Entrepreneurship. Based on the review, we discuss future research directions, gaps in the current literature, as well as ethical concerns raised in the literature. We highlight the need for more macro-level research on GenAI and LLMs as external enablers for entrepreneurship and for research on effective regulatory frameworks that facilitate business experimentation, innovation, and further technology development.
--------------------------------------------------------------------------------------------------------
Safety by Measurement: A Systematic Literature Review of AI Safety Evaluation Methods
As AI systems approach transformative capabilities, ensuring their safety becomes paramount, yet evaluation methods remain fragmented and incomplete. This systematic review consolidates the field by proposing a three-dimensional taxonomy: what properties to measure (capabilities, propensities, control), how to measure them (behavioral and internal techniques), and integration into governance frameworks. The work examines critical capabilities like deception and autonomous replication alongside concerning propensities like power-seeking. Applications include AI safety auditing, regulatory compliance frameworks, development decision-making, and risk assessment protocols. By addressing challenges like capability absence proofs and potential sandbagging, this review provides essential guidance for organizations developing or deploying advanced AI systems.
Authors: Markov Grey, Charbel-Raphaël Segerie
Link: https://arxiv.org/abs/2505.05541v1
Date: 2025-05-08
Summary:
As frontier AI systems advance toward transformative capabilities, we need a parallel transformation in how we measure and evaluate these systems to ensure safety and inform governance. While benchmarks have been the primary method for estimating model capabilities, they often fail to establish true upper bounds or predict deployment behavior. This literature review consolidates the rapidly evolving field of AI safety evaluations, proposing a systematic taxonomy around three dimensions: what properties we measure, how we measure them, and how these measurements integrate into frameworks. We show how evaluations go beyond benchmarks by measuring what models can do when pushed to the limit (capabilities), the behavioral tendencies exhibited by default (propensities), and whether our safety measures remain effective even when faced with subversive adversarial AI (control). These properties are measured through behavioral techniques like scaffolding, red teaming and supervised fine-tuning, alongside internal techniques such as representation analysis and mechanistic interpretability. We provide deeper explanations of some safety-critical capabilities like cybersecurity exploitation, deception, autonomous replication, and situational awareness, alongside concerning propensities like power-seeking and scheming. The review explores how these evaluation methods integrate into governance frameworks to translate results into concrete development decisions. We also highlight challenges to safety evaluations - proving absence of capabilities, potential model sandbagging, and incentives for "safetywashing" - while identifying promising research directions. By synthesizing scattered resources, this literature review aims to provide a central reference point for understanding AI safety evaluations.
--------------------------------------------------------------------------------------------------------
Traditional causal inference focuses on average effects, overlooking the rich information contained in distributional characteristics. This work extends causal analysis to moments beyond the mean, including variance, skewness, and kurtosis, providing deeper insights into treatment heterogeneity. By developing identification theorems and bounds for causal effect moments, researchers can better understand how interventions affect different population segments. Applications include personalized medicine where treatment variance matters, policy evaluation considering outcome inequality, A/B testing with heterogeneous effects, and economic interventions with distributional consequences. The framework's ability to capture relationships between causal effects through product moments enables more nuanced decision-making in complex systems.
Authors: Yuta Kawakami, Jin Tian
Link: https://arxiv.org/abs/2505.04971v1
Date: 2025-05-08
Summary:
The moments of random variables are fundamental statistical measures for characterizing the shape of a probability distribution, encompassing metrics such as mean, variance, skewness, and kurtosis. Additionally, the product moments, including covariance and correlation, reveal the relationships between multiple random variables. On the other hand, the primary focus of causal inference is the evaluation of causal effects, which are defined as the difference between two potential outcomes. While traditional causal effect assessment focuses on the average causal effect, this work provides definitions, identification theorems, and bounds for moments and product moments of causal effects to analyze their distribution and relationships. We conduct experiments to illustrate the estimation of the moments of causal effects from finite samples and demonstrate their practical application using a real-world medical dataset.
--------------------------------------------------------------------------------------------------------
The Power of Stories: Narrative Priming Shapes How LLM Agents Collaborate and Compete
Understanding how shared narratives influence AI agent behavior has profound implications for multi-agent system design. This study uses public goods games to demonstrate that story-based priming significantly affects negotiation outcomes between LLM agents. Common narratives enhance collaboration while conflicting stories promote competition, mirroring human social dynamics. Applications include designing cooperative AI systems for supply chain optimization, developing negotiation protocols for autonomous vehicles, creating collaborative research assistants, and building consensus mechanisms for decentralized systems. The findings suggest that carefully crafted narratives could align multi-agent systems toward collective goals, offering a novel approach to AI coordination beyond traditional reward engineering.
Authors: Gerrit Großmann, Larisa Ivanova, Sai Leela Poduru, Mohaddeseh Tabrizian, Islam Mesabah, David A. Selby, Sebastian J. Vollmer
Link: https://arxiv.org/abs/2505.03961v2
Date: 2025-05-08
Summary:
According to Yuval Noah Harari, large-scale human cooperation is driven by shared narratives that encode common beliefs and values. This study explores whether such narratives can similarly nudge LLM agents toward collaboration. We use a finitely repeated public goods game in which LLM agents choose either cooperative or egoistic spending strategies. We prime agents with stories highlighting teamwork to different degrees and test how this influences negotiation outcomes. Our experiments explore four questions:(1) How do narratives influence negotiation behavior? (2) What differs when agents share the same story versus different ones? (3) What happens when the agent numbers grow? (4) Are agents resilient against self-serving negotiators? We find that story-based priming significantly affects negotiation strategies and success rates. Common stories improve collaboration, benefiting each agent. By contrast, priming agents with different stories reverses this effect, and those agents primed toward self-interest prevail. We hypothesize that these results carry implications for multi-agent system design and AI alignment.
--------------------------------------------------------------------------------------------------------
Edge Large AI Models: Collaborative Deployment and IoT Applications
The deployment of large AI models at the network edge promises real-time intelligence but faces significant resource constraints. This framework addresses the challenge through collaborative training that adaptively decomposes models based on computational resources, data modalities, and objectives. The microservice-based inference architecture virtualizes functional modules to optimize resource utilization. Applications include smart city infrastructure, industrial IoT monitoring, autonomous vehicle networks, and healthcare wearables. By enabling context-aware generative tasks and multi-modal inference at the edge, this work bridges the gap between powerful AI capabilities and the distributed nature of IoT ecosystems, making sophisticated AI accessible where centralized processing is impractical.
Authors: Zixin Wang, Yuanming Shi, Khaled. B. Letaief
Link: https://arxiv.org/abs/2505.03139v1
Date: 2025-05-06
Summary:
Large artificial intelligence models (LAMs) emulate human-like problem-solving capabilities across diverse domains, modalities, and tasks. By leveraging the communication and computation resources of geographically distributed edge devices, edge LAMs enable real-time intelligent services at the network edge. Unlike conventional edge AI, which relies on small or moderate-sized models for direct feature-to-prediction mappings, edge LAMs leverage the intricate coordination of modular components to enable context-aware generative tasks and multi-modal inference. We shall propose a collaborative deployment framework for edge LAM by characterizing the LAM intelligent capabilities and limited edge network resources. Specifically, we propose a collaborative training framework over heterogeneous edge networks that adaptively decomposes LAMs according to computation resources, data modalities, and training objectives, reducing communication and computation overheads during the fine-tuning process. Furthermore, we introduce a microservice-based inference framework that virtualizes the functional modules of edge LAMs according to their architectural characteristics, thereby improving resource utilization and reducing inference latency. The developed edge LAM will provide actionable solutions to enable diversified Internet-of-Things (IoT) applications, facilitated by constructing mappings from diverse sensor data to token representations and fine-tuning based on domain knowledge.
--------------------------------------------------------------------------------------------------------
AIOps solutions typically discard historical models during periodic retraining, potentially wasting models that could excel on future data patterns. This empirical study evaluates model selection mechanisms that leverage past models for current predictions, using datasets from Google, Alibaba, and BackBlaze. The findings reveal that temporal adjacency-based selection often outperforms periodic retraining, suggesting significant efficiency gains. Applications include cloud infrastructure monitoring, predictive maintenance systems, anomaly detection services, and performance optimization tools. By demonstrating a performance gap between current methods and theoretical bounds, this work motivates development of more sophisticated selection mechanisms that could dramatically improve AIOps efficiency and effectiveness.
Authors: Yingzhe Lyu, Hao Li, Heng Li, Ahmed E. Hassan
Link: https://arxiv.org/abs/2505.02961v1
Date: 2025-05-05
Summary:
AIOps (Artificial Intelligence for IT Operations) solutions leverage the tremendous amount of data produced during the operation of large-scale systems and machine learning models to assist software practitioners in their system operations. Existing AIOps solutions usually maintain AIOps models against concept drift through periodical retraining, despite leaving a pile of discarded historical models that may perform well on specific future data. Other prior works propose dynamically selecting models for prediction tasks from a set of candidate models to optimize the model performance. However, there is no prior work in the AIOps area that assesses the use of model selection mechanisms on historical models to improve model performance or robustness. To fill the gap, we evaluate several model selection mechanisms by assessing their capabilities in selecting the optimal AIOps models that were built in the past to make predictions for the target data. We performed a case study on three large-scale public operation datasets: two trace datasets from the cloud computing platforms of Google and Alibaba, and one disk stats dataset from the BackBlaze cloud storage data center. We observe that the model selection mechnisms utilizing temporal adjacency tend to have a better performance and can prevail the periodical retraining approach. Our findings also highlight a performance gap between existing model selection mechnisms and the theoretical upper bound which may motivate future researchers and practitioners in investigating more efficient and effective model selection mechanisms that fit in the context of AIOps.
--------------------------------------------------------------------------------------------------------
The use of Artificial Intelligence for Intervention and Assessment in Individuals with ASD
Early diagnosis and effective intervention for autism spectrum disorder remain challenging, yet AI technologies offer promising solutions through objective assessment and personalized support. This paper explores deep learning algorithms for behavioral pattern recognition and AI-powered intervention tools including social robots and adaptive communication systems. Robots like NAO and Kaspar demonstrate effectiveness in enhancing social skills through structured interactions. Applications include early screening programs, personalized therapy protocols, educational support systems, and communication assistive technologies. The work addresses both opportunities and challenges, emphasizing the need for long-term evaluation and individual customization to ensure AI tools genuinely benefit individuals with ASD.
Authors: Aggeliki Sideraki, Christos-Nikolaos Anagnostopoulos
Link: https://arxiv.org/abs/2505.02747v1
Date: 2025-05-05
Summary:
This paper explores the use of Artificial Intelligence (AI) as a tool for diagnosis, assessment, and intervention for individuals with Autism Spectrum Disorder (ASD). It focuses particularly on AI's role in early diagnosis, utilizing advanced machine learning techniques and data analysis. Recent studies demonstrate that deep learning algorithms can identify behavioral patterns through biometric data analysis, video-based interaction assessments, and linguistic feature extraction, providing a more accurate and timely diagnosis compared to traditional methods. Additionally, AI automates diagnostic tools, reducing subjective biases and enabling the development of personalized assessment protocols for ASD monitoring. At the same time, the paper examines AI-powered intervention technologies, emphasizing educational robots and adaptive communication tools. Social robotic assistants, such as NAO and Kaspar, have been shown to enhance social skills in children by offering structured, repetitive interactions that reinforce learning. Furthermore, AI-driven Augmentative and Alternative Communication (AAC) systems allow children with ASD to express themselves more effectively, while machine-learning chatbots provide language development support through personalized responses. The study presents research findings supporting the effectiveness of these AI applications while addressing challenges such as long-term evaluation and customization to individual needs. In conclusion, the paper highlights the significance of AI as an innovative tool in ASD diagnosis and intervention, advocating for further research to assess its long-term impact.
--------------------------------------------------------------------------------------------------------
Sensing Framework Design and Performance Optimization with Action Detection for ISCC
Next-generation wireless networks require efficient integration of sensing, communication, and computation for human-centric applications. This framework introduces action detection modules at edge devices to identify relevant time windows, reducing unnecessary data transmission and processing. The ADMM-based distributed algorithm optimizes sensing accuracy under resource constraints through alternating device-level and server-level optimization. Applications include smart home automation, gesture recognition systems, health monitoring networks, and augmented reality interfaces. By achieving higher accuracy with limited resources compared to baselines, this work enables practical deployment of intelligent sensing applications in resource-constrained environments, advancing the vision of pervasive computing.
Authors: Weiwei Chen, Yinghui He, Guanding Yu, Jianfeng Wang, Haiyan Luo
Link: https://arxiv.org/abs/2505.02554v1
Date: 2025-05-05
Summary:
Integrated sensing, communication, and computation (ISCC) has been regarded as a prospective technology for the next-generation wireless network, supporting humancentric intelligent applications. However, the delay sensitivity of these computation-intensive applications, especially in a multidevice ISCC system with limited resources, highlights the urgent need for efficient sensing task execution frameworks. To address this, we propose a resource-efficient sensing framework in this paper. Different from existing solutions, it features a novel action detection module deployed at each device to detect the onset of an action. Only time windows filled with signals of interest are offloaded to the edge server and processed by the edge recognition module, thus reducing overhead. Furthermore, we quantitatively analyze the sensing performance of the proposed sensing framework and formulate a sensing accuracy maximization problem under power, delay, and resource limitations for the multi-device ISCC system. By decomposing it into two subproblems, we develop an alternating direction method of multipliers (ADMM)-based distributed algorithm. It alternatively solves a sensing accuracy maximization subproblem at each device and employs a closed-form computation resource allocation strategy at the edge server till convergence. Finally, a real-world test is conducted using commodity wireless devices to validate the sensing performance analysis. Extensive test results demonstrate that our proposal achieves higher sensing accuracy under the limited resource compared to two baselines.
--------------------------------------------------------------------------------------------------------