Week Ending 4.6.2025
RESEARCH WATCH: 4.6.2025
MedSAM2: Segment Anything in 3D Medical Images and Videos
MedSAM2 represents a breakthrough in medical image segmentation, addressing the critical need for precise delineation of anatomical structures in 3D medical images and videos. By fine-tuning the Segment Anything Model 2 on over 455,000 3D image-mask pairs and 76,000 frames, this promptable foundation model outperforms previous approaches across diverse organs, lesions, and imaging modalities. The human-in-the-loop pipeline reduces manual annotation costs by more than 85%, as demonstrated in extensive user studies involving CT lesions, liver MRI lesions, and echocardiogram video frames. With integration into widely used platforms and user-friendly interfaces, MedSAM2 offers practical support for efficient, scalable, and high-quality segmentation in both research and healthcare settings.
Authors: Jun Ma, Zongxin Yang, Sumin Kim, Bihui Chen, Mohammed Baharoon, Adibvafa Fallahpour, Reza Asakereh, Hongwei Lyu, Bo Wang
Link: https://arxiv.org/abs/2504.03600v1
Date: 2025-04-04
Summary:
Medical image and video segmentation is a critical task for precision medicine, which has witnessed considerable progress in developing task or modality-specific and generalist models for 2D images. However, there have been limited studies on building general-purpose models for 3D images and videos with comprehensive user studies. Here, we present MedSAM2, a promptable segmentation foundation model for 3D image and video segmentation. The model is developed by fine-tuning the Segment Anything Model 2 on a large medical dataset with over 455,000 3D image-mask pairs and 76,000 frames, outperforming previous models across a wide range of organs, lesions, and imaging modalities. Furthermore, we implement a human-in-the-loop pipeline to facilitate the creation of large-scale datasets resulting in, to the best of our knowledge, the most extensive user study to date, involving the annotation of 5,000 CT lesions, 3,984 liver MRI lesions, and 251,550 echocardiogram video frames, demonstrating that MedSAM2 can reduce manual costs by more than 85%. MedSAM2 is also integrated into widely used platforms with user-friendly interfaces for local and cloud deployment, making it a practical tool for supporting efficient, scalable, and high-quality segmentation in both research and healthcare environments.
--------------------------------------------------------------------------------------------------------
Towards Resilient Federated Learning in CyberEdge Networks: Recent Advances and Future Trends
This survey explores cutting-edge techniques in resilient federated learning (ResFL) for CyberEdge networks, focusing on joint training with agglomerative deduction and feature-oriented security mechanisms. The research addresses non-IID data challenges through adaptive hierarchical learning strategies, improving scalability while reducing communication overhead. The paper analyzes fault tolerance techniques for detecting unreliable devices and enhancing convergence stability, alongside comprehensive examination of feature-oriented threats including poisoning and inference attacks. Resilient aggregation techniques, anomaly detection, and cryptographic defenses are discussed as security enhancements. Additionally, the integration of 6G, large language models, and interoperable learning frameworks promotes privacy-preserving, decentralized cross-domain training with ultra-low latency and AI-driven network management, fostering secure ResFL deployment in CyberEdge networks.
Authors: Kai Li, Zhengyang Zhang, Azadeh Pourkabirian, Wei Ni, Falko Dressler, Ozgur B. Akan
Link: https://arxiv.org/abs/2504.01240v1
Date: 2025-04-01
Summary:
In this survey, we investigate the most recent techniques of resilient federated learning (ResFL) in CyberEdge networks, focusing on joint training with agglomerative deduction and feature-oriented security mechanisms. We explore adaptive hierarchical learning strategies to tackle non-IID data challenges, improving scalability and reducing communication overhead. Fault tolerance techniques and agglomerative deduction mechanisms are studied to detect unreliable devices, refine model updates, and enhance convergence stability. Unlike existing FL security research, we comprehensively analyze feature-oriented threats, such as poisoning, inference, and reconstruction attacks that exploit model features. Moreover, we examine resilient aggregation techniques, anomaly detection, and cryptographic defenses, including differential privacy and secure multi-party computation, to strengthen FL security. In addition, we discuss the integration of 6G, large language models (LLMs), and interoperable learning frameworks to enhance privacy-preserving and decentralized cross-domain training. These advancements offer ultra-low latency, artificial intelligence (AI)-driven network management, and improved resilience against adversarial attacks, fostering the deployment of secure ResFL in CyberEdge networks.
--------------------------------------------------------------------------------------------------------
PolygoNet: Leveraging Simplified Polygonal Representation for Effective Image Classification
PolygoNet introduces an innovative approach to image classification by transforming input images into compact polygonal representations using dominant points or contour coordinates. This transformation significantly reduces computational requirements, accelerates training, and conserves resources—making it ideal for real-time and resource-constrained applications. By inherently capturing essential image features while filtering noise, these representations provide natural regularization that mitigates overfitting. The resulting lightweight models achieve performance comparable to state-of-the-art methods using full-resolution images while enabling deployment on edge devices. Extensive experiments on benchmark datasets validate the approach's effectiveness in reducing complexity, improving generalization, and facilitating edge computing applications, demonstrating the potential of polygonal representations in advancing efficient and scalable deep learning solutions.
Authors: Salim Khazem, Jeremy Fix, Cédric Pradalier
Link: https://arxiv.org/abs/2504.01214v1
Date: 2025-04-01
Summary:
Deep learning models have achieved significant success in various image related tasks. However, they often encounter challenges related to computational complexity and overfitting. In this paper, we propose an efficient approach that leverages polygonal representations of images using dominant points or contour coordinates. By transforming input images into these compact forms, our method significantly reduces computational requirements, accelerates training, and conserves resources making it suitable for real time and resource constrained applications. These representations inherently capture essential image features while filtering noise, providing a natural regularization effect that mitigates overfitting. The resulting lightweight models achieve performance comparable to state of the art methods using full resolution images while enabling deployment on edge devices. Extensive experiments on benchmark datasets validate the effectiveness of our approach in reducing complexity, improving generalization, and facilitating edge computing applications. This work demonstrates the potential of polygonal representations in advancing efficient and scalable deep learning solutions for real world scenarios. The code for the experiments of the paper is provided in https://github.com/salimkhazem/PolygoNet.
--------------------------------------------------------------------------------------------------------
SeizureTransformer addresses the critical challenge of seizure detection for epilepsy patients, offering a novel solution to the limitations of existing methods that require extensive post-processing and struggle with long-range patterns in EEG data. This innovative model combines a deep encoder with 1D convolutions, a residual CNN stack with a transformer encoder for contextual information, and a streamlined decoder that directly indicates seizure presence at every time step. Ranking first in the 2025 "seizure detection challenge" at the International Conference on Artificial Intelligence in Epilepsy, SeizureTransformer demonstrates superior performance across both public and private EEG datasets, highlighting its potential for real-time, precise detection that could significantly improve epilepsy management for the 65 million affected individuals worldwide.
Authors: Kerui Wu, Ziyue Zhao, Bülent Yener
Link: https://arxiv.org/abs/2504.00336v2
Date: 2025-04-02
Summary:
Epilepsy is a common neurological disorder that affects around 65 million people worldwide. Detecting seizures quickly and accurately is vital, given the prevalence and severity of the associated complications. Recently, deep learning-based automated seizure detection methods have emerged as solutions; however, most existing methods require extensive post-processing and do not effectively handle the crucial long-range patterns in EEG data. In this work, we propose SeizureTransformer, a simple model comprised of (i) a deep encoder comprising 1D convolutions (ii) a residual CNN stack and a transformer encoder to embed previous output into high-level representation with contextual information, and (iii) streamlined decoder which converts these features into a sequence of probabilities, directly indicating the presence or absence of seizures at every time step. Extensive experiments on public and private EEG seizure detection datasets demonstrate that our model significantly outperforms existing approaches (ranked in the first place in the 2025 "seizure detection challenge" organized in the International Conference on Artificial Intelligence in Epilepsy and Other Neurological Disorders), underscoring its potential for real-time, precise seizure detection.
--------------------------------------------------------------------------------------------------------
FedPaI: Achieving Extreme Sparsity in Federated Learning via Pruning at Initialization
FedPaI introduces a groundbreaking approach to Federated Learning (FL) that addresses the significant resource constraints of edge environments through Pruning at Initialization (PaI). Unlike existing iterative pruning techniques that struggle with FL's decentralized and data-imbalanced nature, FedPaI identifies optimal sparse connections at the beginning of training, maximizing model capacity while drastically reducing communication and computation overhead. Supporting both structured and unstructured pruning, the framework adapts to diverse hardware and software environments, while introducing personalized client-side pruning mechanisms and sparsity-aware server-side aggregation for enhanced efficiency. Experimental results show FedPaI achieving an unprecedented 98% sparsity level without compromising accuracy, even under challenging non-IID settings, while accelerating training by 6.4 to 7.9 times.
Authors: Haonan Wang, Zeli Liu, Kajimusugura Hoshino, Tuo Zhang, John Paul Walters, Stephen Crago
Link: https://arxiv.org/abs/2504.00308v1
Date: 2025-04-01
Summary:
Federated Learning (FL) enables distributed training on edge devices but faces significant challenges due to resource constraints in edge environments, impacting both communication and computational efficiency. Existing iterative pruning techniques improve communication efficiency but are limited by their centralized design, which struggles with FL's decentralized and data-imbalanced nature, resulting in suboptimal sparsity levels. To address these issues, we propose FedPaI, a novel efficient FL framework that leverages Pruning at Initialization (PaI) to achieve extreme sparsity. FedPaI identifies optimal sparse connections at an early stage, maximizing model capacity and significantly reducing communication and computation overhead by fixing sparsity patterns at the start of training. To adapt to diverse hardware and software environments, FedPaI supports both structured and unstructured pruning. Additionally, we introduce personalized client-side pruning mechanisms for improved learning capacity and sparsity-aware server-side aggregation for enhanced efficiency. Experimental results demonstrate that FedPaI consistently outperforms existing efficient FL that applies conventional iterative pruning with significant leading in efficiency and model accuracy. For the first time, our proposed FedPaI achieves an extreme sparsity level of up to 98% without compromising the model accuracy compared to unpruned baselines, even under challenging non-IID settings. By employing our FedPaI with joint optimization of model learning capacity and sparsity, FL applications can benefit from faster convergence and accelerate the training by 6.4 to 7.9 times.
--------------------------------------------------------------------------------------------------------
Synthesizing Public Opinions with LLMs: Role Creation, Impacts, and the Future to eDemorcacy
This research introduces an innovative approach to synthesizing public opinion data using Large Language Models (LLMs), addressing challenges in traditional survey methods like declining response rates and non-response bias. The novel technique of role creation based on knowledge injection leverages RAG and personality profiles from the HEXACO model alongside demographic information to generate dynamically crafted prompts. This method enables LLMs to simulate diverse opinions more accurately than existing prompt engineering approaches. Experiments using questions from the Cooperative Election Study demonstrate that this role-creation approach significantly improves the alignment of LLM-generated opinions with real-world human survey responses, increasing answer adherence. The research explores implications for electronic democracy while discussing challenges, limitations, and future research directions for synthesizing representative public opinion.
Authors: Rabimba Karanjai, Boris Shor, Amanda Austin, Ryan Kennedy, Yang Lu, Lei Xu, Weidong Shi
Link: https://arxiv.org/abs/2504.00241v1
Date: 2025-03-31
Summary:
This paper investigates the use of Large Language Models (LLMs) to synthesize public opinion data, addressing challenges in traditional survey methods like declining response rates and non-response bias. We introduce a novel technique: role creation based on knowledge injection, a form of in-context learning that leverages RAG and specified personality profiles from the HEXACO model and demographic information, and uses that for dynamically generated prompts. This method allows LLMs to simulate diverse opinions more accurately than existing prompt engineering approaches. We compare our results with pre-trained models with standard few-shot prompts. Experiments using questions from the Cooperative Election Study (CES) demonstrate that our role-creation approach significantly improves the alignment of LLM-generated opinions with real-world human survey responses, increasing answer adherence. In addition, we discuss challenges, limitations and future research directions.
--------------------------------------------------------------------------------------------------------
This comprehensive study investigates how corporate digital innovation affects Environmental, Social, and Governance (ESG) performance, with particular attention to Generative AI technology adoption as a mediating factor. Using an extensive panel dataset of 8,000 observations from the CMARS and WIND database (2015-2023), the research employs multiple econometric techniques to reveal that digital innovation significantly enhances corporate ESG performance, with GAI technology serving as a crucial mediating mechanism. The relationship varies across firm size, industry type, and ownership structure, as confirmed through heterogeneity analysis. Results remain robust after addressing potential endogeneity concerns through instrumental variable estimation, propensity score matching, and difference-in-differences approaches. This research contributes valuable insights to the growing literature on technology-driven sustainability transformations and offers practical implications for corporate strategy and policy development.
Authors: Jun Cui
Link: https://arxiv.org/abs/2504.01041v1
Date: 2025-03-31
Summary:
This study investigates the relationship between corporate digital innovation and Environmental, Social, and Governance (ESG) performance, with a specific focus on the mediating role of Generative artificial intelligence technology adoption. Using a comprehensive panel dataset of 8,000 observations from the CMARS and WIND database spanning from 2015 to 2023, we employ multiple econometric techniques to examine this relationship. Our findings reveal that digital innovation significantly enhances corporate ESG performance, with GAI technology adoption serving as a crucial mediating mechanism. Specifically, digital innovation positively influences GAI technology adoption, which subsequently improves ESG performance. Furthermore, our heterogeneity analysis indicates that this relationship varies across firm size, industry type, and ownership structure. Finally, our results remain robust after addressing potential endogeneity concerns through instrumental variable estimation, propensity score matching, and differenc in differences approaches. This research contributes to the growing literature on technologydriven sustainability transformations and offers practical implications for corporate strategy and policy development in promoting sustainable business practices through technological advancement.
--------------------------------------------------------------------------------------------------------
Learned Image Compression and Restoration for Digital Pathology
CLERIC introduces a groundbreaking deep learning-based image compression framework specifically designed for whole slide images (WSIs) in digital pathology. Addressing the challenges posed by ultra-high resolution and large file sizes that complicate storage, transmission, and real-time visualization, this innovative approach integrates a learnable lifting scheme with advanced convolutional techniques. The framework decomposes images into low- and high-frequency components through a lifting-scheme transform, enabling more structured latent representations. Parallel encoders with Deformable Residual Blocks and Recurrent Residual Blocks enhance feature extraction and spatial adaptability, while an inverse lifting transform ensures high-fidelity restoration of fine-grained tissue structures. Evaluated against state-of-the-art learned image compression models, CLERIC demonstrates superior rate-distortion performance, significantly reducing storage requirements while maintaining high diagnostic image quality for efficient clinical workflows.
Authors: SeonYeong Lee, EonSeung Seong, DongEon Lee, SiYeoul Lee, Yubin Cho, Chunsu Park, Seonho Kim, MinKyung Seo, YoungSin Ko, MinWoo Kim
Link: https://arxiv.org/abs/2503.23862v2
Date: 2025-04-01
Summary:
Digital pathology images play a crucial role in medical diagnostics, but their ultra-high resolution and large file sizes pose significant challenges for storage, transmission, and real-time visualization. To address these issues, we propose CLERIC, a novel deep learning-based image compression framework designed specifically for whole slide images (WSIs). CLERIC integrates a learnable lifting scheme and advanced convolutional techniques to enhance compression efficiency while preserving critical pathological details. Our framework employs a lifting-scheme transform in the analysis stage to decompose images into low- and high-frequency components, enabling more structured latent representations. These components are processed through parallel encoders incorporating Deformable Residual Blocks (DRB) and Recurrent Residual Blocks (R2B) to improve feature extraction and spatial adaptability. The synthesis stage applies an inverse lifting transform for effective image reconstruction, ensuring high-fidelity restoration of fine-grained tissue structures. We evaluate CLERIC on a digital pathology image dataset and compare its performance against state-of-the-art learned image compression (LIC) models. Experimental results demonstrate that CLERIC achieves superior rate-distortion (RD) performance, significantly reducing storage requirements while maintaining high diagnostic image quality. Our study highlights the potential of deep learning-based compression in digital pathology, facilitating efficient data management and long-term storage while ensuring seamless integration into clinical workflows and AI-assisted diagnostic systems. Code and models are available at: https://github.com/pnu-amilab/CLERIC.
--------------------------------------------------------------------------------------------------------
Large Language Models Pass the Turing Test
This groundbreaking study evaluates four systems—ELIZA, GPT-4o, LLaMa-3.1-405B, and GPT-4.5—in randomized, controlled, and pre-registered Turing tests with independent populations. Participants engaged in 5-minute conversations simultaneously with another human and one system before judging which partner they believed was human. When prompted to adopt a humanlike persona, GPT-4.5 was judged to be human 73% of the time—significantly more often than participants selected the actual human. LLaMa-3.1, with the same prompt, achieved a 56% selection rate, statistically indistinguishable from humans, while baseline models (ELIZA and GPT-4o) scored significantly below chance. These results provide the first empirical evidence of an artificial system passing a standard three-party Turing test, with profound implications for debates about LLM intelligence and their potential social and economic impacts.
Authors: Cameron R. Jones, Benjamin K. Bergen
Link: https://arxiv.org/abs/2503.23674v1
Date: 2025-03-31
Summary:
We evaluated 4 systems (ELIZA, GPT-4o, LLaMa-3.1-405B, and GPT-4.5) in two randomised, controlled, and pre-registered Turing tests on independent populations. Participants had 5 minute conversations simultaneously with another human participant and one of these systems before judging which conversational partner they thought was human. When prompted to adopt a humanlike persona, GPT-4.5 was judged to be the human 73% of the time: significantly more often than interrogators selected the real human participant. LLaMa-3.1, with the same prompt, was judged to be the human 56% of the time -- not significantly more or less often than the humans they were being compared to -- while baseline models (ELIZA and GPT-4o) achieved win rates significantly below chance (23% and 21% respectively). The results constitute the first empirical evidence that any artificial system passes a standard three-party Turing test. The results have implications for debates about what kind of intelligence is exhibited by Large Language Models (LLMs), and the social and economic impacts these systems are likely to have.
--------------------------------------------------------------------------------------------------------
GIScience in the Era of Artificial Intelligence: A Research Agenda Towards Autonomous GIS
This visionary paper explores how generative AI and large language models are transforming geographic information systems toward autonomous GIS. The authors present a comprehensive framework defining five autonomous goals, five levels of autonomy, five core functions, and three operational scales for this emerging paradigm. Through proof-of-concept GIS agents, they demonstrate how autonomous GIS could independently perform geospatial data retrieval, spatial analysis, and map making without human intervention. The research identifies critical challenges and future directions, including fine-tuning decision cores, autonomous modeling, and addressing ethical implications. By establishing groundwork for a paradigm shift in GIScience, this paper envisions GIS evolving beyond traditional workflows to autonomously reason, derive, and innovate solutions for pressing global challenges, representing a fundamental rethinking of how geographic information is processed and applied.
Authors: Zhenlong Li, Huan Ning, Song Gao, Krzysztof Janowicz, Wenwen Li, Samantha T. Arundel, Chaowei Yang, Budhendra Bhaduri, Shaowen Wang, A-Xing Zhu, Mark Gahegan, Shashi Shekhar, Xinyue Ye, Grant McKenzie, Guido Cervone, Michael E. Hodgson
Link: https://arxiv.org/abs/2503.23633v2
Date: 2025-04-01
Summary:
The advent of generative AI exemplified by large language models (LLMs) opens new ways to represent and compute geographic information and transcend the process of geographic knowledge production, driving geographic information systems (GIS) towards autonomous GIS. Leveraging LLMs as the decision core, autonomous GIS can independently generate and execute geoprocessing workflows to perform spatial analysis. In this vision paper, we elaborate on the concept of autonomous GIS and present a framework that defines its five autonomous goals, five levels of autonomy, five core functions, and three operational scales. We demonstrate how autonomous GIS could perform geospatial data retrieval, spatial analysis, and map making with four proof-of-concept GIS agents. We conclude by identifying critical challenges and future research directions, including fine-tuning and self-growing decision cores, autonomous modeling, and examining the ethical and practical implications of autonomous GIS. By establishing the groundwork for a paradigm shift in GIScience, this paper envisions a future where GIS moves beyond traditional workflows to autonomously reason, derive, innovate, and advance solutions to pressing global challenges.
--------------------------------------------------------------------------------------------------------
Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets
Unified World Models (UWM) presents a groundbreaking framework that bridges the gap between imitation learning and world modeling for robotics. The approach tackles a fundamental challenge: while high-quality expert demonstrations are scarce, abundant video data depicting diverse environments and behaviors lacks the action annotations required for conventional imitation learning. UWM integrates action diffusion and video diffusion processes within a unified transformer architecture, using independent diffusion timesteps for each modality. By controlling these timesteps, UWM flexibly represents policies, forward dynamics, inverse dynamics, and video generation. Experiments in simulated and real-world environments demonstrate that UWM enables effective pretraining on large-scale multitask robot datasets and facilitates learning from action-free video data, resulting in more generalizable and robust policies than traditional imitation learning approaches, offering a promising pathway toward leveraging heterogeneous datasets for scalable robot learning.
Authors: Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, Abhishek Gupta
Link: https://arxiv.org/abs/2504.02792v1
Date: 2025-04-03
Summary:
Imitation learning has emerged as a promising approach towards building generalist robots. However, scaling imitation learning for large robot foundation models remains challenging due to its reliance on high-quality expert demonstrations. Meanwhile, large amounts of video data depicting a wide range of environments and diverse behaviors are readily available. This data provides a rich source of information about real-world dynamics and agent-environment interactions. Leveraging this data directly for imitation learning, however, has proven difficult due to the lack of action annotation required for most contemporary methods. In this work, we present Unified World Models (UWM), a framework that allows for leveraging both video and action data for policy learning. Specifically, a UWM integrates an action diffusion process and a video diffusion process within a unified transformer architecture, where independent diffusion timesteps govern each modality. We show that by simply controlling each diffusion timestep, UWM can flexibly represent a policy, a forward dynamics, an inverse dynamics, and a video generator. Through simulated and real-world experiments, we show that: (1) UWM enables effective pretraining on large-scale multitask robot datasets with both dynamics and action predictions, resulting in more generalizable and robust policies than imitation learning, (2) UWM naturally facilitates learning from action-free video data through independent control of modality-specific diffusion timesteps, further improving the performance of finetuned policies. Our results suggest that UWM offers a promising step toward harnessing large, heterogeneous datasets for scalable robot learning, and provides a simple unification between the often disparate paradigms of imitation learning and world modeling. Videos and code are available at https://weirdlabuw.github.io/uwm/.
--------------------------------------------------------------------------------------------------------
Role and Use of Race in AI/ML Models Related to Health
This perspective paper addresses the controversial and complex issue of race in health-related AI/ML models, providing a systematic framework to guide stakeholders through challenges across the AI/ML lifecycle. While race-related factors in healthcare AI have received increasing attention, until now there has been no comprehensive framework to examine and resolve these multifaceted issues. The authors present a broad-based, cross-cutting landscape analysis structured as "points to consider" to support inquiry and decision-making across different stages of AI/ML development and implementation. By adopting this holistic approach, the paper helps stakeholders navigate ethical considerations, potential biases, and appropriate applications of race in healthcare AI, promoting more responsible and equitable development of these increasingly important technologies.
Authors: Martin C. Were, Ang Li, Bradley A. Malin, Zhijun Yin, Joseph R. Coco, Benjamin X. Collins, Ellen Wright Clayton, Laurie L. Novak, Rachele Hendricks-Sturrup, Abiodun Oluyomi, Shilo Anders, Chao Yan
Link: https://arxiv.org/abs/2504.00899v1
Date: 2025-04-01
Summary:
The role and use of race within health-related artificial intelligence and machine learning (AI/ML) models has sparked increasing attention and controversy. Despite the complexity and breadth of related issues, a robust and holistic framework to guide stakeholders in their examination and resolution remains lacking. This perspective provides a broad-based, systematic, and cross-cutting landscape analysis of race-related challenges, structured around the AI/ML lifecycle and framed through "points to consider" to support inquiry and decision-making.
--------------------------------------------------------------------------------------------------------
Why Reasoning Matters? A Survey of Advancements in Multimodal Reasoning (v1)
This comprehensive survey explores the critical role of reasoning in artificial intelligence, highlighting recent advances in large language models (LLMs) and addressing the unique challenges of multimodal reasoning. While LLMs have made significant progress in arithmetic, commonsense, and symbolic reasoning domains, extending these capabilities to multimodal contexts—where models must integrate visual and textual inputs—presents substantial challenges. The paper identifies core difficulties such as handling conflicting cross-modal information, which requires sophisticated interpretative strategies. Through detailed comparisons of reasoning techniques in both textual and multimodal LLMs, the authors formulate fundamental challenges and opportunities, highlighting practical methods for post-training optimization and test-time inference. This work bridges theoretical frameworks with practical implementations, providing valuable guidance for researchers and practitioners working to enhance artificial reasoning systems across diverse modalities.
Authors: Jing Bi, Susan Liang, Xiaofei Zhou, Pinxin Liu, Junjia Guo, Yunlong Tang, Luchuan Song, Chao Huang, Guangyu Sun, Jinxi He, Jiarui Wu, Shu Yang, Daoan Zhang, Chen Chen, Lianggong Bruce Wen, Zhang Liu, Jiebo Luo, Chenliang Xu
Link: https://arxiv.org/abs/2504.03151v1
Date: 2025-04-04
Summary:
Reasoning is central to human intelligence, enabling structured problem-solving across diverse tasks. Recent advances in large language models (LLMs) have greatly enhanced their reasoning abilities in arithmetic, commonsense, and symbolic domains. However, effectively extending these capabilities into multimodal contexts-where models must integrate both visual and textual inputs-continues to be a significant challenge. Multimodal reasoning introduces complexities, such as handling conflicting information across modalities, which require models to adopt advanced interpretative strategies. Addressing these challenges involves not only sophisticated algorithms but also robust methodologies for evaluating reasoning accuracy and coherence. This paper offers a concise yet insightful overview of reasoning techniques in both textual and multimodal LLMs. Through a thorough and up-to-date comparison, we clearly formulate core reasoning challenges and opportunities, highlighting practical methods for post-training optimization and test-time inference. Our work provides valuable insights and guidance, bridging theoretical frameworks and practical implementations, and sets clear directions for future research.
--------------------------------------------------------------------------------------------------------
Over-the-Air Edge Inference via End-to-End Metasurfaces-Integrated Artificial Neural Networks
This innovative research reimagines wireless communication for Edge Inference (EI) by treating the wireless medium as an active computational element rather than merely a source of noise. Leveraging emerging Reconfigurable Intelligent Surfaces (RISs) and Stacked Intelligent Metasurfaces (SIM) technologies, the authors optimize the wireless environment to perform over-the-air computing similar to neural network layers. Their proposed Metasurfaces-Integrated Neural Networks (MINNs) framework includes comprehensive modeling, training through modified backpropagation for fading channels, and deployment considerations for both channel-aware and channel-agnostic transceivers. Numerical evaluations demonstrate that metasurfaces significantly enhance image classification performance under challenging link budgets that would impede conventional systems. Most impressively, the MINN framework achieves near-optimal performance with 50 dB lower testing signal-to-noise ratio compared to training, even without transceiver channel knowledge—potentially revolutionizing edge computing in resource-constrained wireless environments.
Authors: Kyriakos Stylianopoulos, Paolo Di Lorenzo, George C. Alexandropoulos
Link: https://arxiv.org/abs/2504.00233v1
Date: 2025-03-31
Summary:
In the Edge Inference (EI) paradigm, where a Deep Neural Network (DNN) is split across the transceivers to wirelessly communicate goal-defined features in solving a computational task, the wireless medium has been commonly treated as a source of noise. In this paper, motivated by the emerging technologies of Reconfigurable Intelligent Surfaces (RISs) and Stacked Intelligent Metasurfaces (SIM) that offer programmable propagation of wireless signals, either through controllable reflections or diffractions, we optimize the RIS/SIM-enabled smart wireless environment as a means of over-the-air computing, resembling the operations of DNN layers. We propose a framework of Metasurfaces-Integrated Neural Networks (MINNs) for EI, presenting its modeling, training through a backpropagation variation for fading channels, and deployment aspects. The overall end-to-end DNN architecture is general enough to admit RIS and SIM devices, through controllable reconfiguration before each transmission or fixed configurations after training, while both channel-aware and channel-agnostic transceivers are considered. Our numerical evaluation showcases metasurfaces to be instrumental in performing image classification under link budgets that impede conventional communications or metasurface-free systems. It is demonstrated that our MINN framework can significantly simplify EI requirements, achieving near-optimal performance with $50~$dB lower testing signal-to-noise ratio compared to training, even without transceiver channel knowledge.
--------------------------------------------------------------------------------------------------------
This thought-provoking analysis challenges the widely accepted "disclosure thesis" that clinicians are ethically obligated to inform patients when they use medical machine learning systems. The author examines four major arguments supporting mandatory disclosure: risk-based, rights-based, materiality, and autonomy arguments—and systematically demonstrates why each fails to provide compelling ethical justification. Beyond rejecting the disclosure thesis, the paper suggests that mandating disclosure could potentially harm patients by providing stakeholders with a mechanism to evade accountability for improper system applications or uses. This contrarian perspective invites a fundamental reconsideration of ethical frameworks surrounding AI transparency in healthcare, questioning whether automatic disclosure truly serves patient interests or merely creates an illusion of ethical practice while potentially undermining more substantive accountability mechanisms.
Authors: Joshua Hatherley
Link: https://arxiv.org/abs/2504.01043v2
Date: 2025-04-04
Summary:
It is commonly accepted that clinicians are ethically obligated to disclose their use of medical machine learning systems to patients, and that failure to do so would amount to a moral fault for which clinicians ought to be held accountable. Call this "the disclosure thesis." Four main arguments have been, or could be, given to support the disclosure thesis in the ethics literature: the risk-based argument, the rights-based argument, the materiality argument, and the autonomy argument. In this article, I argue that each of these four arguments are unconvincing, and therefore, that the disclosure thesis ought to be rejected. I suggest that mandating disclosure may also even risk harming patients by providing stakeholders with a way to avoid accountability for harm that results from improper applications or uses of these systems.
--------------------------------------------------------------------------------------------------------
AI2Agent: An End-to-End Framework for Deploying AI Projects as Autonomous Agents
AI2Agent addresses a critical challenge in the AI industry: the complex, time-consuming deployment process that hinders widespread adoption of AI technologies. This end-to-end framework automates AI project deployment through three key mechanisms: guideline-driven execution, self-adaptive debugging, and case & solution accumulation. By dynamically analyzing deployment challenges, learning from past cases, and iteratively refining its approach, AI2Agent significantly reduces human intervention requirements. The framework's effectiveness was demonstrated through experiments on 30 diverse AI deployment cases spanning text-to-speech, image generation, image editing, and other applications. Results show substantial improvements in both deployment time and success rates compared to conventional methods. With publicly accessible code and demonstration videos, AI2Agent represents a significant advancement in making AI deployment more accessible, efficient, and reliable across industries.
Authors: Jiaxiang Chen, Jingwei Shi, Lei Gan, Jiale Zhang, Qingyu Zhang, Dongqian Zhang, Xin Pang, Zhucong Li, Yinghui Xu
Link: https://arxiv.org/abs/2503.23948v1
Date: 2025-03-31
Summary:
As AI technology advances, it is driving innovation across industries, increasing the demand for scalable AI project deployment. However, deployment remains a critical challenge due to complex environment configurations, dependency conflicts, cross-platform adaptation, and debugging difficulties, which hinder automation and adoption. This paper introduces AI2Agent, an end-to-end framework that automates AI project deployment through guideline-driven execution, self-adaptive debugging, and case \& solution accumulation. AI2Agent dynamically analyzes deployment challenges, learns from past cases, and iteratively refines its approach, significantly reducing human intervention. To evaluate its effectiveness, we conducted experiments on 30 AI deployment cases, covering TTS, text-to-image generation, image editing, and other AI applications. Results show that AI2Agent significantly reduces deployment time and improves success rates. The code and demo video are now publicly accessible.
--------------------------------------------------------------------------------------------------------
Catch Me if You Search: When Contextual Web Search Results Affect the Detection of Hallucinations
This insightful study examines how web search integration affects users' ability to detect hallucinations in large language model (LLM) outputs. With 560 participants evaluating LLM-generated content (genuine, minor hallucination, or major hallucination) under three conditions—no search results, static search results, or dynamic participant-driven searches—the research reveals significant differences in performance. Participants with search capabilities rated hallucinated content as less accurate than the control group, with those using dynamic searches demonstrating greater accuracy in evaluating genuine content and higher overall confidence. Additionally, individual differences emerged, as participants with higher need for cognition more accurately identified major hallucinations. These findings highlight the potential benefits of integrating web search capabilities into LLMs as a hallucination detection mechanism, while underscoring the importance of personalized approaches that account for user characteristics when developing human-centered AI systems.
Authors: Mahjabin Nahar, Eun-Ju Lee, Jin Won Park, Dongwon Lee
Link: https://arxiv.org/abs/2504.01153v1
Date: 2025-04-01
Summary:
While we increasingly rely on large language models (LLMs) for various tasks, these models are known to produce inaccurate content or 'hallucinations' with potentially disastrous consequences. The recent integration of web search results into LLMs prompts the question of whether people utilize them to verify the generated content, thereby avoiding falling victim to hallucinations. This study (N = 560) investigated how the provision of search results, either static (fixed search results) or dynamic (participant-driven searches), affect participants' perceived accuracy and confidence in evaluating LLM-generated content (i.e., genuine, minor hallucination, major hallucination), compared to the control condition (no search results). Findings indicate that participants in both static and dynamic conditions (vs. control) rated hallucinated content to be less accurate. However, those in the dynamic condition rated genuine content as more accurate and demonstrated greater overall confidence in their assessments than those in the static or control conditions. In addition, those higher in need for cognition (NFC) rated major hallucinations to be less accurate than low NFC participants, with no corresponding difference for genuine content or minor hallucinations. These results underscore the potential benefits of integrating web search results into LLMs for the detection of hallucinations, as well as the need for a more nuanced approach when developing human-centered systems, taking user characteristics into account.
--------------------------------------------------------------------------------------------------------
The PENGWIN 2024 challenge represents a significant advancement in automated pelvic fracture segmentation, addressing crucial needs in trauma diagnosis, surgical planning, and intraoperative guidance. This benchmark study, organized as a MICCAI 2024 satellite event, evaluated state-of-the-art algorithms using a diverse dataset of 150 CT scans from multiple clinical centers and simulated X-ray images generated via DeepDRR. The 16 international participating teams were assessed under rigorous multi-metric testing, with the top CT algorithm achieving an impressive 0.930 fragment-wise intersection over union score, while X-ray performance peaked at 0.774, highlighting the challenges posed by overlapping structures. Beyond quantitative evaluation, the challenge revealed methodological diversity in algorithm design and exposed inherent uncertainties in fragment definition, particularly for incomplete fractures. These findings suggest that interactive approaches integrating human decision-making may be essential for improving model reliability in clinical settings.
Authors: Yudi Sang, Yanzhen Liu, Sutuke Yibulayimu, Yunning Wang, Benjamin D. Killeen, Mingxu Liu, Ping-Cheng Ku, Ole Johannsen, Karol Gotkowski, Maximilian Zenk, Klaus Maier-Hein, Fabian Isensee, Peiyan Yue, Yi Wang, Haidong Yu, Zhaohong Pan, Yutong He, Xiaokun Liang, Daiqi Liu, Fuxin Fan, Artur Jurgas, Andrzej Skalski, Yuxi Ma, Jing Yang, Szymon Płotka, Rafał Litka, Gang Zhu, Yingchun Song, Mathias Unberath, Mehran Armand, Dan Ruan, S. Kevin Zhou, Qiyong Cao, Chunpeng Zhao, Xinbao Wu, Yu Wang
Link: https://arxiv.org/abs/2504.02382v1
Date: 2025-04-03
Summary:
The segmentation of pelvic fracture fragments in CT and X-ray images is crucial for trauma diagnosis, surgical planning, and intraoperative guidance. However, accurately and efficiently delineating the bone fragments remains a significant challenge due to complex anatomy and imaging limitations. The PENGWIN challenge, organized as a MICCAI 2024 satellite event, aimed to advance automated fracture segmentation by benchmarking state-of-the-art algorithms on these complex tasks. A diverse dataset of 150 CT scans was collected from multiple clinical centers, and a large set of simulated X-ray images was generated using the DeepDRR method. Final submissions from 16 teams worldwide were evaluated under a rigorous multi-metric testing scheme. The top-performing CT algorithm achieved an average fragment-wise intersection over union (IoU) of 0.930, demonstrating satisfactory accuracy. However, in the X-ray task, the best algorithm attained an IoU of 0.774, highlighting the greater challenges posed by overlapping anatomical structures. Beyond the quantitative evaluation, the challenge revealed methodological diversity in algorithm design. Variations in instance representation, such as primary-secondary classification versus boundary-core separation, led to differing segmentation strategies. Despite promising results, the challenge also exposed inherent uncertainties in fragment definition, particularly in cases of incomplete fractures. These findings suggest that interactive segmentation approaches, integrating human decision-making with task-relevant information, may be essential for improving model reliability and clinical applicability.
--------------------------------------------------------------------------------------------------------
This groundbreaking research presents a hybrid photonic architecture that integrates quantum dot (QD)-containing waveguides with low-loss lithium niobate circuits, incorporating 20 deterministic single-photon sources. Leveraging the piezoelectric properties of thin-film lithium niobate, the researchers achieve unprecedented on-chip local spectral tuning of QD emissions—up to 7.7 meV, three orders of magnitude greater than the transform-limited linewidth. This innovation enables on-chip quantum interference with 0.73 visibility between spatially separated QD sources connected by 0.48 mm waveguides, establishing a functional quantum network. The large-scale integration of spectrally tunable QD-based single-photon sources into low-loss lithium niobate circuits, combined with fast electro-optical switching, represents a significant advancement toward compact, lightweight, and scalable photonic quantum networks, addressing key challenges in quantum communication technology.
Authors: Xudong Wang, Xiuqi Zhang, Bowen Chen, Yifan Zhu, Yuanhao Qin, Lvbin Dong, Jiachen Cai, Dongchen Sui, Jinbo Wu, Quan Zhang
Link: https://arxiv.org/abs/2503.23755v1
Date: 2025-03-31
Summary:
Hybrid integrated quantum photonics combines solid-state artificial atoms with reconfigurable photonic circuits, enabling scalable chip-based quantum networks. Self-assembled quantum dots (QDs) are ideal for this goal due to their ability to generate highly indistinguishable single photons with exceptional brightness. Integrating QDs into low-loss photonic circuits can facilitate complex quantum networks by enabling entanglement transfer via two-photon interference. However, challenges such as limited scalability, spectral inhomogeneity, and quantum interference between independent sources remain. We present a hybrid photonic architecture that integrates QD-containing waveguides with low-loss lithium niobate (LN) circuits, incorporating 20 deterministic single-photon sources (SPSs). Using the piezoelectric properties of thin-film lithium niobate (TFLN), we achieve on-chip local spectral tuning of QD emissions by up to 7.7 meV,three orders of magnitude greater than the transform-limited linewidth. This approach enables on-chip quantum interference with a visibility of 0.73 between two spatially separated QD SPSs connected by 0.48 mm long waveguides, establishing a functional quantum network.The large-scale integration of spectrally tunable QD-based SPSs into low-loss LN circuits, combined with fast electro-optical switching, paves the way for compact, lightweight, and scalable photonic quantum networks.
--------------------------------------------------------------------------------------------------------
Bias in Large Language Models Across Clinical Applications: A Systematic Review
This comprehensive systematic review investigates bias in large language models (LLMs) applied to clinical tasks, examining 38 studies from database inception through 2025. The research reveals pervasive bias across various LLMs and clinical applications, stemming from both biased training data and model training processes. These biases manifest as allocative harm (differential treatment recommendations), representational harm (stereotypical associations), and performance disparities (variable output quality), predominantly affecting race/ethnicity and gender, but also age, disability, and language. The findings highlight how clinical LLM bias could lead to misdiagnosis and inappropriate treatment, particularly for marginalized populations. The authors emphasize the critical need for rigorous model evaluation, effective mitigation strategies, and continuous monitoring in real-world clinical settings to ensure safe, equitable, and trustworthy deployment of LLMs in healthcare, protecting patient outcomes and preventing exacerbation of existing health inequities.
Authors: Thanathip Suenghataiphorn, Narisara Tribuddharat, Pojsakorn Danpanichkul, Narathorn Kulthamrongsri
Link: https://arxiv.org/abs/2504.02917v1
Date: 2025-04-03
Summary:
Background: Large language models (LLMs) are rapidly being integrated into healthcare, promising to enhance various clinical tasks. However, concerns exist regarding their potential for bias, which could compromise patient care and exacerbate health inequities. This systematic review investigates the prevalence, sources, manifestations, and clinical implications of bias in LLMs. Methods: We conducted a systematic search of PubMed, OVID, and EMBASE from database inception through 2025, for studies evaluating bias in LLMs applied to clinical tasks. We extracted data on LLM type, bias source, bias manifestation, affected attributes, clinical task, evaluation methods, and outcomes. Risk of bias was assessed using a modified ROBINS-I tool. Results: Thirty-eight studies met inclusion criteria, revealing pervasive bias across various LLMs and clinical applications. Both data-related bias (from biased training data) and model-related bias (from model training) were significant contributors. Biases manifested as: allocative harm (e.g., differential treatment recommendations); representational harm (e.g., stereotypical associations, biased image generation); and performance disparities (e.g., variable output quality). These biases affected multiple attributes, most frequently race/ethnicity and gender, but also age, disability, and language. Conclusions: Bias in clinical LLMs is a pervasive and systemic issue, with a potential to lead to misdiagnosis and inappropriate treatment, particularly for marginalized patient populations. Rigorous evaluation of the model is crucial. Furthermore, the development and implementation of effective mitigation strategies, coupled with continuous monitoring in real-world clinical settings, are essential to ensure the safe, equitable, and trustworthy deployment of LLMs in healthcare.
--------------------------------------------------------------------------------------------------------