Week Ending 12.22.2024
RESEARCH WATCH: 12.22.2024
Explores how self-supervised learning from video can improve fundamental computer vision tasks like tracking and depth estimation. Using massive video datasets and transformer models scaling up to 22B parameters, they show consistent improvements in spatial and temporal tasks, advancing our understanding of how AI systems can learn to perceive the world in four dimensions (space + time). This could enhance robotics, autonomous vehicles, and surveillance systems' ability to understand complex scenes.
Authors: João Carreira, Dilara Gokay, Michael King, Chuhan Zhang, Ignacio Rocco, Aravindh Mahendran, Thomas Albert Keck, Joseph Heyward, Skanda Koppula, Etienne Pot, Goker Erdogan, Yana Hasson, Yi Yang, Klaus Greff, Guillaume Le Moing, Sjoerd van Steenkiste, Daniel Zoran, Drew A. Hudson, Pedro Vélez, Luisa Polanía, Luke Friedman, Chris Duvarney, Ross Goroshin, Kelsey Allen, Jacob Walker, Rishabh Kabra, Eric Aboussouan, Jennifer Sun, Thomas Kipf, Carl Doersch, Viorica Pătrăucean, Dima Damen, Pauline Luc, Mehdi S. M. Sajjadi, Andrew Zisserman
Link: https://arxiv.org/abs/2412.15212v1
Date: 2024-12-19
Summary:
Scaling has not yet been convincingly demonstrated for pure self-supervised learning from video. However, prior work has focused evaluations on semantic-related tasks $\unicode{x2013}$ action classification, ImageNet classification, etc. In this paper we focus on evaluating self-supervised learning on non-semantic vision tasks that are more spatial (3D) and temporal (+1D = 4D), such as camera pose estimation, point and object tracking, and depth estimation. We show that by learning from very large video datasets, masked auto-encoding (MAE) with transformer video models actually scales, consistently improving performance on these 4D tasks, as model size increases from 20M all the way to the largest by far reported self-supervised video model $\unicode{x2013}$ 22B parameters. Rigorous apples-to-apples comparison with many recent image and video models demonstrates the benefits of scaling 4D representations.
--------------------------------------------------------------------------------------------------------
A Full Transformer-based Framework for Automatic Pain Estimation using Videos
Presents a novel system using transformer neural networks to automatically assess pain levels from video footage. Testing on the BioVid database, their dual-transformer approach achieves state-of-the-art results in pain detection tasks. This technology could transform healthcare by enabling continuous, objective pain monitoring for patients unable to communicate verbally, improving pain management in hospitals and care facilities.
Authors: Stefanos Gkikas, Manolis Tsiknakis
Link: https://arxiv.org/abs/2412.15095v1
Date: 2024-12-19
Summary:
The automatic estimation of pain is essential in designing an optimal pain management system offering reliable assessment and reducing the suffering of patients. In this study, we present a novel full transformer-based framework consisting of a Transformer in Transformer (TNT) model and a Transformer leveraging cross-attention and self-attention blocks. Elaborating on videos from the BioVid database, we demonstrate state-of-the-art performances, showing the efficacy, efficiency, and generalization capability across all the primary pain estimation tasks.
--------------------------------------------------------------------------------------------------------
Movie2Story: A framework for understanding videos and telling stories in the form of novel text
Introduces a framework that converts videos into detailed written narratives by analyzing audio, video, and character information. Using GPT4o for integration, it generates novel-length text descriptions that capture both visual and audio elements. This technology could automate video summarization for content creators, assist visually impaired individuals in understanding videos, and create written records of video content for archival or educational purposes.
Authors: Kangning Li, Zheyang Jia, Anyu Ying
Link: https://arxiv.org/abs/2412.14965v1
Date: 2024-12-19
Summary:
Multimodal video-to-text models have made considerable progress, primarily in generating brief descriptions of video content. However, there is still a deficiency in generating rich long-form text descriptions that integrate both video and audio. In this paper, we introduce a framework called M2S, designed to generate novel-length text by combining audio, video, and character recognition. M2S includes modules for video long-form text description and comprehension, audio-based analysis of emotion, speech rate, and character alignment, and visual-based character recognition alignment. By integrating multimodal information using the large language model GPT4o, M2S stands out in the field of multimodal text generation. We demonstrate the effectiveness and accuracy of M2S through comparative experiments and human evaluation. Additionally, the model framework has good scalability and significant potential for future research.
--------------------------------------------------------------------------------------------------------
Proposes a mobile app algorithm for comprehensive child development monitoring using digital phenotyping. By incorporating Bayesian AI, it tracks multiple aspects of growth including physical, emotional, cognitive, and environmental factors. This could transform pediatric care by providing real-time, multidimensional insights into child development, enabling earlier intervention and more personalized care strategies for families and healthcare providers.
Authors: Rolando Gonzales Martinez, Hinke Haisma
Link: https://arxiv.org/abs/2412.14720v1
Date: 2024-12-19
Summary:
This document proposes an algorithm for a mobile application designed to monitor multidimensional child growth through digital phenotyping. Digital phenotyping offers a unique opportunity to collect and analyze high-frequency data in real time, capturing behavioral, psychological, and physiological states of children in naturalistic settings. Traditional models of child growth primarily focus on physical metrics, often overlooking multidimensional aspects such as emotional, social, and cognitive development. In this paper, we introduce a Bayesian artificial intelligence (AI) algorithm that leverages digital phenotyping to create a Multidimensional Index of Child Growth (MICG). This index integrates data from various dimensions of child development, including physical, emotional, cognitive, and environmental factors. By incorporating probabilistic modeling, the proposed algorithm dynamically updates its learning based on data collected by the mobile app used by mothers and children. The app also infers uncertainty from response times, adjusting the importance of each dimension of child growth accordingly. Our contribution applies state-of-the-art technology to track multidimensional child development, enabling families and healthcare providers to make more informed decisions in real time.
--------------------------------------------------------------------------------------------------------
Deep reinforcement learning with time-scale invariant memory
Integrates neuroscience-inspired scale-invariant memory into AI systems. This approach allows agents to learn effectively across different time scales, mirroring human learning capabilities. Applications could include more adaptable robots, improved AI decision-making systems, and better models for understanding human cognition and learning processes.
Authors: Md Rysul Kabir, James Mochizuki-Freeman, Zoran Tiganj
Link: https://arxiv.org/abs/2412.15292v1
Date: 2024-12-19
Summary:
The ability to estimate temporal relationships is critical for both animals and artificial agents. Cognitive science and neuroscience provide remarkable insights into behavioral and neural aspects of temporal credit assignment. In particular, scale invariance of learning dynamics, observed in behavior and supported by neural data, is one of the key principles that governs animal perception: proportional rescaling of temporal relationships does not alter the overall learning efficiency. Here we integrate a computational neuroscience model of scale invariant memory into deep reinforcement learning (RL) agents. We first provide a theoretical analysis and then demonstrate through experiments that such agents can learn robustly across a wide range of temporal scales, unlike agents built with commonly used recurrent memory architectures such as LSTM. This result illustrates that incorporating computational principles from neuroscience and cognitive science into deep neural networks can enhance adaptability to complex temporal dynamics, mirroring some of the core properties of human learning.
--------------------------------------------------------------------------------------------------------
Addresses language bias in ophthalmological AI systems across different languages. The study reveals performance disparities in medical question-answering across languages and proposes CLARA, a method to reduce these biases. This work could improve healthcare access in Low and Middle-Income Countries by making AI-powered ophthalmological tools more equitable and effective across different languages and cultures.
Authors: David Restrepo, Chenwei Wu, Zhengxu Tang, Zitao Shuai, Thao Nguyen Minh Phan, Jun-En Ding, Cong-Tinh Dao, Jack Gallifant, Robyn Gayle Dychiao, Jose Carlo Artiaga, André Hiroshi Bando, Carolina Pelegrini Barbosa Gracitelli, Vincenz Ferrer, Leo Anthony Celi, Danielle Bitterman, Michael G Morley, Luis Filipe Nakayama
Link: https://arxiv.org/abs/2412.14304v1
Date: 2024-12-18
Summary:
Current ophthalmology clinical workflows are plagued by over-referrals, long waits, and complex and heterogeneous medical records. Large language models (LLMs) present a promising solution to automate various procedures such as triaging, preliminary tests like visual acuity assessment, and report summaries. However, LLMs have demonstrated significantly varied performance across different languages in natural language question-answering tasks, potentially exacerbating healthcare disparities in Low and Middle-Income Countries (LMICs). This study introduces the first multilingual ophthalmological question-answering benchmark with manually curated questions parallel across languages, allowing for direct cross-lingual comparisons. Our evaluation of 6 popular LLMs across 7 different languages reveals substantial bias across different languages, highlighting risks for clinical deployment of LLMs in LMICs. Existing debiasing methods such as Translation Chain-of-Thought or Retrieval-augmented generation (RAG) by themselves fall short of closing this performance gap, often failing to improve performance across all languages and lacking specificity for the medical domain. To address this issue, We propose CLARA (Cross-Lingual Reflective Agentic system), a novel inference time de-biasing method leveraging retrieval augmented generation and self-verification. Our approach not only improves performance across all languages but also significantly reduces the multilingual bias gap, facilitating equitable LLM application across the globe.
--------------------------------------------------------------------------------------------------------
A Computationally Grounded Framework for Cognitive Attitudes (extended version)
Presents a formal system for modeling agents' beliefs and motivations. Using belief bases and modal operators, it provides a mathematical foundation for understanding psychological concepts. This framework could enhance AI systems' ability to model human psychology, improve human-AI interaction, and advance our theoretical understanding of cognitive processes.
Authors: Tiago de Lima, Emiliano Lorini, Elise Perrotin, François Schwarzentruber
Link: https://arxiv.org/abs/2412.14073v1
Date: 2024-12-18
Summary:
We introduce a novel language for reasoning about agents' cognitive attitudes of both epistemic and motivational type. We interpret it by means of a computationally grounded semantics using belief bases. Our language includes five types of modal operators for implicit belief, complete attraction, complete repulsion, realistic attraction and realistic repulsion. We give an axiomatization and show that our operators are not mutually expressible and that they can be combined to represent a large variety of psychological concepts including ambivalence, indifference, being motivated, being demotivated and preference. We present a dynamic extension of the language that supports reasoning about the effects of belief change operations. Finally, we provide a succinct formulation of model checking for our languages and a PSPACE model checking algorithm relying on a reduction into TQBF. We present some experimental results for the implemented algorithm on computation time in a concrete example.
--------------------------------------------------------------------------------------------------------
JoVALE: Detecting Human Actions in Video Using Audiovisual and Language Contexts
Introduces a novel approach to video action detection that combines visual, audio, and language information. Using actor-centric aggregation of multiple data streams, it achieves state-of-the-art performance in detecting actions in videos. This technology could improve surveillance systems, content moderation, and automated video analysis for various applications.
Authors: Taein Son, Soo Won Seo, Jisong Kim, Seok Hwan Lee, Jun Won Choi
Link: https://arxiv.org/abs/2412.13708v1
Date: 2024-12-18
Summary:
Video Action Detection (VAD) involves localizing and categorizing action instances in videos. Videos inherently contain various information sources, including audio, visual cues, and surrounding scene contexts. Effectively leveraging this multi-modal information for VAD is challenging, as the model must accurately focus on action-relevant cues. In this study, we introduce a novel multi-modal VAD architecture called the Joint Actor-centric Visual, Audio, Language Encoder (JoVALE). JoVALE is the first VAD method to integrate audio and visual features with scene descriptive context derived from large image captioning models. The core principle of JoVALE is the actor-centric aggregation of audio, visual, and scene descriptive contexts, where action-related cues from each modality are identified and adaptively combined. We propose a specialized module called the Actor-centric Multi-modal Fusion Network, designed to capture the joint interactions among actors and multi-modal contexts through Transformer architecture. Our evaluation conducted on three popular VAD benchmarks, AVA, UCF101-24, and JHMDB51-21, demonstrates that incorporating multi-modal information leads to significant performance gains. JoVALE achieves state-of-the-art performances. The code will be available at \texttt{https://github.com/taeiin/AAAI2025-JoVALE}.
--------------------------------------------------------------------------------------------------------
Continuous Patient Monitoring with AI: Real-Time Analysis of Video in Hospital Care Settings
Describes a platform for real-time video analysis in hospital settings. The system monitors patient behavior and interactions, focusing on fall prevention and safety monitoring. This technology could revolutionize hospital care by providing continuous, automated patient monitoring, reducing adverse events, and enabling more efficient allocation of healthcare resources.
Authors: Paolo Gabriel, Peter Rehani, Tyler Troy, Tiffany Wyatt, Michael Choma, Narinder Singh
Link: https://arxiv.org/abs/2412.13152v1
Date: 2024-12-17
Summary:
This study introduces an AI-driven platform for continuous and passive patient monitoring in hospital settings, developed by LookDeep Health. Leveraging advanced computer vision, the platform provides real-time insights into patient behavior and interactions through video analysis, securely storing inference results in the cloud for retrospective evaluation. The dataset, compiled in collaboration with 11 hospital partners, encompasses over 300 high-risk fall patients and over 1,000 days of inference, enabling applications such as fall detection and safety monitoring for vulnerable patient populations. To foster innovation and reproducibility, an anonymized subset of this dataset is publicly available. The AI system detects key components in hospital rooms, including individual presence and role, furniture location, motion magnitude, and boundary crossings. Performance evaluation demonstrates strong accuracy in object detection (macro F1-score = 0.92) and patient-role classification (F1-score = 0.98), as well as reliable trend analysis for the "patient alone" metric (mean logistic regression accuracy = 0.82 \pm 0.15). These capabilities enable automated detection of patient isolation, wandering, or unsupervised movement-key indicators for fall risk and other adverse events. This work establishes benchmarks for validating AI-driven patient monitoring systems, highlighting the platform's potential to enhance patient safety and care by providing continuous, data-driven insights into patient behavior and interactions.
--------------------------------------------------------------------------------------------------------
Creating an LLM-based AI-agent: A high-level methodology towards enhancing LLMs with APIs
Outlines a methodology for enhancing language models with API capabilities. The seven-step approach covers model selection, task decomposition, and API integration. This framework could guide the development of more capable AI assistants that can interact with external systems and perform real-world tasks more effectively.
Authors: Ioannis Tzachristas
Link: https://arxiv.org/abs/2412.13233v1
Date: 2024-12-17
Summary:
Large Language Models (LLMs) have revolutionized various aspects of engineering and science. Their utility is often bottlenecked by the lack of interaction with the external digital environment. To overcome this limitation and achieve integration of LLMs and Artificial Intelligence (AI) into real-world applications, customized AI agents are being constructed. Based on the technological trends and techniques, we extract a high-level approach for constructing these AI agents, focusing on their underlying architecture. This thesis serves as a comprehensive guide that elucidates a multi-faceted approach for empowering LLMs with the capability to leverage Application Programming Interfaces (APIs). We present a 7-step methodology that begins with the selection of suitable LLMs and the task decomposition that is necessary for complex problem-solving. This methodology includes techniques for generating training data for API interactions and heuristics for selecting the appropriate API among a plethora of options. These steps eventually lead to the generation of API calls that are both syntactically and semantically aligned with the LLM's understanding of a given task. Moreover, we review existing frameworks and tools that facilitate these processes and highlight the gaps in current attempts. In this direction, we propose an on-device architecture that aims to exploit the functionality of carry-on devices by using small models from the Hugging Face community. We examine the effectiveness of these approaches on real-world applications of various domains, including the generation of a piano sheet. Through an extensive analysis of the literature and available technologies, this thesis aims to set a compass for researchers and practitioners to harness the full potential of LLMs augmented with external tool capabilities, thus paving the way for more autonomous, robust, and context-aware AI agents.
--------------------------------------------------------------------------------------------------------
Investigates how transformers develop internal representations during in-context learning. By studying concept encoding and decoding mechanisms, it reveals how these models form and use abstractions. This research could lead to more interpretable AI systems and improve our understanding of machine learning and human cognition.
Authors: Seungwook Han, Jinyeop Song, Jeff Gore, Pulkit Agrawal
Link: https://arxiv.org/abs/2412.12276v2
Date: 2024-12-18
Summary:
Humans distill complex experiences into fundamental abstractions that enable rapid learning and adaptation. Similarly, autoregressive transformers exhibit adaptive learning through in-context learning (ICL), which begs the question of how. In this paper, we propose concept encoding-decoding mechanism to explain ICL by studying how transformers form and use internal abstractions in their representations. On synthetic ICL tasks, we analyze the training dynamics of a small transformer and report the coupled emergence of concept encoding and decoding. As the model learns to encode different latent concepts (e.g., ``Finding the first noun in a sentence.") into distinct, separable representations, it concureently builds conditional decoding algorithms and improve its ICL performance. We validate the existence of this mechanism across pretrained models of varying scales (Gemma-2 2B/9B/27B, Llama-3.1 8B/70B). Further, through mechanistic interventions and controlled finetuning, we demonstrate that the quality of concept encoding is causally related and predictive of ICL performance. Our empirical insights shed light into better understanding the success and failure modes of large language models via their representations.
--------------------------------------------------------------------------------------------------------
SpeechPrune: Context-aware Token Pruning for Speech Information Retrieval
Addresses the challenge of processing long audio sequences in speech recognition systems. The token pruning strategy improves accuracy while reducing computational demands. This could enable more efficient speech recognition systems for longer conversations, meetings, and lectures, making audio content more accessible and searchable.
Authors: Yueqian Lin, Yuzhe Fu, Jingyang Zhang, Yudong Liu, Jianyi Zhang, Jingwei Sun, Hai "Helen" Li, Yiran Chen
Link: https://arxiv.org/abs/2412.12009v1
Date: 2024-12-16
Summary:
We introduce Speech Information Retrieval (SIR), a new long-context task for Speech Large Language Models (Speech LLMs), and present SPIRAL, a 1,012-sample benchmark testing models' ability to extract critical details from approximately 90-second spoken inputs. While current Speech LLMs excel at short-form tasks, they struggle with the computational and representational demands of longer audio sequences. To address this limitation, we propose SpeechPrune, a training-free token pruning strategy that uses speech-text similarity and approximated attention scores to efficiently discard irrelevant tokens. In SPIRAL, SpeechPrune achieves accuracy improvements of 29% and up to 47% over the original model and the random pruning model at a pruning rate of 20%, respectively. SpeechPrune can maintain network performance even at a pruning level of 80%. This approach highlights the potential of token-level pruning for efficient and scalable long-form speech understanding.
--------------------------------------------------------------------------------------------------------
Transformers Use Causal World Models in Maze-Solving Tasks
Examines how transformer models develop internal representations when solving maze tasks. Using sparse autoencoders and attention analysis, it reveals surprising capabilities in handling complex features. This research could advance our understanding of AI reasoning and lead to more interpretable and controllable AI systems.
Authors: Alex F. Spies, William Edwards, Michael I. Ivanitskiy, Adrians Skapars, Tilman Räuker, Katsumi Inoue, Alessandra Russo, Murray Shanahan
Link: https://arxiv.org/abs/2412.11867v1
Date: 2024-12-16
Summary:
Recent studies in interpretability have explored the inner workings of transformer models trained on tasks across various domains, often discovering that these networks naturally develop surprisingly structured representations. When such representations comprehensively reflect the task domain's structure, they are commonly referred to as ``World Models'' (WMs). In this work, we discover such WMs in transformers trained on maze tasks. In particular, by employing Sparse Autoencoders (SAEs) and analysing attention patterns, we examine the construction of WMs and demonstrate consistency between the circuit analysis and the SAE feature-based analysis. We intervene upon the isolated features to confirm their causal role and, in doing so, find asymmetries between certain types of interventions. Surprisingly, we find that models are able to reason with respect to a greater number of active features than they see during training, even if attempting to specify these in the input token sequence would lead the model to fail. Futhermore, we observe that varying positional encodings can alter how WMs are encoded in a model's residual stream. By analyzing the causal role of these WMs in a toy domain we hope to make progress toward an understanding of emergent structure in the representations acquired by Transformers, leading to the development of more interpretable and controllable AI systems.
--------------------------------------------------------------------------------------------------------
Predicting the Original Appearance of Damaged Historical Documents
Presents a new approach to repairing damaged historical documents using AI. The system can predict original document appearance and includes a large dataset for training. This technology could aid in preserving cultural heritage by restoring damaged historical documents and making them more accessible for study.
Authors: Zhenhua Yang, Dezhi Peng, Yongxin Shi, Yuyi Zhang, Chongyu Liu, Lianwen Jin
Link: https://arxiv.org/abs/2412.11634v1
Date: 2024-12-16
Summary:
Historical documents encompass a wealth of cultural treasures but suffer from severe damages including character missing, paper damage, and ink erosion over time. However, existing document processing methods primarily focus on binarization, enhancement, etc., neglecting the repair of these damages. To this end, we present a new task, termed Historical Document Repair (HDR), which aims to predict the original appearance of damaged historical documents. To fill the gap in this field, we propose a large-scale dataset HDR28K and a diffusion-based network DiffHDR for historical document repair. Specifically, HDR28K contains 28,552 damaged-repaired image pairs with character-level annotations and multi-style degradations. Moreover, DiffHDR augments the vanilla diffusion framework with semantic and spatial information and a meticulously designed character perceptual loss for contextual and visual coherence. Experimental results demonstrate that the proposed DiffHDR trained using HDR28K significantly surpasses existing approaches and exhibits remarkable performance in handling real damaged documents. Notably, DiffHDR can also be extended to document editing and text block generation, showcasing its high flexibility and generalization capacity. We believe this study could pioneer a new direction of document processing and contribute to the inheritance of invaluable cultures and civilizations. The dataset and code is available at https://github.com/yeungchenwa/HDR.
--------------------------------------------------------------------------------------------------------
Analyzes deep learning integration in cybersecurity systems. The study examines GPU support and implementation details across frameworks. This research could guide the development of more effective AI-powered cybersecurity systems and help organizations choose appropriate security solutions.
Authors: Tobias Becher, Simon Torka
Link: https://arxiv.org/abs/2412.12648v1
Date: 2024-12-17
Summary:
Traditional rule-based cybersecurity systems have proven highly effective against known malware threats. However, they face challenges in detecting novel threats. To address this issue, emerging cybersecurity systems are incorporating AI techniques, specifically deep-learning algorithms, to enhance their ability to detect incidents, analyze alerts, and respond to events. While these techniques offer a promising approach to combating dynamic security threats, they often require significant computational resources. Therefore, frameworks that incorporate AI-based cybersecurity mechanisms need to support the use of GPUs to ensure optimal performance. Many cybersecurity framework vendors do not provide sufficiently detailed information about their implementation, making it difficult to assess the techniques employed and their effectiveness. This study aims to overcome this limitation by providing an overview of the most used cybersecurity frameworks that utilize AI techniques, specifically focusing on frameworks that provide comprehensive information about their implementation. Our primary objective is to identify the deep-learning techniques employed by these frameworks and evaluate their support for GPU acceleration. We have identified a total of \emph{two} deep-learning algorithms that are utilized by \emph{three} out of 38 selected cybersecurity frameworks. Our findings aim to assist in selecting open-source cybersecurity frameworks for future research and assessing any discrepancies between deep-learning techniques used in theory and practice.
--------------------------------------------------------------------------------------------------------
Proposes a new method for evaluating AI art tools through dialogue with artists and art experts. Focusing on non-western art worlds, it examines how AI tools interact with cultural contexts. This approach could lead to more culturally inclusive AI art tools and better understanding of AI's role in creative practices.
Authors: Rida Qadri, Piotr Mirowski, Aroussiak Gabriellan, Farbod Mehr, Huma Gupta, Pamela Karimi, Remi Denton
Link: https://arxiv.org/abs/2412.14077v1
Date: 2024-12-18
Summary:
This paper proposes dialogue as a method for evaluating generative AI tools for culturally-situated creative practice, that recognizes the socially situated nature of art. Drawing on sociologist Howard Becker's concept of Art Worlds, this method expands the scope of traditional AI and creativity evaluations beyond benchmarks, user studies with crowd-workers, or focus groups conducted with artists. Our method involves two mutually informed dialogues: 1) 'dialogues with art worlds' placing artists in conversation with experts such as art historians, curators, and archivists, and 2)'dialogues with the machine,' facilitated through structured artist- and critic-led experimentation with state-of-the-art generative AI tools. We demonstrate the value of this method through a case study with artists and experts steeped in non-western art worlds, specifically the Persian Gulf. We trace how these dialogues help create culturally rich and situated forms of evaluation for representational possibilities of generative AI that mimic the reception of generative artwork in the broader art ecosystem. Putting artists in conversation with commentators also allow artists to shift their use of the tools to respond to their cultural and creative context. Our study can provide generative AI researchers an understanding of the complex dynamics of technology, human creativity and the socio-politics of art worlds, to build more inclusive machines for diverse art worlds.
--------------------------------------------------------------------------------------------------------
Modality-Inconsistent Continual Learning of Multimodal Large Language Models
Addresses how AI systems can learn to handle different types of input (image, audio, video) without forgetting previous capabilities. Their solution helps preserve knowledge across modality shifts. This could enable more flexible AI systems that can continuously learn new skills while maintaining existing ones.
Authors: Weiguo Pian, Shijian Deng, Shentong Mo, Yunhui Guo, Yapeng Tian
Link: https://arxiv.org/abs/2412.13050v1
Date: 2024-12-17
Summary:
In this paper, we introduce Modality-Inconsistent Continual Learning (MICL), a new continual learning scenario for Multimodal Large Language Models (MLLMs) that involves tasks with inconsistent modalities (image, audio, or video) and varying task types (captioning or question-answering). Unlike existing vision-only or modality-incremental settings, MICL combines modality and task type shifts, both of which drive catastrophic forgetting. To address these challenges, we propose MoInCL, which employs a Pseudo Targets Generation Module to mitigate forgetting caused by task type shifts in previously seen modalities. It also incorporates Instruction-based Knowledge Distillation to preserve the model's ability to handle previously learned modalities when new ones are introduced. We benchmark MICL using a total of six tasks and conduct experiments to validate the effectiveness of our proposed MoInCL. The experimental results highlight the superiority of MoInCL, showing significant improvements over representative and state-of-the-art continual learning baselines.
--------------------------------------------------------------------------------------------------------
Conjecture: the set of prime numbers is supernatural
Introduces a new formal mathematical conjecture about prime numbers. While prime numbers' distribution has long been considered mysterious, this paper provides a formal framework to characterize this mystery through the concept of "supernatural" properties. The author presents this as a fundamental challenge for both human and artificial intelligence, suggesting it may represent a canonical problem in number theory.
Authors: Arnaud Mayeux
Link: https://arxiv.org/abs/2412.12041v1
Date: 2024-12-16
Summary:
Prime numbers are fascinating by the way they appear in the set of natural numbers. Despite several results enlighting us about their repartition, the set of prime numbers is often informally qualified as misterious. In the present paper, we introduce a formalism allowing to state a formal conjecture: the set of prime numbers is supernatural. Our conjecture has no analog in the existing literature. We explain that this conjecture is expected to be a hard challenge for any kind of intelligence. However it is really natural and even seems very close to be canonical.
--------------------------------------------------------------------------------------------------------
EPOCHS XI: The Structure and Morphology of Galaxies in the Epoch of Reionization to z ~ 12.5
Analyzes galaxy structure and morphology in the early universe using JWST data. The study reveals insights about galaxy sizes and merger rates up to 12.5 billion years ago. This research advances our understanding of galaxy evolution and the early universe, potentially reshaping cosmological models.
Authors: Lewi Westcott, Christopher J. Conselice, Thomas Harvey, Duncan Austin, Nathan Adams, Fabricio Ferrari, Leonardo Ferreira, James Trussler, Qiong Li, Vadim Rusakov, Qiao Duan, Honor Harris, Caio Goolsby, Thomas J. Broadhurst, Dan Coe, Seth H. Cohen, Simon P. Driver, Jordan C. J. D'Silva, Brenda Frye, Norman A. Grogin, Nimish P. Hathi, Rolf A. Jansen, Anton M. Koekemoer, Madeline A. Marshall, Rafael Ortiz III, Nor Pirzkal, Aaron Robotham, Russell E. Ryan Jr., Jake Summers, Christopher N. A. Willmer, Rogier A. Windhorst, Haojing Yan
Link: https://arxiv.org/abs/2412.14970v1
Date: 2024-12-19
Summary:
We present a structural analysis of 521 galaxy candidates at 6.5 < z < 12.5, with $SNR > 10\sigma$ in the F444W filter, taken from the EPOCHS v1 sample, consisting of uniformly reduced deep JWST NIRCam data, covering the CEERS, JADES GOOD-S, NGDEEP, SMACS0723, GLASS and PEARLS surveys. We use standard software to fit single S\'ersic models to each galaxy in the rest-frame optical and extract their parametric structural parameters (S\'ersic index, half-light radius and axis-ratio), and \texttt{Morfometryka} to measure their non-parametric concentration and asymmetry parameters. We find a wide range of sizes for these early galaxies, but with a strong galaxy-size mass correlation up to $z \sim 12$ such that galaxy sizes continue to get progressively smaller in the high-redshift regime, following $R_{e} = 2.74 \pm 0.49 \left( 1 + z \right) ^{-0.79 \pm 0.08}$ kpc. Using non-parametric methods we find that galaxy merger fractions, classified through asymmetry parameters, at these redshifts remain consistent with those in literature, maintaining a value of $f_{m} \sim 0.12 \pm 0.07$ showing little dependence with redshift when combined with literature at $z > 4$. We find that galaxies which are smaller in size also appear rounder, with an excess of high axis-ratio objects. Finally, we artificially redshift a subsample of our objects to determine how robust the observational trends we see are, determining that observed trends are due to real evolutionary effects, rather than being a consequence of redshift effects.
--------------------------------------------------------------------------------------------------------
Develops a language-independent framework for conditional independence using approximation fixpoint theory. This theoretical work enables parallel processing of complex reasoning tasks, improving computational efficiency. The framework's applications span logic programming and knowledge representation systems, potentially enabling faster and more efficient AI reasoning systems across multiple domains.
Authors: Jesse Heyninck
Link: https://arxiv.org/abs/2412.13712v1
Date: 2024-12-18
Summary:
Conditional independence is a crucial concept supporting adequate modelling and efficient reasoning in probabilistics. In knowledge representation, the idea of conditional independence has also been introduced for specific formalisms, such as propositional logic and belief revision. In this paper, the notion of conditional independence is studied in the algebraic framework of approximation fixpoint theory. This gives a language-independent account of conditional independence that can be straightforwardly applied to any logic with fixpoint semantics. It is shown how this notion allows to reduce global reasoning to parallel instances of local reasoning, leading to fixed-parameter tractability results. Furthermore, relations to existing notions of conditional independence are discussed and the framework is applied to normal logic programming.
--------------------------------------------------------------------------------------------------------