📚 Daily ArXiv Digest — 2026-02-23

主题覆盖：#embodied #representation #rl ｜精选5篇近期高价值论文

#1 #embodied DriveFine: Refining-Augmented Masked Diffusion VLA for Precise and Robust Driving

作者：Chenxu Dang, Sining Ang, Yongkang Li, Haochen Tian, Jie Wang, Guang Li, Hangjun Ye, Jie Ma et al.

日期：2026-02-16

链接：https://arxiv.org/pdf/2602.14577v1

摘要速读：Vision-Language-Action (VLA) models for autonomous driving increasingly adopt generative planners trained with imitation learning followed by reinforcement learning. Diffusion-based planners suffer from modality alignment difficulties, low training efficiency, and limited generalization. Token-based planners are plagued by cumulative causal errors and irreversible decoding. In summary, the two dominant paradigms exhibit complementary strengths and weaknesses. In this paper, we propose DriveFine, a masked diffusion VLA model that combines flexible decoding with self-correction capabilities. In particular, we design a novel plug-and-play block-MoE, which seamlessly injects a refinement expert on top of the generation expert. By enabling explicit expert selection during inference and gradient...

个人洞察：这篇工作若在真实机器人任务上保持成功率，意味着其策略具备较强的可部署潜力。建议优先关注数据采集成本、硬件泛化与失败回放分析。

#2 #representation Be Wary of Your Time Series Preprocessing

作者：Sofiane Ennadir, Tianze Wang, Oleg Smirnov, Sahar Asadi, Lele Cao

日期：2026-02-19

链接：https://arxiv.org/pdf/2602.17568v1

摘要速读：Normalization and scaling are fundamental preprocessing steps in time series modeling, yet their role in Transformer-based models remains underexplored from a theoretical perspective. In this work, we present the first formal analysis of how different normalization strategies, specifically instance-based and global scaling, impact the expressivity of Transformer-based architectures for time series representation learning. We propose a novel expressivity framework tailored to time series, which quantifies a model's ability to distinguish between similar and dissimilar inputs in the representation space. Using this framework, we derive theoretical bounds for two widely used normalization methods: Standard and Min-Max scaling. Our analysis reveals that the choice of normalization strategy can...

个人洞察：关键看“学到的表示”是否真正可迁移：若在线性探针、低样本迁移和OOD鲁棒性上同时获益，说明其表征更接近通用能力。

#3 #rl TempoNet: Slack-Quantized Transformer-Guided Reinforcement Scheduler for Adaptive Deadline-Centric Real-Time Dispatchs

作者：Rong Fu, Yibo Meng, Guangzhen Yao, Jiaxuan Lu, Zeyu Zhang, Zhaolu Kang, Ziming Guo, Jia Yee Tan et al.

日期：2026-02-20

链接：https://arxiv.org/pdf/2602.18109v1

摘要速读：Real-time schedulers must reason about tight deadlines under strict compute budgets. We present TempoNet, a reinforcement learning scheduler that pairs a permutation-invariant Transformer with a deep Q-approximation. An Urgency Tokenizer discretizes temporal slack into learnable embeddings, stabilizing value learning and capturing deadline proximity. A latency-aware sparse attention stack with blockwise top-k selection and locality-sensitive chunking enables global reasoning over unordered task sets with near-linear scaling and sub-millisecond inference. A multicore mapping layer converts contextualized Q-scores into processor assignments through masked-greedy selection or differentiable matching. Extensive evaluations on industrial mixed-criticality traces and large multiprocessor setting...

个人洞察：建议重点审视其提升是否在不同环境族都成立。若不仅在单一benchmark有效，而在训练稳定性与样本效率上都改进，价值更高。

#4 #embodied DM0: An Embodied-Native Vision-Language-Action Model towards Physical AI

作者：En Yu, Haoran Lv, Jianjian Sun, Kangheng Lin, Ruitao Zhang, Yukang Shi, Yuyang Chen, Ze Chen et al.

日期：2026-02-16

链接：https://arxiv.org/pdf/2602.14974v1

摘要速读：Moving beyond the traditional paradigm of adapting internet-pretrained models to physical tasks, we present DM0, an Embodied-Native Vision-Language-Action (VLA) framework designed for Physical AI. Unlike approaches that treat physical grounding as a fine-tuning afterthought, DM0 unifies embodied manipulation and navigation by learning from heterogeneous data sources from the onset. Our methodology follows a comprehensive three-stage pipeline: Pretraining, Mid-Training, and Post-Training. First, we conduct large-scale unified pretraining on the Vision-Language Model (VLM) using diverse corpora--seamlessly integrating web text, autonomous driving scenarios, and embodied interaction logs-to jointly acquire semantic knowledge and physical priors. Subsequently, we build a flow-matching action e...

#5 #embodied ForesightSafety Bench: A Frontier Risk Evaluation and Governance Framework towards Safe AI

作者：Haibo Tong, Feifei Zhao, Linghao Feng, Ruoyu Wu, Ruolin Chen, Lu Jia, Zhou Zhao, Jindong Li et al.

日期：2026-02-15

链接：https://arxiv.org/pdf/2602.14135v2

摘要速读：Rapidly evolving AI exhibits increasingly strong autonomy and goal-directed capabilities, accompanied by derivative systemic risks that are more unpredictable, difficult to control, and potentially irreversible. However, current AI safety evaluation systems suffer from critical limitations such as restricted risk dimensions and failed frontier risk detection. The lagging safety benchmarks and alignment technologies can hardly address the complex challenges posed by cutting-edge AI models. To bridge this gap, we propose the "ForesightSafety Bench" AI Safety Evaluation Framework, beginning with 7 major Fundamental Safety pillars and progressively extends to advanced Embodied AI Safety, AI4Science Safety, Social and Environmental AI risks, Catastrophic and Existential Risks, as well as 8 crit...