Podcast

Ashish's AI News Briefings.

Morning and evening AI news briefings, plus research paper deep dives.

RSS feedSubscribe in Apple Podcasts, Spotify, or any app via RSS.

Episodes

EveningJul 15

2026-07-14

The real AI race may no longer be at the frontier; Introducing Claude for Teachers; We Must Act Now: economists and AI leaders warn on AI's economic impact; Anthropic commits $10 million to Canadian AI research

MorningJul 15

2026-07-14

GenPage: Towards End-to-End Generative Homepage Construction at Netflix; How Siemens ?sliced the elephant,? modernizing legacy code with agentic workflows; EdgeBench: Measuring Real-World Environment Learning and Discovering a New Scaling Law; Apple Sues OpenAI, Accusing It of Stealing Company Secrets; Why AI Might Actually Help Solve the Next Labor Crisis

MorningJul 9

2026-07-09

How Schneider Electric Built Their LLMOps Foundations At Enterprise Scale With LangSmith; SWE-1.7: Frontier Intelligence at a Fraction of the Cost; Lakeflow: A new era of agentic data engineering; Introducing Robostral Navigate; Claude Cowork is coming to mobile and web

ResearchJul 8

Vision Pretraining for Dense Spatial Perception

Dense spatial perception is essential for physical intelligence, where visual systems are expected to recover structured, metric, and actionable representations from pixel observations. Modern visual foundation models tend to prioritize semantic invariance, often at the expense of detailed spatial understanding. In this work, we study vision pretraining through a boundary-centric lens, motivated by the premise that boundaries and shape discontinuities offer essential cues for perceiving geometric properties. Concretely, we propose masked boundary modeling, a self-supervised paradigm that dynamically learns sub-pixel boundary representations and subsequently leverages the discovered boundary-bearing tokens as masked targets to facilitate dense visual token learning. By scaling this framework, we develop LingBot-Vision and demonstrate its efficacy across a diverse set of downstream vision tasks with DINOv3 as a strong baseline. Remarkably, LingBot-Vision drives the progression from LingBot-Depth 1.0 to LingBot-Depth 2.0 for depth completion, and thereby yields enhanced depth estimation, a key pillar for embodied artificial intelligence. Our findings reveal that boundary modeling goes beyond simple line segments and instead serves as a scalable pretraining principle for learning spatially structured visual representations.

MorningJul 8

2026-07-08

Harness Engineering for Self-Improvement; A global workspace in language models; Reducing Doom Loops with Final Token Preference Optimization; How tech workers are feeling in 2026: a workforce splitting in two; Challenging the Chatbot

EveningJun 20

2026-06-20

New global order: AI CEOs as heads of nation-states; Google shake-up highlights how human brains may be the scarcest AI resource of all; From PGP to Mythos: a brief history of export controls that didn't stop anyone; OpenAI admits enterprises need better control over AI costs; Nobel laureate John Jumper is leaving DeepMind for rival Anthropic

ResearchJun 8

dots.tts Technical Report

We present this http URL , a 2B-parameter continuous autoregressive text-to-speech (TTS) foundation model that models speech in a continuous latent space. Compared with existing continuous autoregressive models, our key innovations are threefold. First, we train an AudioVAE with multiple objectives to build a semantically structured and prediction-friendly continuous speech space. Second, we use full-history conditioning in the flow-matching head to preserve long-range consistency and reduce drift during generation. Third, we apply reward-free self-corrective post-training to the flow-matching head to further improve robustness and acoustic quality. After being trained on a large-scale multilingual corpus, this http URL achieves the best average performance on Seed-TTS-Eval, with WERs of 0.94%/1.30%/6.60% and SIM scores of 81.0/77.1/79.5 on the zh/en/zh-hard test sets, respectively. Across other benchmarks, this http URL also consistently demonstrates open-source state-of-the-art performance, exhibiting strong generation stability, voice cloning ability, and emotional expressiveness. For efficient inference, we further apply CFG-aware MeanFlow distillation, enabling low-latency speech generation with first-packet latencies of 85/54 ms in output streaming and dual-streaming modes, respectively. To facilitate reproducible research and practical deployment, we release the training and inference code, together with the pretrained, post-trained, and MeanFlow-distilled checkpoints, under the Apache 2.0 license.

EveningJun 8

2026-06-08

Nvidia clinches deals with South Korean giants including SK Group to advance AI boom; New plans to stop children taking, sharing or viewing nude images; Uber, Wayve, and Waymo are headed toward a robotaxi showdown in London; Apple debuts software updates amid Siri overhaul; OpenAI is still working on that super app

ResearchJun 5

PaddleOCR-VL-1.6: Expanding the Frontier of Document Parsing with Under-Optimized Region Refinement and Progressive Post-Training

We introduce PaddleOCR-VL-1.6, an upgraded compact document parsing model built upon PaddleOCR-VL-1.5. Although PaddleOCR-VL-1.5 establishes a strong 0.9B baseline, its remaining errors concentrate in under-optimized regions where model behavior is unstable, data coverage is sparse, or supervision is unreliable. Rather than expanding the training corpus indiscriminately, PaddleOCR-VL-1.6 introduces a region-aware data optimization framework that identifies weak regions from the previous model, applies targeted enhancement to these regions, and improves the reliability of supervision signals. It further adopts a progressive post-training recipe based on curated data selection and reinforcement learning, pushing model performance to a higher level through staged optimization. PaddleOCR-VL-1.6 achieves a new state-of-the-art score of 96.33% on OmniDocBench v1.6, demonstrates strong competitiveness against top-tier VLMs, and provides a practical post-training recipe for the PaddleOCR-VL series.

EveningJun 5

2026-06-05

Anthropic says Claude now writes most of its production code and warns about recursive self-improvement; Federal Register publishes U.S. executive order on AI innovation, frontier-model cyber review, and AI cybersecurity clearinghouse; OpenAI rolls out “Dreaming” memory architecture for scalable, reviewable ChatGPT personalization; Enterprises start governing AI spend as token costs collide with infrastructure buildout

MorningJun 5

2026-06-05

Designing Efficient Verifiers for Legal Agents; NVIDIA Nemotron 3 Ultra Powers Faster, More Efficient Reasoning for Long-Running Agents; Introducing Gemma 4 12B: a unified, encoder-free multimodal model; Ideogram 4.0

ResearchJun 4

GRAIL: Generating Humanoid Loco-Manipulation from 3D Assets and Video Priors

Scaling humanoid loco-manipulation requires robot-compatible demonstrations across diverse objects, whole-body motions, and scene geometries, but teleoperation and motion capture are difficult to scale because each collection depends on physical setups, instrumented actors, and robot operation. We present GRAIL, a digital generation pipeline that remains fully virtual until deployment: it composes 3D assets, simulator-ready scenes, and priors from video foundation models (VFMs) to synthesize interactions without rebuilding physical environments or teleoperating the robot. Rather than reconstructing unconstrained in-the-wild videos, GRAIL starts from fully specified 3D configurations in which object geometry, camera parameters, metric scale, environment depth, and a robot-proportioned character are known before video generation and reused during reconstruction. This privileged setup better conditions 4D recovery, allowing model-based object tracking, human motion estimation, and interaction-aware optimization to reconstruct metric 4D human-object interaction (HOI) trajectories with reduced depth ambiguity and morphology mismatch. We retarget the recovered motions to a humanoid robot and train complementary task-general trackers: an object-aware latent adaptor for manipulation and a scene-aware tracker for terrain traversal. GRAIL produces over 20,000 sequences spanning pick-up, object manipulation, sitting, and terrain traversal. Using only GRAIL-generated data, we train egocentric visual policies through a sim-to-real pipeline and deploy them on a Unitree G1 humanoid, achieving 84\% real-world success on diverse object pick-up and 90\% success on stair-climbing.

EveningJun 4

2026-06-04

OpenAI updates GPT‑Rosalind for life-sciences research workflows; Microsoft Foundry Build 2026 ships production-agent platform primitives; House draft bill would preempt state AI model-development laws for three years; Canada unveils national “AI for all” strategy with sovereign compute and adoption targets

MorningJun 4

2026-06-04

Rewiring software delivery for the agentic era; Announcing Microsoft Web IQ; Introducing Microsoft Scout: Your always-on personal agent; Trump Signs AI Executive Order to Increase Government Oversight; Composing a new platform for agent-first devices

ResearchJun 3

VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild

LLM-based agents score well on search benchmarks, yet real users consistently find results unsatisfying, revealing a persistent evaluation-experience gap. We attribute this gap to existing benchmarks' reliance on over-specified queries, single-turn interactions, and fixed-schema evaluation, none of which reflect real search behavior where users and agents collaboratively refine vague intent through multi-turn dialogue. We term this paradigm VibeSearch and introduce VibeSearchBench, a benchmark comprising 200 manually curated bilingual (Chinese and English) tasks across 20 domains, split into VibeSearch-Pro (professional) and VibeSearch-Daily (daily-life) subsets. Each task pairs a user persona with a schema-free ground-truth knowledge graph, and is evaluated through a progressive-disclosure user simulator and a graph-matching evaluation framework. We benchmark seven frontier models under both the ReAct framework and the OpenClaw agent harness. Results show that all models remain substantially inadequate for VibeSearch (best F1: 30.30), highlighting the need for fundamental advances in long-context reasoning, proactive intent elicitation, and structured knowledge construction.

EveningJun 3

2026-06-03

Alphabet raises AI infrastructure stakes with ~$85B financing plan and $180B-$190B 2026 capex; DeepSeek reportedly nears $7.4B raise for research-first open-source AI; Anthropic formalizes Claude enterprise delivery tiers with Partner Hub and production-deployment metrics; GitHub ships VS Code Stable preview for agent-first development workflows, remote agents, air-gapped BYOK, and terminal risk controls

MorningJun 3

2026-06-03

How Rippling Went AI-Native Across Every Product in 6 Months with Deep Agents and LangSmith; Rethinking Search as Code Generation; Building a hill-climbing machine: Launching seven new MAI models; Codex for every role, tool, and workflow; The Next Era of Knowledge Work

ResearchJun 2

Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

The ability to accurately interpret complex visual information is a crucial topic of multimodal large language models (MLLMs). Recent work indicates that enhanced visual perception significantly reduces hallucinations and improves performance on resolution-sensitive tasks, such as optical character recognition and document analysis. A number of recent MLLMs achieve this goal using a mixture of vision encoders. Despite their success, there is a lack of systematic comparisons and detailed ablation studies addressing critical aspects, such as expert selection and the integration of multiple vision experts. This study provides an extensive exploration of the design space for MLLMs using a mixture of vision encoders and resolutions. Our findings reveal several underlying principles common to various existing strategies, leading to a streamlined yet effective design approach. We discover that simply concatenating visual tokens from a set of complementary vision encoders is as effective as more complex mixing architectures or strategies. We additionally introduce Pre-Alignment to bridge the gap between vision-focused encoders and language tokens, enhancing model coherence. The resulting family of MLLMs, Eagle, surpasses other leading open-source models on major MLLM benchmarks.

EveningJun 2

2026-06-02

Microsoft Build 2026: Microsoft consolidates agent context, models, governance, Windows sandboxes, and developer workflows; Promoting Advanced Artificial Intelligence Innovation and Security; Enterprise Software Leaders Build AI Agents With NVIDIA; Expanding Project Glasswing; Workday Launches Agent Passport to Test, Verify, and Continuously Monitor Every AI Agent in the Enterprise

MorningJun 2

2026-06-02

Anthropic’s browser agent got hijacked 31.5% of the time before safeguards engaged; Why we Built our own Cloud Agent Infrastructure; Develop Physical AI Reasoning, World, and Action Models with NVIDIA Cosmos 3; Anthropic Files to Go Public, Setting Stage for Huge I.P.O.; MiniMax M3 - Coding & Agentic Frontier, 1M Context, Multimodal

ResearchJun 1

VLM3: Vision Language Models Are Native 3D Learners

Vision Language Models (VLMs) enable a unified model to solve various vision tasks through prompting. They have shown promising performance in semantic understanding. However, 3D understanding still largely relies on expert vision models with complex task-specific designs. The key argument this work wants to make is that VLMs are native 3D learners. Our in-depth large scale study shows that 1) focal length unification, 2) text-based pixel reference and 3) data mixture and scaling, are all you need for effective 3D learning. Model architecture changes, large models, heavy data augmentations, and complex losses including the regression formulation, many of which form the foundation of expert vision models, are actually not necessary conditions. As a result, we propose VLM3, a scalable method with the simplest design that enables standard VLMs to master diverse 3D tasks. VLM3 not only advances the VLM depth estimation accuracy by a large margin (0.84 -> 0.9), but also enables diverse 3D tasks such as pixel correspondence, camera pose estimation and object-level 3D understanding, matching expert vision model accuracy while maintaining standard architectures and text-based training. We believe VLM3 opens up a new paradigm for simple and scalable 3D learning.

EveningJun 1

2026-06-01

Anthropic confidentially submits draft S-1 to the SEC; NVIDIA at COMPUTEX 2026: RTX Spark + DLSS 4.5 updates; Deepening OpenAI collaboration with U.S. Department of Energy; GitHub Copilot usage-based billing now live; Organize My Files in Drive now generally available

MorningJun 1

2026-06-01

Mistral AI launches Vibe, expands into industrial AI and announces data center push to challenge OpenAI; From data overload to actionable insights: How Verizon Connect scaled agentic AI to 100,000 users; Solving Long-Context Evals for Production Agents; AI’s Impact on SaaS Will Be Uneven. Here’s What Leaders Need to Know; AI Is Already Rewiring the Aftermarket and Services

ResearchMay 31

LongCat-Video Technical Report

Video generation is a critical pathway toward world models, with efficient long video inference as a key capability. Toward this end, we introduce LongCat-Video, a foundational video generation model with 13.6B parameters, delivering strong performance across multiple video generation tasks. It particularly excels in efficient and high-quality long video generation, representing our first step toward world models. Key features include: Unified architecture for multiple tasks: Built on the Diffusion Transformer (DiT) framework, LongCat-Video supports Text-to-Video, Image-to-Video, and Video-Continuation tasks with a single model; Long video generation: Pretraining on Video-Continuation tasks enables LongCat-Video to maintain high quality and temporal coherence in the generation of minutes-long videos; Efficient inference: LongCat-Video generates 720p, 30fps videos within minutes by employing a coarse-to-fine generation strategy along both the temporal and spatial axes. Block Sparse Attention further enhances efficiency, particularly at high resolutions; Strong performance with multi-reward RLHF: Multi-reward RLHF training enables LongCat-Video to achieve performance on par with the latest closed-source and leading open-source models. Code and model weights are publicly available to accelerate progress in the field.

EveningMay 31

2026-05-31

Model Release Notes | OpenAI Help Center; Governor Newsom signs first-of-its-kind executive order to prepare workers and businesses for potential AI disruption; SoftBank says it will invest up to €75 billion to build French data centers

ResearchMay 30

stable-worldmodel-v1: Reproducible World Modeling Research and Evaluation

World Models have emerged as a powerful paradigm for learning compact, predictive representations of environment dynamics, enabling agents to reason, plan, and generalize beyond direct experience. Despite recent interest in World Models, most available implementations remain publication-specific, severely limiting their reusability, increasing the risk of bugs, and reducing evaluation standardization. To mitigate these issues, we introduce stable-worldmodel (SWM), a modular, tested, and documented world-model research ecosystem that provides efficient data-collection tools, standardized environments, planning algorithms, and baseline implementations. In addition, each environment in SWM enables controllable factors of variation, including visual and physical properties, to support robustness and continual learning research. Finally, we demonstrate the utility of SWM by using it to study zero-shot robustness in DINO-WM.

EveningMay 30

2026-05-30

NVIDIA Unveils New Open Models, Data and Tools to Advance AI Across Every Industry; Strengthening societal resilience with Rosalind Biodefense; Runway started by helping filmmakers — now it wants to beat Google at AI

ResearchMay 29

minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models

Recent video diffusion foundation models have achieved remarkable progress in high-quality video generation, yet turning them into real-time interactive video world models remains challenging. Interactive world models require controllable, causal, and low-latency rollout, which in practice demands a full pipeline spanning data construction, controllable fine-tuning, autoregressive training, few-step distillation, and streaming inference. In this work, we present minWM, a full-stack open-source framework for building real-time interactive video world models. minWM provides an end-to-end pipeline that converts existing bidirectional T2V/TI2V video foundation models into camera-controllable few-step autoregressive world models. Specifically, minWM first fine-tunes a bidirectional video diffusion model with camera control, and then applies the Causal Forcing / Causal Forcing++ pipeline, including AR diffusion training, causal ODE or causal consistency distillation, and asymmetric DMD, to distill it into a few-step autoregressive generator for low-latency rollout. The framework is modular and architecture-extensible: we instantiate it on representative open backbones, including Wan2.1-T2V-1.3B and HY1.5-TI2V-8B, covering both cross-attention-based condition injection and MMDiT-style architectures. minWM also supports adapting existing video world models, such as HY-WorldPlay, to new data distributions, training recipes, and latency targets. Beyond releasing runnable scripts, checkpoints, documentation, and inference code, we provide practical ablations on camera trajectory quality, controllability training steps, and minimal batch-size requirements. We hope minWM serves as a reproducible and extensible recipe for building and adapting real-time interactive video world models. Project Page: [ this https URL ]( this https URL )

EveningMay 29

2026-05-29

Building self-improving tax agents with Codex; Building a safe, effective sandbox to enable Codex on Windows; News — Google DeepMind (May 2026 updates index)

ResearchMay 28

OSP-Next: Efficient High-Quality Video Generation with Sparse Sequence Parallelism, HiF8 Quantization, and Reinforcement Learning

Diffusion Transformers achieve strong video generation quality, but the quadratic cost of full attention limits efficiency. We introduce OSP-Next, an efficient text-to-video generation model that integrates sparse attention, parallelism, quantization, and reinforcement learning. OSP-Next uses a hybrid full-sparse attention architecture, where the sparse component is implemented with Skiparse-2D Attention. This fixed-pattern mechanism applies token-wise and group-wise sparse attention along spatial dimensions, leveraging locality while maintaining native compatibility with FlashAttention kernels. Based on the local equivalence of rearrangement in Skiparse-2D Attention, we further propose Sparse Sequence Parallelism (SSP), which partitions subsequences across ranks and switches sparse patterns through a single All-to-All communication. Compared with Ulysses Sequence Parallelism (SP), SSP provides a native parallel strategy for sparse attention and reduces communication volume by 75%. OSP-Next also incorporates HiF8 quantization to enable stable joint training with 8-bit quantization and sparse fine-tuning, and applies Mix-GRPO post-training to improve the performance of the sparse model. Experiments show that OSP-Next achieves a VBench total score of 83.73%, surpassing the Wan2.1 baseline. Under the 5-second 720P and 5-second 768P settings, OSP-Next achieves up to 1.64$\times$ single-GPU speedup and over 1.52$\times$ eight-GPU speedup on NVIDIA H200 GPUs. In addition, with only a 0.4% drop in VBench total score, OSP-Next-HiF8 achieves 1.69$\times$ and 2.27$\times$ speedups under the two settings on a single Ascend 950PR, demonstrating the efficiency and performance of OSP-Next across hardware platforms.

MorningMay 28

2026-05-28

Building self-improving tax agents with Codex; Is a compute crunch coming?; Introducing 1-bit and Ternary Bonsai Image 4B: Image Generation for Local Devices; CFOs Funded the AI Revolution. Now They’re Joining It.; Choosing to Stay Human

MorningMay 27

2026-05-27

A terminal is all you need for web agents; How Glance turns hours of video into mobile-ready clips with AI; Introducing Grok Build; Some ideas for what comes next, May 2026

ResearchMay 26

TriSplat: Simulation-Ready Feed-Forward 3D Scene Reconstruction

Sparse-view 3D reconstruction is increasingly addressed with feed-forward splatting networks that predict explicit primitives directly from images. Yet most existing methods remain centered on Gaussian primitives and expose surfaces only indirectly: extracting a usable mesh for downstream simulation, physics reasoning, or embodied interaction still requires expensive post-hoc steps that break the feed-forward promise. This limitation is especially pronounced in pose-free settings, where scene structure and camera parameters must be estimated jointly from sparse observations. We present TriSplat, a feed-forward reconstruction network that represents scenes with oriented triangle primitives and directly exports simulation-ready mesh scenes from a single forward pass. Given input images, the network predicts local 3D point maps, triangle attributes, camera poses, and optional intrinsics. Rather than regressing triangle orientation as an unconstrained latent variable, our approach constructs geometry normals from the predicted point maps, refines them with an image-conditioned normal head, and converts them into stable local frames for triangle parameterization. A mono-normal bootstrap schedule further stabilizes early training, while opacity and blur scheduling progressively sharpens the learned surface representation for direct mesh extraction. Experiments on RealEstate10K and DL3DV show that this representation produces more geometry-faithful reconstructions than Gaussian feed-forward baselines while maintaining competitive novel-view rendering quality. Because the rendering primitives are themselves surface triangles, the output can be directly ingested by physics engines, collision detectors, and standard rendering pipelines without any conversion, making it a practical simulation-ready solution for feed-forward 3D scene reconstruction.

EveningMay 26

2026-05-26

China's DeepSeek to make permanent 75% price cut on flagship V4‑Pro AI model; Trending Papers - Hugging Face (SkillOpt spotlight)

MorningMay 26

2026-05-26

SkillOpt: Executive Strategy for Self-Evolving Agent Skills; AdventHealth advances whole-person care with OpenAI; The Best Manufacturers Build AI with Workers, Not for Them; How I Choose Which Cloudflare Employees to Replace With AI; How AI is forcing McKinsey and its peers to rethink pricing

ResearchMay 25

SkillOpt: Executive Strategy for Self-Evolving Agent Skills

Agent skills today are hand-crafted, generated one-shot, or evolved through loosely controlled self-revision, none of which behaves like a deep-learning optimizer for the skill, and none of which reliably improves over its starting point under feedback. We argue the skill should instead be trained as the external state of a frozen agent, with the same discipline that makes weight-space optimization reproducible. SkillOpt is, to our knowledge, the first systematic controllable text-space optimizer for agent skills: a separate optimizer model turns scored rollouts into bounded add/delete/replace edits on a single skill document, and an edit is accepted only when it strictly improves a held-out validation score. A textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta update make skill training stable while adding zero inference-time model calls at deployment. Across six benchmarks, seven target models, and three execution harnesses (direct chat, Codex, Claude Code), SkillOpt is best or tied on all 52 evaluated (model, benchmark, harness) cells and beats every per-cell competitor among human, one-shot LLM, Trace2Skill, TextGrad, GEPA, and EvoSkill skills. On GPT-5.5 it lifts the average no-skill accuracy by +23.5 points in direct chat, by +24.8 inside the Codex agentic loop, and by +19.1 inside Claude Code. Transfer experiments further show that optimized skill artifacts retain value when moved across model scales, between Codex and Claude Code execution environments, and to a nearby math benchmark without further optimization.

EveningMay 25

2026-05-25

100 things we announced at I/O 2026; All the news from the Google I/O 2026 Developer keynote; From AI pilots to enterprise impact: Why execution is the new differentiator; Anthropic in talks to use Microsoft's AI chips, The Information reports

MorningMay 25

2026-05-25

State of AI: May 2026; Think 2026: IBM Delivers the Blueprint for the AI Operating Model as the AI Divide Widens; AI Updates Today (May 2026) – Latest AI Model Releases

ResearchMay 24

L2P: Unlocking Latent Potential for Pixel Generation

Pixel diffusion models have recently regained attention for visual generation. However, training advanced pixel-space models from scratch demands prohibitive computational and data resources. To address this, we propose the Latent-to-Pixel (L2P) transfer paradigm, an efficient framework that directly harnesses the rich knowledge of pre-trained LDMs to build powerful pixel-space models. Specifically, L2P discards the VAE in favor of large-patch tokenization and freezes the source LDM's intermediate layers, exclusively training shallow layers to learn the latent-to-pixel transformation. By utilizing LDM-generated synthetic images as the sole training corpus, L2P fits an already smooth data manifold, enabling rapid convergence with zero real-data collection. This strategy allows L2P to seamlessly migrate massive latent priors to the pixel space using only 8 GPUs. Furthermore, eliminating the VAE memory bottleneck unlocks native 4K ultra-high resolution generation. Extensive experiments across mainstream LDM architectures show that L2P incurs negligible training overhead, yet performs on par with the source LDM on DPG-Bench and reaches 93% performance on GenEval.

EveningMay 24

2026-05-24

A new era for AI Search; A new personal finance experience in ChatGPT

ResearchMay 23

Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation

Despite rapid advances in automatic speech recognition (ASR) and large audio-language models, robust recognition in real-world environments remains limited by an "acoustic robustness bottleneck": models often lose acoustic grounding and produce omissions or hallucinations under severe, compositional distortions. We propose Mega-ASR, a unified ASR-in-the-wild framework that combines scalable compound-data construction with progressive acoustic-to-semantic optimization. We introduce Voices-in-the-Wild-2M, covering 7 classic acoustic phenomena and 54 physically plausible compound scenarios, and train Mega-ASR with Acoustic-to-Semantic Progressive Supervised Fine-Tuning and Dual-Granularity WER-Gated Policy Optimization. Extensive experiments demonstrate that Mega-ASR achieves significant advantages over prior state-of-the-art systems on adverse-condition ASR benchmarks (45.69% vs. 54.01% on VOiCES R4-B-F, and 21.49% vs. 29.34% on NOIZEUS Sta-0). On complex compositional acoustic scenarios, Mega-ASR further delivers over 30% relative WER reduction against strong open- and closed-source baselines, establishing a scalable paradigm for robust ASR in-the-wild.

EveningMay 23

2026-05-23

CAISI Evaluation of DeepSeek V4 Pro; The Art of Building Verifiers for Computer Use Agents; State of AI: May 2026

ResearchMay 22

PhysX-Omni: Unified Simulation-Ready Physical 3D Generation for Rigid, Deformable, and Articulated Objects

Simulation-ready physical 3D assets have emerged as a promising direction owing to their broad applicability in downstream tasks. However, most existing 3D generation methods either neglect physical properties or are limited to a single asset category, e.g., rigid, deformable, or articulated objects. To address these limitations, we introduce PhysX-Omni, a unified framework for simulation-ready physical 3D generation across diverse asset types. Specifically, we develop a novel and efficient geometry representation tailored for Vision-Language Models, which directly encodes high-resolution 3D structures without compression, significantly improving generation performance. In addition, we construct the first general simulation-ready 3D dataset, PhysXVerse, covering diverse indoor and outdoor categories. Furthermore, to comprehensively and flexibly evaluate both generative and understanding capabilities in the wild, we propose PhysX-Bench, which encompasses six key attributes: geometry, absolute scale, material, affordance, kinematics, and function description. Extensive experiments with conventional metrics and PhysX-Bench show that PhysX-Omni performs strongly in both generation and understanding. Moreover, additional studies further validate the potential of PhysX-Omni for applications in simulation-ready scene generation and robotic policy learning. We believe PhysX-Omni can significantly advance a wide range of downstream applications, particularly in embodied AI and physics-based simulation.

EveningMay 22

2026-05-22

Gemini 3.5: frontier intelligence with action; OpenAI named a Leader in enterprise coding agents by Gartner; An OpenAI model has disproved a central conjecture in discrete geometry; Center for AI Standards and Innovation (CAISI) frontier AI testing posture

MorningMay 22

2026-05-22

Stable Audio 3.0, the model family built for artistic experimentation with open-weight models; How Ramp engineers accelerate code review with Codex; Presien reduces critical safety events on construction sites by 70%+ with Claude; Qwen3.7: The Agent Frontier; Your AI Change Is Actually a People Change

ResearchMay 21

LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation

We present LongLive-2.0, an NVFP4-based parallel infrastructure throughout the full training and inference workflow of long video generation, addressing speed and memory bottlenecks. For training, we introduce sequence-parallel autoregressive (AR) training, instantiated as Balanced SP, which co-designs the efficient teacher-forcing layout with SP execution by pairing clean-history and noisy-target temporal chunks on each rank, enabling a natural teacher-forcing mask with SP-aware chunked VAE encoding. Combined with NVFP4 precision, it reduces GPU memory cost and accelerates GEMM computation during training, the proportion of which increases as video length grows. Moreover, we show that a high-quality infrastructure and dataset enable a remarkably clean training pipeline. Unlike existing Self-Forcing series methods that rely on ODE initialization and subsequent distribution matching distillation (DMD), LongLive-2.0 directly tunes a diffusion model into a long, multi-shot, interactive auto-regressive (AR) diffusion model. It can be further converted to real-time generation (4 to 2 denoising steps) with standalone LoRA weights. For inference on Blackwell GPUs, we enable W4A4 NVFP4 inference, quantize KV cache into NVFP4 for memory savings, and boost end-to-end throughput with asynchronous streaming VAE decoding. On non-Blackwell GPU architectures, we deploy SP inference to match the speed on Blackwell GPUs, while the quantized KV cache can lower inter-GPU communication of SP. Experiments show up to 2.15x speedup in training, and 1.84x in inference. LongLive-2.0-5B achieves 45.7 FPS inference while attaining strong performance on benchmarks. To our knowledge, LongLive-2.0 is the first NVFP4 training and inference system for long video generation.

EveningMay 21

2026-05-21

An OpenAI model has disproved a central conjecture in discrete geometry; Co-Scientist: A multi-agent AI partner to accelerate research; Accelerating scientific discovery with Co-Scientist (Nature); Vera Arrives: NVIDIA’s First CPU Built for Agents Lands at Top AI Labs

MorningMay 20

2026-05-20

General Agent: A Self-Evolving, Synthetic Agent Environment

ResearchMay 19

Lance: Unified Multimodal Modeling by Multi-Task Synergy

We present Lance, a lightweight native unified model supporting multimodal understanding, generation, and editing for both images and videos. Rather than relying on model capacity scaling or text-image-dominant designs, Lance explores a practical paradigm for unified multimodal modeling via collaborative multi-task training. It is grounded in two core principles: unified context modeling and decoupled capability pathways. Specifically, Lance is trained from scratch and employs a dual-stream mixture-of-experts architecture on shared interleaved multimodal sequences, enabling joint context learning while decoupling the pathways for understanding and generation. We further introduce modality-aware rotary positional encoding to mitigate interference among heterogeneous visual tokens and boost cross-task alignment. During training, Lance adopts a staged multi-task training paradigm with capability-oriented objectives and adaptive data scheduling to strengthen both semantic comprehension and visual generation performance. Experimental results demonstrate that Lance substantially outperforms existing open-source unified models in image and video generation, while retaining strong multimodal understanding capabilities. The homepage is available at this https URL .

EveningMay 19

2026-05-19

I/O 2026: Welcome to the agentic Gemini era; Advancing content provenance for a safer, more transparent AI ecosystem; Anthropic acquires Stainless; The 13 biggest announcements at Google I/O 2026; AI Act | Shaping Europe’s digital future

MorningMay 19

2026-05-19

Project Glasswing: what Mythos showed us; Introducing Composer 2.5; Starchild-1: The First Real-Time Multimodal World Model; How Claude Code works in large codebases: Best practices and where to start; A new personal finance experience in ChatGPT

ResearchMay 18

MMSkills: Towards Multimodal Skills for General Visual Agents

Reusable skills have become a core substrate for improving agent capabilities, yet most existing skill packages encode reusable behavior primarily as textual prompts, executable code, or learned routines. For visual agents, however, procedural knowledge is inherently multimodal: reuse depends not only on what operation to perform, but also on recognizing the relevant state, interpreting visual evidence of progress or failure, and deciding what to do next. We formalize this requirement as multimodal procedural knowledge and address three practical challenges: (I) what a multimodal skill package should contain; (II) where such packages can be derived from public interaction experience; and (III) how agents can consult multimodal evidence at inference time without excessive image context or over-anchoring to reference screenshots. We introduce MMSkills, a framework for representing, generating, and using reusable multimodal procedures for runtime visual decision making. Each MMSkill is a compact, state-conditioned package that couples a textual procedure with runtime state cards and multi-view keyframes. To construct these packages, we develop an agentic trajectory-to-skill Generator that transforms public non-evaluation trajectories into reusable multimodal skills through workflow grouping, procedure induction, visual grounding, and meta-skill-guided auditing. To use them, we introduce a branch-loaded multimodal skill agent: selected state cards and keyframes are inspected in a temporary branch, aligned with the live environment, and distilled into structured guidance for the main agent. Experiments across GUI and game-based visual-agent benchmarks show that MMSkills consistently improve both frontier and smaller multimodal agents, suggesting that external multimodal procedural knowledge complements model-internal priors.

EveningMay 18

2026-05-18

NASA’s new AI space chip could let spacecraft think for themselves; We’re launching the Google DeepMind Accelerator program in Asia Pacific to tackle environmental risks.; Google I/O 2026: How to Watch the Keynote and What to Expect; New Models Today — AI & LLM Releases Last 24 Hours

MorningMay 18

2026-05-18

Work with Codex from anywhere

ResearchMay 17

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Large Language Models (LLMs) have demonstrated remarkable prowess in generating contextually coherent responses, yet their fixed context windows pose fundamental challenges for maintaining consistency over prolonged multi-session dialogues. We introduce Mem0, a scalable memory-centric architecture that addresses this issue by dynamically extracting, consolidating, and retrieving salient information from ongoing conversations. Building on this foundation, we further propose an enhanced variant that leverages graph-based memory representations to capture complex relational structures among conversational elements. Through comprehensive evaluations on LOCOMO benchmark, we systematically compare our approaches against six baseline categories: (i) established memory-augmented systems, (ii) retrieval-augmented generation (RAG) with varying chunk sizes and k-values, (iii) a full-context approach that processes the entire conversation history, (iv) an open-source memory solution, (v) a proprietary model system, and (vi) a dedicated memory management platform. Empirical results show that our methods consistently outperform all existing memory systems across four question categories: single-hop, temporal, multi-hop, and open-domain. Notably, Mem0 achieves 26% relative improvements in the LLM-as-a-Judge metric over OpenAI, while Mem0 with graph memory achieves around 2% higher overall score than the base configuration. Beyond accuracy gains, we also markedly reduce computational overhead compared to full-context method. In particular, Mem0 attains a 91% lower p95 latency and saves more than 90% token cost, offering a compelling balance between advanced reasoning capabilities and practical deployment constraints. Our findings highlight critical role of structured, persistent memory mechanisms for long-term conversational coherence, paving the way for more reliable and efficient LLM-driven AI agents.

EveningMay 17

2026-05-17

OpenAI launches the OpenAI Deployment Company to help businesses build around intelligence; GPT-5.5 Instant: smarter, clearer, and more personalized; Introducing Claude Design by Anthropic Labs; Gemini Robotics 1.5 brings AI agents into the physical world

ResearchMay 16

SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720x1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x speedup). In summary, SANA-Video enables low-cost, high-quality video generation.

EveningMay 16

2026-05-16

Introducing GPT-5; Anthropic forms $200 million partnership with the Gates Foundation; A smarter, more proactive Android with Gemini Intelligence; OpenAI to give EU access to new cyber model but Anthropic still holding out on Mythos

ResearchMay 15

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Software is one of the most powerful tools that we humans have at our disposal; it allows a skilled programmer to interact with the world in complex and profound ways. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. In this paper, we introduce OpenHands (f.k.a. OpenDevin), a platform for the development of powerful and flexible AI agents that interact with the world in similar ways to those of a human developer: by writing code, interacting with a command line, and browsing the web. We describe how the platform allows for the implementation of new agents, safe interaction with sandboxed environments for code execution, coordination between multiple agents, and incorporation of evaluation benchmarks. Based on our currently incorporated benchmarks, we perform an evaluation of agents over 15 challenging tasks, including software engineering (e.g., SWE-BENCH) and web browsing (e.g., WEBARENA), among others. Released under the permissive MIT license, OpenHands is a community project spanning academia and industry with more than 2.1K contributions from over 188 contributors.

EveningMay 15

2026-05-15

Grok Model Retirement on May 15, 2026; Defense at AI speed: Microsoft’s new multi-model agentic security system tops leading industry benchmark

ResearchMay 14

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

Recent large vision-language models (VLMs) remain fundamentally constrained by a persistent dichotomy: understanding and generation are treated as distinct problems, leading to fragmented architectures, cascaded pipelines, and misaligned representation spaces. We argue that this divide is not merely an engineering artifact, but a structural limitation that hinders the emergence of native multimodal intelligence. Hence, we introduce SenseNova-U1, a native unified multimodal paradigm built upon NEO-unify, in which understanding and generation evolve as synergistic views of a single underlying process. We launch two native unified variants, SenseNova-U1-8B-MoT and SenseNova-U1-A3B-MoT, built on dense (8B) and mixture-of-experts (30B-A3B) understanding baselines, respectively. Designed from first principles, they rival top-tier understanding-only VLMs across text understanding, vision-language perception, knowledge reasoning, agentic decision-making, and spatial intelligence. Meanwhile, they deliver strong semantic consistency and visual fidelity, excelling in conventional or knowledge-intensive any-to-image (X2I) synthesis, complex text-rich infographic generation, and interleaved vision-language generation, with or without think patterns. Beyond performance, we show detailed model design, data preprocessing, pre-/post-training, and inference strategies to support community research. Last but not least, preliminary evidence demonstrates that our models extend beyond perception and generation, performing strongly in vision-language-action (VLA) and world model (WM) scenarios. This points toward a broader roadmap where models do not translate between modalities, but think and act across them in a native manner. Multimodal AI is no longer about connecting separate systems, but about building a unified one and trusting the necessary capabilities to emerge from within.

EveningMay 14

2026-05-14

Exclusive: US clears H200 chip sales to 10 China firms as Nvidia CEO looks for breakthrough; OpenAI explores legal options against Apple, source says; Changelog – Codex | OpenAI Developers; Anthropic and Gates Foundation launch $200 million partnership for AI in health, education; U.S. clears H200 chip sales to 10 China firms as Nvidia CEO looks for breakthrough: Reuters exclusive (BNN Bloomberg syndication)

MorningMay 14

2026-05-14

Protect your enterprise now from the Shai-Hulud worm and npm vulnerability in 6 actionable steps; The end of the trade-off: How AI agents broke the onboarding trilemma; The Math Behind the Cost of AI Agents; Is Software Losing Its Head?; Soon, access to frontier AI will be scarce and selective

MorningMay 13

2026-05-13

Interaction Models: A Scalable Approach to Human-AI Collaboration; How Miro uses Amazon Bedrock to boost software bug routing accuracy and improve time-to-resolution from days to hours; The Inference Shift; The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents; Introducing Perceptron Mk1

ResearchMay 12

Pixal3D: Pixel-Aligned 3D Generation from Images

Recent advances in 3D generative models have rapidly improved image-to-3D synthesis quality, enabling higher-resolution geometry and more realistic appearance. Yet fidelity, which measures pixel-level faithfulness of the generated 3D asset to the input image, still remains a central bottleneck. We argue this stems from an implicit 2D-3D correspondence issue: most 3D-native generators synthesize shape in canonical space and inject image cues via attention, leaving pixel-to-3D associations ambiguous. To tackle this issue, we draw inspiration from 3D reconstruction and propose Pixal3D, a pixel-aligned 3D generation paradigm for high-fidelity 3D asset creation from images. Instead of generating in a canonical pose, Pixal3D directly generates 3D in a pixel-aligned way, consistent with the input view. To enable this, we introduce a pixel back-projection conditioning scheme that explicitly lifts multi-scale image features into a 3D feature volume, establishing direct pixel-to-3D correspondence without ambiguity. We show that Pixal3D is not only scalable and capable of producing high-quality 3D assets, but also substantially improves fidelity, approaching the fidelity level of reconstruction. Furthermore, Pixal3D naturally extends to multi-view generation by aggregating back-projected feature volumes across views. Finally, we show pixel-aligned generation benefits scene synthesis, and present a modular pipeline that produces high-fidelity, object-separated 3D scenes from images. Pixal3D for the first time demonstrates 3D-native pixel-aligned generation at scale, and provides a new inspiring way towards high-fidelity 3D generation of object or scene from single or multi-view images. Project page: this https URL

EveningMay 12

2026-05-12

Thomson Reuters + Anthropic: MCP integration connects Claude with CoCounsel Legal; Google DeepMind publishes expanded AlphaEvolve impact metrics; Reuters: Isomorphic Labs raises $2.1B for AI drug discovery scale-up; Reuters: Anthropic expands Claude legal tooling for law firms

MorningMay 12

2026-05-12

EMO: Pretraining mixture of experts for emergent modularity; Uber uses OpenAI to help people earn smarter and book faster; Halliburton enhances seismic workflow creation with Amazon Bedrock and Generative AI; What Are Your Company’s AI Nightmares?; What Corporate Functions of the Future Won’t Look Like Functions at All

MorningMay 11

2026-05-11

Teaching Claude why; ZAYA1-74B-Preview: Scaling Pretraining on AMD; Redesigning Your Marketing Organization for the Agentic Age; Cracking the Code of Campaign Success with Google’s AlphaEvolve Agent; Learning on the Shop floor

ResearchMay 8

ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration

This report describes ARIS (Auto-Research-in-sleep), an open-source research harness for autonomous research, including its architecture, assurance mechanisms, and early deployment experience. The performance of agent systems built on LLMs depends on both the model weights and the harness around them, which governs what information to store, retrieve, and present to the model. For long-horizon research workflows, the central failure mode is not a visible breakdown but a plausible unsupported success: a long-running agent can produce claims whose evidential support is incomplete, misreported, or silently inherited from the executor's framing. Therefore, we present ARIS as a research harness that coordinates machine-learning research workflows through cross-model adversarial collaboration as a default configuration: an executor model drives forward progress while a reviewer from a different model family is recommended to critique intermediate artifacts and request revisions. ARIS has three architectural layers. The execution layer provides more than 65 reusable Markdown-defined skills, model integrations via MCP, a persistent research wiki for iterative reuse of prior findings, and deterministic figure generation. The orchestration layer coordinates five end-to-end workflows with adjustable effort settings and configurable routing to reviewer models. The assurance layer includes a three-stage process for checking whether experimental claims are supported by evidence: integrity verification, result-to-claim mapping, and claim auditing that cross-checks manuscript statements against the claim ledger and raw evidence, as well as a five-pass scientific-editing pipeline, mathematical-proof checks, and visual inspection of the rendered PDF. A prototype self-improvement loop records research traces and proposes harness improvements that are adopted only after reviewer approval.

EveningMay 8

2026-05-08

Running Codex safely at OpenAI; Advancing voice intelligence with new models in the API; Advancing AI evaluation with CAISI (US) and AISI (UK); Google, Microsoft and xAI agree to US government AI testing programme; OpenAI introduces GPT-5.5-Cyber (limited preview)

MorningMay 8

2026-05-08

Natural Language Autoencoders: Turning Claude’s thoughts into text; Agents that transact: Introducing Amazon Bedrock AgentCore payments, built with Coinbase and Stripe; Parloa builds service agents customers want to talk to; Notes from inside China’s AI labs; The New Rules of Customer Experience in the Age of AI

ResearchMay 7

DFlash: Block Diffusion for Flash Speculative Decoding

Autoregressive large language models (LLMs) deliver strong performance but require inherently sequential decoding, leading to high inference latency and poor GPU utilization. Speculative decoding mitigates this bottleneck by using a fast draft model whose outputs are verified in parallel by the target LLM; however, existing methods still rely on autoregressive drafting, which remains sequential and limits practical speedups. Diffusion LLMs offer a promising alternative by enabling parallel generation, but current diffusion models typically underperform compared with autoregressive models. In this paper, we introduce DFlash, a speculative decoding framework that employs a lightweight block diffusion model for parallel drafting. By generating draft tokens in a single forward pass and conditioning the draft model on context features extracted from the target model, DFlash enables efficient drafting with high-quality outputs and higher acceptance rates. Experiments show that DFlash achieves over 6x lossless acceleration across a range of models and tasks, delivering up to 2.5x higher speedup than the state-of-the-art speculative decoding method EAGLE-3.

EveningMay 7

2026-05-07

Advancing voice intelligence with new models in the API; Anthropic strikes SpaceX data center deal as it plows ahead on AI coding; What Google Cloud announced in AI this month – and how it helps you

MorningMay 7

2026-05-07

Higher usage limits for Claude and a compute deal with SpaceX

ResearchMay 6

PersonaLive! Expressive Portrait Image Animation for Live Streaming

Current diffusion-based portrait animation models predominantly focus on enhancing visual quality and expression realism, while overlooking generation latency and real-time performance, which restricts their application range in the live streaming scenario. We propose PersonaLive, a novel diffusion-based framework towards streaming real-time portrait animation with multi-stage training recipes. Specifically, we first adopt hybrid implicit signals, namely implicit facial representations and 3D implicit keypoints, to achieve expressive image-level motion control. Then, a fewer-step appearance distillation strategy is proposed to eliminate appearance redundancy in the denoising process, greatly improving inference efficiency. Finally, we introduce an autoregressive micro-chunk streaming generation paradigm equipped with a sliding training strategy and a historical keyframe mechanism to enable low-latency and stable long-term video generation. Extensive experiments demonstrate that PersonaLive achieves state-of-the-art performance with up to 7-22x speedup over prior diffusion-based portrait animation models.

EveningMay 6

2026-05-06

OpenAI expands ChatGPT ads with self-serve buying, CPC, and conversion measurement; Reuters: Anthropic reportedly commits $200B to Google Cloud/chips over five years; Microsoft and OpenAI amend partnership terms (cloud, IP license, revenue-share mechanics); Google AI April roundup highlights enterprise agent platform, TPUs, Gemma 4, Deep Research Max; White House releases U.S. national AI legislative framework

MorningMay 6

2026-05-06

Five contrarian ideas about GenAI in the workplace

ResearchMay 5

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

We introduce MinerU2.5, a 1.2B-parameter document parsing vision-language model that achieves state-of-the-art recognition accuracy while maintaining exceptional computational efficiency. Our approach employs a coarse-to-fine, two-stage parsing strategy that decouples global layout analysis from local content recognition. In the first stage, the model performs efficient layout analysis on downsampled images to identify structural elements, circumventing the computational overhead of processing high-resolution inputs. In the second stage, guided by the global layout, it performs targeted content recognition on native-resolution crops extracted from the original image, preserving fine-grained details in dense text, complex formulas, and tables. To support this strategy, we developed a comprehensive data engine that generates diverse, large-scale training corpora for both pretraining and fine-tuning. Ultimately, MinerU2.5 demonstrates strong document parsing ability, achieving state-of-the-art performance on multiple benchmarks, surpassing both general-purpose and domain-specific models across various recognition tasks, while maintaining significantly lower computational overhead.

EveningMay 5

2026-05-05

Google DeepMind UK workers vote to unionize amid military-AI concerns

MorningMay 5

2026-05-05

Anthropic and OpenAI are both launching joint ventures for enterprise AI services

MorningMay 4

2026-05-04

EveningMay 4

2026-05-03

Remote agents in Vibe. Powered by Mistral Medium 3.5.; OpenAI models, Codex, and Managed Agents come to AWS; Amazon Bedrock now offers OpenAI models, Codex, and Managed Agents (Limited Preview); Gemini 3 — Google DeepMind

EveningMay 4

2026-05-02

Exclusive: US officials weigh cutting deadlines to fix digital flaws amid worries over AI-powered hacking, sources say; Anthropic Economic Index report: Economic primitives; Bringing AI to the next generation of fusion energy

ResearchMay 1

Geometric Context Transformer for Streaming 3D Reconstruction

Streaming 3D reconstruction aims to recover 3D information, such as camera poses and point clouds, from a video stream, which necessitates geometric accuracy, temporal consistency, and computational efficiency. Motivated by the principles of Simultaneous Localization and Mapping (SLAM), we introduce LingBot-Map, a feed-forward 3D foundation model for reconstructing scenes from streaming data, built upon a geometric context transformer (GCT) architecture. A defining aspect of LingBot-Map lies in its carefully designed attention mechanism, which integrates an anchor context, a pose-reference window, and a trajectory memory to address coordinate grounding, dense geometric cues, and long-range drift correction, respectively. This design keeps the streaming state compact while retaining rich geometric context, enabling stable efficient inference at around 20 FPS on 518 x 378 resolution inputs over long sequences exceeding 10,000 frames. Extensive evaluations across a variety of benchmarks demonstrate that our approach achieves superior performance compared to both existing streaming and iterative optimization-based approaches.

EveningMay 1

2026-05-01

Introducing Advanced Account Security; Pentagon tech chief says Anthropic is still blacklisted, but Mythos is a separate issue; Huawei expects AI chip revenue to jump at least 60% this year, FT reports; Beacon Biosignals is mapping the brain during sleep

MorningMay 1

2026-05-01

Enabling a new model for healthcare with AI co-clinician; Build programmatic agents with the Cursor SDK; Writer launches AI agents that can act without prompts, taking on Amazon, Microsoft and Salesforce; Sun Finance automates ID extraction and fraud detection with generative AI on AWS; From scan to fix, done seamlessly

ResearchApr 30

Kronos: A Foundation Model for the Language of Financial Markets

The success of large-scale pre-training paradigm, exemplified by Large Language Models (LLMs), has inspired the development of Time Series Foundation Models (TSFMs). However, their application to financial candlestick (K-line) data remains limited, often underperforming non-pre-trained architectures. Moreover, existing TSFMs often overlook crucial downstream tasks such as volatility prediction and synthetic data generation. To address these limitations, we propose Kronos, a unified, scalable pre-training framework tailored to financial K-line modeling. Kronos introduces a specialized tokenizer that discretizes continuous market information into token sequences, preserving both price dynamics and trade activity patterns. We pre-train Kronos using an autoregressive objective on a massive, multi-market corpus of over 12 billion K-line records from 45 global exchanges, enabling it to learn nuanced temporal and cross-asset representations. Kronos excels in a zero-shot setting across a diverse set of financial tasks. On benchmark datasets, Kronos boosts price series forecasting RankIC by 93% over the leading TSFM and 87% over the best non-pre-trained baseline. It also achieves a 9% lower MAE in volatility forecasting and a 22% improvement in generative fidelity for synthetic K-line sequences. These results establish Kronos as a robust, versatile foundation model for end-to-end financial time series analysis. Our pre-trained model is publicly available at this https URL .

EveningApr 30

2026-04-30

OpenAI: “Building the compute infrastructure for the Intelligence Age”; Google/Alphabet Q1 2026: AI full-stack monetization acceleration; Anthropic Research: BioMysteryBench for agentic bioinformatics

MorningApr 30

2026-04-30

Remote agents in Vibe. Powered by Mistral Medium 3.5; Introducing NVIDIA Nemotron 3 Nano Omni; Generative AI in healthcare: Adoption matures as agentic AI emerges; How Popsa used Amazon Nova to inspire customers with personalised title suggestions; Shifting from AI-assisted coding to AI-assisted delivery with IBM Bob

EveningApr 29

2026-04-29

Microsoft + OpenAI rewrite economics/governance of the flagship AI alliance; OpenAI GPT-5.5: higher autonomy at similar latency envelope; Google DeepMind Deep Research Max productizes high-compute research agents; Google grants broader Pentagon classified-network AI access; Anthropic Project Glasswing: frontier cyber capability redirected toward defense

ResearchApr 30

VibeVoice Technical Report

This report presents VibeVoice, a novel model designed to synthesize long-form speech with multiple speakers by employing next-token diffusion, which is a unified method for modeling continuous data by autoregressively generating latent vectors via diffusion. To enable this, we introduce a novel continuous speech tokenizer that, when compared to the popular Encodec model, improves data compression by 80 times while maintaining comparable performance. The tokenizer effectively preserves audio fidelity while significantly boosting computational efficiency for processing long sequences. Thus, VibeVoice can synthesize long-form speech for up to 90 minutes (in a 64K context window length) with a maximum of 4 speakers, capturing the authentic conversational \`\`vibe'' and surpassing open-source and proprietary dialogue models.

EveningApr 28

2026-04-28

Google signs classified AI deal with Pentagon; Microsoft-OpenAI partnership terms reset; Anthropic releases Claude Opus 4.7 GA; NVIDIA details enterprise Codex/GPT-5.5 deployment pattern

MorningApr 28

2026-04-28

How Delivery Hero's agent merges 100+ pull requests a day with Claude; How SAP Concur automates expense reporting with agentic AI; Learning to Orchestrate Agents in Natural Language with the Conductor; Lowe’s Enhances Customer Experience With Gen AI and Digital Twins; It’s the Age of Electricity and America Isn’t Ready

EveningApr 27

2026-04-27

DeepMind + Republic of Korea: national AI science partnership; AWS: Bedrock AgentCore adds managed harness + CLI + coding-agent skills; Reuters: DeepSeek-V4 marks normalization of low-cost challenger dynamics

MorningApr 27

2026-04-27

Orchestrating AI Code Review at scale; Project Deal; Context decay, orchestration drift, and silent AI failures; Tech Services Buyer Survey: Betting Big on AI and Resilience; The End of One-Size-Fits-All Enterprise Software

EveningApr 26

2026-04-26

Reuters: Google to invest up to $40B in Anthropic; OpenAI: Codex for (almost) everything; Anthropic + NEC partnership

Guest appearances

On other shows.

YouTube

Ashish's AI News Briefings.

On other shows.

Two Months with OpenClaw: Real-World Lessons for Enterprises

Becoming an AI Builder: Claude Code & OpenClaw Explained

Moving Beyond Chatbots and Automation

Perspectives on the Agentic Web

AWAAI Learn & Lead with Ashish Bhatia

Laid Off: Beating the H1B Countdown (#024)

Democratizing AI: Ashish Bhatia's Journey from Microsoft to Power Automate and the Evolution of AI Builder

2024 AI Predictions — Microsoft Principal Product Manager

AI with AI Builder

Using GPT in AI Builder — Microsoft Official

Azure OpenAI in AI Builder

AI Builder + OpenAI Demo

Unleash the Power of Azure OpenAI Service with AI Builder

AI Builder | A World of AI at Your Fingertips