Graph Representation Learning Augmented Model Manipulation on Federated Fine-Tuning of LLMs
Authors: Hanlin Cai, Kai Li, Houtianfu Wang, Haofan Dong, Yichen Li, Falko Dressler, Ozgur B. Akan
First: 2026-05-08T16:24:54+00:00 · Latest: 2026-05-08T16:24:54+00:00
Abstract
Federated fine-tuning (FFT) has emerged as a privacy-preserving paradigm for collaboratively adapting large language models (LLMs). Built upon federated learning, FFT enables distributed agents to jointly refine a shared pretrained LLM by aggregating local LLM updates without sharing local raw data. However, FFT-based LLMs remain vulnerable to model manipulation threats, in which adversarial participants upload manipulated LLM updates that corrupt the aggregation process and degrade the performance of the global LLM. In this paper, we propose an Augmented Model maniPulation (AugMP) strategy against FFT-based LLMs. Specifically, we design a novel graph representation learning framework that captures feature correlations among benign LLM updates to guide the generation of malicious updates. To enhance manipulation effectiveness and stealthiness, we develop an iterative manipulation algorithm based on an augmented Lagrangian dual formulation. Through this formulation, malicious updates are optimized to embed adversarial objectives while preserving benign-like parameter characteristics. Experimental results across multiple LLM backbones demonstrate that the AugMP strategy achieves the strongest manipulation performance among all competing baselines, reducing the global LLM accuracy by up to 26% and degrading the average accuracy of local LLM agents by up to 22%. Meanwhile, AugMP maintains high statistical and geometric consistency with benign updates, enabling it to evade conventional distance- and similarity-based defense methods.
Summary / 总结
Federated fine-tuning (FFT) has emerged as a privacy-preserving paradigm for collaboratively adapting large language models (LLMs).
Deadline-Driven Hierarchical Agentic Resource Sharing for AI Services and RAN Functions in AI-RAN
Authors: Haiyuan Li, Yulei Wu, Dimitra Simeonidou
First: 2026-05-08T10:22:12+00:00 · Latest: 2026-05-08T10:22:12+00:00
Abstract
AI-RAN consolidates AI services and Radio Access Network (RAN) functions onto a unified, GPU-accelerated infrastructure at the network edge. However, compute sharing between real-time RAN functions and highly heterogeneous AI services requires coordination of scheduling decisions at mismatched timescales, and placement adaptation may require service migration across nodes with non-negligible interruptions. This paper proposes a hierarchical agentic framework (HAF) for compute sharing in AI-RAN that combines a large language model (LLM)-based agent for slow-timescale placement of AI services and RAN functions with a closed-form, deadline-aware convex algorithm for fast-timescale GPU/CPU allocation. The LLM agent is further equipped with a predictive critic that filters out migrations when the induced service interruption outweighs the expected service-level objective (SLO) benefit. Experimental results show that HAF reaches 90.0% overall SLO fulfillment, a 20.5% improvement over the strongest baseline, and raises AI service request fulfillment from 51% to 85.3%. Further evaluations show that HAF retains its advantage under diverse load conditions, while the critic consistently improves SLO fulfillment across multiple open-source LLM agents.
Summary / 总结
AI-RAN consolidates AI services and Radio Access Network (RAN) functions onto a unified, GPU-accelerated infrastructure at the network edge.
From Map-and-Encap to BIER: Observations on Network Routing Scalability
Authors: Tianyuan Yu, Lan Wang, Beichuan Zhang, Lixia Zhang
First: 2026-05-08T00:41:20+00:00 · Latest: 2026-05-08T00:41:20+00:00
Abstract
The TCP/IP protocol stack uses IP addresses for two distinct roles: identifying hosts and locating their attachment points in the network topology. This dual purpose creates a fundamental tension that has led to routing and forwarding scalability challenges throughout the history of the Internet in unicast packet delivery and, more notably, in multicast delivery. This paper reviews the evolution of routing scalability solutions over the years and makes four observations. First, map-and-encap is a recurring architectural solution shared by all scalable unicast and multicast delivery methods, developed independently across different problem contexts. Second, a new solution tends to succeed when it can bring immediate local gains to early adopters without requiring coordination across administrative domains. Third, network routing and forwarding designs that depend on external factors, such as the number of distinct end sites or even application-specific deliveries, inherently preclude an upper bound on their scalability. Fourth, today's inter-domain routing protocol, BGP, lacks a topological abstraction equivalent to an egress router within a routing domain, thereby inherently preventing a map-and-encap solution for scalability. These observations offer insights into the design of future scalable routing system architectures.
Summary / 总结
The TCP/IP protocol stack uses IP addresses for two distinct roles: identifying hosts and locating their attachment points in the network topology.
CCL-Bench 1.0: A Trace-Based Benchmark for LLM Infrastructure
Authors: Eric Ding, Byungsoo Oh, Bhaskar Kataria, Kaiwen Guo, Jelena Gvero, Abhishek Vijaya Kumar, Arjun Devraj, Lindsey Bowen, Atharv Sonwane, Emaad Manzoor, Rachee Singh
First: 2026-05-07T16:40:39+00:00 · Latest: 2026-05-07T16:40:39+00:00
Abstract
Evaluative claims about LLM infrastructure -- ``workload X is fastest on hardware Y with software Z'' -- depend on a complex configuration space spanning hardware accelerators, interconnect bandwidth, software frameworks, parallelism plans, and communication libraries. Current infrastructure evaluation benchmarks publish a small set of end-to-end numbers that do not explain why one configuration outperforms another. We present CCL-Bench, a trace-based benchmark that addresses the limitations of existing benchmarks by recording reusable evidence for every ML workload. Each contributed data point in CCL-Bench packages an execution trace, a YAML workload card, and the launch scripts. We have developed a community-extensible toolkit to compute fine-grained compute, memory, and communication efficiency metrics from this evidence. Using CCL-Bench, we surface three claims that summary-statistic benchmarks cannot support: (i) higher compute-communication overlap can coincide with longer training step time and reveal inefficient parallelization choices, (ii) doubling TPU interconnect bandwidth yields a much higher end-to-end improvement in step time than doubling GPU interconnect bandwidth on small and medium workloads, and (iii) the best-tuned configuration on one training framework can run up to 3$\times$ slower than the best-tuned configuration on a peer framework on identical hardware.
Summary / 总结
Evaluative claims about LLM infrastructure -- ``workload X is fastest on hardware Y with software Z'' -- depend on a complex configuration space spanning hardware accelerators, interconnect bandwidth, software frameworks, parallelism plans, and communication libraries.
Binary Image-Based Intrusion Detection for Operational Technology Networks: Extending the SPHBI Methodology from IoT to Modbus TCP
Authors: Aamir Omar
First: 2026-05-05T19:34:45+00:00 · Latest: 2026-05-07T09:46:06+00:00
Comments: 14 pages, 5 figures, 5 tables. Preprint
Abstract
This paper extends the Single Packet Header Binary Image (SPHBI) intrusion detection methodology from IoT to Modbus TCP, evaluating five approaches spanning a gradient of protocol depth on the CIC Modbus 2023 dataset (11.4 million packets, eight detectable attack types). TCP/IP headers alone achieve only 51.8% binary accuracy, confirming that header-level heterogeneity exploited in IoT traffic is absent in uniform SCADA environments. Adding eight bytes of application-layer information improves binary accuracy to 98.1% with just 63 parameters, directly relevant to per-packet classification on resource-constrained OT edge devices. The best-performing approach achieves 94.4% +/- 2.2pp multiclass accuracy across nine classes (95% CI [92.9%, 95.9%], 10 seeds) with 56,873 parameters, roughly 430 times fewer than comparable ResNet50-based approaches. Per-class recall analysis shows seven of eight detectable attack types identified with recall above 94%, while replay attacks remain structurally undetectable by any single-packet method.
Summary / 总结
This paper extends the Single Packet Header Binary Image (SPHBI) intrusion detection methodology from IoT to Modbus TCP, evaluating five approaches spanning a gradient of protocol depth on the CIC Modbus 2023 dataset (11.4 million packets, eight detectable attack types).
Optimizing Split Learning Latency in TinyML-Based IoT Systems
Authors: Zied Jenhani, Mounir Bensalem, Jasenka Dizdarević, Admela Jukan
First: 2025-07-22T13:50:12+00:00 · Latest: 2026-05-06T12:54:05+00:00
Comments: This paper is uploaded here for research community, thus it is for non-commercial purposes
Abstract
Split learning (SL) addresses the limitation of running deep learning inference directly on low-power edge/IoT nodes, in which it executes part of the inference process on the sensor and offloading the remainder to a companion device. Despite its promise, the inference latency of SL on constrained hardware under realistic low-power wireless protocols remains unexplored. This paper presents the first experimental latency benchmark of TinyML-based SL on ESP32-S3 boards, comparing four wireless communication protocol solutions (UDP, TCP, ESP-NOW, BLE). We also analyze the impact of the choice of different split points across different models (MobileNet-V2 and ResNet50) in terms of communication and computation overhead as a way to minimize the end-to-end inference latency. We propose a Beam Search-based algorithm for split point optimization that minimizes end-to-end latency, and compare it with other methods, including Greedy Search, First-Fit, Random-Fit, and Brute Force. ESP-NOW achieves the best RTT (3.6 s) and serves as the base protocol for the algorithm, which delivers near-optimal latency with processing time of 0.1 s for 5 devices.
Summary / 总结
Split learning (SL) addresses the limitation of running deep learning inference directly on low-power edge/IoT nodes, in which it executes part of the inference process on the sensor and offloading the remainder to a companion device.
AFL-ICP: Enhancing Industrial Control Protocol Reliability via Specification-Guided Fuzzing
Authors: Jiaying Meng, Xuewei Feng, Qi Li, Min Liu, Ke Xu
First: 2026-05-06T11:07:24+00:00 · Latest: 2026-05-06T11:07:24+00:00
Comments: 11 pages, 5 figures
Abstract
Industrial Control Protocols (ICPs) are critical to the reliability and stability of industrial infrastructure, yet their security is fundamentally compromised by a specification-blindness bottleneck. Modern fuzzers, constrained by observation-driven inference, struggle to penetrate deep protocol states or detect subtle semantic deviations. In this paper, we present AFL-ICP, an autonomous fuzzing framework that pioneers a specification-driven paradigm. AFL-ICP features a context-aware specification formalization pipeline to transform complex specifications into rigorous machine-executable grammars. Building on this formalized specification, AFL-ICP leverages LLMs to enable automated protocol adaptation and seed generation, allowing for rapid extension to new protocols with minimal manual effort. Additionally, it includes an LLM-powered differential checker that cross-references implementation outputs with specification requirements to detect subtle semantic and logic bugs that existing fuzzers cannot detect. We implement AFL-ICP and evaluate it on four widely used ICPs, including both open-source and closed-source variants. Results show that AFL-ICP significantly outperforms state-of-the-art fuzzers in coverage and uncovers 24 previously unknown vulnerabilities, for which we have received acknowledgments from affected vendors (e.g., FreyrSCADA). Specifically, the identified vulnerabilities include 16 semantic and logic bugs that can silently disrupt industrial operations and degrade service availability.
Summary / 总结
Industrial Control Protocols (ICPs) are critical to the reliability and stability of industrial infrastructure, yet their security is fundamentally compromised by a specification-blindness bottleneck.
SADE: Symptom-Aware Diagnostic Escalation for LLM-Based Network Troubleshooting
Authors: Kuan-Hao Tseng, Niruth Bogahawatta, Yasod Ginige, Kosta Dekic, Arunan Sivanathan, Suranga Seneviratne
First: 2026-05-06T06:15:08+00:00 · Latest: 2026-05-06T06:15:08+00:00
Abstract
Large language model (LLM) agents are increasingly applied to network troubleshooting, but root-cause localization on public benchmarks remains well below practical deployment thresholds. We argue this is because existing agents do not encode the disciplined, layer-by-layer methodology that human network engineers use, and instead rely on free-form deliberation that conflates evidence acquisition with hypothesis commitment. We present SADE (Symptom-Aware Diagnostic Escalation), an agent that encodes the classical Cisco troubleshooting methodology as an explicit policy. SADE pairs a phase-gated diagnostic workflow, which separates evidence acquisition from hypothesis commitment, with a routed library of fault-family skills and high-yield diagnostic helpers. On a held-out 523 incident set of the public NIKA benchmark covering eleven unseen scenarios, SADE improves root-cause F1 by 37 percentage points over a ReAct + GPT-5 baseline; a model-controlled comparison against the same Claude Sonnet backend without the SADE policy attributes 22 of those points to the diagnostic policy alone, showing that the gain is not a side-effect of the model upgrade.
Summary / 总结
Large language model (LLM) agents are increasingly applied to network troubleshooting, but root-cause localization on public benchmarks remains well below practical deployment thresholds.
Joint Optimization of Trajectory Control, Resource Allocation, and Task Offloading for Multi-UAV-Assisted IoV
Authors: Maoxin Ji, Qiong Wu, Pingyi Fan, Cui Zhang, Nan Cheng, Wen Chen, Khaled B. Letaief
First: 2026-05-06T02:59:18+00:00 · Latest: 2026-05-06T02:59:18+00:00
Comments: This paper has been submitted to TMC
Abstract
This paper investigates a multi-Unmanned Aerial Vehicle (UAV) joint base station-assisted Internet of Vehicles (IoV) task offloading system in dense urban environments. To minimize system delay and energy consumption under strict coupling constraints, the complex non-convex optimization problem is decoupled into a hierarchical execution framework. First, a sequential distributed optimization algorithm based on Second-Order Cone Programming (SOCP) is proposed to optimize the 3D flight trajectory of each UAV, ensuring adaptive network coverage. Second, a novel hybrid resource scheduling paradigm synergizing Deep Reinforcement Learning (DRL) and Large Language Models (LLMs) is developed. Within this framework, the DRL agent dictates the initial resource allocation, while the LLM acts as a semantic macro-scheduler to rectify long-tail allocation imbalances for failed and surplus tasks. Crucially, a reward decoupling mechanism is introduced to isolate DRL training from external LLM interventions, thereby ensuring policy convergence. Finally, the task offloading ratios are precisely determined via Linear Programming (LP) within an alternating optimization loop. Simulation results demonstrate that the proposed method significantly outperforms traditional multi-agent reinforcement learning baselines in terms of task success rate and system efficiency.
Summary / 总结
This paper investigates a multi-Unmanned Aerial Vehicle (UAV) joint base station-assisted Internet of Vehicles (IoV) task offloading system in dense urban environments.
Worst-Case Discovery and Runtime Protection for RL-Based Network Controllers
Authors: Hongyu Hè, Minhao Jin, Maria Apostolaki
First: 2026-05-06T00:42:32+00:00 · Latest: 2026-05-06T00:42:32+00:00
Comments: 23 pages, 12 figures, 4 tables
Abstract
RL-based controllers achieve strong average-case performance in networking tasks such as congestion control and adaptive bitrate streaming. Yet their performance can degrade severely under network conditions where strong performance is still achievable. Identifying such conditions and quantifying the resulting performance gap is intractable by enumeration, while the sequential and closed-loop nature of RL controllers makes formal verification methods impractical.
We present ReGuard, a framework that discovers worst-case scenarios for a given RL controller and protects it against them at inference time without retraining. Discovery is formulated as a bilevel regret-maximization problem, which yields a certified lower bound on the worst-case performance gap. The discovered trajectories are then analyzed as counterfactuals and compiled into lightweight logic rules that intervene only when a risky state is detected, leaving the controller's behavior unchanged otherwise.
We evaluate ReGuard across three RL-based network controllers: Pensieve, Sage, and Park. ReGuard discovers scenarios in which the controller's performance is 43$-$64% worse than what is achievable. ReGuard not only discovers gaps 57% to 6$\times$ larger than those found by the strongest baselines but also shrinks them by 79$-$85% via lightweight rule-based protection while preserving nominal performance. ReGuard's protection extends beyond the scenarios it discovers, improving performance across a wider range of network conditions.
Summary / 总结
RL-based controllers achieve strong average-case performance in networking tasks such as congestion control and adaptive bitrate streaming.
Surviving the Edge: Federated Learning under Networking and Resource Constraints
Authors: Mike Mwanje, Okemawo Obadofin, Theophilus Benson, Joao Barros
First: 2026-05-05T15:30:11+00:00 · Latest: 2026-05-05T15:30:11+00:00
Abstract
Motivated by the growing proliferation of federated learning (FL) in edge environments, we present the first systematic characterization of transport-layer breaking points in FL systems operating under conditions of highly constrained network and compute resources. Using a reproducible testbed with chaos engineering tools, we evaluate Flower under progressively degraded network conditions representative of resource-constrained deployments in Africa and similar environments. Our empirical investigation reveals a fundamental mismatch between FL's burst-idle communication pattern and standard TCP connection management. We identify precise operational boundaries: FL training catastrophically fails at 5-second one-way latency due to TCP handshake timeouts, above 50% packet loss due to buffer exhaustion, and with 90% client dropout rates. Through systematic analysis of connection patterns during training rounds, we demonstrate that FL's periodic model update bursts, separated by extended local training periods, violate the assumptions underlying default TCP configurations. To validate the significance of these findings, we show that adjusting just three TCP connection management parameters can significantly reduce training time under extreme latency, proving that transport-layer awareness is not merely beneficial but essential for FL deployment at the network edge. Our characterization methodology and findings provide practitioners with concrete thresholds for determining when standard FL deployments will fail and when advanced reliability techniques become necessary.
Summary / 总结
Motivated by the growing proliferation of federated learning (FL) in edge environments, we present the first systematic characterization of transport-layer breaking points in FL systems operating under conditions of highly constrained network and compute resources.
Say the Mission, Execute the Swarm: Agent-Enhanced LLM Reasoning in the Web-of-Drones
Authors: Andrea Iannoli, Lorenzo Gigli, Luca Sciullo, Angelo Trotta, Marco Di Felice
First: 2026-05-05T14:14:57+00:00 · Latest: 2026-05-05T14:14:57+00:00
Comments: 15 pages, 5 figures. This paper has been accepted for presentation at the 27th IEEE International Symposium on a World of Wireless, Mobile and Multimedia Networks (WoWMoM 2026)
Abstract
Large Language Models (LLMs) are increasingly explored as high-level reasoning engines for cyber-physical systems, yet their application to real-time UAV swarm management remains challenging due to heterogeneous interfaces, limited grounding, and the need for long-running closed-loop execution. This paper presents a mission-agnostic, agent-enhanced LLM framework for UAV swarm control, where users express mission objectives in natural language and the system autonomously executes them through grounded, real-time interactions. The proposed architecture combines an LLM-based Agent Core with a Model Context Protocol (MCP) gateway and a Web-of-Drones abstraction based on W3C Web of Things (WoT) standards. By exposing drones, sensors, and services as standardized WoT Things, the framework enables structured tool-based interaction, continuous state observation, and safe actuation without relying on code generation. We evaluate the framework using ArduPilot-based simulation across four swarm missions and six state-of-the-art LLMs. Results show that, despite strong reasoning abilities, current general-purpose LLMs still struggle to achieve reliable execution - even for simple swarm tasks - when operating without explicit grounding and execution support. Task-specific planning tools and runtime guardrails substantially improve robustness, while token consumption alone is not indicative of execution quality or reliability.
Summary / 总结
Large Language Models (LLMs) are increasingly explored as high-level reasoning engines for cyber-physical systems, yet their application to real-time UAV swarm management remains challenging due to heterogeneous interfaces, limited grounding, and the need for long-running closed-loop execution.
Not All Prefills Are Equal: PPD Disaggregation for Multi-turn LLM Serving
Authors: Zongze Li, Jingyu Liu, Zhen Xu, Yineng Zhang, Tahseen Rabbani, Ce Zhang
Venue: ICML 2026
First: 2026-03-09T06:11:23+00:00 · Latest: 2026-05-05T10:21:59+00:00
Comments: 19 pages, 11 figures. Accepted at ICML 2026
Abstract
Prefill-Decode (PD) disaggregation has become the standard architecture for modern LLM inference engines, which alleviates the interference of two distinctive workloads. With the growing demand for multi-turn interactions in chatbots and agentic systems, we re-examined PD in this case and found two fundamental inefficiencies: (1) every turn requires prefilling the new prompt and response from the last turn, and (2) repeated KV transfers between prefill and decode nodes saturate the bandwidth, leading to high latency and even service degradation. Our key insight is that not all prefill operations are equally disruptive: append-prefill, which processes only the new input tokens while reusing cached KV states, incurs an order-of-magnitude smaller decoding slowdown than full prefill. This motivates routing append-prefill to decode nodes locally. However, through comprehensive analysis, we show that no single fixed routing strategy satisfies all Service Level Objectives (SLOs) simultaneously. Based on this insight, we propose Prefill Prefill-capable Decode (PPD) disaggregation, a dynamic routing system that decides when to process Turn 2+ requests locally on decode nodes using cached KV states. PPD adapts to varying SLOs via configurable weights and seamlessly integrates with traditional PD deployments. With extensive evaluations, we show that PPD reduces Turn 2+ time-to-first-token (TTFT) by $\sim$68\% while maintaining competitive time-per-output-token (TPOT), effectively alleviating KV transfer congestion under high load. PPD provides a flexible and efficient paradigm for multi-turn LLM serving.
Summary / 总结
Prefill-Decode (PD) disaggregation has become the standard architecture for modern LLM inference engines, which alleviates the interference of two distinctive workloads.
ARC: Consistent, Low-Latency Delivery via Receiver-Side Scheduling
Authors: Michael Luby
First: 2025-11-21T02:32:33+00:00 · Latest: 2026-05-05T00:54:03+00:00
Comments: 30 pages, 6 figures, 1 table
Abstract
Applications such as cloud gaming, video streaming, telemetry, ML inference, and data transfer provide a better experience when data is released at the receiver with timing reflecting how the data enters the sender. In practice, network delay variation and recovery dynamics at the receiver distort this timing even when transports deliver all packets correctly, producing visible jitter, stalls, and unstable playback. Many such applications operate best when delivery preserves this timing behavior and its implied order; out-of-order or irregular delivery can significantly degrade performance even when all data eventually arrives. We present a lightweight receiver-side release scheduling protocol, Adaptive Release Control (ARC), that restores this timing at the receiver. ARC releases recovered data in a manner that follows the sender's timing, maintaining ordering and limiting reordering when necessary while producing smooth delivery with minimal added latency given network conditions. It operates entirely on the receiver clock and requires no feedback, synchronization, or changes to the underlying transport. As an example, we integrate ARC into LT3, a network-layer system currently deployed as a software overlay that forwards traffic without altering the transport protocols it carries, where ARC functions as an independent module that regulates release timing for forwarded data. Evaluating LT3 with ARC on a cloud-gaming workload shows that the protocol removes virtually all large jitter excursions and yields release intervals that closely match the sender's timing, translating into improved perceptual smoothness. Broader latency improvements arise from the behavior of the full LT3 system. The benefits of ARC extend to transport protocols carried over LT3, including TCP, QUIC, WebRTC, UDP, and RTP, as preserving sender timing improves their behavior across a wide range of conditions.
Summary / 总结
Applications such as cloud gaming, video streaming, telemetry, ML inference, and data transfer provide a better experience when data is released at the receiver with timing reflecting how the data enters the sender.
Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live
Authors: Hanchen Li, Runyuan He, Qiuyang Mang, Qizheng Zhang, Huanzhi Mao, Xiaokun Chen, Hangrui Zhou, Alvin Cheung, Joseph Gonzalez, Ion Stoica
First: 2025-11-04T03:43:05+00:00 · Latest: 2026-05-04T23:49:24+00:00
Abstract
KV cache management is essential for efficient LLM inference. To maximize utilization, existing inference engines evict finished requests' KV cache if new requests are waiting. This policy breaks for agentic workloads, which interleave LLM calls with tools, introducing pauses that prevent effective KV reuse across turns. Since many tool calls have much shorter durations than human response multi-turn chatbot, it would be promising to retain the KV cache in during these tools. However, many challenges remain. First, we need to consider both the potential cost of recomputation or reloading (if offloading enabled) as well as the increasing queueing delays after eviction from GPU. Second, due to the internal variance of tool call durations, the method needs to remain robust under limited predictability of tool call durations.
We present CacheTTL, a serving system to optimize job completion time for multi-turn agent workloads by introducing time-to-live mechanism for KV cache retention. For requests that generate tool calls, CacheTTL selectively pins the KV cache in GPU memory with a time-to-live value determined by the reload cost and potential queueing delay induced by eviction. When the TTL expires, the KV cache can be automatically evicted to free up GPU memory, providing robust performance under edge cases. When combined with program-level first-come-first-serve, CacheTTL preserves multi-turn continuity, and reduces delay for agentic workflows. Evaluations on real-world agents (SWE-Bench, BFCL, OpenHand) with Llama-3.1 8B/70B, Gemma-3 12B, and GLM-4.5 355B shows that CacheTTL improves the average job completion times by over 8x while improving throughput.
Summary / 总结
KV cache management is essential for efficient LLM inference.
SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference
Authors: Hongyao Liu, Liuqun Zhai, Junyi Wang, Zhengru Fang
First: 2026-04-23T02:55:31+00:00 · Latest: 2026-05-04T22:59:12+00:00
Comments: Withdrawn by the authors due to an incorrect assumption in the model definition in Section 4, which affects the conclusions
Abstract
Efficient inference for on-device Large Language Models (LLMs) remains challenging due to limited hardware resources and the high cost of the prefill stage, which processes the full input context to construct Key-Value (KV) caches. We present SparKV, an adaptive KV loading framework that combines cloud-based KV streaming with on-device computation. SparKV models the cost of individual KV chunks and decides whether each chunk should be streamed or computed locally, while overlapping the two execution paths to reduce latency. To handle fluctuations in wireless connectivity and edge resource availability, SparKV further refines offline-generated schedules at runtime to rebalance communication and computation costs. Experiments across diverse datasets, LLMs, and edge devices show that SparKV reduces Time-to-First-Token by 1.3$x-5.1x with negligible impact on response quality, while lowering per-request energy consumption by 1.5x to 3.3x, demonstrating its robustness and practicality for real-world on-device deployment.
Summary / 总结
Efficient inference for on-device Large Language Models (LLMs) remains challenging due to limited hardware resources and the high cost of the prefill stage, which processes the full input context to construct Key-Value (KV) caches.
Beyond State Machines: Executing Network Procedures with Agentic Tool-Calling Sequences
Authors: Purna Sai Garigipati, Onur Ayan, Kishor Chandra Joshi, Xueli An
First: 2026-05-04T13:34:20+00:00 · Latest: 2026-05-04T13:34:20+00:00
Abstract
Agentic AI will be an essential enabling technology for designing future mobile communication systems, which could provide flexible and customized services, automate complex network operations, and drive autonomous decision-making across the network. This work studies how Large Language Model (LLM)-based network AI agents can be utilized to execute network procedures expressed as sequences of tool invocations. We investigate four approaches, which differ in how the agent obtains the procedure and in how execution is distributed between the agent and the underlying tools. We evaluated the latency and execution correctness across these approaches using a User Equipment (UE) IP allocation procedure as a case study. Furthermore, we conduct a stress test to examine how many sequential procedural steps an LLM agent can reliably execute before failure. Our results show that approaches relying on iterative agent-side reasoning incur higher latency and are more prone to execution errors, while approaches where the procedure is encapsulated within a single tool, which internally orchestrates the required steps by invoking other tools, reduce latency by limiting repeated reasoning. The stress-test results further show that the model with advanced tool-calling capability maintains reliable execution over longer procedures than the other evaluated models; however, all models exhibit reliability degradation as procedure length increases, revealing clear execution limits in multi-step tool-based workflows. To systematically analyze failures in procedure execution, we introduce a procedure-specific error taxonomy that categorizes deviations in multi-step procedural execution.
Summary / 总结
Agentic AI will be an essential enabling technology for designing future mobile communication systems, which could provide flexible and customized services, automate complex network operations, and drive autonomous decision-making across the network.
A Protocol-Independent Transport Architecture
Authors: Kimiya Mohammadtaheri, David Gao, Samuel Zhang, Matthew Chen, Eric Su, Pengyu Ji, Saad Syed, Chris Neely, Mario Baldi, Nachiket Kapre, Mina Tahmasbi Arashloo
First: 2026-05-04T04:20:04+00:00 · Latest: 2026-05-04T04:20:04+00:00
Abstract
The network transport layer is increasingly implemented in the NIC hardware to meet the performance demands of modern workloads, but this has made it difficult to evolve or deploy new transport protocols. Existing approaches either fix protocol logic in the data-path or build protocol-specific assumptions into the architecture that limit the range of protocols that can be supported on a single hardware substrate.
We present PITA, a protocol-independent transport architecture that enables full data-path programmability while sustaining line-rate performance. PITA eliminates protocol-specific assumptions by structuring the data-path around a uniform abstraction over events, state, and instructions, and rethinks core components, including scheduling, packet generation, and data reassembly, to operate on this abstraction. We evaluate PITA along key dimensions reflecting the goals of its protocol-agnostic datapath design. Specifically, we show that PITA supports diverse protocol semantics by showing it can implement TCP and \roce on the same data path and preserve their distinct end-to-end behavior. Through targeted microbenchmarks and synthesis on Alveo U250 cards, we show that PITA's redesigned components sustain high performance under demanding conditions, with modest hardware overhead and meeting timing at 250MHz.
Summary / 总结
The network transport layer is increasingly implemented in the NIC hardware to meet the performance demands of modern workloads, but this has made it difficult to evolve or deploy new transport protocols.
AdvNet: Revealing Performance Issues in Network Protocols by Generating Adversarial Environments
Authors: Shehab Sarar Ahmed, William Sentosa, Yinjie Zhang, Yoav Lebendiker, Michael Shnaiderman, Tomer Gilad, Nathan H. Jay, Brighten Godfrey, Michael Schapira
First: 2026-05-01T16:12:27+00:00 · Latest: 2026-05-04T02:55:17+00:00
Comments: 18 pages, 8 figures
Abstract
Infrastructure protocols like Congestion Control (CC) seek to provide reliable performance across a wide range of Internet environments. Currently, protocol designers assess performance through hand-designed test cases or data sets captured from real environments. However, such approaches may inadvertently overlook critical facets of the algorithm's behavior when they encounter an unanticipated environment or workload.
We seek to understand the unanticipated with AdvNet, a system that automatically generates adversarial network environments that cause a target protocol implementation to perform poorly. AdvNet employs machine learning-based optimization to generate environments, and incorporates a robust noise-handling mechanism to mitigate the variability inherent in real-world protocol performance. Although our approach is more general, this paper focuses specifically on transport protocols and their CC implementations. We showcase AdvNet's capability to create adversarial scenarios for 27 kernel-space implementations of both single-path and multi-path CC protocols, for several use cases with different performance goals. AdvNet identifies problematic network conditions that expose previously unnoticed Linux kernel bugs and uncovers hidden limitations in CC implementations, and provides insights about robustness. These results suggest that automated adversarial testing can be a valuable tool in protocol development, and that robustness is a useful new dimension for benchmarking CC protocols.
Summary / 总结
Infrastructure protocols like Congestion Control (CC) seek to provide reliable performance across a wide range of Internet environments.
6G Needs Agents: Toward Agentic AI-Native Networks for Autonomous Intelligence
Authors: Mohamed Amine Ferrag, Abderrahmane Lakas, Merouane Debbah
First: 2026-05-02T17:24:12+00:00 · Latest: 2026-05-02T17:24:12+00:00
Abstract
Sixth-generation (6G) networks are increasingly envisioned as AI-native infrastructures integrating communication, sensing, and computing into a unified fabric. However, existing approaches remain largely optimization-centric, relying on closed-loop control with limited reasoning capability. In this paper, we argue for a paradigm shift toward Agentic AI-Native 6G, in which Large Language Model (LLM)-based agents operate as bounded, policy-governed reasoning entities within a semantic control plane layered above deterministic 3GPP infrastructure. We propose a four-layer architecture that integrates deterministic network infrastructure, semantic abstraction of intent and context, hierarchical reasoning, and a distributed multi-agent fabric spanning device, edge, and core domains. To assess feasibility, we develop a proof-of-concept agentic reasoning and orchestration framework and conduct an extensive empirical study using a domain-specific 6G benchmark under realistic deployment constraints. Our results reveal a fundamental tradeoff between reasoning capability and system efficiency, showing that no single model simultaneously satisfies latency, throughput, and accuracy requirements. Instead, heterogeneous deployment of LLM agents across the device--edge--core continuum is necessary to balance these constraints. We further demonstrate that quantization introduces non-uniform effects across models, reinforcing the need for system-level optimization rather than model-level compression alone. These findings establish agentic intelligence as a viable architectural direction for 6G and highlight key challenges in achieving scalable, trustworthy, and self-reasoning networks. All experimental results and evaluation scripts are publicly available to support reproducibility.
Summary / 总结
Sixth-generation (6G) networks are increasingly envisioned as AI-native infrastructures integrating communication, sensing, and computing into a unified fabric.
Space Network of Experts: Architecture and Expert Placement
Authors: Zhanwei Wang, Huiling Yang, Min Sheng, Khaled B. Letaief, Kaibin Huang
First: 2026-05-01T08:40:31+00:00 · Latest: 2026-05-01T08:40:31+00:00
Abstract
Leveraging continuous solar energy harvesting at high efficiency, space data centers are envisioned as a promising platform for executing energy-intensive large language models (LLMs). Recognizing this advantage, space and AI conglomerates (e.g., SpaceX, Google) are actively investing in this vision. One key challenge, however, is the efficient distributed deployment of a large-scale LLM in a satellite network due to the limited onboard computing and communication resources. This gives rise to a placement problem that involves partitioning and mapping model components to satellites such that the fundamentally different model architecture and network topology can be reconciled to ensure low-latency token generation. To address this problem, we present the Space Network of Experts (Space-XNet) framework targeting the distributed execution of a popular mixture-of-experts (MoE) model in space. The proposed placement strategies are two-level: (1) layer placement, which assigns MoE layers to satellite subnets; and (2) intra-layer expert placement, which assigns individual experts to satellites associated with the same layer/subnet. For layer placement, we exploit the ring-like communication pattern of autoregressive inference to partition the satellite constellation along the orbiting direction into subnets arranged on a ring, each hosting one MoE layer. Based on this architecture, we formulate and solve an optimization problem for intra-layer expert placement to map experts with heterogeneous activation probabilities onto satellites. The derived strategy reveals an intuitive principle: a frequently activated expert should be mapped to a satellite on a routing path with low expected latency. Experiments over a thousand-satellite constellation show that Space-XNet achieves at least a threefold latency reduction compared with conventional random and ablation-based placement strategies.
Summary / 总结
Leveraging continuous solar energy harvesting at high efficiency, space data centers are envisioned as a promising platform for executing energy-intensive large language models (LLMs).
LLM-Based Agentic Negotiation for 6G: Addressing Uncertainty Neglect and Tail-Event Risk
Authors: Hatim Chergui, Farhad Rezazadeh, Mehdi Bennis, Merouane Debbah, Christos Verikoukis
First: 2025-11-24T14:36:11+00:00 · Latest: 2026-04-30T21:49:18+00:00
Abstract
A critical barrier to the trustworthiness of sixth-generation (6G) agentic autonomous networks is the uncertainty neglect bias; a cognitive tendency for large language model (LLM)-powered agents to make high-stakes decisions based on simple averages while ignoring the tail risk of extreme events. This paper proposes an unbiased, risk-aware framework for agentic negotiation, designed to ensure robust resource allocation in 6G network slicing. Specifically, agents leverage Digital Twins (DTs) to predict full latency distributions, which are then evaluated using a formal framework from extreme value theory, namely, Conditional Value-at-Risk (CVaR). This approach fundamentally shifts the agent's objective from reasoning over the mean to reasoning over the tail, thereby building a statistically-grounded buffer against worst-case outcomes. Furthermore, our framework ensures full uncertainty awareness by requiring agents to quantify epistemic uncertainty -- confidence in their own DTs predictions -- and propagate this meta-verification to make robust decisions, preventing them from acting on unreliable data. We validate this framework in a 6G inter-slice negotiation use-case between an eMBB and a URLLC agent across 200 trials. The results demonstrate the profound failure of the biased, mean-based baseline, which systematically violates the strict URLLC SLA 11 times. Our unbiased, CVaR-aware agent successfully mitigates this bias, eliminating SLA violations entirely and significantly reducing the 99.999th-percentile latencies by up to 51.7\%. We show this reliability comes at the rational and quantifiable cost of reduced energy savings, exposing the false economy of the biased approach. Crucially, executing our framework with an otel-llm-1b-it model on a single NVIDIA RTX A4000 GPU achieves sub-1.5-second inference times, validating the feasibility for non-real-time RIC use-cases.
Summary / 总结
A critical barrier to the trustworthiness of sixth-generation (6G) agentic autonomous networks is the uncertainty neglect bias; a cognitive tendency for large language model (LLM)-powered agents to make high-stakes decisions based on simple averages while ignoring the tail risk of extreme events.
Rethinking Network Topologies for Cost-Effective Mixture-of-Experts LLM Serving
Authors: Junsun Choi, Sam Son, Sunjin Choi, Hansung Kim, Yakun Sophia Shao, Scott Shenker, Sylvia Ratnasamy, Borivoje Nikolic
First: 2026-04-30T21:35:22+00:00 · Latest: 2026-04-30T21:35:22+00:00
Abstract
Mixture-of-experts (MoE) architectures have turned LLM serving into a cluster-scale workload in which communication consumes a considerable portion of LLM serving runtime. This has prompted industry to invest heavily in expensive high-bandwidth scale-up networks. We question whether such costly infrastructure is strictly necessary. We present the first systematic cross-layer analysis of network cost-effectiveness for MoE LLM serving, comparing four representative XPU (e.g., GPU/TPU) topologies (scale-up, scale-out, 3D torus, and 3D full-mesh). We find that lower-cost switchless topologies are more cost-effective than the scale-up topology across all serving scenarios explored, improving cost-effectiveness by 20.6-56.2%. In particular, the 3D full-mesh topology is Pareto-optimal in terms of the performance-cost tradeoff. We also find that current scale-up link bandwidths are over-provisioned: reducing the link bandwidth improves throughput per cost by up to 27%. A forward-looking analysis of upcoming GPU generations indicates that the cost-performance advantage of switchless networks will likely persist.
Summary / 总结
Mixture-of-experts (MoE) architectures have turned LLM serving into a cluster-scale workload in which communication consumes a considerable portion of LLM serving runtime.
A Multi-Perspective Study of the Internet Shutdown in Iran
Authors: Ali Sadeghi Jahromi, Jason Jaskolka
First: 2026-04-30T20:04:12+00:00 · Latest: 2026-04-30T20:04:12+00:00
Comments: 7 pages, 1 figure
Abstract
Iran conducted two nationwide Internet shutdowns in January and March 2026, the latter ongoing at the time of writing and the longest documented Iranian disruption. Using a three-plane methodology combining passive Censys scan data, active TCP reachability probing from five vantage points, and BGP analysis across 33 RIPE RIS snapshots from 2019 to 2026, we show that the 2022 and 2026 shutdowns are enforced via forwarding-plane null-routing at a centralized border while BGP announcements remain stable, and that Iran shifted from partial BGP withdrawal in 2019 to pure null-routing by 2022. This control- and forwarding-plane decoupling prevents BGP-based outage monitors from detecting shutdowns.
Active probing of 4,571 BGP-visible Iranian prefixes shows that 96.5 to 97.4% are null-routed across all vantage points, indicating a centrally coordinated mechanism. Passive scan analysis reveals a 3.7 times increase in visible hosts between shutdown events due to measurement artifacts rather than recovery, along with two structural exemptions: academic networks rise from 1.4 to 66.6% of visible hosts during partial recovery, and ArvanCloud CDN retains 99.7% visibility while other major operators drop by at least 77%.
Summary / 总结
Iran conducted two nationwide Internet shutdowns in January and March 2026, the latter ongoing at the time of writing and the longest documented Iranian disruption.
RouteProfile: Elucidating the Design Space of LLM Profiles for Routing
Authors: Jingjun Xu, Hongji Pu, Tao Feng, Haozhen Zhang, Jiaxuan You, Ge Liu
First: 2026-04-30T19:56:08+00:00 · Latest: 2026-04-30T19:56:08+00:00
Abstract
As the large language model (LLM) ecosystem expands, individual models exhibit varying capabilities across queries, benchmarks, and domains, motivating the development of LLM routing. While prior work has largely focused on router mechanism design, LLM profiles, which capture model capabilities, remain underexplored. In this work, we ask: How does LLM profile design affect routing performance across different routers? Addressing this question helps clarify the role of profiles in routing, disentangle profile design from router design, and enable fairer comparison and more principled development of routing systems. To this end, we view LLM profiling as a structured information integration problem over heterogeneous interaction histories. We develop a general design space of LLM profiles, named RouteProfile, along four key dimensions: organizational form, representation type, aggregation depth, and learning configuration. Through systematic evaluation across three representative routers under both standard and new-LLM generalization settings, we show that: (1) structured profiles consistently outperform flat ones; (2) query-level signals are more reliable than coarse domain-level signals; and (3) generalization to newly introduced models benefits most from structured profiles under trainable configurations. Overall, our work highlights LLM profile design as an important direction for future routing research.
Summary / 总结
As the large language model (LLM) ecosystem expands, individual models exhibit varying capabilities across queries, benchmarks, and domains, motivating the development of LLM routing.
DeGenTWeb: A First Look at LLM-dominant Websites
Authors: Sichang Steven He, Calvin Ardi, Ramesh Govindan, Harsha V. Madhyastha
First: 2026-04-30T17:54:35+00:00 · Latest: 2026-04-30T17:54:35+00:00
Comments: 6 pages, 6 figures, 13 page total; in submission
Abstract
Many recent news reports have claimed that content generated by large language models (LLMs) is taking over the web. However, these claims are typically not based on a representative sample of the web and the methodology underlying them is often opaque. Moreover, when aiming to minimize the chances of falsely attributing human-authored content to LLMs, we find that detectors of LLM-generated text perform much worse than advertised. Consequently, we lack an understanding of the true prevalence and characteristics of LLM content on the web.
We describe DeGenTWeb which systematically identifies LLM-dominant websites: sites whose content has been generated using LLMs with little human input. We show how to adapt detectors of LLM-generated text for use on web pages, and how to aggregate detection results from multiple pages on a site for accurate site-level categorization. Using DeGenTWeb, we find that LLM-dominant sites are highly prevalent both in data from Common Crawl and in Bing's search results, and that this share is growing over time. We also show that continuing to accurately identify such sites appears challenging given the capabilities of the latest LLMs.
Summary / 总结
Many recent news reports have claimed that content generated by large language models (LLMs) is taking over the web.
Multi-Connectivity for UAVs: A Measurement Study of Integrating Cellular, Aerial Mesh, and LEO Satellite Links
Authors: Aygun Baltaci, Irshad A. Meer, Mustafa Ozger, Cicek Cavdar, Dominic Schupke
First: 2026-04-30T09:33:36+00:00 · Latest: 2026-04-30T09:33:36+00:00
Comments: Accepted in IEEE EuCNC
Abstract
Future uncrewed aerial vehicle (UAV) systems increasingly combine heterogeneous communication technologies, such as low-latency aerial mesh, terrestrial cellular, and satellite links, to improve robustness and coverage. Multipath transport is a natural mechanism for aggregating these links, yet its ability to support real-time UAV services in highly heterogeneous environments remains insufficiently characterized. We present a measurement-driven study based on UAV flight experiments in an integrated network comprising UAV-to-UAV aerial mesh, private cellular, and low Earth orbit (LEO) satellite connectivity. Using Multipath TCP (MPTCP) as a representative lossless, in-order multipath transport framework, we find that aggregation can preserve end-to-end connectivity under severe link outages. However, large round-trip time (RTT) heterogeneity amplifies packet reordering, leading to substantial receiver-side buffering and bursty delivery. In addition, when the available links do not provide sufficient capacity for the offered load, pronounced sender-side buffering emerges. These effects cause real-time streaming to violate delay constraints, including cases where aggregate capacity is sufficient. To interpret these results, we formalize the distinction between connectivity continuity and service continuity and show empirically that maintaining connectivity is necessary but not sufficient for timely real-time delivery in multi-technology UAV networks. The findings motivate multipath designs that explicitly account for delay constraints, rather than optimizing for connectivity alone.
Summary / 总结
Future uncrewed aerial vehicle (UAV) systems increasingly combine heterogeneous communication technologies, such as low-latency aerial mesh, terrestrial cellular, and satellite links, to improve robustness and coverage.
BLINC: Context-Specific Causal Learning for Automated RAN Configuration
Authors: Reshma Prasad, Michele Polese, Tommaso Melodia
First: 2026-04-29T18:25:01+00:00 · Latest: 2026-04-29T18:25:01+00:00
Comments: 10 pages
Abstract
Radio Access Network (RAN) configuration has traditionally required significant manual effort due to indirect causal dependencies between observable Key Performance Indicators (KPIs), and context-dependent characteristics, where the optimal configurations vary with network conditions. Although recent data-driven approaches improve parameter tuning, they remain limited in distinguishing causal direction from statistical correlation and in generalizing across diverse operating contexts.
To address these challenges, we propose BLINC (Bayesian Large Language Model (LLM)-Driven Intelligent Network Configuration), an LLM-assisted Bayesian Network framework that integrates telecommunications domain knowledge into causal structure learning. Trained and validated on a private 5G deployment, our method achieves throughput improvement of 63.5% with 19.7% reduction on block error rate over data-only baselines through joint optimization of power control and link adaptation parameters. The framework provides interpretable causal structure, while also quantifying prediction uncertainty. We also demonstrate the ability of the Bayesian Network framework to adapt to different deployment scenarios and propose an incremental Conditional Probability Distribution (CPD) update mechanism with learning rate for continuous model adaptation as network conditions evolve.
Summary / 总结
Radio Access Network (RAN) configuration has traditionally required significant manual effort due to indirect causal dependencies between observable Key Performance Indicators (KPIs), and context-dependent characteristics, where the optimal configurations vary with network conditions.
SWE-Bench 5G: Benchmarking AI Coding Agents on Telecom Network Engineering Tasks
Authors: Jiao Chen, Jianhua Tang, Xiaotong Yang, Zuohong Lv
First: 2026-04-29T04:20:35+00:00 · Latest: 2026-04-29T04:20:35+00:00
Abstract
AI coding agents demonstrate strong performance on general-purpose software benchmarks. However, their ability to handle 5G network engineering tasks remains unexplored. We propose SWE-Bench~5G, the first benchmark designed to investigate whether AI coding agents can resolve real-world bugs in 5G core network software. The benchmark collects task instances from three open-source 5G projects, packages each as a self-contained Docker environment with automated fail-to-pass tests, and provides a dual test strategy tailored to the complex runtime dependencies of telecom code. In addition, for instances whose original issues reference 3GPP specification clauses, we construct concise specification context documents, enabling controlled evaluation of whether domain knowledge improves agent performance. Experiments on four LLMs reveal that all models diagnose bugs at rates exceeding 91\%, yet resolve rates remain between 10\% and 30\%, suggesting that both iterative code editing capability and domain knowledge play important roles. The specification injection experiment further confirms that 3GPP excerpts improve resolve rates on specification-dependent bugs, while the gains on generic defensive checks remain limited, indicating that the effect of domain knowledge is conditional on bug type.
Summary / 总结
AI coding agents demonstrate strong performance on general-purpose software benchmarks.
Assistants, Not Architects: The Role of LLMs in Networked Systems Design
Authors: Pratyush Sahu, Rahul Bothra, Venkat Arun, Brighten Godfrey, Akshay Narayan, Ahmed Saeed
First: 2026-04-28T11:08:11+00:00 · Latest: 2026-04-28T11:08:11+00:00
Abstract
Designing the architecture of modern networked systems requires navigating a large, combinatorial space of hardware, systems, and configuration choices with complex cross-layer interactions. Architects must balance competing objectives such as performance, cost, and deployability while satisfying compatibility and resource constraints, often relying on scattered rules-of-thumb drawn from benchmarks, papers, documentation, and expert experience. This raises a natural question: can large language models (LLMs) reliably perform this kind of architectural reasoning? We find that they cannot. While LLMs produce plausible configurations, they frequently miss critical constraints, encode incorrect assumptions, and exhibit ``stickiness'' to familiar patterns. A natural workaround--iterative validation via simulation or experimentation--is often prohibitively expensive at scale and, in many cases, infeasible, particularly when comparing hardware-dependent alternatives.
Motivated by this gap, we present Kepler, a lightweight reasoning framework for architecture design that combines structured, expert-driven specifications with SMT-based optimization. Kepler encodes architecturally significant properties--requirements, incompatibilities, and qualitative trade-offs--about systems, hardware, and workloads as constraints, and synthesizes feasible designs that optimize user-defined objectives. It operates at an abstract level, capturing ``rules-of-thumb'' rather than detailed system behavior, enabling tractable reasoning while preserving key interactions, and provides explanations for its decisions. Through experiments and case studies, we show that Kepler uncovers interactions missed by LLMs and supports systematic, explainable design exploration.
Summary / 总结
Designing the architecture of modern networked systems requires navigating a large, combinatorial space of hardware, systems, and configuration choices with complex cross-layer interactions.
TrimCaching: Parameter-sharing Edge Caching for AI Model Downloading
Authors: Guanqiao Qu, Zheng Lin, Qian Chen, Jian Li, Fangming Liu, Xianhao Chen, Kaibin Huang
First: 2024-04-22T14:13:36+00:00 · Latest: 2026-04-27T06:22:12+00:00
Comments: 19 pages, 13 figures. Part of this work has been accepted by ICDCS 2024
Abstract
Next-generation mobile networks are expected to facilitate fast AI model downloading to end users. By caching models on edge servers, mobile networks can deliver models to end users with low latency, resulting in a paradigm of edge model caching. In this paper, we develop a novel model placement framework, called parameter-sharing model caching (TrimCaching). TrimCaching exploits the key observation that a wide range of AI models, such as convolutional neural networks or large language models, can share a significant proportion of parameter blocks containing reusable knowledge, thereby improving storage efficiency. To this end, we formulate a parameter-sharing model placement problem to maximize the cache hit ratio in multi-edge wireless networks by balancing the fundamental tradeoff between storage efficiency and service latency. We show that the formulated problem is a submodular maximization problem with submodular constraints, for which no polynomial-time approximation algorithm exists. To tackle this challenge, we study an important special case, where a small fixed number of parameter blocks are shared across models, which often holds in practice. In such a case, a polynomial-time algorithm with a $\left(1-ε\right)/2$-approximation guarantee is developed. Subsequently, we address the original problem for the general case by developing a greedy algorithm. Simulation results demonstrate that the proposed TrimCaching framework significantly improves the cache hit ratio compared with state-of-the-art content caching without exploiting shared parameters in AI models.
Summary / 总结
Next-generation mobile networks are expected to facilitate fast AI model downloading to end users.
MatchRDMA: A Segmented and Rate-Matched Long-Haul RDMA Scheme for Geo-distributed LLM Training over OTN
Authors: Jun Dai, Xiaorun Wang, Xingde Li, Zheng Yang, Kexiong Fang, Zhiqun Gu, Hongxiang Wang, Yuefeng Ji, Jiawei Zhang
First: 2026-04-27T01:16:21+00:00 · Latest: 2026-04-27T01:16:21+00:00
Comments: 4 pages, 3 figures
Abstract
We propose MatchRDMA, a proactive, segmented, and rate-matched long-haul RDMA scheme for geo-distributed LLM training over OTN. By coordinating source and destination OTN rates, it improves inter-DC throughput by up to 20x compared with conventional RDMA, and reduces destination-OTN buffer occupancy by up to 62.7%.
Summary / 总结
We propose MatchRDMA, a proactive, segmented, and rate-matched long-haul RDMA scheme for geo-distributed LLM training over OTN.
An Agentic Framework for Intent Co-Creation in 6G NaaS: Architecture and Open-Source Model Evaluation
Authors: Kostis Trantzas, Besiana Agko, Christos Tranoris, Irene Denazi
First: 2026-04-25T13:04:23+00:00 · Latest: 2026-04-25T13:04:23+00:00
Abstract
6G network complexity necessitates high levels of autonomy, yet current intent-based systems struggle with ambiguous or incomplete human requests. This paper introduces an agent-based, intent-driven end-to-end (E2E) orchestration framework designed for Network-as-a-Service (NaaS) delivery through collaborative intent co-creation. The proposed system leverages a pool of Domain Expert Agents and a TM Forum-aligned Body-of-Knowledge (BoK) to iteratively refine user requests into deterministic, machine-readable actions. A fundamental design principle is the decoupling of cognition and actuation, where AI-driven reasoning is isolated from standardized execution controllers to ensure safety and operational trust. The framework includes a dual-layer memory system to maintain coherence during multi-step collaborations. The presented prototype, built on ETSI OpenSlice and the Model Context Protocol (MCP), evaluates across several open-source Large Language Models (LLMs). While these models demonstrate high instruction compliance, results reveal a significant gap in translating high-resolution intents into valid, catalog-backed orders without hallucinations.
Summary / 总结
6G network complexity necessitates high levels of autonomy, yet current intent-based systems struggle with ambiguous or incomplete human requests.
Towards Agentic Test-Driven Quality Assurance for 6G Networks
Authors: Christos Tranoris, Besiana Agko, Kostis Trantzas, Irene Denazi
First: 2026-04-25T12:58:36+00:00 · Latest: 2026-04-25T12:58:36+00:00
Comments: 8 pages
Abstract
This work proposes an agentic, intent-driven end-to-end (E2E) orchestration framework that integrates intent co-creation with a Test-Driven Quality Assurance paradigm. In this framework, autonomous agents iteratively refine a user's initial intent into a confirmed, auditable specification. Furthermore, the system automatically derives validation tests from these intents before provisioning, directly mirroring the Test-Driven Development workflow in software engineering to ensure proactive Service Level Agreement (SLA) compliance. The architecture is grounded in a standards-aligned knowledge representation using TM Forum (TMF) information models and catalogs. This enables deterministic graph traversal from high-level Product Offerings down to granular Service/Resource and Test specifications. We prototyped this architecture by extending OpenSlice with a message-driven, multi-agent pattern and integrating MCP-enabled (Model Context Protocol) tool access for real-time knowledge retrieval. Currently, our evaluation of the agents targets the intent co-creation phase as a baseline toward full-scale orchestration. Building on experiments with multiple open-source Large Language Model (LLM) backends integrated with the TMF-based knowledge base, we observe substantial variability in tool-use reliability and hallucination patterns, underscoring the critical importance of robust knowledge integration in agentic 6G systems.
Summary / 总结
This work proposes an agentic, intent-driven end-to-end (E2E) orchestration framework that integrates intent co-creation with a Test-Driven Quality Assurance paradigm.
RANalyzer: Automated Continuous RAN Software Evaluation and Regression Analysis
Authors: Ravis Shirkhani, Reshma Prasad, Leonardo Bonati, Tommaso Melodia, Michele Polese
First: 2026-04-25T05:49:44+00:00 · Latest: 2026-04-25T05:49:44+00:00
Comments: 9 pages, 11 figures, 2 tables
Abstract
Software-driven O-RAN architectures enable rapid innovation through frequent, independent updates to virtualized components. However, attributing performance variations to specific software changes is challenging due to the stochastic nature of wireless systems, where channel conditions, interference, and hardware variability confound analysis. Traditional threshold-based monitoring and manual troubleshooting do not scale with modern software evolution.
This paper presents RANalyzer, an automated test analysis framework that quantifies the performance impact of software updates beyond what can be explained by wireless channel conditions. RANalyzer combines LLM-assisted semantic extraction with residuals analysis. The first categorizes code changes by affected protocol layers and functional components, while the second provides insights on the effect of load, channel, or code changes on the test performance. We contribute an extensive dataset collected over more than two years of continuous over-the-air testing on an experimental O-RAN testbed, comprising over 8,600 automated tests across 69 releases of the OAI stack. By modeling expected performance and interpreting deviations as software-induced effects, we identify degraded instances attributable to code changes and correlate them with specific change categories. The framework can be integrated into CI/CD/CT pipelines for automated, continuous evaluation of software updates at scale.
Summary / 总结
Software-driven O-RAN architectures enable rapid innovation through frequent, independent updates to virtualized components.
Chamelio: A Fast Shared Cloud Network Stack for Isolated Tenant-Defined Protocols
Authors: Matheus Stolet, Simon Peter, Antoine Kaufmann
First: 2026-04-24T14:28:37+00:00 · Latest: 2026-04-24T14:28:37+00:00
Abstract
Conventional cloud network virtualization sends packets through multiple guest and host layers, inflating CPU cost and tail latency. Shared host datapaths collapse this layering into one optimized path across tenants, but existing shared stacks are fixed-function: tenants cannot specialize their protocols. eBPF is the natural vehicle for restoring programmability to a shared datapath, but today's extensions are hook-sized, and its verifier provides safety -- not performance isolation: one tenant's per-packet work can inflate every other tenant's tail latency.
Chamelio is a programmable shared network stack that lets tenants implement full protocols through a bounded eBPF fast path and a tenant slow path, while approaching the performance and preserving the strong isolation of fixed shared stacks. It combines three ideas: a shared-stack architecture for tenant-defined protocols; joint optimisation of tenant handlers with provider infrastructure and co-resident tenants in the shared fast path; and a bounded fast path contract with runtime cycle accounting that keeps tenant programmability compatible with strong performance isolation. A tenant programmable TCP on Chamelio reaches 9.2 Mreq/s, matching the hand-tuned TAS stack; joint compilation shrinks the programmability tax from 23.9% to 3.8%; and under a scaling TCP adversary that drives uninstrumented stacks to 154 microseconds, Chamelio bounds victim tail latency at 46 microseconds.
Summary / 总结
Conventional cloud network virtualization sends packets through multiple guest and host layers, inflating CPU cost and tail latency.
Benchmarking LLM-Driven Network Configuration Repair
Authors: Ioannis Protogeros, Rufat Asadli, Benjamin Hoffman, Laurent Vanbever
First: 2026-04-24T12:53:07+00:00 · Latest: 2026-04-24T12:53:07+00:00
Abstract
There is a rapidly growing interest in using Large Language Models (LLMs) to automate complex network operations, but their reliable adoption requires rigorous assessment of their effectiveness and safety. Existing benchmarks do not address whether LLMs can successfully resolve errors in large-scale, interdependent network configurations without introducing new disruptions. Developing such a benchmark is challenging: scenarios must be diverse and increasingly complex, yet their evaluation must be straightforward and meaningful. In this paper, we present Cornetto, the first benchmark to evaluate LLM-driven network configuration repair functionally and at scale. Cornetto features a generation pipeline that synthesizes representative and plausible misconfiguration scenarios, coupled with an evaluation framework that uses formal verification to assess functional correctness of proposed fixes against ground-truth specifications. Using this pipeline, we synthesize a dataset of 231 problems for fixing configurations across varying network topologies (20--754 nodes) and diverse protocols. We evaluate 9 state-of-the-art LLMs and find that while they show promise, they often introduce regressions and their performance degrades at scale. Our results indicate that reliable LLM-powered network automation requires integrating LLMs into iterative workflows guided by formal verification.
Summary / 总结
There is a rapidly growing interest in using Large Language Models (LLMs) to automate complex network operations, but their reliable adoption requires rigorous assessment of their effectiveness and safety.
OCC: Physical-Layer Assisted Congestion Control for Real-Time Communications
Authors: Yufan Zhuang, Zili Meng, Zehong Lin, Jun Zhang
First: 2026-04-24T09:21:29+00:00 · Latest: 2026-04-24T09:21:29+00:00
Abstract
Real-time communications (RTC) is a core technology for emerging applications in 6G, such as cloud gaming, teleoperation, and extended reality (XR), which require consistently low latency and high bitrates. Existing RTC solutions fundamentally struggle to maintain low latency while supporting high bitrates due to their reliance on trial-and-error-based mechanisms. These mechanisms fail to probe the available bandwidth (ABW) promptly and accurately, leading to a trade-off between latency reliability and bandwidth utilization. The tension becomes extremely more critical as the cellular bandwidth and application's demand fluctuate with a larger range in cellular networks nowadays. To address this trade-off, we propose OCC, a novel approach that utilizes physical-layer information to explicitly obtain the ABW in real time, enabling rapid adaptation to dynamic wireless network conditions. However, the unique characteristics of RTC, including traffic bursts, application (APP) limits, and encoder lag, make the physical-layer informed control non-trivial. OCC effectively addresses these issues through three innovative strategies: frame-aware bandwidth measurement, APP-limit-aware bandwidth estimation, and encoder-friendly rate control. Extensive over-the-air experiments on an open-source cellular testbed demonstrate that OCC significantly enhances the performance of mobile RTC, reducing tail network latency by $13\%$ to $68\%$ and improving video frame bitrate by $1.2\times$ to $3.5\times$.
Summary / 总结
Real-time communications (RTC) is a core technology for emerging applications in 6G, such as cloud gaming, teleoperation, and extended reality (XR), which require consistently low latency and high bitrates.
SDN-SYN PoW: Adaptive Ingress-Aware Defense with Non-Interactive PoW Against Volumetric SYN Floods
Authors: Wenyang Jia, Jingjing Wang, Xianneng Zou, Kai Lei
Venue: The 10th Asia-Pacific Workshop on Networking (APNet 2026)
First: 2026-03-02T18:49:34+00:00 · Latest: 2026-04-24T06:53:24+00:00
Abstract
The stability of Internet services is persistently challenged by large volumetric TCP SYN floods, for which conventional defenses such as SYN Cookies preserve server state but still amplify bandwidth pressure. This paper presents SDN-SYN PoW, an ingress aware defense architecture that integrates non interactive Proof of Work with an SDN control plane for managed edge networks. The controller monitors per ingress SYN pressure and raises PoW difficulty when flooding is detected. If traffic mainly originates from a stable source region, enforcement is refined to the offending source prefix to reduce overhead on benign co located clients; otherwise, ingress wide enforcement is retained under randomized or spoofed sources. We further design a conservative Difficulty Discovery Protocol that reuses TCP retransmissions and commits difficulty updates only after a successful handshake. Experiments on a custom SDN testbed show restored application QoS under concentrated and spoofed floods, 11.7% higher benign client throughput than ingress only enforcement, and below 0.8% transient false escalations under 2% random loss.
Summary / 总结
The stability of Internet services is persistently challenged by large volumetric TCP SYN floods, for which conventional defenses such as SYN Cookies preserve server state but still amplify bandwidth pressure.
A Task Decomposition and Planning Framework for Efficient LLM Inference in AI-Enabled WiFi-Offload Networks
Authors: Mingqi Han, Xinghua Sun
First: 2026-04-23T08:05:10+00:00 · Latest: 2026-04-23T08:05:10+00:00
Comments: 7 pages, 4 figures, conference version
Abstract
AI WiFi offload is emerging as a promising approach for providing large language model (LLM) services to resource-constrained wireless devices. However, unlike conventional edge computing, LLM inference over WiFi must jointly address heterogeneous model capabilities, wireless contention, uncertain task complexity, and semantic correlation among reasoning tasks. In this paper, we investigate LLM inference offloading in a multi-user multi-edge WiFi network, where each task can be executed locally, directly offloaded to a nearby edge access point (AP), or decomposed into multiple subtasks for collaborative execution across local and edge nodes. To this end, we propose a user-edge collaborative framework with an LLM-based planner that not only performs task decomposition but also infers subtask difficulty and expected output token length, enabling more accurate estimation of execution quality and latency on heterogeneous nodes. Based on these estimates, we further design a decomposition-aware scheduling strategy that jointly optimizes subtask assignment, execution, and aggregation under communication, queuing, and computation constraints. Simulation results show that the proposed framework achieves a better latency-accuracy tradeoff than local-only and nearest-edge baselines, reducing the average latency by $20\%$ and improving the overall reward by $80\%$. Moreover, the distilled lightweight planner approaches the performance of the large teacher model while remaining more suitable for practical edge deployment.
Summary / 总结
AI WiFi offload is emerging as a promising approach for providing large language model (LLM) services to resource-constrained wireless devices.
Modeling AI-RAN Economics: A Techno-Economic Framework
Authors: Gabriele Gemmi, Michele Polese, Tommaso Melodia
First: 2026-03-30T16:59:15+00:00 · Latest: 2026-04-22T22:04:54+00:00
Abstract
The large-scale deployment of 5G networks has not delivered the expected return on investment for mobile network operators, raising concerns about the economic viability of future 6G rollouts. At the same time, surging demand for Artificial Intelligence (AI) inference and training workloads is straining global compute capacity. AI-RAN architectures, in which Radio Access Network (RAN) platforms accelerated on Graphics Processing Unit (GPU) share idle capacity with AI workloads during off-peak periods, offer a potential path to improved capital efficiency. However, the economic case for such systems remains unsubstantiated. In this paper, we present a techno-economic analysis of AI-RAN deployments by combining publicly available benchmarks of 5G Layer-1 processing on heterogeneous platforms -- from x86 servers with accelerators for channel coding to modern GPUs -- with realistic traffic models and AI service demand profiles for Large Language Model (LLM) inference. We construct a joint cost and revenue model that quantifies the surplus compute capacity available in GPU-based RAN deployments and evaluates the returns from leasing it to AI tenants. Our results show that, across a range of scenarios encompassing token depreciation, varying demand dynamics, and diverse GPU serving densities, the additional capital and operational expenditures of GPU-heavy deployments are offset by AI-on-RAN revenue, yielding a return on investment of up to 8x. These findings strengthen the long-term economic case for accelerator-based RAN architectures and future 6G deployments.
Summary / 总结
The large-scale deployment of 5G networks has not delivered the expected return on investment for mobile network operators, raising concerns about the economic viability of future 6G rollouts.
Behavioral Consistency and Transparency Analysis on Large Language Model API Gateways
Authors: Guanjie Lin, Yinxin Wan, Shichao Pei, Ting Xu, Kuai Xu, Guoliang Xue
First: 2026-04-22T20:51:20+00:00 · Latest: 2026-04-22T20:51:20+00:00
Comments: 11 pages. Initially submitted to IMC 2026 Cycle 1 on November 20, 2025; accepted on March 13, 2026. To appear in Proceedings of the 2026 ACM Internet Measurement Conference (IMC '26)
Abstract
Third-party Large Language Model (LLM) API gateways are rapidly emerging as unified access points to models offered by multiple vendors. However, the internal routing, caching, and billing policies of these gateways are largely undisclosed, leaving users with limited visibility into whether requests are served by the advertised models, whether responses remain faithful to upstream APIs, or whether invoices accurately reflect public pricing policies. To address this gap, we introduce GateScope, a lightweight black-box measurement framework for evaluating behavioral consistency and operational transparency in commercial LLM gateways. GateScope is designed to detect key misbehaviors, including model downgrading or switching, silent truncation, billing inaccuracies, and instability in latency by auditing gateways along four critical dimensions: response content analysis, multi-turn conversation performance, billing accuracy, and latency characteristics. Our measurements across 10 real-world commercial LLM API gateways reveal frequent gaps between expected and actual behaviors, including silent model substitutions, degraded memory retention, deviations from announced pricing, and substantial variation in latency stability across platforms.
Summary / 总结
Third-party Large Language Model (LLM) API gateways are rapidly emerging as unified access points to models offered by multiple vendors.
Revisiting and Expanding the IPv6 Network Periphery: Global-Scale Measurement and Security Analysis
Authors: Zixuan Xie, Zitao Yang, Shurui Fang, Zhaoyang Li, Wenxing Xie, Nannan Fu, Liangyu Dong, Xiang Li
First: 2026-04-21T14:09:59+00:00 · Latest: 2026-04-22T02:16:59+00:00
Comments: 15 pages, 7 figures, 9 tables. Submitted to IEEE Transactions on Dependable and Secure Computing
Abstract
As IPv6 deployment accelerates, understanding the evolving security posture of network peripheries becomes increasingly important. A DSN 2021 study introduced the first large-scale discovery of IPv6 network peripheries, uncovering risks like service exposure and routing loops. However, its scope was limited to three regions and is now outdated. In this paper, we revisit and significantly expand upon that work, presenting a comprehensive, up-to-date security assessment of IPv6 network peripheries. To support efficient large-scale scanning, we propose a novel Response-Guided Prefix Selection (RGPS) strategy to identify high-value IPv6 prefixes for probing. Our global-scale measurement covers 73 countries/regions and identifies over 281.9M active IPv6 network peripheries, including a 371.2% increase (245M) over the 52M reported in 2021 for India, China, and America. Our service exposure analysis shows that 2.5% of reachable services are still dangerously exposed, including outdated administrative interfaces and misconfigured servers, while correlation with known CVEs reveals recurring software vulnerabilities. Building on this service-exposure perspective, we further design a Hierarchical LLM Exposure Verification (HLEV) framework to identify unauthorized-access risks in exposed LLM deployment tools, revealing multiple security weaknesses caused by insecure default configurations and missing authentication. Additionally, we revisit routing loop vulnerabilities and identify 4.5M loop-prone responses, confirming that flawed routing behaviors remain widespread across vendors and countries/regions. These findings suggest that while IPv6 adoption has surged, key security challenges persist and are structurally embedded.
Summary / 总结
As IPv6 deployment accelerates, understanding the evolving security posture of network peripheries becomes increasingly important.
Dynamic Model Routing and Cascading for Efficient LLM Inference: A Survey
Authors: Yasmin Moslem, John D. Kelleher
First: 2026-02-23T21:57:27+00:00 · Latest: 2026-04-21T10:38:10+00:00
Comments: Work funded by ADAPT Centre, Trinity College Dublin, and Huawei Ireland
Abstract
The rapid growth of large language models (LLMs) with diverse capabilities, costs, and domains has created a critical need for intelligent model selection at inference time. While smaller models suffice for routine queries, complex tasks demand more capable models. However, static model deployment does not account for the complexity and domain of incoming queries, leading to suboptimal performance and increased costs. Dynamic routing systems that adaptively select models based on query characteristics have emerged as a solution to this challenge.
We provide a systematic analysis of state-of-the-art multi-LLM routing and cascading approaches. In contrast to mixture-of-experts architectures, which route within a single model, we study routing across multiple independently trained LLMs. We cover diverse routing paradigms, including query difficulty, human preferences, clustering, uncertainty quantification, reinforcement learning, multimodality, and cascading. For each paradigm, we analyze representative methods and examine key trade-offs. Beyond taxonomy, we introduce a conceptual framework that characterizes routing systems along three dimensions: when decisions are made, what information is used, and how they are computed. This perspective highlights that practical systems are often compositional, integrating multiple paradigms under operational constraints.
Our analysis demonstrates that effective multi-LLM routing requires balancing competing objectives. Choosing the optimal routing strategy depends on deployment and computational constraints. Well-designed routing systems can outperform even the most powerful individual models by strategically leveraging specialized capabilities across models while maximizing efficiency gains. Meanwhile, open challenges remain in developing routing mechanisms that generalize across diverse architectures, modalities, and applications.
Summary / 总结
The rapid growth of large language models (LLMs) with diverse capabilities, costs, and domains has created a critical need for intelligent model selection at inference time.
End-to-End Performance of Video Streaming With MPEG-DASH Over Satellite 5G IAB Networks
Authors: Muhammad Adeel Zahid, Ekram Hossain, Peng Hu
First: 2026-04-17T18:45:22+00:00 · Latest: 2026-04-17T18:45:22+00:00
Abstract
We present an end-to-end performance evaluation of MPEG-DASH video streaming over a Low-Earth Orbit (LEO) satellite-based 5G Integrated Access and Backhaul (IAB) network. Our objective is to investigate how modern transport protocols and congestion control algorithms affect adaptive video delivery in an integrated satellite-terrestrial network (ISTN), where latency, throughput variation, and playback continuity jointly shape the user Quality-of-Experience (QoE). We implement a simulation framework in ns-3 by adapting open-source modules for the 5G radio access network, LEOS backhaul, transport layer protocols, and MPEG-DASH application behavior. Within this framework, TCP and QUIC are evaluated with multiple congestion control algorithms, including CUBIC, NewReno, and BBR. Performance is assessed using application-level and transport-level metrics, including playback duration, interruption duration, stall count, playback bitrate, throughput, latency, and fairness. The results show that no single configuration is uniformly optimal across all metrics. However, clear tradeoffs are observed among throughput, latency, playback continuity, and fairness. In particular, QUIC-BBR provides the most balanced overall behavior from a streaming QoE perspective, combining adequate playback duration with fewer interruptions and substantially lower latency than other alternatives. These findings highlight the importance of jointly considering transport design and congestion control when evaluating adaptive video streaming over ISTNs.
Summary / 总结
We present an end-to-end performance evaluation of MPEG-DASH video streaming over a Low-Earth Orbit (LEO) satellite-based 5G Integrated Access and Backhaul (IAB) network.
MM-Telco: Benchmarks and Multimodal Large Language Models for Telecom Applications
Authors: Anshul Kumar, Gagan Raj Gupta, Manish Rai, Apu Chakraborty, Ashutosh Modi, Abdelaali Chaoub, Soumajit Pramanik, Moyank Giri, Yashwanth Holla, Sunny Kumar, M. V. Kiran Sooraj
First: 2025-11-17T08:34:41+00:00 · Latest: 2026-04-17T03:37:10+00:00
Abstract
Large Language Models (LLMs) have emerged as powerful tools for automating complex reasoning and decision-making tasks. In telecommunications, they hold the potential to transform network optimization, automate troubleshooting, enhance customer support, and ensure regulatory compliance. However, their deployment in telecom is hindered by domain-specific challenges that demand specialized adaptation. To overcome these challenges and to accelerate the adaptation of LLMs for telecom, we propose MM-Telco, a comprehensive suite of multimodal benchmarks and models tailored for the telecom domain. The benchmark introduces various tasks (both text based and image based) that address various practical real-life use cases such as network operations, network management, improving documentation quality, and retrieval of relevant text and images. Further, we perform baseline experiments with various LLMs and VLMs. The models fine-tuned on our dataset exhibit a significant boost in performance. Our experiments also help analyze the weak areas in the working of current state-of-art multimodal LLMs, thus guiding towards further development and research.
Summary / 总结
Large Language Models (LLMs) have emerged as powerful tools for automating complex reasoning and decision-making tasks.
SCENIC: Stream Computation-Enhanced SmartNIC
Authors: Benjamin Ramhorst, Maximilian Jakob Heer, Luhao Liu, Heejae Kim, Jonas Dann, Jin-Soo Kim, Gustavo Alonso
First: 2026-04-16T15:13:48+00:00 · Latest: 2026-04-16T15:13:48+00:00
Abstract
Although modern, AI-centric datacenters heavily rely on SmartNICs, existing devices impose a hard trade-off. Commercial SmartNICs provide high bandwidth and easy software integration, but offer limited support for customization and data processing offload. In contrast, research SmartNICs often suffer from low bandwidth, limited functionality, and poor software compatibility -- to the point that many are not actual NICs in a technical sense. This gap can be closed by treating the NIC datapath as a first-class stream computation substrate with shared hardware/software abstractions for a tight co-design of infrastructure and applications. To demonstrate this, we introduce SCENIC, an open-source datacenter SmartNIC. SCENIC implements a 200G network datapath over offloaded TCP/IP and RDMA stacks, as well as a fallback path for processing arbitrary network traffic. On top of the network logic, SCENIC combines on-datapath Stream Compute Units (SCUs) for data processing and embedded ARM cores for flexible control path manipulation with direct access to GPUs and SSDs. SCENIC is fully integrated with the OS, exposing native Linux network and RDMA verb interfaces, making the programmable datapath transparent to existing applications while enabling control of, e.g., user-defined offloads and programmable congestion control. SCENIC's performance matches commercial platforms, and we show its versatility through several use cases such as offloaded collective communication and network-to-GPU hash-based data partitioning.
Summary / 总结
Although modern, AI-centric datacenters heavily rely on SmartNICs, existing devices impose a hard trade-off.
Tail Contagion: Sub-microsecond Time Protection in Shared Software Network Datapaths
Authors: Matheus Stolet, Liam Arzola, Simon Peter, Antoine Kaufmann
First: 2023-09-25T10:29:06+00:00 · Latest: 2026-04-16T09:53:27+00:00
Comments: Under submission for conference peer review
Abstract
Shared software datapaths underpin modern datacentre networking. They implement mechanisms such as virtual switching, network virtualisation tunneling, or reliable transport, and enforce policies, such as tenant rate limits, virtual network isolation, or congestion control. However, because multiple applications, containers, or VMs share them, often across tenants, they pose a tail latency isolation challenge. Current isolation approaches either sacrifice efficiency via coarse-grained core partitioning or provide weak tail latency isolation when sharing cores with basic rate limits.
This paper presents Virtuoso, a time protection mechanism for shared software datapaths that provides strong cross-tenant tail latency isolation while preserving low overhead and microsecond-scale latency. Our key insight is that tail latency is fundamentally a time metric, so byte or packet throughput is the wrong metric for controlling interference when packet processing costs vary. Our design instead enforces isolation through per-tenant CPU-time budgets at datapath intervention points within run-to-completion loops, without relying on preemption. In a case study, we instantiate Virtuoso in the TAS TCP stack and demonstrate a 7.8X reduction in victim tail latency under adversarial interference while keeping throughput within 5% of unmodified TAS. We also observe a 3X per-core efficiency improvement compared to siloed datapaths under bursty workloads.
Summary / 总结
Shared software datapaths underpin modern datacentre networking.
Switching Efficiency: A Novel Framework for Dissecting AI Data Center Network Efficiency
Authors: Niangen Ye, Jiawen Zhu, Baojun Chen, Dong Wang, Jiang Sun, Weiqiang Sun, Weisheng Hu
First: 2026-04-16T06:49:12+00:00 · Latest: 2026-04-16T06:49:12+00:00
Abstract
Communication is pivotal in LLM training, and a thorough analysis of the communication efficiency of AI data center (AIDC) network is essential for guiding the design of these capital-intensive clusters. However, conventional metrics are inadequate for such analysis, as they do not directly link network activity to computational progress and lack granularity to diagnose the impact of different network design patterns. To address this, we introduce a metric framework, the Switching Efficiency Framework, whose core metric - Switching Efficiency ($η$) - quantifies computationally effective data throughput per unit switching capacity. We further decompose $η$ into three factors - Data, Routing Efficiency, and Port Utilization to facilitate analysis of distinct communication bottlenecks.
Using this metric framework, we demonstrate how the symmetric, distributed switching of 3D-Torus and the centralized, hierarchical switching of Rail-Optimized architecture align with sparse or imbalanced LLM training traffic, and show that All-to-All traffic from Mixture-of-Experts models severely degrades their port utilization and routing efficiency. Our analysis also demonstrates how key design choices - such as adjusting switching resource allocation, expanding server size, adopting in-network computing, and multi-plane design - positively influence distinct facets of communication efficiency. Ultimately, the Switching Efficiency Framework provides an analytical tool for analyzing efficiency bottlenecks, thereby informing the design of future-generation AIDC networks.
Summary / 总结
Communication is pivotal in LLM training, and a thorough analysis of the communication efficiency of AI data center (AIDC) network is essential for guiding the design of these capital-intensive clusters.
SAKURAONE: An Open Ethernet-Based AI HPC System and Its Observed Workload Dynamics in a Single-Tenant LLM Development Environment
Authors: Fumikazu Konishi, Yuuki Tsubouchi, Hirofumi Tsuruta
First: 2026-04-15T08:09:15+00:00 · Latest: 2026-04-16T06:22:54+00:00
Comments: Accepted at MLSys 2026
Abstract
SAKURAONE is a managed high performance computing (HPC) cluster developed and operated by the SAKURA Internet Research Center. It builds on the KOKARYOKU PHY bare metal GPU platform and is optimized for advanced workloads, including large language model (LLM) training. In ISC 2025 TOP500, SAKURAONE is ranked 49th by HPL and is the only top 100 system that uses a fully open networking stack - 800 GbE with SONiC - demonstrating the scalability of vendor-neutral technology. Measured performance is 33.95 PFLOP/s (HPL Rmax), 396.295 TFLOP/s (HPCG), and 339.86 PFLOP/s on HPL-MxP with FP8. The system consists of 100 nodes, each with eight NVIDIA H100 GPUs and a 2 PB all-flash Lustre file system, interconnected via a rail-optimized 800 GbE leaf-spine fabric with RoCEv2. Through exclusive use by a single research project, we observed the characteristics of development-related jobs. Consistent with previous HPC studies, small-scale jobs dominated in number, while a few large-scale jobs accounted for most GPU resource time. As the project progressed, resource use shifted from large-scale to mid-scale jobs, reflecting a transition from initial large-scale training to iterative refinement. These observations illustrate the real-world utilization dynamics of GPU clusters under unified project workloads.
Summary / 总结
SAKURAONE is a managed high performance computing (HPC) cluster developed and operated by the SAKURA Internet Research Center.