Siyuan Liang

Siyuan Liang (梁思远) focuses on long-context modeling and recurrent architectures for sequence models, and proposed the Truncated Recurrent Transformer for strong length extrapolation under train-short, test-long settings.

Previously, he worked as an algorithm researcher at Megvii in Beijing, delivering production algorithms for fingerprint and face liveness, display demura, and XR hand tracking.

He received his M.S. in Electronic and Communication Engineering from Xidian University, where his research centered on deep-learning-based radar anti-jamming detection and intelligent electromagnetic games.

llama2RNN.c — Truncated Recurrent Transformer implementations in C

LEDiT — PyTorch Implementation, NeurIPS 2025

SimpleDG — Training and test code for ECCV2022 workshop NICO challenge

latest posts

Jun 01, 2019	世界模型（二）：智能电磁博弈
Jun 01, 2019	World Models (II): Intelligent Electromagnetic Game
Mar 31, 2019	世界模型（一）：记忆、感知、预测、评估、决策的联合

selected publications

Truncated Recurrent Transformer: Unlocking Length Extrapolation via TBPTT

Siyuan Liang

arXiv preprint, 2025

Abs Bib PDF Code

The Transformer architecture, especially in decoder-only form for autoregressive language modeling, has achieved remarkable success in natural language processing due to its powerful parallel capabilities and attention mechanism. However, the standard Transformer’s attention mechanism is stateless, which poses dual challenges of O(N^2) computational complexity and inference memory consumption when processing long texts. Existing solutions such as Transformer-XL introduce segment-level recurrence, but their state transfer is limited to direct copying of the KV cache, lacking deep state evolution, which restricts their theoretical receptive field to N x L. In this paper, we propose a Truncated Recurrent Transformer architecture. This architecture transforms the Transformer from a stateless parallel computer into a stateful sequence model by introducing an explicit Recurrent State. Our core innovations lie in: (1) Explicit State Evolution: introducing a non-linear projection (Projection FFN) between blocks, enabling memory to ’think’ and ’compress’ in the temporal dimension, rather than just being passively stored; and (2) Stateful Segment Training: combining the classic truncated backpropagation through time (TBPTT) idea from RNNs, maintaining state transfer across batches during training, enabling the model to learn true long-distance dependencies. Experimental results show that the model, trained on a length of only 256, can generalize to sequences of 4096 or even longer, with loss continuously decreasing as length increases (Train Short, Test Long), demonstrating effective length extrapolation.
@article{liang2025recurrent, title = {Truncated Recurrent Transformer: Unlocking Length Extrapolation via TBPTT}, author = {Liang, Siyuan}, year = {2025}, journal = {arXiv preprint}, }
NeurIPS

LEDiT: Your Length-Extrapolatable Diffusion Transformer without Positional Encoding

Shen Zhang, Siyuan Liang, Yaning Tan, and 9 more authors

In Advances in Neural Information Processing Systems, 2025

Abs HTML Code Poster Website

Diffusion transformers (DiTs) struggle to generate images at resolutions higher than their training resolutions. The primary obstacle is that the explicit positional encodings(PE), such as RoPE, need extrapolating to unseen positions which degrades performance when the inference resolution differs from training. In this paper, We propose a Length-Extrapolatable Diffusion Transformer (LEDiT) to overcome this limitation. LEDiT needs no explicit PEs, thereby avoiding PE extrapolation. The key innovation of LEDiT lies in the use of causal attention. We demonstrate that causal attention can implicitly encode global positional information and show that such information facilitates extrapolation. We further introduce a locality enhancement module, which captures fine-grained local information to complement the global coarse-grained position information encoded by causal attention. Experimental results on both conditional and text-to-image generation tasks demonstrate that LEDiT supports up to 4x resolution scaling (e.g., from 256x256 to 512x512), achieving better image quality compared to the state-of-the-art length extrapolation methods. We believe that LEDiT marks a departure from the standard RoPE-based methods and offers a promising insight into length extrapolation.
IEEE Sensors

An End-to-End Anti-Jamming Target Detection Method Based on CNN

Yu Zhang, Bo Jiu, Penghui Wang, and 2 more authors

IEEE Sensors Journal, 2021

DOI HTML PDF