publications | Siyuan Liang

2025

Truncated Recurrent Transformer: Unlocking Length Extrapolation via TBPTT

Siyuan Liang

arXiv preprint, 2025

Abs Bib PDF Code

The Transformer architecture, especially in decoder-only form for autoregressive language modeling, has achieved remarkable success in natural language processing due to its powerful parallel capabilities and attention mechanism. However, the standard Transformer’s attention mechanism is stateless, which poses dual challenges of O(N^2) computational complexity and inference memory consumption when processing long texts. Existing solutions such as Transformer-XL introduce segment-level recurrence, but their state transfer is limited to direct copying of the KV cache, lacking deep state evolution, which restricts their theoretical receptive field to N x L. In this paper, we propose a Truncated Recurrent Transformer architecture. This architecture transforms the Transformer from a stateless parallel computer into a stateful sequence model by introducing an explicit Recurrent State. Our core innovations lie in: (1) Explicit State Evolution: introducing a non-linear projection (Projection FFN) between blocks, enabling memory to ’think’ and ’compress’ in the temporal dimension, rather than just being passively stored; and (2) Stateful Segment Training: combining the classic truncated backpropagation through time (TBPTT) idea from RNNs, maintaining state transfer across batches during training, enabling the model to learn true long-distance dependencies. Experimental results show that the model, trained on a length of only 256, can generalize to sequences of 4096 or even longer, with loss continuously decreasing as length increases (Train Short, Test Long), demonstrating effective length extrapolation.
@article{liang2025recurrent, title = {Truncated Recurrent Transformer: Unlocking Length Extrapolation via TBPTT}, author = {Liang, Siyuan}, year = {2025}, journal = {arXiv preprint}, }
NeurIPS

LEDiT: Your Length-Extrapolatable Diffusion Transformer without Positional Encoding

Shen Zhang, Siyuan Liang, Yaning Tan, and 9 more authors

In Advances in Neural Information Processing Systems, 2025

Abs HTML Code Poster Website

Diffusion transformers (DiTs) struggle to generate images at resolutions higher than their training resolutions. The primary obstacle is that the explicit positional encodings(PE), such as RoPE, need extrapolating to unseen positions which degrades performance when the inference resolution differs from training. In this paper, We propose a Length-Extrapolatable Diffusion Transformer (LEDiT) to overcome this limitation. LEDiT needs no explicit PEs, thereby avoiding PE extrapolation. The key innovation of LEDiT lies in the use of causal attention. We demonstrate that causal attention can implicitly encode global positional information and show that such information facilitates extrapolation. We further introduce a locality enhancement module, which captures fine-grained local information to complement the global coarse-grained position information encoded by causal attention. Experimental results on both conditional and text-to-image generation tasks demonstrate that LEDiT supports up to 4x resolution scaling (e.g., from 256x256 to 512x512), achieving better image quality compared to the state-of-the-art length extrapolation methods. We believe that LEDiT marks a departure from the standard RoPE-based methods and offers a promising insight into length extrapolation.

2022

ECCV Workshop

SimpleDG: Simple Domain Generalization Baseline Without Bells and Whistles

Zhi Lv, Bo Lin, Siyuan Liang, and 4 more authors

In European Conference on Computer Vision, 2022

HTML PDF Code

2021

IEEE Sensors

An End-to-End Anti-Jamming Target Detection Method Based on CNN

Yu Zhang, Bo Jiu, Penghui Wang, and 2 more authors

IEEE Sensors Journal, 2021

DOI HTML PDF

2019

IET JoE

Waveform design for cognitive radar in presence of jammer using Stackelberg game

Kang Li, Bo Jiu, Hongwei Liu, and 1 more author

The Journal of Engineering, 2019

DOI HTML