blank

World Models (II): Intelligent Electromagnetic Game

2019-06-01T12:00:00+00:00

Preface

%% The following is the abridged content of my master’s thesis in 2019 %%

My master’s thesis mainly introduced and studied the radar anti-jamming detection network in detail. We solved the generalization problem of the detection network under arbitrary transmit waveforms.

Research on Radar Anti-jamming Method Based on Deep Learning

Here, we will address another problem of the anti-jamming detection network: generalization to jamming forms. To enable the network to generalize to as many jamming forms as possible, we must provide samples of sufficiently diverse jamming forms. This is something manual design cannot satisfy. We will use a jamming network to generate jamming signals, so the first step is to solve the construction and training of the jamming generation network. Afterward, this chapter will present conjectures and related experiments on other parts of the intelligent electromagnetic game, such as transmit waveforms, memory, and the dynamic game between radar and jamming.

Joint Optimization of Jamming, Detection, and Generation

Detection, Denoising, and Recovery Network for Radar Signals Received by Jammer

When the jammer receives a radar signal, it has two main tasks: detection of the radar signal and waveform recovery. The performance of its detection network and traditional detection theory can be found in reference [37]. We (in the thesis above) verified that the network performance approaches the theoretical value of optimal detection. Below, we mainly introduce the recovery network of the jammer for the received radar signal. We establish the following cost function, where the echo is:

\[X=S+N\]

The mean square error loss between the jammer-generated signal and the original radar signal is:

\[L_{MSE}=|G(X)-S|^2\]

The pulse compression loss is defined as:

\[L_{PC}=1-\frac{S^H G(X)}{|S|\;|G(X)|}\]

The Generative Adversarial Network (GAN) loss is:

\[L_{GAN}=\log(1-D(S|X))+\log(D(G(X)|X))\]

The final optimization function for training the jamming network is:

\[\min_G \max_D C_{MSE} L_{MSE}+C_{PC} L_{PC}+C_{GAN} L_{GAN}\]

Jamming Generation Network

Where $X$ is the noisy radar signal received by the jammer, $S$ represents the radar’s transmit signal, $N$ is Gaussian white noise, and $Y$ is the jamming signal recovered and transmitted by the jammer. Our goal is to denoise the received noisy signal. $G$ is the jammer’s waveform recovery network for the radar signal. This network is similar to the one above; it is also a generative network, meaning its output is a tensor containing structural information, and it can utilize the same network structure as the radar detection network mentioned earlier. $L_{MSE}$ is the Mean Square Error loss, measuring the distance between the recovered signal and the original signal in Euclidean space; $L_{PC}$ is the Pulse Compression loss, measuring the peak loss of the recovered jamming signal after pulse compression (i.e., the difference between the projection length of the recovered signal on the original signal and the unit length). When the radar uses traditional pulse compression processing to detect targets, the jamming recovery network should be optimized against this loss function; $L_{GAN}$ is the GAN loss, measuring the degree of confusion of the given discriminator network between the real signal and the generated signal. When the radar side uses a deep network to detect targets, the jamming recovery network should be optimized against this loss function. $c_{MSE}$, $c_{PC}$, and $c_{GAN}$ are the proportional coefficients of the three loss functions, which can be determined according to the specific situation.

We define the waveform similarity as:

\[similarity(G(X))=1-L_{PC}=\frac{S^H G(X)}{|S|\;|G(X)|}\]

The figure below shows the signal recovery effect of the jamming recovery network for different types of transmit waveforms. From left to right in the figure are the waveform recovery effects for phase-coded, third-order frequency modulation, and linear frequency modulation signals. The upper part shows examples of waveform recovery, where the blue line is the noisy signal received by the jammer, and the orange line is the denoised signal generated by the jammer. The lower part shows the improvement effect of waveform denoising similarity with the signal-to-noise ratio (SNR), where the blue line is the similarity between the noisy signal received by the jammer and the radar transmit waveform, and the orange line is the similarity of the recovered signal. As can be seen from the figure, the recovery effect of the denoising network varies for different types of waveforms. The more complex the waveform, the worse the denoising improvement effect, which is consistent with reality.

Test Examples of Jamming Recovery Network

Improvement Performance of Jamming Recovery Network on Different Signals

Jamming Generation Network Targeting Radar Detection Network

In the radar anti-jamming target detection method mentioned above (referring to the master’s thesis), we established the following target detection cross-entropy loss:

\[L(P(Y|X,S),D(X,S))=-P(Y|X,S) \log(D(X,S))-(1-P(Y|X,S))\log(1-D(X,S))\]

We optimize the detection network by minimizing the loss function:

\[\min_D E_X [L(P(Y|X,S),D(X,S))]\]

Here, the jamming form is given. Is it possible to solve for a jamming form that maximizes the jamming effect? We know that the radar echo contains the target signal, jamming signal, and noise:

\[X=T+J+N\]

Where $T=HS$ is the target echo, $H$ represents the target’s response mode, $S$ is the radar’s transmit waveform, $N$ is the noise, and $J$ is the jamming signal produced by the jammer through the generation network, where:

\[J=G(\Gamma S)\]

Where $\Gamma$ is the jammer’s sampling method of the radar transmit waveform, and $G$ is the jamming generation network. For the specific structure of this network, please refer to the target detection network in this article (master’s thesis), except that the output is no longer the probability of targets at each range cell, but the jamming signal. Here we only need to focus on an end-to-end network model. Then we can solve it like this:

\[\min_D \max_G E_X[L(P(Y|X,S),D(X,S))]\] \[X=HS+G(\Gamma S)+N\]

When the entire process from radar transmit waveform to jamming, target echo generation, and then to radar anti-jamming detection can be expressed in the above differentiable form, we can alternately optimize the radar detection network $D$ and the jamming generation network $G$. While continuously improving the jamming capability of the jamming generation network, it will also continuously improve the capability of radar anti-jamming detection. Among them, when training the detection network $D$, we use all forms of jamming generated by the continuously updated jamming generation network $G$, so the finally obtained detection network will inevitably be robust to various forms of jamming. That is to say, we have obtained a target detection network that can generalize to arbitrary forms of jamming.

End-to-End Anti-Jamming Detection Transmit Waveform Optimization

After obtaining the optimized radar detection network $D$ and jamming generation network $G$, we essentially have a detection network $D$ that is optimal for arbitrary transmit waveforms and arbitrary forms of jamming, and a jamming generation network $G$ that is optimal for arbitrary transmit waveforms. At this point, we can solve for the optimal transmit waveform:

\[\min_S E_X [L(P(Y|X,S),D(X,S))]\] \[X=HS+G(\Gamma S)+N\]

By directly maximizing the detection result to optimize the transmit waveform, we obtain a transmit waveform that possesses both low sidelobes and anti-jamming capabilities. When facing the optimal anti-jamming detection network and the optimal jamming generation network, it can achieve the best detection effect, truly achieving end-to-end model optimization. As shown in the schematic diagram below, backpropagation of gradients will simultaneously optimize two performances of the transmit waveform: target detection performance (i.e., low sidelobe requirements of autocorrelation, etc.) and anti-jamming performance. This idea is reflected in Appendix A. Compared with traditional manual design of transmit waveforms, end-to-end network optimization achieves automation and closed-loop feedback.

End-to-End Anti-Jamming Detection Network

Superposition of Long-Term Memory, Evaluation, and Strategy

Multi-Pulse Joint Anti-Jamming Detection Network

Previous radar detections were all single-pulse detections, but in more cases, targets need multiple pulses to be detected, such as moving targets in static clutter environments. At this time, the problem of multi-pulse joint detection emerges.

Let the radar received data at the current moment and several adjacent previous pulses be the observation information at the current moment:

\[o_t=[X_{t-T},…,X_t ]\]

Establish a multi-pulse joint anti-jamming detection network $D(o_t)$ and optimize it by minimizing the detection error:

\[\min_D E_{o_t} [L(P(Y_t|o_t ),D(o_t ))]\]

Through the above equation, the multi-pulse joint anti-jamming detection network can be optimized. It should be noted that the input information of the multi-pulse joint anti-jamming detection network, in addition to the observation information mentioned above, naturally also requires knowledge of the radar transmit waveform (this is the same as single-pulse detection above). Since the radar transmit waveform is knowledge easily obtained by the radar detection network (for a radar that transmits and receives simultaneously), it is omitted here for convenience of expression.

Regarding the structure of the multi-pulse joint anti-jamming network, there are the following thoughts: We use the same convolutional network form as the previous single-pulse detection on a single pulse echo. At the same time, targeting the environmental state information extracted from previous pulses, we add a Long Short-Term Memory (LSTM) structure to each layer of the convolutional network. Combining the current pulse and previous pulses to extract feature information through convolution as the input for the next layer, and outputting the environmental state information at the current moment for detection at the next moment. The Recurrent Convolutional Network (Conv-LSTM) is a structure that adds LSTM to a convolutional network. At this time, the detection network can be expressed as:

\[{detect}_t,{state}_t=D(X_t,{state}_{(t-1)})\]

Where $detect$ is the detection result output by the network, and $state$ is the environmental state information extracted by the network, which can also be called the memory information of the detection network.

Of course, this is one possible method. The advantage of this method is that for each moment, only the raw pulse information of the current moment needs to be recalculated, while the pulse information of previous moments has been preserved as environmental state information through the processing of the network at the previous moment. There is no need to process all echo information of previous moments one by one, which can simplify the calculation process. But conversely, compressing echo information from multiple moments into a single environmental state information is inevitably not the most direct processing method. For the above method to achieve a good detection result, the following formula must hold approximately:

\[P(Y_t |X_t,X_(t-1),…,X_(t-T) )=P(Y_t |X_t,state_(t-1) )\]

That is to say, the information relevant to the current target in multiple previous pulses can be fully represented by one environmental state information.

In addition, a more direct multi-pulse joint detection method is to perform 2D convolution (or even 3D convolution, if the sliding window matching method in Chapter 3 of this article (master’s thesis) is used to adapt to variable transmit waveforms) directly on multiple pulses and multiple range cells. However, the computational pressure and even model complexity pressure brought by doing so need to be carefully considered. Meanwhile, using convolution in the multi-pulse dimension is the same as traditional coherent pulse integration, requiring the number of coherent pulses to be set artificially. Remaining non-coherent pulse information will be lost, whereas LSTM can use long-term memory to preserve useful information from all historical pulses.

Multi-Pulse Joint Anti-Jamming Detection Network

Combination of Convolutional and Recurrent Networks: Conv-LSTM

End-to-End Multi-Pulse Joint Anti-Jamming Detection Transmit Waveform Optimization

After obtaining the multi-pulse joint detection network $D(o_t)$, we can optimize the transmit waveform by minimizing the detection error. First, establish the strategy action network for the transmit waveform:

\[S_t=\pi (o_{(t-1)})\]

The transmit waveform at the current moment is obtained through the observation information at the previous moment. That is to say, the transmit waveform we want to optimize is obtained by analyzing historical observation information. This is a reasonable and common assumption, which has been widely used in the field of cognitive radar.

We take the negative of the detection error at the current moment as the detection reward at the current moment:

\[R_t=-L(P(Y_t|o_t ),D(o_t ))\]

And our ultimate goal is to optimize the transmit waveform at the current moment by maximizing future detection rewards:

\[\max_{S_t} \sum_{\tau=t}^{+\infty} R_\tau\]

However, the above equation cannot be solved directly because future rewards are unknown: future detection rewards require future observation information, while future observation information requires future transmit waveforms, and optimizing transmit waveforms at future moments requires detection rewards at even further moments.

But we can use a value network to evaluate future rewards and solve via the Bellman equation:

\[V(o_t )=\sum_{\tau=t}^{+\infty} R_\tau = R_t + E_{X_{t+1}} [V(o_{t+1}]\]

The value network is an estimation function for future rewards. It directly evaluates future rewards only through current observation information, without needing to actually give the detection reward value for every future moment. Where the transmit waveform at the current moment is given by the policy network, i.e., $S_t=\pi(o_{t-1})$, and $R_t$ can be calculated from the detection result of the detection network, i.e., $R_t=-L(P(Y_t

o_t ),D(o_t ))$. We use the value on the right side of the Bellman equation to continuously correct the evaluation network on the left side until the equation holds approximately, i.e.:

\[\min_V {ValueLoss} = \min_{V_{new}} [V_{new}(o_t) - [R_t+V_{old}(o_{t+1})]]^2\]

Finally, optimize the transmit waveform strategy at the current moment by maximizing the future rewards evaluated by the value network:

\[\max_\pi V(o_t)\]

Continuously alternate updating the value network and the policy network to complete the optimization of the transmit waveform.

In fact, the entire optimization process utilizes Reinforcement Learning [45] methods, specifically as follows:

View the radar side as an agent.
Use the echo or jamming data received by the radar as the agent’s observation information of the environment: $o$.
View the radar’s transmit waveform as the agent’s action. The agent takes action based on different observation information according to the policy function: $S_t=\pi(o_{t-1})$.
View the radar’s detection of targets in the environment: $D(o_t)$ as the agent’s perception of the environment. (The value network $V(o_t)$ mentioned above, which evaluates future rewards based on observation information, also belongs to environment perception. Therefore, when specifically building the detection network and value network, low-level convolutional parameters can be shared. At the same time, the policy network $\pi(o_{t-1})$ also obtains actions by analyzing observation information, so these parameters can also be shared.)
View the detection effect of the radar detection network as the immediate reward for the agent’s action: $R$.

The above idea is shown in the figure below. In the process of interaction with the environment, detection, value evaluation, and strategy selection are performed through the same multi-layer Conv-LSTM network, and network parameters are updated in reverse using various optimization objectives. Ultimately, using only one network, we complete anti-jamming detection of targets, evaluation of long-term detection rewards, and optimization of transmit waveforms based on maximizing long-term detection rewards. The above ideas can be seen in literature [46] [47].

Training of Detection Network and Evaluation Network

In reality, backpropagation of gradients is not as simple as described in the figure above. Its true forward and backward propagation is shown in the figure below. Among them, the optimization of the detection network $D$ only needs to use the current detection loss; the optimization of the value network $V$ requires using the current evaluation error, and calculating the evaluation error requires not only the current detection loss but also the evaluation value of the next moment and the evaluation value of the current moment.

Schematic Diagram of Forward and Backward Propagation

Policy Network Training Method: Real Environment or Simulation Estimation

For the optimization of the policy function, there are the following two methods. One is model-based. This method requires us to model the environmental information and establish a feedforward differentiable process from transmit waveform to echo signal. This establishes a differentiable feedforward process from the policy network to the evaluation network: obtain the current transmit waveform through the observation information of the previous moment, then obtain the current observation information through environmental interaction, and then obtain the reward evaluation through the value network. Finally, along the feedforward calculation process, the policy network can be optimized by backpropagation by maximizing the reward evaluation.

Model-Based Policy Network Training

The condition for the above method to be effective is to model the environment differentiably. If a model-free method is to be established, the input signal of the value network needs to be improved:

\[V(o_t )\rightarrow V(o_{t-1},S_t)\]

The value network no longer evaluates rewards through current observation information, but evaluates based on the observation information of the previous moment and the transmit signal of the current moment. In fact, we implicitly establish the estimation of the environment model within the value network, which needs to estimate the probability of the current echo based on the current transmit waveform on its own, and then make an echo evaluation.

Model-Free Policy Network Training

Model-Free Environment Modeling Radar Agent: Coping with Location Scenarios (Map Fog)

The radar agent without environment modeling is shown in the figure below. An important advantage of the model-free method is that in real anti-jamming detection tasks, we naturally cannot know the jammer’s model. At this time, using the model-free method, we can still learn online, including the detection network, value network, and policy network. When the jammer or environment changes, the environment can be re-evaluated through the optimization of the value network. These ideas have been applied in some simple experiments, such as radar frequency hopping strategy optimization under fixed jamming strategies.

Model-Free Radar Agent

Finally, it is particularly necessary to explain that in the above process of optimizing radar transmit waveforms using deep reinforcement learning, we used the actual detection effect of the detection network to generate reward rewards, forming a closed loop between detection and transmit waveforms. We truly achieved designing transmit waveforms by maximizing detection effects. Compared with the false reward rewards obtained through modeling analysis, this is obviously more real and effective.

Intelligent Electromagnetic Game: Deep Network Adversarial Detection of Radar and Jamming on Continuous Pulses

We call the above multi-pulse joint target detection “continuous pulse detection”. In the continuous pulse detection above, we modeled the radar agent, but did not model the environment, especially the jamming in the environment, as an agent. This leads to the fact that when training the radar agent above, we must provide some fixed form of jamming, and the optimized radar agent can only target the given form of jamming. Its anti-jamming capability against unknown jamming forms cannot be guaranteed. To solve this problem and simultaneously optimize the jammer’s jamming strategy, we need to model the jamming during multi-pulse detection as an agent, just like the adversarial improvement of radar and jamming networks in single-pulse detection, and establish a deep network adversarial detection model for radar and jamming.

Since the radar performs multi-pulse joint detection, the jamming must also target multi-pulse joint detection. This requires the jamming network not only to base on the current radar transmit waveform but also to consider the radar’s previous transmit waveforms. That is to say, the jamming network should be a Conv-LSTM network.

\[G(S_t,S_{t-1},…,S_{t-T})=G(S_t,{stat}_{t-1} )\]

Regarding the optimization criterion of the jamming generation network, we can leverage the detection effect at the radar end. However, it should be noted that unlike single-pulse detection, we no longer aim to maximize the radar’s current detection error, but to minimize the future reward given by the value network. This allows the jamming system to also have a long-term vision, rather than just focusing on current jamming effects:

\[\min_G V(o_t) = \min_G V(o_{t-1},X_t) = \min_G V(o_{t-1},G(S_t,{state}_{t-1} ) + N + HS_t )\]

Training Method of Jamming Generation Network on Continuous Pulses

For the optimization of the value network, one can choose either a model-based method or a model-free method. So for the entire deep network adversarial training process of radar and jamming, see the figure below. It can be seen that in the entire process, we have obtained at least four useful functional networks in electromagnetic warfare:

Deep Network Adversarial Detection of Radar and Jamming on Continuous Pulses

An anti-jamming detection network that can be used for multi-pulse joint detection, which can make optimal anti-jamming detection for arbitrary jamming forms.
A value network that can be used to evaluate detection effects, which will make long-term effect evaluations of anti-jamming detection based on the jammer’s jamming capability.
A policy network for transmit waveforms that can be used for multi-pulse joint anti-jamming detection, which will give the optimal transmit waveform for anti-jamming target detection based on the environmental jamming and target information already mastered.
A jamming generation network that can be used for multi-pulse joint detection, which will give optimal jamming for future detection rewards targeting multi-pulse coherent detection based on the received transmit waveforms.

Regarding the elaboration and understanding of the intelligent electromagnetic game, compared with the cognitive radar mentioned in the introduction, the deep network adversarial detection model can achieve the following points:

Leverage deep learning to achieve intelligent information perception of targets and the environment.
Leverage deep reinforcement learning to achieve closed-loop optimization processing from transmit waveform to target detection.
Leverage recurrent neural networks to achieve the memory function of the radar agent.

In the deep network adversarial detection model, the radar agent can rely on the algorithm’s self-learning and improvement capabilities to achieve closed-loop processing from transmit waveform to target detection results. Relying on the final detection result to improve the radar’s working mode and processing process end-to-end, its scope of use is wider and optimization is more integrated. In a stable environment, it will continuously iterate and update; while in an unknown or changing environment, the intelligent radar can also adapt quickly during interaction with the environment. Compared with traditional radar technology which mostly uses preset working modes and reception processing methods, the radar agent in the deep network adversarial detection model forms a closed loop from reception to transmission. It can more actively perceive external environmental information and perform cognitive transmission and cognitive reception processing based on this prior information. In the continuous adversarial training with jamming, it can simultaneously improve the performance of both radar and jamming.

The above introduction mainly modeled the radar agent with reinforcement learning. The training of the jamming network relied on the detection jamming effect given by the radar end evaluation network. Of course, reinforcement learning modeling can also be performed on the jamming end, which will not be repeated here. Finally, I believe this is the future of intelligent adversarial radar, and the figure above is the symbol.

Summary

Here, the jamming generation network and the radar agent were established successively. The radar agent includes memory, detection network, evaluation network, and policy network. Finally, an intelligent game system of radar and jamming based on deep reinforcement learning was constructed, completing the integrated design of electromagnetic games such as radar anti-jamming strategy, echo signal processing, detection effect evaluation, and jamming strategy. This article conjectured a radar agent possessing most of the functions in the above capabilities, but some advanced functions were not introduced in detail. Perception functions can be implemented relying on autoencoder networks [49], prediction functions can be trained relying on continuously obtained time-series data, and evaluation and action functions can be implemented relying on reinforcement learning. The construction of self-learning capabilities relies on continued research in meta-learning [48] and other artificial intelligence methods. I believe that research in deep reinforcement learning and related fields will lead to Artificial General Intelligence (AGI) and will also bring true intelligent electromagnetic games.

Some Thoughts Now

When I now—a person who has been working in the workplace for six years—look back and organize the content of this unpublished thesis from seven years ago, I am truly filled with emotion. I am surprised by the depth and complexity of the thoughts at that time. Although my engineering ability was very weak at that time, my thoughts were free. I hope my thoughts can remain free in the life to come.

References

[37] Mark A Richards. Fundamentals of Radar Signal Processing [M]. 2008.
[30] Bacon P, Harb J, Precup D, et al. The Option-Critic Architecture[J]. arXiv: Artificial Intelligence, 2016.
[46] Tang Y, Tian Y, Lu J, et al. Deep Progressive Reinforcement Learning for Skeleton-Based Action Recognition[C]. computer vision and pattern recognition, 2018: 5323-5332.
[47] L. Kang, J. Bo, L. Hongwei and L. Siyuan. Reinforcement Learning based Anti-jamming Frequency Hopping Strategies Design for Cognitive Radar[C]. 2018 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC). Qingdao. 2018, pp. 1-5.
[48] Wang J X, Kurthnelson Z, Tirumala D, et al. Learning to reinforcement learn[J]. Cognitive Science, 2016.
[49] Bengio Y, Courville A C, Vincent P, et al. Representation Learning: A Review and New Perspectives[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(8): 1798-1828.

世界模型（二）：智能电磁博弈

2019-06-01T12:00:00+00:00

前言

%% 以下为2019年我硕士论文的删减内容 %%

如下我的硕士论文主要对雷达的抗干扰检测网络进行了详细的介绍和研究，我们解决了任意发射波形下的检测网络的泛化问题。

基于深度学习的雷达抗干扰方法研究

这里将解决抗干扰检测网络的另一个问题：对干扰形式的泛化。要让网络能够泛化尽可能多的干扰形式，就必须能够给出足够多干扰形式的样本，这仍然是手工设计无法满足的，我们将借助干扰网络来生成干扰，所以首先要解决的是干扰生成网络的构建和训练。之后本章将对智能电磁博弈中的其他部分，如发射波形、记忆体以及雷达与干扰的动态博弈给出猜想和相关实验。

干扰、检测、生成的联合优化

干扰接收的雷达信号的检测降噪及恢复网络

当干扰机接收到雷达信号时，首先有两个主要工作：对雷达信号的检测和波形恢复。关于其检测网络的性能及传统检测理论可见文献[37]，我们（在上面的论文中）验证了网络性能逼近于最优检测的理论值，下面主要介绍干扰机对接收的雷达信号的恢复网络。我们建立如下的代价函数，其中回波为

\[X=S+N\]

干扰机生成信号与原始雷达信号的均方误差损失为

\[L_{MSE}=|G(X)-S|^2\]

脉冲压缩损失定义为

\[L_{PC}=1-\frac{S^H G(X)}{|S|\;|G(X)|}\]

生成对抗网络损失为

\[L_{GAN}=\log(1-D(S|X))+\log(D(G(X)|X))\]

最终训练干扰网络的优化函数为

\[\min_G \max_D C_{MSE} L_{MSE}+C_{PC} L_{PC}+C_{GAN} L_{GAN}\]

干扰生成网络

其中 $X$ 为干扰机接收到的带噪声的雷达信号，$S$ 表示雷达的发射信号，$N$ 为高斯白噪声，$Y$ 为干扰端恢复并发射的干扰信号，我们的目标便是对接收到的带噪声信号进行去噪，$G$ 便是干扰机对雷达信号的波形恢复网络，该网络与上文相似，也是一个生成网络，即其输出为包含结构信息的张量，可以利用与上文中雷达检测网络相同的网络结构。$L_{MSE}$ 为均方误差损失，衡量的是恢复信号与原始信号在欧式空间中的距离；$L_{PC}$ 为脉冲压缩损失，衡量的是恢复的干扰信号经过脉冲压缩后的峰值损失（即恢复信号在原始信号上投影长度与单位长度的差），当雷达采用传统的脉冲压缩处理来检测目标时干扰恢复网络应当针对这种损失函数来优化；$L_{GAN}$ 为生成对抗网络损失，衡量的是给定的鉴别网络对真实信号和生成信号的鉴别混淆程度，当雷达方采用深度网络来检测目标时干扰恢复网络应当针对这种损失函数来优化。$c_{MSE}$, $c_{PC}$, $c_{GAN}$ 为三种损失函数的比例系数，可视具体情况而定。

我们定义波形相似度为

\[similarity(G(X))=1-L_{PC}=\frac{S^H G(X)}{|S|\;|G(X)|}\]

下图为针对不同类型的发射波形，干扰恢复网络的信号恢复效果。图中从左到右，分别为相位码、三阶调频码以及线性调频码信号的波形恢复效果，上侧为波形恢复的样例，其中蓝色线条为干扰接收的带噪声信号，橘黄色线条为干扰端生成的降噪信号，下侧为波形降噪的相似度随信噪比变化的改善效果，其中蓝色线条为干扰接收的带噪声信号与雷达发射波形的相似度，橘黄色线条为。从图中可以看出，不同类型的波形，降噪网络的恢复效果是不同的，越是复杂的波形，其降噪改善效果越差，这是符合实际情况的。

干扰恢复网络的测试样例

干扰恢复网络对不同信号的改善性能

针对雷达检测网络的干扰生成网络

在上文（指硕士论文）中雷达的抗干扰目标检测方法中，我们建立了如下的目标检测交叉熵损失：

\[L(P(Y|X,S),D(X,S))=-P(Y|X,S) \log(D(X,S))-(1-P(Y|X,S))\log(1-D(X,S))\]

我们通过最小化损失函数来优化检测网络：

\[\min_D E_X [L(P(Y|X,S),D(X,S))]\]

其中干扰形式是给定的，那么有没有可能去求解出一种干扰形式能最大化干扰效果呢？我们知道雷达回波中包含有目标信号、干扰信号和噪声：

\[X=T+J+N\]

其中 $T=HS$ 为目标回波，$H$ 表示目标的响应方式，$S$ 为雷达的发射波形，$N$ 为噪声，$J$ 为干扰端经过生成网络产生的干扰信号，有

\[J=G(\Gamma S)\]

其中 $\Gamma$ 为干扰方对雷达发射波形的采样方式，而 $G$ 即为干扰生成网络，对于该网络的具体结构可以参考本文（硕士论文）中的目标检测网络，只不过输出的不再是各个距离单元上目标的概率，而是干扰信号，这里我们需要关注的只是一个端到端的网络模型。那么我们可以这样求解：

\[\min_D \max_G E_X[L(P(Y|X,S),D(X,S))]\] \[X=HS+G(\Gamma S)+N\]

当整个从雷达发射波形到干扰、目标回波生成，再到雷达抗干扰检测的过程都可以表达成上述可微分的形式之后，我们便可以通过交替的优化雷达检测网络 $D$ 和干扰生成网络 $G$，在不断提高干扰生成网络的干扰能力同时，也会不断提高雷达抗干扰检测的能力。其中，在训练检测网络 $D$ 时，我们使用了不断更新的干扰生成网络 $G$ 产生的所有形式的干扰，所以最终得到的检测网络必然会对各种形式的干扰都具有鲁棒性，也就是说我们得到了一个能够泛化到任意形式的干扰的目标检测网络。

端到端抗干扰检测的发射波形优化

在我们得到了优化好的雷达检测网络 $D$ 和干扰生成网络 $G$ 之后，我们其实得到了一个关于任意发射波形和任意形式的干扰都是最优的检测网络 $D$，和一个对于任意发射波形都是最优的干扰生成网络 $G$，此时我们便可以去求解最优化的发射波形：

\[\min_S E_X [L(P(Y|X,S),D(X,S))]\] \[X=HS+G(\Gamma S)+N\]

通过直接最大化检测结果，来优化发射波形，得到一个同时拥有低旁瓣和抗干扰能力的发射波形，在面对最优的抗干扰检测网络以及最优的干扰生成网络时，能达到最好的检测效果，真正做到端到端的模型优化。在下面的示意图中可以看到，反向梯度传播将同时优化发射波形的两个性能，即目标检测性能（也就是自相关的低旁瓣要求等）和抗干扰性能。该想法在附录 A 中可以体现。与传统的手工设计发射波形相比，端到端的网络优化做到了自动化和闭环反馈。

端到端抗干扰检测网络

长期记忆、评估、策略的叠加

多脉冲联合抗干扰检测网络

之前的雷达检测都是单脉冲的检测，而更多情况下目标需要多个脉冲才能被检测出来，如静态杂波环境中的运动目标。此时，关于多脉冲联合检测的问题便显露出来。

把当前时刻及之前相邻的多个脉冲的雷达接收数据称为当前时刻的观测信息：

\[o_t=[X_{t-T},…,X_t ]\]

建立多脉冲联合抗干扰检测网络 $D(o_t)$，并通过最小化检测误差进行优化：

\[\min_D E_{o_t} [L(P(Y_t|o_t ),D(o_t ))]\]

通过上式，可以优化得到多脉冲联合抗干扰检测网。需要说明的是，多脉冲联合抗干扰检测网络的输入信息除了上面提到的观测信息外，当然也需要已知雷达发射波形（这点和上文中单脉冲检测是相同的），而雷达发射波形是雷达检测网络很容易获得的知识（对于一个同时收发的雷达来说），为了表达方便这里省略。

关于多脉冲联合抗干扰的网络结构形式有以下思考：我们在单个脉冲回波上采用和之前单脉冲检测相同的卷积网络形式，同时针对之前脉冲提取的环境状态信息，在每一层卷积网络中添加长短期记忆网络（LSTM）结构，结合当前脉冲和之前脉冲共同通过卷积提取特征信息作为下一层的输入，并输出当前时刻的环境状态信息，用于下一时刻的检测。其中循环卷积网络 Conv-LSTM 便是将 LSTM 添加到卷积网络中的结构。此时可将检测网络表达为

\[{detect}_t,{state}_t=D(X_t,{state}_{(t-1)})\]

其中 $detect$ 为网络输出的检测结果，而 $state$ 便是网络提取的环境状态信息，也可称为检测网络的记忆信息。

当然这是一种可能的方法，该方法的优势在于，对于每一个时刻来说，只用重新计算当前时刻的原始脉冲信息，而之前时刻的脉冲信息已经通过上一时刻的网络处理为环境状态信息保留了下来，不需要再对之前时刻的所有回波信息一一处理，这样便可以简化计算过程。但相反的，将多个时刻的回波信息压缩为一个环境状态信息，必然不是最直接的处理方式。上述方法想要取得一个好的检测结果，必须有以下公式近似成立：

\[P(Y_t |X_t,X_(t-1),…,X_(t-T) )=P(Y_t |X_t,state_(t-1) )\]

也就是说之前多个脉冲中的与当前目标相关的信息能够被一个环境状态信息全部表示。

另外，更加直接的多脉冲联合检测方式是直接在多个脉冲和多个距离单元上做二维卷积（甚至是三维卷积，如果采用本文（硕士论文）第三章中的滑窗匹配的方式来适应多变的发射波形），但这样做所带来的计算压力甚至是模型复杂度的压力则需要多加考虑；同时在多脉冲维度使用卷积和传统的相干脉冲积累一样，需要人为地设定相干脉冲个数，其余的非相干脉冲信息将会丢失，而 LSTM 却可以利用长期记忆保留所有历史脉冲中的有用信息。

多脉冲联合抗干扰检测网络

卷积与循环网络的结合 Conv-LSTM

端到端多脉冲联合抗干扰检测的发射波形优化

在得到关于多脉冲的联合检测网络 $D(o_t)$ 后，我们便可以通过最小化检测误差，来优化发射波形。首先建立发射波形的策略行动网络：

\[S_t=\pi (o_{(t-1)})\]

通过上一时刻的观测信息得出当前时刻的发射波形，也就是说我们要优化的发射波形是通过对历史的观测信息进行分析得到的，这是一个合理而常见的假设，在认知雷达领域中已有广泛的应用。

我们把当前时刻的检测误差取负后称为当前时刻的检测回报：

\[R_t=-L(P(Y_t|o_t ),D(o_t ))\]

而我们最终的目标就是，通过最大化未来的检测回报来优化当前时刻的发射波形：

\[\max_{S_t} \sum_{\tau=t}^{+\infty} R_\tau\]

但上式不可直接求解，因为未来回报不可知：未来的检测回报需要未来的观测信息，而未来的观测信息则需要未来的发射波形，优化未来时刻的发射波形则需要更远时刻的检测回报。

但我们可以利用价值网络来评估未来回报，通过 Bellman 方程求解：

\[V(o_t )=\sum_{\tau=t}^{+\infty} R_\tau = R_t + E_{X_{t+1}} [V(o_{t+1}]\]

价值网络是一个对未来回报的估计函数，它仅通过当前观测信息直接评估未来回报，而不需要实际给出未来每一时刻的检测回报值。其中当前时刻的发射波形由策略网络给出，即 $S_t=\pi(o_{t-1})$，而 $R_t$ 则可以由检测网络的检测结果计算得到，即 $R_t=-L(P(Y_t

o_t ),D(o_t ))$。我们用 Bellman 方程的右边的值来不断修正左侧的评估网络，直至等式近似成立，即：

\[\min_V {ValueLoss} = \min_{V_{new}} [V_{new}(o_t) - [R_t+V_{old}(o_{t+1})]]^2\]

最后再通过最大化价值网络评估的未来回报来优化当前时刻的发射波形策略：

\[\max_\pi V(o_t)\]

不断交替重复更新价值网络和策略网络，完成对发射波形的优化。

实际上整个优化过程是利用了强化学习[45]的方法，具体如下：

将雷达端看做一个智能体（agent）。
将雷达接收的回波或干扰数据作为智能体对环境的观测信息（observation）：$o$。
将雷达的发射波形看做智能体的行动（action），智能体依据策略（policy）函数，根据不同的观测信息采取行动：$S_t=\pi(o_{t-1})$。
雷达对环境中目标的检测（detection）：$D(o_t)$ 看做智能体对环境的感知（上文提到的根据观测信息对未来回报进行评估的价值（value）网络：$V(o_t)$ 也属于环境感知，所以在具体构建检测网络和价值网络时可以共享低层的卷积参数，同时策略网络 $\pi(o_{t-1})$ 也是通过分析观测信息才获得的行动，所以也可以共享这些参数。）
将雷达检测网络的检测效果看做智能体行动的立即回报奖励（reward）：$R$。

上述想法如下图所示，在于环境的交互过程中，通过同一个多层的 Conv-LSTM 网络进行检测、价值评估及策略选择，并利用各种优化目标反向更新网络参数，最终，我们仅利用一个网络，便完成了包括目标的抗干扰检测、长期检测回报的评估以及基于最大化长期检测回报优化得到的发射波形。上述想法在文献 [46] [47] 中可以看到。

检测网络和评估网络的训练

实际上，反向梯度传播并没有上图中描述的那么简单，其真正的前向和反向传播如下图所示。其中，对检测网络 $D$ 的优化仅需要利用当前的检测损失就可以了；对价值网络 $V$ 的优化，则需要利用当前的评估误差，而计算评估误差不仅需要当前的检测损失，还需要下一时刻的评估价值以及当前时刻的评估价值。

前向和反向传播示意图

策略网络的训练方式：真实环境 or 模拟估计

而对于策略函数的优化则有以下两种方式，一种是基于模型的方式，这种方式需要我们对环境信息进行建模，建立从发射波形到回波信号的前馈可微分过程。这样便建立了可微分的从策略网络到评估网络的前馈过程：通过上一时刻的观测信息得到当前发射波形，再通过环境作用得到当前的观测信息，然后通过价值网络得到回报评估。最后，便可以沿着前馈的计算过程，通过最大化回报评估，反向传播优化策略网络。

基于模型的策略网络训练

上述方法有效的条件是要对环境进行可微分的建模，而若要建立免环境建模的方法，则需要对价值网络的输入信号做出改进：

\[V(o_t )\rightarrow V(o_{t-1},S_t)\]

价值网络不再通过当前的观测信息来进行回报评估，而是根据上一时刻的观测信息以及当前时刻的发射信号进行评估。实际上我们隐性的把对环境模型的估计建立在了价值网络当中，其需要自行的根据当前发射波形来估计当前回波的可能性，进而才能做出回波评估。

免模型的策略网络训练

免环境建模雷达智能体：应对位置场景（地图迷雾）

免环境建模的雷达智能体如下图所示。免环境建模的方法有一个重要的优势就是，在面对真实的抗干扰检测任务中，我们自然是无法得知干扰方的模型。此时利用免环境建模的方法，我们依然可以在线进行学习，包括检测网络、价值网络和策略网络，当干扰方或环境发生变化时，也可以通过对价值网络的优化，重新对环境进行评估。这些想法已应用于一些简单的实验当中，如固定干扰策略下的雷达跳频策略优化。

免模型雷达智能体

最后，特别需要说明的是，在以上利用深度强化学习对雷达发射波形做优化的过程中，我们使用了检测网络的实际检测效果来产生回报奖励，在检测与发射波形间形成了闭环，真正做到了以最优化检测效果来设计发射波形，相比于通过建模分析得到的虚假的回报奖励，这样做显然更加的真实有效。

智能电磁博弈：雷达与干扰在连续脉冲上的深度网络对抗检测

我们将上述多脉冲联合的目标检测称为连续脉冲检测。在上面的连续脉冲检测中，我们对雷达的智能体进行了建模，而对环境，尤其是对环境中的干扰并没有进行智能体建模。这就导致我们在训练上面雷达的智能体时必须给出某种固定形式的干扰，而优化的雷达智能体也只能针对给定形式的干扰，对于未知的干扰形式其抗干扰能力将无法保证。为了解决这个问题，同时优化干扰方的干扰策略，需要我们如同单脉冲检测中的雷达与干扰网络的对抗提升，对多脉冲检测时的干扰也进行智能体建模，建立雷达与干扰的深度网络对抗检测模型。

由于雷达进行的是多脉冲联合检测，那么干扰便也要针对多脉冲联合检测进行干扰，这就要求干扰网络不仅仅要依据当前的雷达发射波形，也要考虑雷达之前的发射波形，也就是说干扰网络应该是一个 Conv-LSTM 网络。

\[G(S_t,S_{t-1},…,S_{t-T})=G(S_t,{stat}_{t-1} )\]

关于干扰生成网络的优化准则，可以借助雷达端的检测效果。但需要注意的是，与单脉冲检测不同的是，我们不再以最大化雷达当前检测误差为目标，而是以最小化价值网络给出的未来回报为目标，这样便可以使得干扰系统也拥有长远的眼光，而不仅仅只注重当前的干扰效果：

\[\min_G V(o_t) = \min_G V(o_{t-1},X_t) = \min_G V(o_{t-1},G(S_t,{state}_{t-1} ) + N + HS_t )\]

连续脉冲上干扰生成网络的训练方式

而对于价值网络的优化，既可以选择有模型的方法，也可以选择免模型的方法。那么对于整个雷达与干扰的深度网络对抗训练过程，可见下图。可以看出，在整个过程中，我们至少得到了四个在电磁对抗当中有用的功能网络：

雷达与干扰在连续脉冲上的深度网络对抗检测

一个可以用于多脉冲联合的抗干扰检测网络，该网络可以针对任意的干扰形式作出最优的抗干扰检测。
一个可以用于评估检测效果的价值网络，该网络会根据干扰方的干扰能力作出抗干扰检测的长期效果评估。
一个可以用于多脉冲联合的抗干扰检测的发射波形的策略网络，该网络会根据已经掌握的环境干扰和目标信息，给出最优的抗干扰目标检测的发射波形。
一个可以用于多脉冲联合检测的干扰生成网络，该网络会根据接收到的发射波形针对多脉冲相参检测对未来检测回报给出最优的干扰。

关于对智能电磁博弈的阐述和理解，与绪论中提到的认知雷达相比，深度网络对抗检测模型可以做到以下几点：

借助深度学习，实现对目标和环境的智能化信息感知。
借助深度强化学习，实现从发射波形到目标检测的闭环优化处理。
借助循环神经网络，实现雷达智能体的记忆功能。

深度网络对抗检测模型中雷达智能体能够依靠算法本身的自我学习和改善能力，实现从发射波形到目标检测结果的闭环处理，依靠最终检测结果端到端地改善雷达的工作方式和处理过程，其使用范围更广，优化更加一体化。在平稳的环境下其会不断地迭代更新；而在未知或变化的环境中，智能化雷达也能够在与环境的交互中快速适应。相比于传统雷达技术多采用预设的工作模式和接收处理方式，深度网络对抗检测模型中雷达智能体形成了从接收到发射的闭环，可以更加主动的感知外部环境信息，并基于这些先验信息进行认知发射和认知接收处理，在与干扰的不断对抗训练中，能够同时改善雷达与干扰的性能。

上面的介绍主要是对雷达智能体进行了强化学习建模，干扰网络的训练依赖于雷达端评估网络给出的检测干扰效果，当然也可以对干扰端进行强化学习建模，这里不再赘述。最后，我相信这是智能化对抗雷达的未来，而上图便是象征。

小结

这里先后建立了干扰生成网络和雷达智能体，其中雷达智能体包含了记忆体、检测网络、评估网络和策略网络，最终构建了基于深度强化学习的雷达与干扰的智能博弈体系，完成了对雷达抗干扰策略、回波信号处理、检测效果评估和干扰策略等电磁博弈的一体化设计。本文猜想了一个雷达智能体，其拥有上述能力中的大部分功能，但还有一些高级的功能并没有给出详细的介绍，感知功能可以依靠自编码网络[49]实现，预测功能可以依靠不断得到的时序数据来训练，评估和行动功能可以依靠强化学习来实现，而如自我学习能力的构建，则要依靠元学习[48]和其它人工智能方法的继续研究。我相信，深度强化学习和相关领域的研究将通向通用人工智能，也将带来真正的智能电磁博弈。

一些现在的感想

当现在的我——一个已经在职场工作六年的人，回头再整理七年前未发表的这篇论文内容时，真的感慨万千。我惊讶于那时候的思想深度和复杂度。虽然那时候自己工程能力很弱，但思想是自由的。希望往后的人生我的思想都能是自由的。

文献

[37] Mark A Richards. 雷达信号处理基础[M]. 2008.
[30] Bacon P, Harb J, Precup D, et al. The Option-Critic Architecture[J]. arXiv: Artificial Intelligence, 2016.
[46] Tang Y, Tian Y, Lu J, et al. Deep Progressive Reinforcement Learning for Skeleton-Based Action Recognition[C]. computer vision and pattern recognition, 2018: 5323-5332.
[47] L. Kang, J. Bo, L. Hongwei and L. Siyuan. Reinforcement Learning based Anti-jamming Frequency Hopping Strategies Design for Cognitive Radar[C]. 2018 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC). Qingdao. 2018, pp. 1-5.
[48] Wang J X, Kurthnelson Z, Tirumala D, et al. Learning to reinforcement learn[J]. Cognitive Science, 2016.
[49] Bengio Y, Courville A C, Vincent P, et al. Representation Learning: A Review and New Perspectives[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(8): 1798-1828.

World Models (I): The Union of Memory, Perception, Prediction, Evaluation, and Decision

2019-03-31T16:00:00+00:00

First drafted in April 2019 for my M.S. thesis on intelligent radar; revived in Jan 2026 with new insights from LLMs and spatial intelligence. May this note serve fellow travellers on the road to AGI.

Before diving in, let us distinguish two concepts: simulating the world and understanding the world. Modern video-generative models (e.g. Sora, MovieGen) excel at pixel-level simulation, yet do they understand the underlying physics and causality? Borrowing the metaphor of a “unified field theory” from physics, I define a World Model as a differentiable, end-to-end framework that tightly couples five functions—memory, perception, prediction, evaluation, and decision—into a single, learnable closed loop. The goal is not merely photorealistic frames, but a reasoning, interactive mind.

Conceptual diagram of a world model

Prelude

The metaphor is borrowed from physics: a unified field theory that merges the four fundamental forces.

A unified model here means one that fuses memory, perception, prediction, evaluation, and decision into a single, differentiable, end-to-end architecture.

Below I detail each module and show how to weave them together “organically” (i.e. differentiably). Later sections instantiate the framework on concrete tasks.

Five Functional Modules

An agent should implement the following:

Memory – temporal, causal memory
Perception – compressive representation
Prediction – next-state forecasting
Evaluation – value estimation
Decision – policy / action selection

1. Memory

Memory is not passive storage; it actively combines the previous memory state $m_{t-1}$ with the current observation $o_t$ to produce an updated state $s_t$ and memory $m_t$.

\[s_t,\; m_t \;=\; D\!\big(o_t,\; m_{t-1}\big)\]

Any recurrent architecture that preserves long-term causality qualifies—classic RNNs, LSTMs, and recent “Renaissance” hybrids such as RWKV, RetNet, Mamba, etc. In 2023 I hacked llama2RNN.c as a toy demo; a longer write-up is forthcoming.

Memory-attention mechanism

2. Perception

Perception compresses high-dimensional observations into abstract states and approximately reconstructs the original signal.

\[\hat{o}\;=\;D^{-1}\!\big(D(o)\big)\]

The code $D(o)$ must be dramatically smaller than the raw observation $o$. Vanilla auto-encoders or MAE already satisfy this template.

3. Prediction

From the abstract state (and any prior) the agent forecasts the next abstract state, not the next pixel frame.

\[s'_{t+1}\;=\;P(s_t)\]

Large language models follow the same principle, except they predict raw tokens rather than states.

4. Evaluation

The agent assigns a scalar value to each state, reflecting expected cumulative reward.

\[v_t \;=\; E(s_t) \;=\; \mathbb{E}\!\left[r \;+\; \gamma\, E\!\big(s_{t+1}\big)\right]\]

This is the value network familiar in RL.

5. Decision

The agent acts to change both the external world and its own internal state.

\[\pi(s) \;=\; \arg\max_{a}\, Q(s, a)\]

Actions include not only motor commands but also self-modifications—e.g. architecture search (NASNet-style), learning-rate updates, or any differentiable controller that rewrites its own parameters.

Instantiations

A. Vision-based Multi-task Manipulation from Demonstration

End-to-end imitation learning for cheap robot arms

The system couples a multi-modal auto-regressive control network with a VAE-GAN reconstructor; the encoder (perception) feeds state features to the controller, yielding a minimal but complete perception–action loop.

B. Next-State Prediction instead of Next-Token Prediction

If LLMs push next-token prediction to the extreme, next-state prediction couples forecasting with perception for data-efficient learning on high-bandwidth modalities such as video.

Next-state predictive framework

Key references:

Joint Embedding Predictive Architecture (JEPA)
Emu3.5

LeCun’s roadmap to autonomous machine intelligence resonates strongly with this line of thought—sadly I still lack the engineering muscle to ship a full-scale demo.

C. V-JEPA 2-AC: Self-supervised Video Understanding & Planning

V-JEPA 2-AC adds action conditioning to perception and prediction. Although it does not emit actions directly (evaluation + RL are still needed), it learns to imitate state-action transitions observed in the training videos.

V-JEPA 2-AC overview

Frontiers: Spatial Intelligence

Prof. Fei-Fei Li’s team (World Labs) recently popularised Spatial Intelligence—a perfect sandbox for world models.

1. Vision before Language?

Perception and action became the core loop driving the evolution of intelligence.

Even pre-vertebrate animals without language rely on vision to grasp physics (gravity, occlusion) and act. The next leap toward AGI must therefore endow AI with spatial cognition, not merely linguistic competence.

2. Definition

Building frontier models that can perceive, generate, reason, and interact with the 3D world.

This aligns one-to-one with our five-module taxonomy:

Perceive – 3D structure understanding
Generate – imagine future states
Reason – causal inference (evaluation + memory)
Interact – decision-making in physical spaces

3. Marble: From “Generating Videos” to “Generating Worlds”

Marble: persistent, editable 3D worlds

Marble highlights two deficits of video-centric models:

Spatial inconsistency – objects drift or vanish; perspective violates physics.
Ephemerality – pixels disappear; no persistent 3D substrate.

Spatial intelligence demands an explicit 3D latent state that respects physics and remains editable. The AI graduates from painter to demiurge.

Long-form temporal consistency can also be injected via long-context memory, from early ConvLSTM to modern state-space models and my own Truncated Recurrent Transformer experiments.

Long-context state-space video world models

State-space model architecture for long contexts

Learning like Humans

World models diverge from mainstream deep learning in data efficiency and adaptation.

Abstract Learning – physicians read MRI scans by concepts, not pixels; future AI must exploit spatial commonsense.
Continual Learning – we should target an evolving intelligence that adapts lifelong, rather than a frozen AGI that ships once.
Temporal Awareness – time is the only unquestionable physical quantity. Any serious model (CNN or Transformer) will eventually re-acquire an RNN backbone; without it, entropy and causality remain invisible, precluding true silicon life.

Recurrent inductive biases endow models with long-term, causal memory, solving length extrapolation and letting AI accumulate experience across training steps instead of being reformatted after every restart.

Case Study: Intelligent Electromagnetic Game

To show that the framework is not limited to video games, I apply it to radar–jammer adversarial signalling—a decidedly hardcore domain.

Intelligent electromagnetic game: radar vs. jammer

My M.S. thesis built a deep-RL radar agent implementing the full loop:

Perception + Memory – Conv-LSTM ingests pulse echoes, retaining long-term memory of earlier pulses.
Decision – a policy network $\pi(o_{t-1})$ generates the next transmit waveform instead of using a fixed template.
Evaluation – a value network $V(o_t)$ predicts the long-term detection return of the chosen waveform under future jamming.
World – radar and jammer co-train in a fully differentiable adversarial channel.

The cycle transmit (decision) → jamming (world feedback) → echo detection (perception / evaluation) forms an end-to-end closed loop.

Epilogue

History offers a constellation of ideas—RL, meta-learning, self-supervised prediction, compressive sensing, RNNs, ResNets, Transformers, NAS, and more. Each has its merits. The AGI of tomorrow will weave them together without disdain, greeting even today’s over-industrialised LLMs with the words:

“You have arrived precisely on time.”

Series Navigation

Next: World Models (II): Intelligent Electromagnetic Game

世界模型（一）：记忆、感知、预测、评估、决策的联合

2019-03-31T16:00:00+00:00

本文原写于 2019 年 4 月，拟作为雷达硕士论文的一章，因精力有限搁置。2026 年 1 月重审，补入近年 LLM 与空间智能的新思考，仍愿为相关方向的同学提供一份“初心”笔记。

在讨论世界模型之前，我们先区分两个概念：模拟世界（simulate） 与 理解世界（understand）。当下的视频生成模型（如 Sora、MovieGen）能在像素层面“模拟”世界，但是否真正把握了背后的物理与因果？借用物理学中的“统一场论”隐喻，我将世界模型（World Models）定义为：能够将记忆、感知、预测、评估、决策功能联合为整体可微可导闭环的模型框架。它不止于生成逼真的帧，还要构建一个能推理、能交互的“心智”。

世界模型概念图

前言

这里其实是借鉴物理学中统一场论的概念：一个可以统一四种基本力的物理理论。

统一模型是指一个能够把记忆、感知、预测、评估、决策功能联合为整体可微可导的模型框架。

下面我会详细说明这些功能的具体内容以及如何将它们“有机”（可微可导）地结合在一起。后续章节则尝试在具体任务中构建它们。

多种功能的具体介绍

首先我们对智能体给出这样的描述，智能体应该拥有如下几个功能：

记忆功能
感知功能
预测功能
评估功能
行动功能

下面将逐一介绍这些功能。

记忆功能

记忆能力并不是指能够记录信息，而是要能够利用上一时刻的记忆信息和当前时刻的观测信息共同完成信息处理（包括但不限于信息的感知、预测、评估、决策）和当前时刻的记忆形成。以信息感知为例：

\[s_t,\; m_t \;=\; D\!\big(o_t,\; m_{t-1}\big)\]

其中 $D$ 为感知系统，$o$ 为观测信息，$s$ 为感知得到的状态信息，$m$ 便是记忆信息。

其实只要能保留长期记忆的时序因果模型在结构上都属于带记忆功能的，这部分比较古老的框架如 RNN、LSTM；最近几年也重新兴起了重铸 RNN 荣光的事情，也就是将 RNN 与 Transformer 相结合，如 RWKV、RetNet 等。我在 2023 年也兴致勃勃地构建了 llama2RNN.c 的 demo（可下载），这里是一些零碎的介绍，后续会整理成长文。

Truncated Recurrent Transformer

感知功能

感知能力是指系统能够将观测信息进行压缩理解，得到抽象概念，并根据抽象概念大致还原出原始信息的能力：

\[\hat{o}\;=\;D^{-1}\!\big(D(o)\big)\]

其中 $D^{-1}$ 为 $D$ 的逆处理系统，得到的抽象概念 $D(o)$ 的数据大小要远小于观测信息 $o$ 的数据大小。

这里比较简单的自编码器就可以完成感知任务了，MAE 也大致可以算作这个思路。

预测功能

预测能力是指系统能够根据上一时刻从感知系统中得到的状态信息（以及其它能够获取的先验信息）预测下一时刻的状态信息：

\[s'_{t+1}\;=\;P(s_t)\]

其中 $P$ 为预测系统。

现在的 LLM 大抵就是这么学习的了，不过它们不是针对状态预测，而是直接对原始信息（仅仅简单的做了下压缩分词变成 Token）做预测。

评估功能

评估能力是指系统能够对给定的状态做出价值评估，估计出自身状态的好坏，用一个单值表示：

\[v_t \;=\; E(s_t) \;=\; \mathbb{E}\!\left[r \;+\; \gamma\, E\!\big(s_{t+1}\big)\right]\]

其中 $E$ 为评估系统，$v$ 为给定状态 $s$ 下的评估价值，该评估值与系统自身接收的真实奖励 $r$ 以及未来奖励 $E(s_{t+1})$ 有关。

决策功能

行动能力是指系统可以根据状态信息作出行动决策，该行动能够改变环境和自身状态及价值：

\[\pi(s) \;=\; \arg\max_{a}\, Q(s, a)\]

这里的重点是能够改变环境和自身状态的行动才是有效的行动，需要注意。其中 $\pi$ 为决策系统，$a$ 为决策系统给出的有效行动，$Q$ 为动作价值函数。决策不仅包括对外部环境和自身状态的改变，甚至是对自身网络结构（完成类似 2019 年比较火的模型架构搜索相关的功能，如 NASNet）和训练过程的改变和控制，指系统能够利用所有可用资源不断地改善智能体的各种功能的效果，也就是系统拥有“自我学习能力”。

统一模型的一些例子

基于端到端模仿学习的廉价机器人视觉多任务操作系统

基于端到端模仿学习的廉价机器人视觉多任务操作系统示意

上图为一个基于端到端模仿学习的廉价机器人视觉多任务操作系统（Vision-Based Multi-Task Manipulation for Inexpensive Robots Using End-to-End Learning from Demonstration），完成了上述的感知、记忆、行动功能的组合。系统包含一个基于多模式自回归估计的输出联合命令的控制网络，和一个重构图片的 VAE-GAN 自编码器，其中编码器（感知系统）为控制网络提供状态特征信息。

Next State prediction

如果说 LLM 的 Next World prediction 是一个将预测功能发挥到极致效果的表现，那 Next State prediction 就是将预测与感知相结合，来解决信息量密集型数据（如图像）的高效学习。类似于现在我们已经完成了对互联网全部文本数据的学习，但就算是文本也可以用隐层状态预测来极大地提高效率。

下面是我针对视频预测的一个简单构想的框架示意图：

Next State 预测构想

这个构想非常简单直接，就是结合自编码器的感知能力和自回归模型的预测能力，所以有很多相似的想法可以参考：

Joint Embedding Predictive Architecture（JEPA）
Emu3.5（我都想入职了：）

by the way, 感觉自己很多思路和出发点都能在 LeCun 老爷子的世界模型那里获得认同感（见“通往自主机器智能的道路”），而且我也同样没有能力把想法给工程化：）。

V-JEPA 2-AC：自监督视频模型实现理解、预测和规划

这里在感知和预测的基础上，增加了决策的影响，虽然不是直接给出行动（这可能还需要评估模块的引入以及强化学习，我将在下一篇博客中具体介绍：），而是有监督的学习什么样的动作会演变为下一时刻的状态。所以最终实现了对训练数据中动作的简单模仿：）（如果有误，欢迎指正）。

V-JEPA 2-AC

前沿方向：空间智能 (Spatial Intelligence)

斯坦福李飞飞教授团队（World Labs）最近提出的空间智能 (Spatial Intelligence) 概念，为世界模型提供了一个极佳的落地场景。

1. 视觉 > 语言？

李飞飞教授指出，相比于语言，视觉（Vision） 是更为基础的生物本能。

Perception and action became the core loop driving the evolution of intelligence.

从寒武纪大爆发开始，“感知-行动” 的循环就是推动智能进化的核心动力。没有语言的动物依然可以通过视觉理解物理世界的规则（如重力、空间遮挡）并做出决策。因此，构建 AGI 的下一步，不应仅仅局限于 LLM 的文本逻辑，更需要让 AI 拥有“空间认知” 能力。

2. 核心定义

她定义的空间智能模型需要具备：

building frontier models that can perceive, generate, reason, and interact with the 3D world.

这与我上述的“五位一体”定义不谋而合：

Perceive (感知)：理解 3D 空间结构。
Generate (预测/生成)：想象未来的可能性。
Reason (评估/记忆)：进行因果推理。
Interact (决策)：与物理世界进行交互。

3. Marble：从“生成视频”到“生成世界”

Marble：持久化 3D 世界生成

World Labs 推出的首款产品 Marble，展示了空间智能与普通视频生成的关键区别：

空间一致性 (Spatial Consistency)：Sora 等视频生成模型往往存在“空间崩坏”的问题（如人走着走着消失了，或者透视关系错误）。而空间智能要求模型内部有一个显式的、符合物理规律的 3D 表达（Hidden State）。
持久性 (Persistence)：生成的不是稍纵即逝的像素帧，而是一个可以被存储、编辑、反复进入的持久化 3D 世界。

这种能力让 AI 从“画师”变成了“造物主”，能够构建一个可交互的虚拟实验场。

长时序状态空间视频世界模型

当然“生成视频”的路子也有解决思路，那就是引入长期记忆，从早期的 ConvLSTM，到最新的 State-Space Model，甚至我之前设计的 Truncated Recurrent Transformer 都是要做这样的时序一致性推理。

长时序状态空间模型架构

核心特征：像人类一样学习

世界模型与传统深度学习系统的一个显著区别在于学习路径。现有的深度学习（DL）效率极低，依赖海量标注数据（Supervised Learning）或试错（RL）。未来十年，AI 的学习方式可能有一些本质的改变：

抽象学习（Abstract Learning）：像人类医生看 MRI 影像一样，AI 将学会利用“空间常识”和“抽象概念”进行学习，而非死记硬背像素点或者下一个单词。
持续学习（Continual Learning）：从“通用”到“进化”：我们不应追求一个出厂即巅峰的 AGI，而应追求像人类一样能不断适应环境、持续进化的 Evolving Intelligence。
时间感知：现实世界中，时间的流逝是唯一的物理真理。未来的模型（无论是 CNN 还是 Transformer）最终都要加上类似 LSTM 的 RNN 体质。如果模型无法从结构上感知到时间，就无法理解熵增与因果，也就无法诞生真正的“硅基生命”。

通过 RNN 类的架构，模型将具备时序因果的长期记忆。这不仅能解决“长度外推”问题，更能让 AI 在物理世界的单向时间流中，通过持续的 Training Step 和状态保留，像生物一样积累经验，而非每次重启都被“格式化”。

实践案例：智能电磁博弈（Intelligent Electromagnetic Game）

为了证明这个框架不仅仅适用于生成视频或玩游戏，我以一个更“硬核”的领域——雷达与干扰的博弈为例，展示世界模型是如何在信号处理领域落地的。

智能电磁博弈示意

在我的硕士论文研究中，构建了一个基于深度强化学习的雷达智能体，它完整地体现了世界模型的闭环：

感知与记忆：利用 Conv-LSTM 处理连续脉冲回波，不仅提取当前特征，还保留了历史脉冲的长期记忆。
决策（Action）：雷达的发射波形不再是固定的，而是由策略网络 $\pi(o_{t-1})$ 根据历史观测生成的。
评估（Value）：构建价值网络 $V(o_t)$，预测当前波形在未来对抗中的长期检测回报。
博弈（World）：雷达与干扰机（环境）在全微分的链路中进行对抗训练。

在这个模型中，发射波形（决策）-> 环境干扰（反馈）-> 回波检测（感知/评估） 形成了一个完整的端到端闭环。

结束语

我认为历史上众多璀璨的想法和技术，无论是强化学习、元学习、自回归预测、压缩感知等学习方式，还是 RNN、ResNet、Transformer 等具体的模型结构，亦或者是模型架构或者训练超参数搜索等等技术，它们都有自己的可取之处。我也相信未来 AGI 的构建需要这些智慧结晶，而对于现在极致工业化和商用流行的 LLM 也不会嫌弃，它们都是构成这个宏大“世界模型”拼图的一部分。

系列导航