本文基于TianXing Chen 的github repo: embodied-ai-guide
主要来帮助我建立起和组里match的认知框架,并指导逐渐填充,阅读前最好有深度学习基础,对梯度下降,反向传播,transformer VIT LLM 等有一定认知。
整体认知框架建立(论文基础)白色为主线:
flowchart TB
%% =========================
%% Style defs
%% =========================
classDef core fill:#ffffff,stroke:#111111,stroke-width:3px,color:#111;
classDef wmcore fill:#e6f0ff,stroke:#003d99,stroke-width:2px,color:#111;
%% =========================================================
%% Tier 0 Foundations
%% =========================================================
subgraph T0["Tier 0 Foundations"]
ATT["Attention Is All You Need"]
VIT["ViT An Image is Worth 16x16 Words"]
EFF["EfficientNet Rethinking Model Scaling"]
FILM["FiLM Visual Reasoning with a General Conditioning Layer"]
TOK["TokenLearner What Can 8 Learned Tokens Do"]
PERC["Perceiver General Perception with Iterative Attention"]
PERCIO["Perceiver IO General Architecture for Structured Inputs and Outputs"]
CLIP["CLIP Learning Transferable Visual Models from Natural Language Supervision"]
SIGLIP["Sigmoid Loss for Language Image Pre-Training"]
DINO["DINOv2 Learning Robust Visual Features without Supervision"]
TTT["Test Time Training with Self Supervision under Distribution Shifts"]
FLAN["Finetuned Language Models Are Zero Shot Learners"]
INSTRUCTGPT["Training language models to follow instructions with human feedback"]
end
ATT --> VIT
ATT --> PERC
PERC --> PERCIO
VIT --> CLIP
VIT --> DINO
CLIP -.-> SIGLIP
ATT -.-> TTT
INSTRUCTGPT -.-> FLAN
%% =========================================================
%% Tier 1 World Models and Model Based RL
%% =========================================================
subgraph T1W["Tier 1 World Models and Model Based RL"]
WM["World Models"]
PLANET["Learning Latent Dynamics for Planning from Pixels"]
DREAMER["Dreaming Model based RL by Latent Imagination without Reconstruction"]
MUZERO["Mastering Atari Go Chess and Shogi by Planning with a Learned Model"]
EBM["Introduction to Latent Variable Energy Based Models"]
end
WM --> PLANET
PLANET --> DREAMER
WM -.-> MUZERO
EBM -.-> WM
%% =========================================================
%% Tier 1 Generative Modeling
%% =========================================================
subgraph T1G["Tier 1 Generative Modeling"]
DDPM["Denoising Diffusion Probabilistic Models"]
DDIM["Denoising Diffusion Implicit Models"]
ADM["Diffusion Models Beat GANs on Image Synthesis"]
CFG["Classifier Free Diffusion Guidance"]
LDM["High Resolution Image Synthesis with Latent Diffusion Models"]
CTRL["Adding Conditional Control to Text to Image Diffusion Models"]
DIT["Scalable Diffusion Models with Transformers"]
SVD["Stable Video Diffusion Scaling Latent Video Diffusion Models"]
FM["Flow Matching for Generative Modeling"]
VAR["Visual Autoregressive Modeling Next Scale Prediction"]
MAMMOTH["MammothModa2 Unified AR Diffusion Framework"]
end
DDPM --> DDIM
DDPM --> ADM
DDPM --> CFG
DDIM --> LDM
CFG --> LDM
LDM --> CTRL
LDM --> SVD
ATT --> DIT
DDPM --> DIT
ATT --> VAR
VAR --> MAMMOTH
DDPM -.-> MAMMOTH
FM -.-> MAMMOTH
%% =========================================================
%% Tier 1 Robotics and Multimodal Bases
%% =========================================================
subgraph T1R["Tier 1 Robotics and Multimodal Bases"]
PERACT["Perceiver Actor A Multi Task Transformer for Robotic Manipulation"]
SAYCAN["Do As I Can Not As I Say"]
VOX["VoxPoser Composable 3D Value Maps"]
RT1["RT 1 Robotics Transformer"]
PALME["PaLM E An Embodied Multimodal Language Model"]
PALIX["PaLI X Scaling up a Multilingual Vision and Language Model"]
RT2["RT 2 Vision Language Action Models"]
OXE["Open X Embodiment Datasets and RT X Models"]
ALOHA["Learning Fine Grained Bimanual Manipulation with Low Cost Hardware"]
end
PERCIO --> PERACT
SAYCAN -.-> VOX
EFF --> RT1
FILM --> RT1
TOK --> RT1
ATT -.-> RT1
PALME --> RT2
PALIX --> RT2
RT1 --> RT2
RT2 -.-> OXE
OXE -.-> RT1
ALOHA -.-> OXE
%% =========================================================
%% Tier 2 VLM and VLA Systems
%% =========================================================
subgraph T2["Tier 2 VLM and VLA Systems"]
VITUN["Visual Instruction Tuning"]
DEEPSTACK["DeepStack Deeply Stacking Visual Tokens"]
QWEN["Qwen3 VL Technical Report"]
PRISM["Prismatic VLMs Design Space"]
OPENVLA["OpenVLA An Open Source VLA"]
PI0["pi0 A Vision Language Action Flow Model"]
DP["Diffusion Policy Action Diffusion"]
DP3["3D Diffusion Policy"]
OCTO["Octo Open Source Generalist Robot Policy"]
HYBRID["HybridVLA Diffusion plus Autoregression"]
MANUAL["ManualVLA Chain of Thought Manual Generation"]
FIS["Fast in Slow Dual System Foundation Model"]
LIFT3D["Lift3D Foundation Policy"]
CORDVIP["CordViP Correspondence based Visuomotor Policy"]
MLA["MLA Multisensory Language Action Model"]
SOFAR["SoFar Language Grounded Orientation"]
ACTIVEUMI["ActiveUMI Active Perception from Robot Free Human Demonstrations"]
GRASPVLA["GraspVLA Billion scale Synthetic Action Data"]
WOW["WoW World omniscient World model Through Embodied Interaction"]
end
INSTRUCTGPT -.-> VITUN
FLAN -.-> VITUN
DEEPSTACK --> QWEN
VITUN -.-> QWEN
CLIP --> PRISM
DINO --> PRISM
SIGLIP -.-> PRISM
PRISM --> OPENVLA
OXE --> OPENVLA
FM --> PI0
OXE -.-> PI0
DDPM --> DP
DP --> DP3
DP --> OCTO
OXE -.-> OCTO
OPENVLA --> HYBRID
DP --> HYBRID
VAR -.-> HYBRID
OPENVLA --> MANUAL
OPENVLA --> FIS
DINO -.-> LIFT3D
LIFT3D --> CORDVIP
LIFT3D -.-> MLA
DP3 -.-> MLA
VOX -.-> SOFAR
ALOHA -.-> ACTIVEUMI
OPENVLA -.-> GRASPVLA
PI0 -.-> GRASPVLA
DREAMER -.-> WOW
MUZERO -.-> WOW
DIT -.-> WOW
SVD -.-> WOW
%% =========================================================
%% Highlight classes
%% =========================================================
class ATT,VIT,CLIP,DINO,INSTRUCTGPT,DDPM,FM,VAR,PRISM,OPENVLA,DP,DP3,OCTO,HYBRID,LIFT3D,MLA,OXE,RT1,RT2 core;
class WM,PLANET,DREAMER,MUZERO wmcore;
算法
a.工具
点云降采样:
常见模式:
1.随机采样(O(1),但是精度很差,不太可用) 2.Uniform Sample(借助体素方格进行切分后采样)
3.FPS(不断选择最远点,能够保留足够多的信息和边缘信息,但是计算复杂度太高)(但是根据各个论文来看,PointNet和MLA最后喂给NN前都是FPS,似乎是最优解)
4.Normal Space Sampling :在法向空间中采样,只保留法向量,压缩平坦区域,保留重点,总体来说是在法空间里做US。
5.去噪
6.深度学习降采样:可以Learning-based,嵌入到神经网络里,跟随着进行梯度下降,端到端且可微(我个人认为最优雅的),但是没有Ground truth,Loss不太好设计,如果依赖整体的loss很难直接拟合到最优解,而且极其Data-Driven,数据不够多的情况下效果可能不够好。而FPS并不依赖数据。但可以认为SampleNet是未来趋势。
手眼标定
常见模式
FishROS:比较传统,贴一个标定板(如ArUco/ArTookit码),控制机器人移动15-20个姿态,记录下机械臂关节数据,相机拍到的标定点坐标,代入AX=XB解出X

注:标定板:相机拍到的图片是2D的,相机会有畸变,标定板可以确定尺度,建立坐标系,消除畸变
2.EasyHeC :利用机械臂虚拟资产通过backpropagation进行对齐
流程:
final:最终获取了4x4的齐次变换矩阵,把2D像素反推回四维空间,把获取的image反推回3D空间,用到相机内参和手眼外参 获取真实世界的3维坐标点。

只是比较依赖Mesh/URDF这种虚拟资产,但是基本上主流机械臂也是比较好找的,至少我已经找到了实验室用的FR3的官方assets
b. visual models
1.CLIP/SIGLIP(语义理解:是什么?):图👉Token👉文本,robotics主要用SIGLIP,openvla用的就是,因为其把分类问题改为一堆独立的二分类问题,使用sigmoid判断。总的来说,这两者侧重图片理解语义。
2.DINOv1/v2/v3(视觉特征:长什么样?):提取的特征包含丰富的几何信息(深度、形状、对应关系),适合做抓取。
3.SAM1/2/3(像素分割:在图上的哪?):进行图像分割和检索,获取精准轮廓。
4.Point Transformer v3:极其高效的通过序列来处理点云,提取特征
虽然如此,但是MLA提供了encoder-free的路径,直接让LLM前几层学习上述visual models的能力,实现End2End,但是我认为虽然效率更高但是scaling起来之前效果应该不如带pretrainded的encoder的,不过scaling起来之后应该就不一样了。
c. RobotLearning
1.MPC:
2.RL:CS285 李宏毅公开课
3.Imitation Learning:
VLA,各种Vision2action模型,主要以信息流的角度画图梳理了一下
Action Chunking Transformers(ACT) : encoder-transformer-decoder
主要工作:ALOHA
graph TD
%% 输入
IMG[图像 Observation] --> V_ENC[Vision Encoder: SigLIP/DINO]
STATE[Robot State] --> S_PROJ[State MLP]
%% Encoder 内部
subgraph Transformer_Encoder [Transformer Encoder]
V_ENC --> V_TOKENS[Visual Tokens]
S_PROJ --> S_TOKENS[State Token]
V_TOKENS & S_TOKENS --> CAT_ENC[Concatenate]
POS_ENC[Positional Embedding 2D+1D<在backprop过程中更新grad de>] -->|相加| CAT_ENC
CAT_ENC --> SELF_ATT_E[Self-Attention Layers]
end
%% Decoder 内部
subgraph Transformer_Decoder [Transformer Decoder]
QUERIES[Learned Action Queries <br/> 1...N steps] --> POS_DEC[Positional Embedding 1D]
POS_DEC -->|相加| SELF_ATT_D[Self-Attention Layers]
SELF_ATT_E -->|作为 K, V| CROSS_ATT[Cross-Attention Layers]
SELF_ATT_D -->|作为 Q| CROSS_ATT
end
%% 输出
CROSS_ATT --> MLP_HEAD[Action Head]
MLP_HEAD --> ACTIONS[Action Chunk: a1, a2, ... aN]
VLA
主要工作:RT2/ OpenVla
graph TD
%% 样式定义
%% 1. 输入层
subgraph Inputs [Multimodal Inputs]
IMG[Image Observation <br/> 224x224]:::vision
TXT["Text Instruction <br/> 'Put the cup in the sink'"]:::text
end
%% 2. 编码层:视觉走后门,文本走正门
subgraph Encoding [Tokenization & Projection]
%% 视觉路径
IMG --> V_ENC[Vision Encoder <br/> SigLIP / ViT]:::vision
V_ENC --> PATCHES[Patch Features]:::vision
PATCHES --> PROJ["Projector / MLP <br/> 维度对齐: 768 -> 4096"]:::vision
PROJ --> V_EMBEDS["Visual Embeddings <br/> (Soft Tokens)"]:::vision
%% 文本路径
TXT --> TOKENIZER[Tokenizer]:::text
TOKENIZER --> T_EMBEDS[Text Embeddings]:::text
end
%% 3. 拼接层
V_EMBEDS & T_EMBEDS --> CONCAT["Input Sequence <br/> [Img_Tokens, Text_Tokens]"]
%% 4. 大脑层:Decoder-only 自回归
subgraph Backbone [Transformer Decoder LLM]
CONCAT --> DECODER[Llama 2 / PaLM Decoder Layers]:::llm
DECODER -->|Causal Self-Attention| DECODER
DECODER -->|Autoregressive Generation| NEXT_TOKEN[Predict Next Token Logits]:::llm
end
%% 5. 输出层:分流
subgraph Outputs [Output Processing]
NEXT_TOKEN -->|Argmax/Sample| GEN_TOKEN[Generated Token ID]
GEN_TOKEN -- "是普通单词" --> COT["Text Output / CoT <br/> 'I will pick up...'"]:::text
GEN_TOKEN -- "是Action Bin" --> UNBIN["De-Tokenization <br/> <BIN_255> -> 0.99"]:::action
UNBIN --> VECTOR["Action Vector <br/> 7-DoF: x,y,z,r,p,y,gripper"]:::action
end
%% 自回归循环
GEN_TOKEN -.->|Append back to input| CONCAT
c.diffusion model 基于扩散的,输出连续动作
经典工作:DiffusionPolicy,RDT-1B
graph TD
%% 样式定义
%% --- 输入模态 (Conditioning) ---
subgraph Inputs [Multimodal Conditioning Inputs]
direction LR
IMG([RGB Images<br/>Multi-view / Wrist]) -->|SigLIP| VIS_ENC[Vision Encoder<br/>SigLIP]:::vision
INST([Language Instruction]) -->|T5| LANG_ENC[Language Encoder<br/>T5-XXL]:::lang
PROP([Proprioception<br/>Joint Pos / Vel]) -->|MLP| STATE_ENC[State Projector]:::prop
end
%% --- 扩散过程 (Denoising) ---
subgraph Diffusion_Process [Diffusion Process]
NOISE([Gaussian Noise<br/>Action Chunk ã_k]) --> ACT_PROJ[Action Projector]:::action
TIME([Timestep k]) --> TIME_MLP[Time Embedding]:::action
end
%% --- 核心骨干 DiT ---
subgraph RDT_Backbone [RDT Backbone: Diffusion Transformer]
%% Token 融合
VIS_ENC & LANG_ENC & STATE_ENC --> CONCAT[Unified Multimodal Tokens]
%% DiT 内部逻辑
subgraph DiT_Block [DiT Block x N]
ACT_PROJ & TIME_MLP --> ADALN[Adaptive Layer Norm<br/>Modulation]
ADALN --> SELF[Self-Attention<br/>Modeling Temporal Dependencies]
CONCAT --> CROSS[Cross-Attention<br/>Conditioning on Inputs]
SELF --> CROSS
CROSS --> FFN[Feed Forward Network]
end
end
%% --- 输出 ---
FFN --> PREDICT[Noise Prediction / Denoised Action]
PREDICT --> DECODER[Physically Interpretable<br/>Unified Action Space]:::action
DECODER --> OUTPUT([Final Action Trajectory<br/>64 Steps]):::action
%% 连接样式
linkStyle default stroke:#333,stroke-width:1.5px;
d. LLM+Diffusion
代表工作:Octo,pi0
graph TD
%% --- 输入模态 (VLM Context) ---
subgraph Inputs [Multimodal Inputs & VLM Encoding]
direction LR
IMG([RGB Images<br/>Multi-view]) -->|SigLIP| VIS_ENC[Vision Features<br/>SigLIP]:::vision
INST([Language Instruction]) -->|Gemma| LANG_ENC[Text Features<br/>Gemma]:::lang
PROP([Proprioception<br/>Joint State]) -->|MLP| STATE_ENC[State Tokens]:::prop
end
%% --- Flow Matching 过程 (替代 Diffusion Process) ---
subgraph Flow_Process [Flow Matching Process]
NOISE([Noisy Action<br/>x_t]) --> ACT_PROJ[Action Projector]:::action
TIME([Timestep t]) --> TIME_MLP[Time Embedding]:::action
end
%% --- 核心骨干 VLM + Action Head ---
subgraph Pi0_Backbone [Pi-Zero Architecture]
%% 1. VLM 语义理解
subgraph VLM_Backbone [VLM Backbone: PaliGemma]
VIS_ENC & LANG_ENC & STATE_ENC --> INTERLEAVE[Interleaved Sequence<br/>Vision + Text + State]
INTERLEAVE --> VLM_LAYERS[VLM Transformer Layers<br/>Physical Reasoning]:::core
VLM_LAYERS --> CONTEXT[Rich Context Embeddings]
end
%% 2. Flow Matching 动作生成
subgraph Action_Head [Flow Matching Action Head]
ACT_PROJ & TIME_MLP --> ADALN[Adaptive Layer Norm<br/>Time Conditioning]
ADALN --> SELF_ATT[Self-Attention<br/>Action Consistency]
CONTEXT --> CROSS_ATT[Cross-Attention<br/>Conditioning on VLM Context]
SELF_ATT --> CROSS_ATT
CROSS_ATT --> FFN[Feed Forward Network]
end
end
%% --- 输出 ---
FFN --> PREDICT[Vector Field Prediction<br/>Velocity v_t]
PREDICT --> INTEGRATOR[ODE Solver<br/>Euler / Heun Integration]:::action
INTEGRATOR --> OUTPUT([Recovered Action Trajectory<br/>Fast Inference < 10 Steps]):::action
%% 连接样式
linkStyle default stroke:#333,stroke-width:1.5px;
基于快慢分层,sys1,sys2的VLA
Computer Vison
CS231n:从cv方向进入DeepLearning世界是非常好的选择,如果特别喜欢LLM和nlp的话选择cs224n或许也可以?但cs231n肯定match很多,vision是embodied-ai的sensory。
2D vision
3D vision
4D vision
5.Visual Prompting 更加适合OOD
6.Affordance Grounding - 可供性锚定
可供性锚定任务的目标是从图像中定位物体上能够与之交互的区域,充当了感知与行动之间的桥梁,是具身智能重要的一环。它不仅需要模型对物体及其局部结构的检测与识别,还需要模型理解物体与人或机器人之间的潜在互动关系。例如,在机器人抓取场景中,可供性锚定帮助模型寻找物体上最佳的抓取位置,从而确定最佳抓取角度。该方向通过整合计算机视觉,多模态大模型技术,能够在弱监督或零样本条件下实现对物体交互可能性的精确定位,提升机器人抓取、操作以及人机交互等任务的性能。
7.RL
1.西湖大学:强化学习的数学原理——赵世钰:建立起基本对于强化学习的认知,了解最基本的Bellman equation,蒙特卡洛方法,actor-critic,初步从离散转向概率视角。赵老师的课真的特别好!深入浅出,我认为认知是把抽象的符号转化成自己脑中的幻觉的过程,而赵老师精彩的讲解和严谨的数学推理能够帮助你并不费力的构建起完整的认知。
2.CS285 UC Berkeley Deep Reinforce Learning 深度强化学习,难度极大,但是和组里实践联系极其紧密,从我个人体感而言,如果没有系统学过强化学习,只有深度学习基础的话建议先看完西湖大学赵老师的课程再尝试跟进。(不过我只听了一点才。)
Computer Graphics
GAMES101
GAMES202
对于仿真和sim2real很重要
…中间部分略过 UAV部分不是我的重点…