Inspired by trajectory stitching, compositional planning and hierarchical planning methods

Updates(per Sat.)

11/29: Survey for silimar works; Find proper VL-planning benchmarks
12/13: Vulkan似乎在lpk-1 server上没办法视觉化（但似乎仍然可以用headless模式）；因而测试需要在本地linux上、训练和coding在server上

SimplerEnv:

['google_robot_pick_coke_can', 'google_robot_pick_horizontal_coke_can', 'google_robot_pick_vertical_coke_can', 'google_robot_pick_standing_coke_can', 'google_robot_pick_object', 'google_robot_move_near_v0', 'google_robot_move_near_v1', 'google_robot_move_near', 'google_robot_open_drawer', 'google_robot_open_top_drawer', 'google_robot_open_middle_drawer', 'google_robot_open_bottom_drawer', 'google_robot_close_drawer', 'google_robot_close_top_drawer', 'google_robot_close_middle_drawer', 'google_robot_close_bottom_drawer', 'google_robot_place_in_closed_drawer', 'google_robot_place_in_closed_top_drawer', 'google_robot_place_in_closed_middle_drawer', 'google_robot_place_in_closed_bottom_drawer', 'google_robot_place_apple_in_closed_top_drawer', 'widowx_spoon_on_towel', 'widowx_carrot_on_plate', 'widowx_stack_cube', 'widowx_put_eggplant_in_basket']

Motivation

Can we mitigate the error accumulation in the long-horizon planning / delicate robot manipulation methods (eg. in video planning) through modeling the temporal and multi-modal correlations with a compositional energy landscape?
- eg: teleportion, irrational trajectories (eg. passing through walls)
Train different energy functions to model the local and global planning objectives and correlations, including but not limited to:
- global goal (constraint)-achievement, given the goal and the current state
- local temporal-correlations, given a context window of consecutive frames of the same modality (i.e. image frame)
- local multi-modal-correlations, given two consecutive frames and the motion between them
The energy landscapes can be trained jointly or separatively.
The trained compositional landscape can be the oracle to guide trajectory sampling&recovery / searching / RL methods for long-horizon planning.

Problem Setting

video planning: $p(\bold x_1, \bold x_2, ..., \bold x_t | \bold g, \bold x_0)$
- inputs - Given:
  - initial image frame $\bold x_0$
  - textual goal $\bold g$E
- outputs - We want to get:
  - a video plan $\bold x_1, \bold x_2, ..., \bold x_t$
  - error-eliminated downsteam motions: $\bold x_0, \bold x_1, \bold x_2, ..., \bold x_t \rightarrow\bold a_1, ... , \bold a_t$

Training & Planning Algorithms