Research

Physical manipulation planning with differentiable closed-loop manipulation primitives

Reuniting planning and reacting

Research Unit: 3

Project Number: 39

Disciplines:
Robotics

 

Principal Investigators:
Marc Toussaint

Doctoral Researchers:
Danny Driess
Danny Driess

 

Project Duration
2021 - 2023


← Projects Overview

Physical manipulation planning with differentiable closed-loop manipulation primitives

Reuniting planning and reacting

Courtesy of Marc Toussaint

Optimization-based Task and Motion Planning (TAMP) approaches show remarkable capabilities in finding paths given a scene description and (intuitive) physics models. However, the result of a TAMP algorithm is usually an open-loop trajectory, which, when executed in the real world, is likely to fail under disturbances or other sources of
uncertainty. The objective of this project is to bring the generality and computational strength of our TAMP framework to real-world execution. Instead of trying to find controllers that execute a planning result, this project investigates a different, novel approach: Can the behavior of closed-loop (perception-based) control primitives be directly embedded into the planning framework itself such that the result is a sequence of reactive closed-loop control policies instead of open-loop paths? We aim to demonstrate the robustness of our control strategies to severe perturbations, e.g., human interventions, in real-world sequential manipulation tasks such as the escape room scenario.


6984777 proj039 1 apa 50 creator desc year 20212 https://www.scienceofintelligence.de/wp-content/plugins/zotpress/
%7B%22status%22%3A%22success%22%2C%22updateneeded%22%3Afalse%2C%22instance%22%3Afalse%2C%22meta%22%3A%7B%22request_last%22%3A0%2C%22request_next%22%3A0%2C%22used_cache%22%3Atrue%7D%2C%22data%22%3A%5B%7B%22key%22%3A%22F9JPKVER%22%2C%22library%22%3A%7B%22id%22%3A6984777%7D%2C%22meta%22%3A%7B%22creatorSummary%22%3A%22Zhou%20et%20al.%22%2C%22parsedDate%22%3A%222023-10-01%22%2C%22numChildren%22%3A1%7D%2C%22bib%22%3A%22%26lt%3Bdiv%20class%3D%26quot%3Bcsl-bib-body%26quot%3B%20style%3D%26quot%3Bline-height%3A%202%3B%20padding-left%3A%201em%3B%20text-indent%3A-1em%3B%26quot%3B%26gt%3B%5Cn%20%20%26lt%3Bdiv%20class%3D%26quot%3Bcsl-entry%26quot%3B%26gt%3BZhou%2C%20H.%2C%20Schubert%2C%20I.%2C%20Toussaint%2C%20M.%2C%20%26amp%3B%20Oguz%2C%20O.%20S.%20%282023%29.%20Spatial%20Reasoning%20via%20Deep%20Vision%20Models%20for%20Robotic%20Sequential%20Manipulation.%20%26lt%3Bi%26gt%3B2023%20IEEE%5C%2FRSJ%20International%20Conference%20on%20Intelligent%20Robots%20and%20Systems%20%28IROS%29%26lt%3B%5C%2Fi%26gt%3B%2C%2011328%26%23x2013%3B11335.%20%26lt%3Ba%20class%3D%26%23039%3Bzp-DOIURL%26%23039%3B%20href%3D%26%23039%3Bhttps%3A%5C%2F%5C%2Fdoi.org%5C%2F10.1109%5C%2FIROS55552.2023.10342010%26%23039%3B%26gt%3Bhttps%3A%5C%2F%5C%2Fdoi.org%5C%2F10.1109%5C%2FIROS55552.2023.10342010%26lt%3B%5C%2Fa%26gt%3B%26lt%3B%5C%2Fdiv%26gt%3B%5Cn%26lt%3B%5C%2Fdiv%26gt%3B%22%2C%22data%22%3A%7B%22itemType%22%3A%22conferencePaper%22%2C%22title%22%3A%22Spatial%20Reasoning%20via%20Deep%20Vision%20Models%20for%20Robotic%20Sequential%20Manipulation%22%2C%22creators%22%3A%5B%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Hongyou%22%2C%22lastName%22%3A%22Zhou%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Ingmar%22%2C%22lastName%22%3A%22Schubert%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Marc%22%2C%22lastName%22%3A%22Toussaint%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Ozgur%20S.%22%2C%22lastName%22%3A%22Oguz%22%7D%5D%2C%22abstractNote%22%3A%22%22%2C%22proceedingsTitle%22%3A%222023%20IEEE%5C%2FRSJ%20International%20Conference%20on%20Intelligent%20Robots%20and%20Systems%20%28IROS%29%22%2C%22conferenceName%22%3A%222023%20IEEE%5C%2FRSJ%20International%20Conference%20on%20Intelligent%20Robots%20and%20Systems%20%28IROS%29%22%2C%22date%22%3A%222023-10-1%22%2C%22eventPlace%22%3A%22%22%2C%22DOI%22%3A%2210.1109%5C%2FIROS55552.2023.10342010%22%2C%22ISBN%22%3A%22978-1-6654-9190-7%22%2C%22citationKey%22%3A%22%22%2C%22url%22%3A%22https%3A%5C%2F%5C%2Fieeexplore.ieee.org%5C%2Fdocument%5C%2F10342010%5C%2F%22%2C%22ISSN%22%3A%22%22%2C%22language%22%3A%22%22%2C%22collections%22%3A%5B%5D%2C%22dateModified%22%3A%222026-06-22T11%3A56%3A43Z%22%7D%7D%2C%7B%22key%22%3A%227GMQQ8NR%22%2C%22library%22%3A%7B%22id%22%3A6984777%7D%2C%22meta%22%3A%7B%22creatorSummary%22%3A%22Toussaint%20et%20al.%22%2C%22parsedDate%22%3A%222022-10-23%22%2C%22numChildren%22%3A1%7D%2C%22bib%22%3A%22%26lt%3Bdiv%20class%3D%26quot%3Bcsl-bib-body%26quot%3B%20style%3D%26quot%3Bline-height%3A%202%3B%20padding-left%3A%201em%3B%20text-indent%3A-1em%3B%26quot%3B%26gt%3B%5Cn%20%20%26lt%3Bdiv%20class%3D%26quot%3Bcsl-entry%26quot%3B%26gt%3BToussaint%2C%20M.%2C%20Harris%2C%20J.%2C%20Ha%2C%20J.-S.%2C%20Driess%2C%20D.%2C%20%26amp%3B%20H%26%23xF6%3Bnig%2C%20W.%20%282022%29.%20Sequence-of-Constraints%20MPC%3A%20Reactive%20Timing-Optimal%20Control%20of%20Sequential%20Manipulation.%20%26lt%3Bi%26gt%3B2022%20IEEE%5C%2FRSJ%20International%20Conference%20on%20Intelligent%20Robots%20and%20Systems%20%28IROS%29%26lt%3B%5C%2Fi%26gt%3B%2C%2013753%26%23x2013%3B13760.%20%26lt%3Ba%20class%3D%26%23039%3Bzp-DOIURL%26%23039%3B%20href%3D%26%23039%3Bhttps%3A%5C%2F%5C%2Fdoi.org%5C%2F10.1109%5C%2FIROS47612.2022.9982236%26%23039%3B%26gt%3Bhttps%3A%5C%2F%5C%2Fdoi.org%5C%2F10.1109%5C%2FIROS47612.2022.9982236%26lt%3B%5C%2Fa%26gt%3B%26lt%3B%5C%2Fdiv%26gt%3B%5Cn%26lt%3B%5C%2Fdiv%26gt%3B%22%2C%22data%22%3A%7B%22itemType%22%3A%22conferencePaper%22%2C%22title%22%3A%22Sequence-of-Constraints%20MPC%3A%20Reactive%20Timing-Optimal%20Control%20of%20Sequential%20Manipulation%22%2C%22creators%22%3A%5B%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Marc%22%2C%22lastName%22%3A%22Toussaint%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Jason%22%2C%22lastName%22%3A%22Harris%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Jung-Su%22%2C%22lastName%22%3A%22Ha%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Danny%22%2C%22lastName%22%3A%22Driess%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Wolfgang%22%2C%22lastName%22%3A%22H%5Cu00f6nig%22%7D%5D%2C%22abstractNote%22%3A%22%22%2C%22proceedingsTitle%22%3A%222022%20IEEE%5C%2FRSJ%20International%20Conference%20on%20Intelligent%20Robots%20and%20Systems%20%28IROS%29%22%2C%22conferenceName%22%3A%222022%20IEEE%5C%2FRSJ%20International%20Conference%20on%20Intelligent%20Robots%20and%20Systems%20%28IROS%29%22%2C%22date%22%3A%222022-10-23%22%2C%22eventPlace%22%3A%22%22%2C%22DOI%22%3A%2210.1109%5C%2FIROS47612.2022.9982236%22%2C%22ISBN%22%3A%22978-1-6654-7927-1%22%2C%22citationKey%22%3A%22%22%2C%22url%22%3A%22https%3A%5C%2F%5C%2Fieeexplore.ieee.org%5C%2Fdocument%5C%2F9982236%5C%2F%22%2C%22ISSN%22%3A%22%22%2C%22language%22%3A%22%22%2C%22collections%22%3A%5B%5D%2C%22dateModified%22%3A%222026-06-22T11%3A49%3A07Z%22%7D%7D%2C%7B%22key%22%3A%22Q2Z3NLG2%22%2C%22library%22%3A%7B%22id%22%3A6984777%7D%2C%22meta%22%3A%7B%22creatorSummary%22%3A%22Harris%20et%20al.%22%2C%22parsedDate%22%3A%222022-10-23%22%2C%22numChildren%22%3A1%7D%2C%22bib%22%3A%22%26lt%3Bdiv%20class%3D%26quot%3Bcsl-bib-body%26quot%3B%20style%3D%26quot%3Bline-height%3A%202%3B%20padding-left%3A%201em%3B%20text-indent%3A-1em%3B%26quot%3B%26gt%3B%5Cn%20%20%26lt%3Bdiv%20class%3D%26quot%3Bcsl-entry%26quot%3B%26gt%3BHarris%2C%20J.%2C%20Driess%2C%20D.%2C%20%26amp%3B%20Toussaint%2C%20M.%20%282022%29.%20FC%26lt%3Bsup%26gt%3B3%26lt%3B%5C%2Fsup%26gt%3B%26%23x202F%3B%3A%20Feasibility-Based%20Control%20Chain%20Coordination.%20%26lt%3Bi%26gt%3B2022%20IEEE%5C%2FRSJ%20International%20Conference%20on%20Intelligent%20Robots%20and%20Systems%20%28IROS%29%26lt%3B%5C%2Fi%26gt%3B%2C%2013769%26%23x2013%3B13776.%20%26lt%3Ba%20class%3D%26%23039%3Bzp-DOIURL%26%23039%3B%20href%3D%26%23039%3Bhttps%3A%5C%2F%5C%2Fdoi.org%5C%2F10.1109%5C%2FIROS47612.2022.9981758%26%23039%3B%26gt%3Bhttps%3A%5C%2F%5C%2Fdoi.org%5C%2F10.1109%5C%2FIROS47612.2022.9981758%26lt%3B%5C%2Fa%26gt%3B%26lt%3B%5C%2Fdiv%26gt%3B%5Cn%26lt%3B%5C%2Fdiv%26gt%3B%22%2C%22data%22%3A%7B%22itemType%22%3A%22conferencePaper%22%2C%22title%22%3A%22FC%3Csup%3E3%3C%5C%2Fsup%3E%20%3A%20Feasibility-Based%20Control%20Chain%20Coordination%22%2C%22creators%22%3A%5B%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Jason%22%2C%22lastName%22%3A%22Harris%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Danny%22%2C%22lastName%22%3A%22Driess%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Marc%22%2C%22lastName%22%3A%22Toussaint%22%7D%5D%2C%22abstractNote%22%3A%22%22%2C%22proceedingsTitle%22%3A%222022%20IEEE%5C%2FRSJ%20International%20Conference%20on%20Intelligent%20Robots%20and%20Systems%20%28IROS%29%22%2C%22conferenceName%22%3A%222022%20IEEE%5C%2FRSJ%20International%20Conference%20on%20Intelligent%20Robots%20and%20Systems%20%28IROS%29%22%2C%22date%22%3A%222022-10-23%22%2C%22eventPlace%22%3A%22%22%2C%22DOI%22%3A%2210.1109%5C%2FIROS47612.2022.9981758%22%2C%22ISBN%22%3A%22978-1-6654-7927-1%22%2C%22citationKey%22%3A%22%22%2C%22url%22%3A%22https%3A%5C%2F%5C%2Fieeexplore.ieee.org%5C%2Fdocument%5C%2F9981758%5C%2F%22%2C%22ISSN%22%3A%22%22%2C%22language%22%3A%22%22%2C%22collections%22%3A%5B%5D%2C%22dateModified%22%3A%222026-06-22T11%3A48%3A56Z%22%7D%7D%2C%7B%22key%22%3A%22FSWYC9SD%22%2C%22library%22%3A%7B%22id%22%3A6984777%7D%2C%22meta%22%3A%7B%22creatorSummary%22%3A%22Ha%20et%20al.%22%2C%22parsedDate%22%3A%222022%22%2C%22numChildren%22%3A1%7D%2C%22bib%22%3A%22%26lt%3Bdiv%20class%3D%26quot%3Bcsl-bib-body%26quot%3B%20style%3D%26quot%3Bline-height%3A%202%3B%20padding-left%3A%201em%3B%20text-indent%3A-1em%3B%26quot%3B%26gt%3B%5Cn%20%20%26lt%3Bdiv%20class%3D%26quot%3Bcsl-entry%26quot%3B%26gt%3BHa%2C%20J.-S.%2C%20Driess%2C%20D.%2C%20%26amp%3B%20Toussaint%2C%20M.%20%282022%29.%20Deep%20Visual%20Constraints%3A%20Neural%20Implicit%20Models%20for%20Manipulation%20Planning%20From%20Visual%20Input.%20%26lt%3Bi%26gt%3BIEEE%20Robotics%20and%20Automation%20Letters%26lt%3B%5C%2Fi%26gt%3B%2C%20%26lt%3Bi%26gt%3B7%26lt%3B%5C%2Fi%26gt%3B%284%29%2C%2010857%26%23x2013%3B10864.%20%26lt%3Ba%20class%3D%26%23039%3Bzp-DOIURL%26%23039%3B%20href%3D%26%23039%3Bhttps%3A%5C%2F%5C%2Fdoi.org%5C%2F10.1109%5C%2FLRA.2022.3194955%26%23039%3B%26gt%3Bhttps%3A%5C%2F%5C%2Fdoi.org%5C%2F10.1109%5C%2FLRA.2022.3194955%26lt%3B%5C%2Fa%26gt%3B%26lt%3B%5C%2Fdiv%26gt%3B%5Cn%26lt%3B%5C%2Fdiv%26gt%3B%22%2C%22data%22%3A%7B%22itemType%22%3A%22journalArticle%22%2C%22title%22%3A%22Deep%20Visual%20Constraints%3A%20Neural%20Implicit%20Models%20for%20Manipulation%20Planning%20From%20Visual%20Input%22%2C%22creators%22%3A%5B%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Jung-Su%22%2C%22lastName%22%3A%22Ha%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Danny%22%2C%22lastName%22%3A%22Driess%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Marc%22%2C%22lastName%22%3A%22Toussaint%22%7D%5D%2C%22abstractNote%22%3A%22%22%2C%22date%22%3A%2210%5C%2F2022%22%2C%22section%22%3A%22%22%2C%22partNumber%22%3A%22%22%2C%22partTitle%22%3A%22%22%2C%22DOI%22%3A%2210.1109%5C%2FLRA.2022.3194955%22%2C%22citationKey%22%3A%22%22%2C%22url%22%3A%22https%3A%5C%2F%5C%2Fieeexplore.ieee.org%5C%2Fdocument%5C%2F9844753%5C%2F%22%2C%22PMID%22%3A%22%22%2C%22PMCID%22%3A%22%22%2C%22ISSN%22%3A%222377-3766%2C%202377-3774%22%2C%22language%22%3A%22%22%2C%22collections%22%3A%5B%5D%2C%22dateModified%22%3A%222026-06-22T11%3A49%3A11Z%22%7D%7D%2C%7B%22key%22%3A%22Q4F4MZA6%22%2C%22library%22%3A%7B%22id%22%3A6984777%7D%2C%22meta%22%3A%7B%22creatorSummary%22%3A%22Grote%20et%20al.%22%2C%22parsedDate%22%3A%222023%22%2C%22numChildren%22%3A1%7D%2C%22bib%22%3A%22%26lt%3Bdiv%20class%3D%26quot%3Bcsl-bib-body%26quot%3B%20style%3D%26quot%3Bline-height%3A%202%3B%20padding-left%3A%201em%3B%20text-indent%3A-1em%3B%26quot%3B%26gt%3B%5Cn%20%20%26lt%3Bdiv%20class%3D%26quot%3Bcsl-entry%26quot%3B%26gt%3BGrote%2C%20P.%2C%20Ortiz-Haro%2C%20J.%2C%20Toussaint%2C%20M.%2C%20%26amp%3B%20Oguz%2C%20O.%20S.%20%282023%29.%20%26lt%3Bi%26gt%3BNeural%20Field%20Representations%20of%20Articulated%20Objects%20for%20Robotic%20Manipulation%20Planning%26lt%3B%5C%2Fi%26gt%3B.%20arXiv.%20%26lt%3Ba%20class%3D%26%23039%3Bzp-DOIURL%26%23039%3B%20href%3D%26%23039%3Bhttps%3A%5C%2F%5C%2Fdoi.org%5C%2F10.48550%5C%2FARXIV.2309.07620%26%23039%3B%26gt%3Bhttps%3A%5C%2F%5C%2Fdoi.org%5C%2F10.48550%5C%2FARXIV.2309.07620%26lt%3B%5C%2Fa%26gt%3B%26lt%3B%5C%2Fdiv%26gt%3B%5Cn%26lt%3B%5C%2Fdiv%26gt%3B%22%2C%22data%22%3A%7B%22itemType%22%3A%22preprint%22%2C%22title%22%3A%22Neural%20Field%20Representations%20of%20Articulated%20Objects%20for%20Robotic%20Manipulation%20Planning%22%2C%22creators%22%3A%5B%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Phillip%22%2C%22lastName%22%3A%22Grote%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Joaquim%22%2C%22lastName%22%3A%22Ortiz-Haro%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Marc%22%2C%22lastName%22%3A%22Toussaint%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Ozgur%20S.%22%2C%22lastName%22%3A%22Oguz%22%7D%5D%2C%22abstractNote%22%3A%22Traditional%20approaches%20for%20manipulation%20planning%20rely%20on%20an%20explicit%20geometric%20model%20of%20the%20environment%20to%20formulate%20a%20given%20task%20as%20an%20optimization%20problem.%20However%2C%20inferring%20an%20accurate%20model%20from%20raw%20sensor%20input%20is%20a%20hard%20problem%20in%20itself%2C%20in%20particular%20for%20articulated%20objects%20%28e.g.%2C%20closets%2C%20drawers%29.%20In%20this%20paper%2C%20we%20propose%20a%20Neural%20Field%20Representation%20%28NFR%29%20of%20articulated%20objects%20that%20enables%20manipulation%20planning%20directly%20from%20images.%20Specifically%2C%20after%20taking%20a%20few%20pictures%20of%20a%20new%20articulated%20object%2C%20we%20can%20forward%20simulate%20its%20possible%20movements%2C%20and%2C%20therefore%2C%20use%20this%20neural%20model%20directly%20for%20planning%20with%20trajectory%20optimization.%20Additionally%2C%20this%20representation%20can%20be%20used%20for%20shape%20reconstruction%2C%20semantic%20segmentation%20and%20image%20rendering%2C%20which%20provides%20a%20strong%20supervision%20signal%20during%20training%20and%20generalization.%20We%20show%20that%20our%20model%2C%20which%20was%20trained%20only%20on%20synthetic%20images%2C%20is%20able%20to%20extract%20a%20meaningful%20representation%20for%20unseen%20objects%20of%20the%20same%20class%2C%20both%20in%20simulation%20and%20with%20real%20images.%20Furthermore%2C%20we%20demonstrate%20that%20the%20representation%20enables%20robotic%20manipulation%20of%20an%20articulated%20object%20in%20the%20real%20world%20directly%20from%20images.%22%2C%22genre%22%3A%22%22%2C%22repository%22%3A%22arXiv%22%2C%22archiveID%22%3A%22%22%2C%22date%22%3A%222023%22%2C%22DOI%22%3A%2210.48550%5C%2FARXIV.2309.07620%22%2C%22citationKey%22%3A%22%22%2C%22url%22%3A%22https%3A%5C%2F%5C%2Farxiv.org%5C%2Fabs%5C%2F2309.07620%22%2C%22language%22%3A%22%22%2C%22collections%22%3A%5B%5D%2C%22dateModified%22%3A%222026-06-22T11%3A56%3A46Z%22%7D%7D%2C%7B%22key%22%3A%22TRZCTPB5%22%2C%22library%22%3A%7B%22id%22%3A6984777%7D%2C%22meta%22%3A%7B%22creatorSummary%22%3A%22Driess%20et%20al.%22%2C%22parsedDate%22%3A%222023-07-03%22%2C%22numChildren%22%3A1%7D%2C%22bib%22%3A%22%26lt%3Bdiv%20class%3D%26quot%3Bcsl-bib-body%26quot%3B%20style%3D%26quot%3Bline-height%3A%202%3B%20padding-left%3A%201em%3B%20text-indent%3A-1em%3B%26quot%3B%26gt%3B%5Cn%20%20%26lt%3Bdiv%20class%3D%26quot%3Bcsl-entry%26quot%3B%26gt%3BDriess%2C%20D.%2C%20Xia%2C%20F.%2C%20Sajjadi%2C%20M.%20S.%20M.%2C%20Lynch%2C%20C.%2C%20Chowdhery%2C%20A.%2C%20Ichter%2C%20B.%2C%20Wahid%2C%20A.%2C%20Tompson%2C%20J.%2C%20Vuong%2C%20Q.%2C%20Yu%2C%20T.%2C%20Huang%2C%20W.%2C%20Chebotar%2C%20Y.%2C%20Sermanet%2C%20P.%2C%20Duckworth%2C%20D.%2C%20Levine%2C%20S.%2C%20Vanhoucke%2C%20V.%2C%20Hausman%2C%20K.%2C%20Toussaint%2C%20M.%2C%20Greff%2C%20K.%2C%20%26%23x2026%3B%20Florence%2C%20P.%20%282023%29.%20PaLM-E%3A%20An%20Embodied%20Multimodal%20Language%20Model.%20%26lt%3Bi%26gt%3BProceedings%20of%20the%2040th%20International%20Conference%20on%20Machine%20Learning%20%28ICML%29%26lt%3B%5C%2Fi%26gt%3B%2C%208469%26%23x2013%3B8488.%20%26lt%3Ba%20class%3D%26%23039%3Bzp-ItemURL%26%23039%3B%20href%3D%26%23039%3Bhttps%3A%5C%2F%5C%2Fproceedings.mlr.press%5C%2Fv202%5C%2Fdriess23a.html%26%23039%3B%26gt%3Bhttps%3A%5C%2F%5C%2Fproceedings.mlr.press%5C%2Fv202%5C%2Fdriess23a.html%26lt%3B%5C%2Fa%26gt%3B%26lt%3B%5C%2Fdiv%26gt%3B%5Cn%26lt%3B%5C%2Fdiv%26gt%3B%22%2C%22data%22%3A%7B%22itemType%22%3A%22conferencePaper%22%2C%22title%22%3A%22PaLM-E%3A%20An%20Embodied%20Multimodal%20Language%20Model%22%2C%22creators%22%3A%5B%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Danny%22%2C%22lastName%22%3A%22Driess%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Fei%22%2C%22lastName%22%3A%22Xia%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Mehdi%20S.%20M.%22%2C%22lastName%22%3A%22Sajjadi%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Corey%22%2C%22lastName%22%3A%22Lynch%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Aakanksha%22%2C%22lastName%22%3A%22Chowdhery%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Brian%22%2C%22lastName%22%3A%22Ichter%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Ayzaan%22%2C%22lastName%22%3A%22Wahid%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Jonathan%22%2C%22lastName%22%3A%22Tompson%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Quan%22%2C%22lastName%22%3A%22Vuong%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Tianhe%22%2C%22lastName%22%3A%22Yu%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Wenlong%22%2C%22lastName%22%3A%22Huang%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Yevgen%22%2C%22lastName%22%3A%22Chebotar%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Pierre%22%2C%22lastName%22%3A%22Sermanet%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Daniel%22%2C%22lastName%22%3A%22Duckworth%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Sergey%22%2C%22lastName%22%3A%22Levine%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Vincent%22%2C%22lastName%22%3A%22Vanhoucke%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Karol%22%2C%22lastName%22%3A%22Hausman%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Marc%22%2C%22lastName%22%3A%22Toussaint%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Klaus%22%2C%22lastName%22%3A%22Greff%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Andy%22%2C%22lastName%22%3A%22Zeng%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Igor%22%2C%22lastName%22%3A%22Mordatch%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Pete%22%2C%22lastName%22%3A%22Florence%22%7D%5D%2C%22abstractNote%22%3A%22Large%20language%20models%20excel%20at%20a%20wide%20range%20of%20complex%20tasks.%20However%2C%20enabling%20general%20inference%20in%20the%20real%20world%2C%20e.g.%20for%20robotics%20problems%2C%20raises%20the%20challenge%20of%20grounding.%20We%20propose%20embodied%20language%20models%20to%20directly%20incorporate%20real-world%20continuous%20sensor%20modalities%20into%20language%20models%20and%20thereby%20establish%20the%20link%20between%20words%20and%20percepts.%20Input%20to%20our%20embodied%20language%20model%20are%20multimodal%20sentences%20that%20interleave%20visual%2C%20continuous%20state%20estimation%2C%20and%20textual%20input%20encodings.%20We%20train%20these%20encodings%20end-to-end%2C%20in%20conjunction%20with%20a%20pre-trained%20large%20language%20model%2C%20for%20multiple%20embodied%20tasks%20including%20sequential%20robotic%20manipulation%20planning%2C%20visual%20question%20answering%2C%20and%20captioning.%20Our%20evaluations%20show%20that%20PaLM-E%2C%20a%20single%20large%20embodied%20multimodal%20model%2C%20can%20address%20a%20variety%20of%20embodied%20reasoning%20tasks%2C%20from%20a%20variety%20of%20observation%20modalities%2C%20on%20multiple%20embodiments%2C%20and%20further%2C%20exhibits%20positive%20transfer%3A%20the%20model%20benefits%20from%20diverse%20joint%20training%20across%20internet-scale%20language%2C%20vision%2C%20and%20visual-language%20domains.%20Our%20largest%20model%20with%20562B%20parameters%2C%20in%20addition%20to%20being%20trained%20on%20robotics%20tasks%2C%20is%20a%20visual-language%20generalist%20with%20state-of-the-art%20performance%20on%20OK-VQA%2C%20and%20retains%20generalist%20language%20capabilities%20with%20increasing%20scale.%22%2C%22proceedingsTitle%22%3A%22Proceedings%20of%20the%2040th%20International%20Conference%20on%20Machine%20Learning%20%28ICML%29%22%2C%22conferenceName%22%3A%22International%20Conference%20on%20Machine%20Learning%22%2C%22date%22%3A%222023-07-03%22%2C%22eventPlace%22%3A%22%22%2C%22DOI%22%3A%22%22%2C%22ISBN%22%3A%22%22%2C%22citationKey%22%3A%22%22%2C%22url%22%3A%22https%3A%5C%2F%5C%2Fproceedings.mlr.press%5C%2Fv202%5C%2Fdriess23a.html%22%2C%22ISSN%22%3A%222640-3498%22%2C%22language%22%3A%22en%22%2C%22collections%22%3A%5B%5D%2C%22dateModified%22%3A%222026-06-22T13%3A37%3A18Z%22%7D%7D%2C%7B%22key%22%3A%22Z7BVH4H4%22%2C%22library%22%3A%7B%22id%22%3A6984777%7D%2C%22meta%22%3A%7B%22creatorSummary%22%3A%22Driess%20et%20al.%22%2C%22parsedDate%22%3A%222023-03-06%22%2C%22numChildren%22%3A1%7D%2C%22bib%22%3A%22%26lt%3Bdiv%20class%3D%26quot%3Bcsl-bib-body%26quot%3B%20style%3D%26quot%3Bline-height%3A%202%3B%20padding-left%3A%201em%3B%20text-indent%3A-1em%3B%26quot%3B%26gt%3B%5Cn%20%20%26lt%3Bdiv%20class%3D%26quot%3Bcsl-entry%26quot%3B%26gt%3BDriess%2C%20D.%2C%20Huang%2C%20Z.%2C%20Li%2C%20Y.%2C%20Tedrake%2C%20R.%2C%20%26amp%3B%20Toussaint%2C%20M.%20%282023%29.%20Learning%20Multi-Object%20Dynamics%20with%20Compositional%20Neural%20Radiance%20Fields.%20%26lt%3Bi%26gt%3BProceedings%20of%20The%206th%20Conference%20on%20Robot%20Learning%20%28CoRL%202022%29%26lt%3B%5C%2Fi%26gt%3B%2C%201755%26%23x2013%3B1768.%20%26lt%3Ba%20class%3D%26%23039%3Bzp-ItemURL%26%23039%3B%20href%3D%26%23039%3Bhttps%3A%5C%2F%5C%2Fproceedings.mlr.press%5C%2Fv205%5C%2Fdriess23a.html%26%23039%3B%26gt%3Bhttps%3A%5C%2F%5C%2Fproceedings.mlr.press%5C%2Fv205%5C%2Fdriess23a.html%26lt%3B%5C%2Fa%26gt%3B%26lt%3B%5C%2Fdiv%26gt%3B%5Cn%26lt%3B%5C%2Fdiv%26gt%3B%22%2C%22data%22%3A%7B%22itemType%22%3A%22conferencePaper%22%2C%22title%22%3A%22Learning%20Multi-Object%20Dynamics%20with%20Compositional%20Neural%20Radiance%20Fields%22%2C%22creators%22%3A%5B%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Danny%22%2C%22lastName%22%3A%22Driess%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Zhiao%22%2C%22lastName%22%3A%22Huang%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Yunzhu%22%2C%22lastName%22%3A%22Li%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Russ%22%2C%22lastName%22%3A%22Tedrake%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Marc%22%2C%22lastName%22%3A%22Toussaint%22%7D%5D%2C%22abstractNote%22%3A%22We%20present%20a%20method%20to%20learn%20compositional%20multi-object%20dynamics%20models%20from%20image%20observations%20based%20on%20implicit%20object%20encoders%2C%20Neural%20Radiance%20Fields%20%28NeRFs%29%2C%20and%20graph%20neural%20networks.%20NeRFs%20have%20become%20a%20popular%20choice%20for%20representing%20scenes%20due%20to%20their%20strong%203D%20prior.%20However%2C%20most%20NeRF%20approaches%20are%20trained%20on%20a%20single%20scene%2C%20representing%20the%20whole%20scene%20with%20a%20global%20model%2C%20making%20generalization%20to%20novel%20scenes%2C%20containing%20different%20numbers%20of%20objects%2C%20challenging.%20Instead%2C%20we%20present%20a%20compositional%2C%20object-centric%20auto-encoder%20framework%20that%20maps%20multiple%20views%20of%20the%20scene%20to%20a%20set%20of%20latent%20vectors%20representing%20each%20object%20separately.%20The%20latent%20vectors%20parameterize%20individual%20NeRFs%20from%20which%20the%20scene%20can%20be%20reconstructed.%20Based%20on%20those%20latent%20vectors%2C%20we%20train%20a%20graph%20neural%20network%20dynamics%20model%20in%20the%20latent%20space%20to%20achieve%20compositionality%20for%20dynamics%20prediction.%20A%20key%20feature%20of%20our%20approach%20is%20that%20the%20latent%20vectors%20are%20forced%20to%20encode%203D%20information%20through%20the%20NeRF%20decoder%2C%20which%20enables%20us%20to%20incorporate%20structural%20priors%20in%20learning%20the%20dynamics%20models%2C%20making%20long-term%20predictions%20more%20stable%20compared%20to%20several%20baselines.%20Simulated%20and%20real%20world%20experiments%20show%20that%20our%20method%20can%20model%20and%20learn%20the%20dynamics%20of%20compositional%20scenes%20including%20rigid%20and%20deformable%20objects.%20Video%3A%20https%3A%5C%2F%5C%2Fdannydriess.github.io%5C%2Fcompnerfdyn%5C%2F%22%2C%22proceedingsTitle%22%3A%22Proceedings%20of%20The%206th%20Conference%20on%20Robot%20Learning%20%28CoRL%202022%29%22%2C%22conferenceName%22%3A%22Conference%20on%20Robot%20Learning%22%2C%22date%22%3A%222023-03-06%22%2C%22eventPlace%22%3A%22%22%2C%22DOI%22%3A%22%22%2C%22ISBN%22%3A%22%22%2C%22citationKey%22%3A%22%22%2C%22url%22%3A%22https%3A%5C%2F%5C%2Fproceedings.mlr.press%5C%2Fv205%5C%2Fdriess23a.html%22%2C%22ISSN%22%3A%222640-3498%22%2C%22language%22%3A%22en%22%2C%22collections%22%3A%5B%5D%2C%22dateModified%22%3A%222026-06-22T13%3A39%3A15Z%22%7D%7D%2C%7B%22key%22%3A%22ZUQKM6LP%22%2C%22library%22%3A%7B%22id%22%3A6984777%7D%2C%22meta%22%3A%7B%22creatorSummary%22%3A%22Driess%20et%20al.%22%2C%22parsedDate%22%3A%222022-01-11%22%2C%22numChildren%22%3A1%7D%2C%22bib%22%3A%22%26lt%3Bdiv%20class%3D%26quot%3Bcsl-bib-body%26quot%3B%20style%3D%26quot%3Bline-height%3A%202%3B%20padding-left%3A%201em%3B%20text-indent%3A-1em%3B%26quot%3B%26gt%3B%5Cn%20%20%26lt%3Bdiv%20class%3D%26quot%3Bcsl-entry%26quot%3B%26gt%3BDriess%2C%20D.%2C%20Ha%2C%20J.-S.%2C%20Toussaint%2C%20M.%2C%20%26amp%3B%20Tedrake%2C%20R.%20%282022%29.%20Learning%20Models%20as%20Functionals%20of%20Signed-Distance%20Fields%20for%20Manipulation%20Planning.%20%26lt%3Bi%26gt%3BProceedings%20of%20the%205th%20Conference%20on%20Robot%20Learning%20%28CoRL%202021%29%26lt%3B%5C%2Fi%26gt%3B%2C%20245%26%23x2013%3B255.%20%26lt%3Ba%20class%3D%26%23039%3Bzp-ItemURL%26%23039%3B%20href%3D%26%23039%3Bhttps%3A%5C%2F%5C%2Fproceedings.mlr.press%5C%2Fv164%5C%2Fdriess22a.html%26%23039%3B%26gt%3Bhttps%3A%5C%2F%5C%2Fproceedings.mlr.press%5C%2Fv164%5C%2Fdriess22a.html%26lt%3B%5C%2Fa%26gt%3B%26lt%3B%5C%2Fdiv%26gt%3B%5Cn%26lt%3B%5C%2Fdiv%26gt%3B%22%2C%22data%22%3A%7B%22itemType%22%3A%22conferencePaper%22%2C%22title%22%3A%22Learning%20Models%20as%20Functionals%20of%20Signed-Distance%20Fields%20for%20Manipulation%20Planning%22%2C%22creators%22%3A%5B%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Danny%22%2C%22lastName%22%3A%22Driess%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Jung-Su%22%2C%22lastName%22%3A%22Ha%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Marc%22%2C%22lastName%22%3A%22Toussaint%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Russ%22%2C%22lastName%22%3A%22Tedrake%22%7D%5D%2C%22abstractNote%22%3A%22This%20work%20proposes%20an%20optimization-based%20manipulation%20planning%20framework%20where%20the%20objectives%20are%20learned%20functionals%20of%20signed-distance%20fields%20that%20represent%20objects%20in%20the%20scene.%20Most%20manipulation%20planning%20approaches%20rely%20on%20analytical%20models%20and%20carefully%20chosen%20abstractions%5C%2Fstate-spaces%20to%20be%20effective.%20A%20central%20question%20is%20how%20models%20can%20be%20obtained%20from%20data%20that%20are%20not%20primarily%20accurate%20in%20their%20predictions%2C%20but%2C%20more%20importantly%2C%20enable%20efficient%20reasoning%20within%20a%20planning%20framework%2C%20while%20at%20the%20same%20time%20being%20closely%20coupled%20to%20perception%20spaces.%20We%20show%20that%20representing%20objects%20as%20signed-distance%20fields%20not%20only%20enables%20to%20learn%20and%20represent%20a%20variety%20of%20models%20with%20higher%20accuracy%20compared%20to%20point-cloud%20and%20occupancy%20measure%20representations%2C%20but%20also%20that%20SDF-based%20models%20are%20suitable%20for%20optimization-based%20planning.%20To%20demonstrate%20the%20versatility%20of%20our%20approach%2C%20we%20learn%20both%20kinematic%20and%20dynamic%20models%20to%20solve%20tasks%20that%20involve%20hanging%20mugs%20on%20hooks%20and%20pushing%20objects%20on%20a%20table.%20We%20can%20unify%20these%20quite%20different%20tasks%20within%20one%20framework%2C%20since%20SDFs%20are%20the%20common%20object%20representation.%20%20Video%3A%20https%3A%5C%2F%5C%2Fyoutu.be%5C%2Fga8Wlkss7co%22%2C%22proceedingsTitle%22%3A%22Proceedings%20of%20the%205th%20Conference%20on%20Robot%20Learning%20%28CoRL%202021%29%22%2C%22conferenceName%22%3A%22Conference%20on%20Robot%20Learning%22%2C%22date%22%3A%222022-01-11%22%2C%22eventPlace%22%3A%22%22%2C%22DOI%22%3A%22%22%2C%22ISBN%22%3A%22%22%2C%22citationKey%22%3A%22%22%2C%22url%22%3A%22https%3A%5C%2F%5C%2Fproceedings.mlr.press%5C%2Fv164%5C%2Fdriess22a.html%22%2C%22ISSN%22%3A%222640-3498%22%2C%22language%22%3A%22en%22%2C%22collections%22%3A%5B%5D%2C%22dateModified%22%3A%222026-06-22T13%3A38%3A41Z%22%7D%7D%2C%7B%22key%22%3A%22QN3L779A%22%2C%22library%22%3A%7B%22id%22%3A6984777%7D%2C%22meta%22%3A%7B%22creatorSummary%22%3A%22Driess%20et%20al.%22%2C%22parsedDate%22%3A%222020%22%2C%22numChildren%22%3A1%7D%2C%22bib%22%3A%22%26lt%3Bdiv%20class%3D%26quot%3Bcsl-bib-body%26quot%3B%20style%3D%26quot%3Bline-height%3A%202%3B%20padding-left%3A%201em%3B%20text-indent%3A-1em%3B%26quot%3B%26gt%3B%5Cn%20%20%26lt%3Bdiv%20class%3D%26quot%3Bcsl-entry%26quot%3B%26gt%3BDriess%2C%20D.%2C%20Ha%2C%20J.-S.%2C%20%26amp%3B%20Toussaint%2C%20M.%20%282020%29.%20Deep%20Visual%20Reasoning%3A%20Learning%20to%20Predict%20Action%20Sequences%20for%20Task%20and%20Motion%20Planning%20from%20an%20Initial%20Scene%20Image.%20%26lt%3Bi%26gt%3BCorvalis%2C%20Oregon%2C%20USA%26lt%3B%5C%2Fi%26gt%3B.%20Robotics%3A%20Science%20and%20Systems.%20%26lt%3Ba%20class%3D%26%23039%3Bzp-DOIURL%26%23039%3B%20href%3D%26%23039%3Bhttps%3A%5C%2F%5C%2Fdoi.org%5C%2F10.15607%5C%2FRSS.2020.XVI.003%26%23039%3B%26gt%3Bhttps%3A%5C%2F%5C%2Fdoi.org%5C%2F10.15607%5C%2FRSS.2020.XVI.003%26lt%3B%5C%2Fa%26gt%3B%26lt%3B%5C%2Fdiv%26gt%3B%5Cn%26lt%3B%5C%2Fdiv%26gt%3B%22%2C%22data%22%3A%7B%22itemType%22%3A%22conferencePaper%22%2C%22title%22%3A%22Deep%20Visual%20Reasoning%3A%20Learning%20to%20Predict%20Action%20Sequences%20for%20Task%20and%20Motion%20Planning%20from%20an%20Initial%20Scene%20Image%22%2C%22creators%22%3A%5B%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Danny%22%2C%22lastName%22%3A%22Driess%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Jung-Su%22%2C%22lastName%22%3A%22Ha%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Marc%22%2C%22lastName%22%3A%22Toussaint%22%7D%5D%2C%22abstractNote%22%3A%22In%20this%20paper%2C%20we%20propose%20a%20deep%20convolutional%20recurrent%20neural%20network%20that%20predicts%20action%20sequences%20for%20task%20and%20motion%20planning%20%28TAMP%29%20from%20an%20initial%20scene%20image.%20Typical%20TAMP%20problems%20are%20formalized%20by%20combining%20reasoning%20on%20a%20symbolic%2C%20discrete%20level%20%28e.g.%20first-order%20logic%29%20with%20continuous%20motion%20planning%20such%20as%20nonlinear%20trajectory%20optimization.%20Due%20to%20the%20great%20combinatorial%20complexity%20of%20possible%20discrete%20action%20sequences%2C%20a%20large%20number%20of%20optimization%5C%2Fmotion%20planning%20problems%20have%20to%20be%20solved%20to%20find%20a%20solution%2C%20which%20limits%20the%20scalability%20of%20these%20approaches.%20To%20circumvent%20this%20combinatorial%20complexity%2C%20we%20develop%20a%20neural%20network%20which%2C%20based%20on%20an%20initial%20image%20of%20the%20scene%2C%20directly%20predicts%20promising%20discrete%20action%20sequences%20such%20that%20ideally%20only%20one%20motion%20planning%20problem%20has%20to%20be%20solved%20to%20find%20a%20solution%20to%20the%20overall%20TAMP%20problem.%20A%20key%20aspect%20is%20that%20our%20method%20generalizes%20to%20scenes%20with%20many%20and%20varying%20number%20of%20objects%2C%20although%20being%20trained%20on%20only%20two%20objects%20at%20a%20time.%20This%20is%20possible%20by%20encoding%20the%20objects%20of%20the%20scene%20in%20images%20as%20input%20to%20the%20neural%20network%2C%20instead%20of%20a%20fixed%20feature%20vector.%20Results%20show%20runtime%20improvements%20of%20several%20magnitudes.%20Video%3A%20https%3A%5C%2F%5C%2Fyoutu.be%5C%2Fi8yyEbbvoEk%22%2C%22proceedingsTitle%22%3A%22Corvalis%2C%20Oregon%2C%20USA%22%2C%22conferenceName%22%3A%22Robotics%3A%20Science%20and%20Systems%22%2C%22date%22%3A%222020%22%2C%22eventPlace%22%3A%22%22%2C%22DOI%22%3A%2210.15607%5C%2FRSS.2020.XVI.003%22%2C%22ISBN%22%3A%22%22%2C%22citationKey%22%3A%22%22%2C%22url%22%3A%22%22%2C%22ISSN%22%3A%22%22%2C%22language%22%3A%22%22%2C%22collections%22%3A%5B%5D%2C%22dateModified%22%3A%222026-06-22T11%3A59%3A35Z%22%7D%7D%2C%7B%22key%22%3A%22CBM6GXGF%22%2C%22library%22%3A%7B%22id%22%3A6984777%7D%2C%22meta%22%3A%7B%22creatorSummary%22%3A%22Driess%20et%20al.%22%2C%22parsedDate%22%3A%222021%22%2C%22numChildren%22%3A1%7D%2C%22bib%22%3A%22%26lt%3Bdiv%20class%3D%26quot%3Bcsl-bib-body%26quot%3B%20style%3D%26quot%3Bline-height%3A%202%3B%20padding-left%3A%201em%3B%20text-indent%3A-1em%3B%26quot%3B%26gt%3B%5Cn%20%20%26lt%3Bdiv%20class%3D%26quot%3Bcsl-entry%26quot%3B%26gt%3BDriess%2C%20D.%2C%20Ha%2C%20J.-S.%2C%20%26amp%3B%20Toussaint%2C%20M.%20%282021%29.%20Learning%20to%20solve%20sequential%20physical%20reasoning%20problems%20from%20a%20scene%20image.%20%26lt%3Bi%26gt%3BThe%20International%20Journal%20of%20Robotics%20Research%26lt%3B%5C%2Fi%26gt%3B%2C%20%26lt%3Bi%26gt%3B40%26lt%3B%5C%2Fi%26gt%3B%2812%26%23x2013%3B14%29%2C%201435%26%23x2013%3B1466.%20%26lt%3Ba%20class%3D%26%23039%3Bzp-DOIURL%26%23039%3B%20href%3D%26%23039%3Bhttps%3A%5C%2F%5C%2Fdoi.org%5C%2F10.1177%5C%2F02783649211056967%26%23039%3B%26gt%3Bhttps%3A%5C%2F%5C%2Fdoi.org%5C%2F10.1177%5C%2F02783649211056967%26lt%3B%5C%2Fa%26gt%3B%26lt%3B%5C%2Fdiv%26gt%3B%5Cn%26lt%3B%5C%2Fdiv%26gt%3B%22%2C%22data%22%3A%7B%22itemType%22%3A%22journalArticle%22%2C%22title%22%3A%22Learning%20to%20solve%20sequential%20physical%20reasoning%20problems%20from%20a%20scene%20image%22%2C%22creators%22%3A%5B%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Danny%22%2C%22lastName%22%3A%22Driess%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Jung-Su%22%2C%22lastName%22%3A%22Ha%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Marc%22%2C%22lastName%22%3A%22Toussaint%22%7D%5D%2C%22abstractNote%22%3A%22In%20this%20article%2C%20we%20propose%20deep%20visual%20reasoning%2C%20which%20is%20a%20convolutional%20recurrent%20neural%20network%20that%20predicts%20discrete%20action%20sequences%20from%20an%20initial%20scene%20image%20for%20sequential%20manipulation%20problems%20that%20arise%2C%20for%20example%2C%20in%20task%20and%20motion%20planning%20%28TAMP%29.%20Typical%20TAMP%20problems%20are%20formalized%20by%20combining%20reasoning%20on%20a%20symbolic%2C%20discrete%20level%20%28e.g.%2C%20first-order%20logic%29%20with%20continuous%20motion%20planning%20such%20as%20nonlinear%20trajectory%20optimization.%20The%20action%20sequences%20represent%20the%20discrete%20decisions%20on%20a%20symbolic%20level%2C%20which%2C%20in%20turn%2C%20parameterize%20a%20nonlinear%20trajectory%20optimization%20problem.%20Owing%20to%20the%20great%20combinatorial%20complexity%20of%20possible%20discrete%20action%20sequences%2C%20a%20large%20number%20of%20optimization%5C%2Fmotion%20planning%20problems%20have%20to%20be%20solved%20to%20find%20a%20solution%2C%20which%20limits%20the%20scalability%20of%20these%20approaches.%20To%20circumvent%20this%20combinatorial%20complexity%2C%20we%20introduce%20deep%20visual%20reasoning%3A%20based%20on%20a%20segmented%20initial%20image%20of%20the%20scene%2C%20a%20neural%20network%20directly%20predicts%20promising%20discrete%20action%20sequences%20such%20that%20ideally%20only%20one%20motion%20planning%20problem%20has%20to%20be%20solved%20to%20find%20a%20solution%20to%20the%20overall%20TAMP%20problem.%20Our%20method%20generalizes%20to%20scenes%20with%20many%20and%20varying%20numbers%20of%20objects%2C%20although%20being%20trained%20on%20only%20two%20objects%20at%20a%20time.%20This%20is%20possible%20by%20encoding%20the%20objects%20of%20the%20scene%20and%20the%20goal%20in%20%28segmented%29%20images%20as%20input%20to%20the%20neural%20network%2C%20instead%20of%20a%20fixed%20feature%20vector.%20We%20show%20that%20the%20framework%20can%20not%20only%20handle%20kinematic%20problems%20such%20as%20pick-and-place%20%28as%20typical%20in%20TAMP%29%2C%20but%20also%20tool-use%20scenarios%20for%20planar%20pushing%20under%20quasi-static%20dynamic%20models.%20Here%2C%20the%20image-based%20representation%20enables%20generalization%20to%20other%20shapes%20than%20during%20training.%20Results%20show%20runtime%20improvements%20of%20several%20orders%20of%20magnitudes%20by%2C%20in%20many%20cases%2C%20removing%20the%20need%20to%20search%20over%20the%20discrete%20action%20sequences.%22%2C%22date%22%3A%2212%5C%2F2021%22%2C%22section%22%3A%22%22%2C%22partNumber%22%3A%22%22%2C%22partTitle%22%3A%22%22%2C%22DOI%22%3A%2210.1177%5C%2F02783649211056967%22%2C%22citationKey%22%3A%22%22%2C%22url%22%3A%22https%3A%5C%2F%5C%2Fjournals.sagepub.com%5C%2Fdoi%5C%2F10.1177%5C%2F02783649211056967%22%2C%22PMID%22%3A%22%22%2C%22PMCID%22%3A%22%22%2C%22ISSN%22%3A%220278-3649%2C%201741-3176%22%2C%22language%22%3A%22en%22%2C%22collections%22%3A%5B%5D%2C%22dateModified%22%3A%222026-06-22T11%3A48%3A25Z%22%7D%7D%2C%7B%22key%22%3A%22H2UBVSA3%22%2C%22library%22%3A%7B%22id%22%3A6984777%7D%2C%22meta%22%3A%7B%22creatorSummary%22%3A%22Driess%20et%20al.%22%2C%22parsedDate%22%3A%222022%22%2C%22numChildren%22%3A1%7D%2C%22bib%22%3A%22%26lt%3Bdiv%20class%3D%26quot%3Bcsl-bib-body%26quot%3B%20style%3D%26quot%3Bline-height%3A%202%3B%20padding-left%3A%201em%3B%20text-indent%3A-1em%3B%26quot%3B%26gt%3B%5Cn%20%20%26lt%3Bdiv%20class%3D%26quot%3Bcsl-entry%26quot%3B%26gt%3BDriess%2C%20D.%2C%20Schubert%2C%20I.%2C%20Florence%2C%20P.%2C%20Li%2C%20Y.%2C%20%26amp%3B%20Toussaint%2C%20M.%20%282022%29.%20Reinforcement%20Learning%20with%20Neural%20Radiance%20Fields.%20%26lt%3Bi%26gt%3BAdvances%20in%20Neural%20Information%20Processing%20Systems%20%28NeurIPS%29%26lt%3B%5C%2Fi%26gt%3B%2C%20%26lt%3Bi%26gt%3B35%26lt%3B%5C%2Fi%26gt%3B.%20%26lt%3Ba%20class%3D%26%23039%3Bzp-ItemURL%26%23039%3B%20href%3D%26%23039%3Bhttps%3A%5C%2F%5C%2Fproceedings.neurips.cc%5C%2Fpaper_files%5C%2Fpaper%5C%2F2022%5C%2Ffile%5C%2F6c294f059e3d77d58dbb8fe48f21fe00-Paper-Conference.pdf%26%23039%3B%26gt%3Bhttps%3A%5C%2F%5C%2Fproceedings.neurips.cc%5C%2Fpaper_files%5C%2Fpaper%5C%2F2022%5C%2Ffile%5C%2F6c294f059e3d77d58dbb8fe48f21fe00-Paper-Conference.pdf%26lt%3B%5C%2Fa%26gt%3B%26lt%3B%5C%2Fdiv%26gt%3B%5Cn%26lt%3B%5C%2Fdiv%26gt%3B%22%2C%22data%22%3A%7B%22itemType%22%3A%22conferencePaper%22%2C%22title%22%3A%22Reinforcement%20Learning%20with%20Neural%20Radiance%20Fields%22%2C%22creators%22%3A%5B%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Danny%22%2C%22lastName%22%3A%22Driess%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Ingmar%22%2C%22lastName%22%3A%22Schubert%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Pete%22%2C%22lastName%22%3A%22Florence%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Yunzhu%22%2C%22lastName%22%3A%22Li%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Marc%22%2C%22lastName%22%3A%22Toussaint%22%7D%5D%2C%22abstractNote%22%3A%22%22%2C%22proceedingsTitle%22%3A%22Advances%20in%20Neural%20Information%20Processing%20Systems%20%28NeurIPS%29%22%2C%22conferenceName%22%3A%22%22%2C%22date%22%3A%222022%22%2C%22eventPlace%22%3A%22%22%2C%22DOI%22%3A%22%22%2C%22ISBN%22%3A%22%22%2C%22citationKey%22%3A%22%22%2C%22url%22%3A%22https%3A%5C%2F%5C%2Fproceedings.neurips.cc%5C%2Fpaper_files%5C%2Fpaper%5C%2F2022%5C%2Ffile%5C%2F6c294f059e3d77d58dbb8fe48f21fe00-Paper-Conference.pdf%22%2C%22ISSN%22%3A%22%22%2C%22language%22%3A%22%22%2C%22collections%22%3A%5B%5D%2C%22dateModified%22%3A%222026-05-26T15%3A31%3A56Z%22%7D%7D%5D%7D
Zhou, H., Schubert, I., Toussaint, M., & Oguz, O. S. (2023). Spatial Reasoning via Deep Vision Models for Robotic Sequential Manipulation. 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 11328–11335. https://doi.org/10.1109/IROS55552.2023.10342010
Toussaint, M., Harris, J., Ha, J.-S., Driess, D., & Hönig, W. (2022). Sequence-of-Constraints MPC: Reactive Timing-Optimal Control of Sequential Manipulation. 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 13753–13760. https://doi.org/10.1109/IROS47612.2022.9982236
Harris, J., Driess, D., & Toussaint, M. (2022). FC3 : Feasibility-Based Control Chain Coordination. 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 13769–13776. https://doi.org/10.1109/IROS47612.2022.9981758
Ha, J.-S., Driess, D., & Toussaint, M. (2022). Deep Visual Constraints: Neural Implicit Models for Manipulation Planning From Visual Input. IEEE Robotics and Automation Letters, 7(4), 10857–10864. https://doi.org/10.1109/LRA.2022.3194955
Grote, P., Ortiz-Haro, J., Toussaint, M., & Oguz, O. S. (2023). Neural Field Representations of Articulated Objects for Robotic Manipulation Planning. arXiv. https://doi.org/10.48550/ARXIV.2309.07620
Driess, D., Xia, F., Sajjadi, M. S. M., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., Huang, W., Chebotar, Y., Sermanet, P., Duckworth, D., Levine, S., Vanhoucke, V., Hausman, K., Toussaint, M., Greff, K., … Florence, P. (2023). PaLM-E: An Embodied Multimodal Language Model. Proceedings of the 40th International Conference on Machine Learning (ICML), 8469–8488. https://proceedings.mlr.press/v202/driess23a.html
Driess, D., Huang, Z., Li, Y., Tedrake, R., & Toussaint, M. (2023). Learning Multi-Object Dynamics with Compositional Neural Radiance Fields. Proceedings of The 6th Conference on Robot Learning (CoRL 2022), 1755–1768. https://proceedings.mlr.press/v205/driess23a.html
Driess, D., Ha, J.-S., Toussaint, M., & Tedrake, R. (2022). Learning Models as Functionals of Signed-Distance Fields for Manipulation Planning. Proceedings of the 5th Conference on Robot Learning (CoRL 2021), 245–255. https://proceedings.mlr.press/v164/driess22a.html
Driess, D., Ha, J.-S., & Toussaint, M. (2020). Deep Visual Reasoning: Learning to Predict Action Sequences for Task and Motion Planning from an Initial Scene Image. Corvalis, Oregon, USA. Robotics: Science and Systems. https://doi.org/10.15607/RSS.2020.XVI.003
Driess, D., Ha, J.-S., & Toussaint, M. (2021). Learning to solve sequential physical reasoning problems from a scene image. The International Journal of Robotics Research, 40(12–14), 1435–1466. https://doi.org/10.1177/02783649211056967
Driess, D., Schubert, I., Florence, P., Li, Y., & Toussaint, M. (2022). Reinforcement Learning with Neural Radiance Fields. Advances in Neural Information Processing Systems (NeurIPS), 35. https://proceedings.neurips.cc/paper_files/paper/2022/file/6c294f059e3d77d58dbb8fe48f21fe00-Paper-Conference.pdf

Research

An overview of our scientific work

See our Research Projects