Deep Learning Research Papers for Robot Perception, Grasping and Manipulation

A collection of deep learning research papers with coverage in perception and associated robotic tasks. Within each research area outlined below, the course staff has identified a core and extended set of research papers. The core set of papers will form the basis of our seminar-style lectures starting in week 7. The extended set provides additional coverage of even more exciting work being done within each area.

Table of contents

RGB-D Architectures
1. Core List
2. Extended List
Pointcloud Processing
1. Core List
2. Extended List
Object Pose, Geometry, SDF, Implicit surfaces
1. Core List
2. Extended List
Dense object descriptors, Category-level representations
1. Core List
2. Extended List
Recurrent Networks and Object Tracking
1. Core List
2. Extended List
Visual Odometry and Localization
1. Core List
2. Extended List
Semantic Scene Graphs and Explicit Representations
1. Core List
2. Extended List
Neural Radiance Fields and Implicit Representations
1. Core List
2. Extended List
Datasets
Self-Supervised Learning
1. Core List
Grasp Pose Detection
1. Core List
2. Extended List
Tactile Perception for Grasping and Manipulation
1. Core List
2. Extended List
Pre-training for Robot Manipulation and Transformer Architectures
1. Core List
2. Extended List
More Frontiers

RGB-D Architectures

Core List

PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes, Xiang et al., 2018
A Unified Framework for Multi-View Multi-Class Object Pose Estimation, Li et al., 2018
Learning RGB-D Feature Embeddings for Unseen Object Instance Segmentation, Li et al., 2020
PVN3D: A Deep Point-Wise 3D Keypoints Voting Network for 6DoF Pose Estimation, He et al., 2020

Extended List

3D ShapeNets: A Deep Representation for Volumetric Shapes, Wu et al., 2015
VoxNet: A 3D Convolutional Neural Network for Real-Time Object Recognition, Maturana et al., 2015
Multi-view Convolutional Neural Networks for 3D Shape Recognition, Su et al., 2015
Volumetric and Multi-View CNNs for Object Classification on 3D Data, Qi et al., 2016
Robust 6D Object Pose Estimation with Stochastic Congruent Sets, Mitash et al., 2018

Pointcloud Processing

Core List

PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation, Qi et al., 2017
PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space, Qi et al., 2017
PointFusion: Deep Sensor Fusion for 3D Bounding Box Estimation, Xu et al., 2018
DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion, Wang et al., 2019

Extended List

3D Object Detection with Pointformer, Pan et al., 2021

Object Pose, Geometry, SDF, Implicit surfaces

Core List

SUM: Sequential scene understanding and manipulation, Sui et al., 2017
DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation, Park et al., 2019
Implicit surface representations as layers in neural networks, Michalkiewicz et al., 2019

Extended List

Local Deep Implicit Functions for 3D Shape, Genova et al., 2020
Implicit geometric regularization for learning shapes, Gropp et al., 2020

Dense object descriptors, Category-level representations

Core List

Dense Object Nets: Learning Dense Visual Object Descriptors By and For Robotic Manipulation, Florence et al., 2018
Normalized Object Coordinate Space for Category-Level 6D Object Pose and Size Estimation, Wang et al., 2019
kPAM: KeyPoint Affordances for Category-Level Robotic Manipulation, Manuelli et al., 2019
Single-Stage Keypoint-Based Category-Level Object Pose Estimation from an RGB Image, Lin et al., 2022

Extended List

Visual Descriptor Learning from Monocular Video, Deekshith et al., 2020
SurfEmb: Dense and Continuous Correspondence Distributions for Object Pose Estimation with Learnt Surface Embeddings, Haugaard et al., 2021
Fully Self-Supervised Class Awareness in Dense Object Descriptors, Hadjivelichkov et al., 2022

Recurrent Networks and Object Tracking

Core List

DeepIM: Deep Iterative Matching for 6D Pose Estimation, Li et al., 2018
PoseRBPF: A Rao-Blackwellized Particle Filter for 6D Object Pose Tracking, Deng et al., 2019
6-PACK: Category-level 6D Pose Tracker with Anchor-Based Keypoints, Wang et al., 2020
XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model, Cheng and Schwing, 2022

Extended List

Long Short-Term Memory, Hochreiter et al., 1997
TrackFormer: Multi-Object Tracking with Transformers, Meinhardt et al., 2022

Visual Odometry and Localization

Core List

Backprop KF: Learning Discriminative Deterministic State Estimators, Haarnoja et al., 2016
Differentiable Particle Filters: End-to-End Learning with Algorithmic Priors, Jonschkowski et al., 2018
Multimodal Sensor Fusion with Differentiable Filters, Lee et al., 2020
Differentiable SLAM-net: Learning Particle SLAM for Visual Navigation, Karkus et al., 2021

Extended List

Differentiable Algorithm Networks for Composable Robot Learning, Karkus et al., 2019
Chasing Ghosts: Instruction Following as Bayesian State Tracking, Anderson et al., 2019
Differentiable Factor Graph Optimization for Learning Smoothers, Yi et al., 2021
How to train your differentiable filter, Kloss et al., 2021
Differentiable Nonparametric Belief Propagation, Opipari et al., 2021
NeRF-SLAM: Real-Time Dense Monocular SLAM with Neural Radiance Fields, Rosinol et al., 2022

Semantic Scene Graphs and Explicit Representations

Core List

Image Retrieval using Scene Graphs, Johnson et al., 2015
Semantic Robot Programming for Goal-Directed Manipulation in Cluttered Scenes, Zeng et al., 2018
Differentiable Scene Graphs, Raboh et al., 2020
Semantic Linking Maps for Active Visual Object Search, Zeng et al., 2020

Extended List

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations, Krishna et al., 2016
Image Generation from Scene Graphs, Johnson et al., 2018
3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera, Armeni et al., 2020
Hydra: A Real-time Spatial Perception System for 3D Scene Graph Construction and Optimization, Hughes et al., 2022

Neural Radiance Fields and Implicit Representations

Core List

NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, Mildenhall et al., 2020
Object-Centric Neural Scene Rendering, Guo et al., 2020
Neural Descriptor Fields: SE(3)-Equivariant Object Representations for Manipulation, Simeonov et al., 2021
NeRF-Supervision: Learning Dense Object Descriptors from Neural Radiance Fields, Yen-Chen et al., 2022
NARF22: Neural Articulated Radiance Fields for Configuration-Aware Rendering, Lewis et al., 2022
Learning Multi-Object Dynamics with Compositional Neural Radiance Fields, Driess et al., 2022

Extended List

NeRF Explosion 2020, Dellaert, 2020
Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations, Sitzmann et al., 2019
Local Implicit Grid Representations for 3D Scenes, Jiang et al., 2020
Convolutional occupancy networks, Peng et al., 2020
INeRF: Inverting Neural Radiance Fields for Pose Estimation, Yen-Chen et al., 2021
ILabel: Interactive Neural Scene Labelling, Zhi et al., 2021
BungeeNeRF: Progressive Neural Radiance Field for Extreme Multi-scale Scene Rendering, Xiangli et al., 2021
Block-NeRF: Scalable Large Scene Neural View Synthesis, Tancik et al., 2022
NeRF2Real: Sim2real Transfer of Vision-guided Bipedal Motion Skills using Neural Radiance Fields, Byravan et al., 2022

Datasets

RGB-D Datasets:

(NYU Depth v2) Indoor Segmentation and Support Inference from RGBD Images, Silberman et al., 2012
SUN RGB-D: A RGB-D Scene Understanding Benchmark Suite, Song et al., 2015
YCB-Video Dataset, Xiang et al., 2018
BOP: Benchmark for 6D Object Pose Estimation, Hodaň et al., 2019
ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes, Dai et al., 2019
TO-Scene: A Large-scale Dataset for Understanding 3D Tabletop Scenes, Xu et al., 2022
ProgressLabeller: Visual Data Stream Annotation for Training Object-Centric 3D Perception, Chen et al., 2022

Collecting data with robots:

Semantic Datasets:

Habitat-Matterport 3D Semantics Dataset, Yadav et al., 2022

Object Model Datasets:

ShapeNet: An Information-Rich 3D Model Repository, Chang et al., 2015
PartNet-Mobility Dataset

Simulators:

MuJoCo: A physics engine for model-based control, Todorov et al., 2015
Pybullet, a python module for physics simulation for games, robotics and machine learning, Coumans et al., 2015
NVIDIA Isaac Sim
Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning, Makoviychuk et al., 2021
SAPIEN: A SimulAted Part-based Interactive ENvironment
Habitat
iGibson: Interactive Simulation of Large Scale Virtualized Realistic Scenes for Robot Learning

Self-Supervised Learning

Core List

VICRegL: Self-Supervised Learning of Local Visual Features, Bardes et al., 2022
Self-Supervised Geometric Correspondence for Category-Level 6D Object Pose Estimation in the Wild, Zhang et al., 2022

Grasp Pose Detection

Core List

Using Geometry to Detect Grasps in 3D Point Clouds, ten Pas and Platt, 2015
Dex-Net 2.0: Deep Learning to Plan Robust Grasps with Synthetic Point Clouds and Analytic Grasp Metrics, Mahler et al., 2017
GlassLoc: Plenoptic Grasp Pose Detection in Transparent Clutter, Zhou et al., 2019
Contact-GraspNet: Efficient 6-DoF Grasp Generation in Cluttered Scenes, Sundermeyer et al., 2021
Sample Efficient Grasp Learning Using Equivariant Models, Zhu et al., 2022

Extended List

High precision grasp pose detection in dense clutter, Gualtieri et al., 2016
Grasp Learning: Models, Methods, and Performance, Platt, 2022

Tactile Perception for Grasping and Manipulation

Core List

Visuotactile Affordances for Cloth Manipulation with Local Control, Sunil et al., 2022
Tactile Object Pose Estimation from the First Touch with Geometric Contact Rendering, • Bauza et al., 2020
ShapeMap 3-D: Efficient shape mapping through dense touch and vision, Suresh et al., 2022
More Than a Feeling: Learning to Grasp and Regrasp using Vision and Touch, Calandra et al., 2018

Extended List

A Review of Tactile Information: Perception and Action Through Touch, Li et al., 2020
Active Visuo-Haptic Object Shape Completion, Rustler et al., 2022
Active Extrinsic Contact Sensing: Application to General Peg-in-Hole Insertion, Kim et al., 2021
Soft-bubble: A highly compliant dense geometry tactile sensor for robot manipulation, Alspach et al., 2019
TACTO: A Fast, Flexible, and Open-source Simulator for High-Resolution Vision-based Tactile Sensors, Wang et al., 2020
The Feeling of Success: Does Touch Sensing Help Predict Grasp Outcomes?, Calandra et al., 2017
Learning Self-Supervised Representations from Vision and Touch for Active Sliding Perception of Deformable Surfaces, Kerr and Huang et al., 2022
GelSight: High-Resolution Robot Tactile Sensors for Estimating Geometry and Force, Yuan et al., 2017
See, Hear, and Feel: Smart Sensory Fusion for Robotic Manipulation, Li et al., 2022

Pre-training for Robot Manipulation and Transformer Architectures

Core List

SORNet: Spatial Object-Centric Representations for Sequential Manipulation, Yuan et al., 2021
Masked Visual Pre-training for Motor Control, Xiao et al., 2022
R3M: A Universal Visual Representation for Robot Manipulation, Nair et al., 2022
CLIPort: What and Where Pathways for Robotic Manipulation, Shridhar et al., 2021
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances, Ahn et al., 2022
RT-1: Robotics Transformer for Real-World Control at Scale, Brohan et al., 2022

Extended List

Interactive Language: Talking to Robots in Real Time, Lynch et al., 2022
Transformers are Adaptable Task Planners, Jain et al., 2022
CLIP: Learning Transferable Visual Models From Natural Language Supervision, Radford et al., 2021
Masked Autoencoders Are Scalable Vision Learners, He et al., 2021
Transporter Networks: Rearranging the Visual World for Robotic Manipulation, Zeng et al., 2020

More Frontiers

Interpreting Deep Learning Models

Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps, Simonyan et al., 2013
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization, Selvaraju et al., 2016
The Building Blocks of Interpretability, Olah et al., 2018
Multimodal Neurons in Artificial Neural Networks, Goh et al., 2021

Fairness and Ethics

Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification, Buolamwini and Gebru, 2018
Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing, Raji et al., 2020

Articulated and Deformable Objects

DensePose: Dense Human Pose Estimation In The Wild, Xiao et al., 2018
Differentiable Nonparametric Belief Propagation, Opipari et al., 2021

Transparent Objects

LIT: Light-field Inference of Transparency for Refractive Object Localization, Zhou et al., 2019
Dex-NeRF: Using a Neural Radiance Field to Grasp Transparent Objects, Ichnowski et al., 2021
ClearPose: Large-scale Transparent Object Dataset and Benchmark, Chen et al., 2022
TransNet: Category-Level Transparent Object Pose Estimation, Zhang et al., 2022

Dynamic Scenes

D-NeRF: Neural Radiance Fields for Dynamic Scenes, Pumarola et al., 2020