Deep Learning Research Papers for Robot Perception, Grasping and Manipulation
A collection of deep learning research papers with coverage in perception and associated robotic tasks. Within each research area outlined below, the course staff has identified a core and extended set of research papers. The core set of papers will form the basis of our seminar-style lectures starting in week 10. The extended set provides additional coverage of even more exciting work being done within each area. We will keep adding papers discovered by the students and staff during the semester.
Table of contents
- RGB-D Architectures
- Pointcloud Processing
- Object Pose, Geometry, SDF, Implicit surfaces
- Dense object descriptors, Category-level representations
- Recurrent Networks and Object Tracking
- Semantic Scene Graphs and Explicit Representations
- Neural Radiance Fields and Implicit Representations
- Datasets
- Self-Supervised Learning
- Grasp Pose Detection
- Tactile Perception for Grasping and Manipulation
- Pre-training for Robot Manipulation and Transformer Architectures
- Perception Beyond Vision
- More Frontiers
RGB-D Architectures
Core List
PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes, Xiang et al., 2018
A Unified Framework for Multi-View Multi-Class Object Pose Estimation, Li et al., 2018
Learning RGB-D Feature Embeddings for Unseen Object Instance Segmentation, Li et al., 2020
PVN3D: A Deep Point-Wise 3D Keypoints Voting Network for 6DoF Pose Estimation, He et al., 2020
Extended List
3D ShapeNets: A Deep Representation for Volumetric Shapes, Wu et al., 2015
VoxNet: A 3D Convolutional Neural Network for Real-Time Object Recognition, Maturana et al., 2015
Multi-view Convolutional Neural Networks for 3D Shape Recognition, Su et al., 2015
Volumetric and Multi-View CNNs for Object Classification on 3D Data, Qi et al., 2016
Robust 6D Object Pose Estimation with Stochastic Congruent Sets, Mitash et al., 2018
Pointcloud Processing
Core List
PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation, Qi et al., 2017
PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space, Qi et al., 2017
PointFusion: Deep Sensor Fusion for 3D Bounding Box Estimation, Xu et al., 2018
DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion, Wang et al., 2019
Extended List
- 3D Object Detection with Pointformer, Pan et al., 2021
Object Pose, Geometry, SDF, Implicit surfaces
Core List
SUM: Sequential scene understanding and manipulation, Sui et al., 2017
DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation, Park et al., 2019
Implicit surface representations as layers in neural networks, Michalkiewicz et al., 2019
Extended List
Local Deep Implicit Functions for 3D Shape, Genova et al., 2020
Implicit geometric regularization for learning shapes, Gropp et al., 2020
Dense object descriptors, Category-level representations
Core List
Dense Object Nets: Learning Dense Visual Object Descriptors By and For Robotic Manipulation, Florence et al., 2018
Normalized Object Coordinate Space for Category-Level 6D Object Pose and Size Estimation, Wang et al., 2019
kPAM: KeyPoint Affordances for Category-Level Robotic Manipulation, Manuelli et al., 2019
Single-Stage Keypoint-Based Category-Level Object Pose Estimation from an RGB Image, Lin et al., 2022
Extended List
Visual Descriptor Learning from Monocular Video, Deekshith et al., 2020
SurfEmb: Dense and Continuous Correspondence Distributions for Object Pose Estimation with Learnt Surface Embeddings, Haugaard et al., 2021
Fully Self-Supervised Class Awareness in Dense Object Descriptors, Hadjivelichkov et al., 2022
Recurrent Networks and Object Tracking
Core List
DeepIM: Deep Iterative Matching for 6D Pose Estimation, Li et al., 2018
PoseRBPF: A Rao-Blackwellized Particle Filter for 6D Object Pose Tracking, Deng et al., 2019
6-PACK: Category-level 6D Pose Tracker with Anchor-Based Keypoints, Wang et al., 2020
XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model, Cheng and Schwing, 2022
Extended List
Long Short-Term Memory, Hochreiter et al., 1997
TrackFormer: Multi-Object Tracking with Transformers, Meinhardt et al., 2022
Semantic Scene Graphs and Explicit Representations
Core List
Image Retrieval using Scene Graphs, Johnson et al., 2015
Semantic Robot Programming for Goal-Directed Manipulation in Cluttered Scenes, Zeng et al., 2018
Differentiable Scene Graphs, Raboh et al., 2020
Semantic Linking Maps for Active Visual Object Search, Zeng et al., 2020
Extended List
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations, Krishna et al., 2016
Image Generation from Scene Graphs, Johnson et al., 2018
3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera, Armeni et al., 2020
Hydra: A Real-time Spatial Perception System for 3D Scene Graph Construction and Optimization, Hughes et al., 2022
Neural Radiance Fields and Implicit Representations
Core List
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, Mildenhall et al., 2020
Object-Centric Neural Scene Rendering, Guo et al., 2020
Neural Descriptor Fields: SE(3)-Equivariant Object Representations for Manipulation, Simeonov et al., 2021
NeRF-Supervision: Learning Dense Object Descriptors from Neural Radiance Fields, Yen-Chen et al., 2022
NARF22: Neural Articulated Radiance Fields for Configuration-Aware Rendering, Lewis et al., 2022
Learning Multi-Object Dynamics with Compositional Neural Radiance Fields, Driess et al., 2022
Extended List
NeRF Explosion 2020, Dellaert, 2020
Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations, Sitzmann et al., 2019
Local Implicit Grid Representations for 3D Scenes, Jiang et al., 2020
Convolutional occupancy networks, Peng et al., 2020
INeRF: Inverting Neural Radiance Fields for Pose Estimation, Yen-Chen et al., 2021
ILabel: Interactive Neural Scene Labelling, Zhi et al., 2021
BungeeNeRF: Progressive Neural Radiance Field for Extreme Multi-scale Scene Rendering, Xiangli et al., 2021
Block-NeRF: Scalable Large Scene Neural View Synthesis, Tancik et al., 2022
NeRF2Real: Sim2real Transfer of Vision-guided Bipedal Motion Skills using Neural Radiance Fields, Byravan et al., 2022
Datasets
RGB-D Datasets:
(NYU Depth v2) Indoor Segmentation and Support Inference from RGBD Images, Silberman et al., 2012
SUN RGB-D: A RGB-D Scene Understanding Benchmark Suite, Song et al., 2015
YCB-Video Dataset, Xiang et al., 2018
BOP: Benchmark for 6D Object Pose Estimation, Hodaň et al., 2019
ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes, Dai et al., 2019
TO-Scene: A Large-scale Dataset for Understanding 3D Tabletop Scenes, Xu et al., 2022
ProgressLabeller: Visual Data Stream Annotation for Training Object-Centric 3D Perception, Chen et al., 2022
Collecting data with robots:
Deep Learning for Robots: Learning from Large-Scale Interaction
All You Need is LUV: Unsupervised Collection of Labeled Images using Invisible UV Fluorescent Indicators, Thananjeyan et al., 2022
Semantic Datasets:
- Habitat-Matterport 3D Semantics Dataset, Yadav et al., 2022
Object Model Datasets:
ShapeNet: An Information-Rich 3D Model Repository, Chang et al., 2015
Simulators:
MuJoCo: A physics engine for model-based control, Todorov et al., 2015
Pybullet, a python module for physics simulation for games, robotics and machine learning, Coumans et al., 2015
Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning, Makoviychuk et al., 2021
iGibson: Interactive Simulation of Large Scale Virtualized Realistic Scenes for Robot Learning
Self-Supervised Learning
Core List
VICRegL: Self-Supervised Learning of Local Visual Features, Bardes et al., 2022
Self-Supervised Geometric Correspondence for Category-Level 6D Object Pose Estimation in the Wild, Zhang et al., 2022
Grasp Pose Detection
Core List
Using Geometry to Detect Grasps in 3D Point Clouds, ten Pas and Platt, 2015
Dex-Net 2.0: Deep Learning to Plan Robust Grasps with Synthetic Point Clouds and Analytic Grasp Metrics, Mahler et al., 2017
GlassLoc: Plenoptic Grasp Pose Detection in Transparent Clutter, Zhou et al., 2019
Contact-GraspNet: Efficient 6-DoF Grasp Generation in Cluttered Scenes, Sundermeyer et al., 2021
Sample Efficient Grasp Learning Using Equivariant Models, Zhu et al., 2022
Extended List
High precision grasp pose detection in dense clutter, Gualtieri et al., 2016
Grasp Learning: Models, Methods, and Performance, Platt, 2022
Tactile Perception for Grasping and Manipulation
Core List
Visuotactile Affordances for Cloth Manipulation with Local Control, Sunil et al., 2022
Tactile Object Pose Estimation from the First Touch with Geometric Contact Rendering, • Bauza et al., 2020
ShapeMap 3-D: Efficient shape mapping through dense touch and vision, Suresh et al., 2022
More Than a Feeling: Learning to Grasp and Regrasp using Vision and Touch, Calandra et al., 2018
Extended List
A Review of Tactile Information: Perception and Action Through Touch, Li et al., 2020
Active Visuo-Haptic Object Shape Completion, Rustler et al., 2022
Active Extrinsic Contact Sensing: Application to General Peg-in-Hole Insertion, Kim et al., 2021
Soft-bubble: A highly compliant dense geometry tactile sensor for robot manipulation, Alspach et al., 2019
TACTO: A Fast, Flexible, and Open-source Simulator for High-Resolution Vision-based Tactile Sensors, Wang et al., 2020
The Feeling of Success: Does Touch Sensing Help Predict Grasp Outcomes?, Calandra et al., 2017
Learning Self-Supervised Representations from Vision and Touch for Active Sliding Perception of Deformable Surfaces, Kerr and Huang et al., 2022
GelSight: High-Resolution Robot Tactile Sensors for Estimating Geometry and Force, Yuan et al., 2017
See, Hear, and Feel: Smart Sensory Fusion for Robotic Manipulation, Li et al., 2022
Pre-training for Robot Manipulation and Transformer Architectures
Core List
SORNet: Spatial Object-Centric Representations for Sequential Manipulation, Yuan et al., 2021
Masked Visual Pre-training for Motor Control, Xiao et al., 2022
R3M: A Universal Visual Representation for Robot Manipulation, Nair et al., 2022
CLIPort: What and Where Pathways for Robotic Manipulation, Shridhar et al., 2021
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances, Ahn et al., 2022
RT-1: Robotics Transformer for Real-World Control at Scale, Brohan et al., 2022
Extended List
Interactive Language: Talking to Robots in Real Time, Lynch et al., 2022
Transformers are Adaptable Task Planners, Jain et al., 2022
CLIP: Learning Transferable Visual Models From Natural Language Supervision, Radford et al., 2021
Masked Autoencoders Are Scalable Vision Learners, He et al., 2021
Transporter Networks: Rearranging the Visual World for Robotic Manipulation, Zeng et al., 2020
Perception Beyond Vision
Specialized Sensors
Pigeons (Columba livia) as Trainable Observers of Pathology and Radiology Breast Cancer Images, Levenson et al., 2015
Automatic color correction for 3D reconstruction of underwater scenes, Skinner et al., 2017
GelSight: High-Resolution Robot Tactile Sensors for Estimating Geometry and Force, Yuan et al., 2017
Classification of Household Materials via Spectroscopy, Erickson et al., 2018
Through-Wall Human Pose Estimation Using Radio Signals, Zhao et al., 2018
A bio-hybrid odor-guided autonomous palm-sized air vehicle, Anderson et al., 2020
Event-based, Direct Camera Tracking from a Photometric 3D Map using Nonlinear Optimization, Bryner et al., 2019
SoundSpaces: Audio-Visual Navigation in 3D Environments, Chen et al., 2019
Neural Implicit Surface Reconstruction using Imaging Sonar, Qadri et al., 2022
More Frontiers
Interpreting Deep Learning Models
Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps, Simonyan et al., 2013
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization, Selvaraju et al., 2016
The Building Blocks of Interpretability, Olah et al., 2018
Multimodal Neurons in Artificial Neural Networks, Goh et al., 2021
Fairness and Ethics
Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification, Buolamwini and Gebru, 2018
Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing, Raji et al., 2020
Articulated Objects
Autonomous Tool Construction Using Part Shape and Attachment Prediction, Nair et al., 2019
Parts-Based Articulated Object Localization in Clutter Using Belief Propagation, Pavlasek et al., 2020
Category-Level Articulated Object Pose Estimation, Li et al., 2020
Differentiable Nonparametric Belief Propagation, Opipari et al., 2021
Category-Independent Articulated Object Tracking with Factor Graphs, Heppert et al., 2022
Kineverse: A Symbolic Articulation Model Framework for Model-Agnostic Mobile Manipulation, Röfer et al., 2022
Deformable Objects
DensePose: Dense Human Pose Estimation In The Wild, Xiao et al., 2018
FabricFlowNet: Bimanual Cloth Manipulation with a Flow-based Policy, Weng et al., 2021
DextAIRity: Deformable Manipulation Can be a Breeze, Xu et al., 2022
Self-supervised Transparent Liquid Segmentation for Robotic Pouring, Narasimhan et al., 2022
Visio-tactile Implicit Representations of Deformable Objects, Wi et al., 2022
Transparent Objects
LIT: Light-field Inference of Transparency for Refractive Object Localization, Zhou et al., 2019
Multi-modal Transfer Learning for Grasping Transparent and Specular Objects, Weng et al., 2020
Dex-NeRF: Using a Neural Radiance Field to Grasp Transparent Objects, Ichnowski et al., 2021
ClearPose: Large-scale Transparent Object Dataset and Benchmark, Chen et al., 2022
TransNet: Category-Level Transparent Object Pose Estimation, Zhang et al., 2022
Dynamic Scenes
D-NeRF: Neural Radiance Fields for Dynamic Scenes, Pumarola et al., 2020
3D Neural Scene Representations for Visuomotor Control, Li et al., 2021
HexPlane: A Fast Representation for Dynamic Scenes, Cao and Johnson, 2023
Beyond 2D Convolutions
Learning Decentralized Controllers for Robot Swarms with Graph Neural Networks, Tolstaya et al., 2019
A Gentle Introduction to Graph Neural Networks, Sanchez-Lengeling et al., 2021
Reinforcement Learning
Deep Reinforcement Learning from Human Preferences, Christiano et al., 2017
Understanding RL Vision, Hilton et al., 2020
Generative Modeling
WaterGAN: Unsupervised Generative Network to Enable Real-time Color Correction of Monocular Underwater Images, Li et al., 2017
Differentiable Particle Filters through Conditional Normalizing Flow, Chen et al., 2021
Planning with Diffusion for Flexible Behavior Synthesis, Janner et al., 2022
Anything-3D: Towards Single-view Anything Reconstruction in the Wild, Shen et al., 2023