Deep Perception

(for manipulation)

Part 1

MIT 6.4210/2

Robotic Manipulation

Fall 2022, Lecture 11

Follow live at

(or later at

Limitations of using geometry only

  • No understanding of what an object is.
    • "Double picks"
    • Might pick up a heavy object from one corner
  • Partial views
  • Depth returns don't work for transparent objects
  • ...
  • some tasks require object recognition! "pick the mustard bottles"

A sample annotated image from the COCO dataset

What object categories/labels are in COCO?

Fine tuning


R-CNN (Regions with CNN features)


Faster R-CNN adds a "region proposal network"


Pick up the mustard bottles...

  1. Segmentation + ICP => model-based grasp selection
  2. Segmentation => antipodal grasp selection

6D Object Pose Estimation Challenge

  • Until 2019, geometric pose estimation was still winning*.
  • In 2020, CosyPose: mask-rcnn + deep pose estimation + geometric pose refinement was best.

* - partly due to low render quality?

Self-supervised pretraining

Example: SimCLR

Example: SimCLR

"Contrastive visual representation learning"

Example: Monocular Depth Estimation


Decentralized self-supervised learning

Goal: Testing in simulation matches testing in reality.  Continual learning / improvement.


Challenge: Distribution shift / non-iid data

Federated Learning

Why Federated Learning?

Why not aggregate all data in the cloud and train a centralized model?

  • Too much data, fleet can induct over 2M/day (bandwidth limits and costs)
  • It is not clear that we should pool all of the data? (Generalization vs specialization).  More data can hurt!

Distribution shifts in Amazon Robotics (AR) Dataset

  • Lighting conditions
  • Density (e.g. time of year/holidays)
  • Upstream material handling systems
  • Altitude
  • Hardware configuration
    • End of arm tooling (EoAT) type
    • Arm type
    • Sensor types
    • Conveyors and walls

Distribution shifts in AR Data

Site A

Site B

Distribution shifts in AR Data

Site C

Site D

Distribution shifts in AR Data

Average number of segments per induct


Key finding

Distributed training on the primary objective (e.g. classification / segmentation) is subject to over-fitting and shows limited robustness to distribution shift.

Distributed training on a surrogate self-supervised objective (e.g. SimCLR, SimSiam) reduces overfitting and shows superior generalization across distributions.

  • Bonus: it requires less human-annotated labels.

We will compare two algorithms

Data: We created distribution shift datasets grouped by clustering labels, images, or features.


Algorithm 1: Supervised Learning (SL): Trains classification or segmentation objective directly.


Algorithm 2: Self-Supervised Learning (SSL): Train common visual representation, then only "fine-tune" a small "head" on the supervised data.

FedAvg algorithm (McMahon et al, 2017)

\(N\) robots.
\(p_k\) is weight for robot \(k\)
\(\ell(x)\) is the loss function

\(n_k\) training samples at \(k\)

distributed client update: (\(E\) steps with random samples \(\xi\))

server update: (after responses from \(K\) clients)

E (number of decentralized steps) 

Classification on CIFAR distribution shift dataset









Classification success rate

Foundation models

quick experiments using CLIP "out of the box" by Kevin Zakka

Lecture 11: Deep Perception (part 1)

By russtedrake

Lecture 11: Deep Perception (part 1)

MIT Robotic Manipulation Fall 2022

  • 416