SlideShare a Scribd company logo
Single-Photon 3D Imaging with
Deep Sensor Fusion
David B. Lindell, Matthew O’Toole, Gordon Wetzstein
Stanford University
d =
𝑐∙𝑡
2
Tradeoffs in Active 3D Imaging
• maximum range
• acquisition speed
• resolution
Tradeoffs in Active 3D Imaging
• maximum range
• acquisition speed
• resolution
Velodyne
*
timestamp
of photon event
scene
Single-photon Avalanche Diode (SPAD)Num.Detections
Time of Flight
Detection Histogram
Ambient Light
Active 3D Imaging
Pulsed Continuous Wave
Algorithms
Hardware
Kinect-type systemsConventional LIDAR
Single-Photon Avalanche
Diodes (SPADs)
[Kirmani et al. 12;
Shin et al. 15, 16]
[Rapp and Goyal 17] [Marco et al. 17,
Su et al. 18]
Deep Depth Deep Sensor Fusion
[Li et al. 16,
Hui et al. 16]
Tradeoff
+ Maximum Range
- Acquisition Speed
- Resolution
Single-Photon Depth Estimation
Tradeoff
- Maximum Range
+ Acquisition Speed
+ Resolution
Single-Photon Depth Estimation
Active 3D Imaging
Pulsed Continuous Wave
Algorithms
Hardware
Kinect-type systemsConventional LIDAR
Single-Photon Avalanche
Diodes (SPADs)
[Kirmani et al. 12;
Shin et al. 15, 16]
[Rapp and Goyal 17] [Marco et al. 17,
Su et al. 18]
Deep Depth Deep Sensor Fusion
[Li et al. 16,
Hui et al. 16]
Tradeoff
+ Maximum Range
- Acquisition Speed
- Resolution
Tradeoff
- Maximum Range
+ Acquisition Speed
+ Resolution
+ single-photon depth estimation
- make assumptions
Single-Photon Depth Estimation
[Kirmani et al. 12;
Shin et al. 15, 16]
[Rapp and Goyal 17]
Laser detections
Noise/Ambient light
Single-Photon Depth Estimation
[Kirmani et al. 12;
Shin et al. 15, 16]
[Rapp and Goyal 17]
Censored Detections
Laser detections
Noise/Ambient light
Single-Photon Depth Estimation
[Kirmani et al. 12;
Shin et al. 15, 16]
[Rapp and Goyal 17]
Censored Detections Solve for Depth
Laser detections
Noise/Ambient light
Noisy
Detections
Censored Detections Solve for Depth
Heuristics for censoring:
• Median time of flight of surrounding pixels
• Use superpixel clustering
• Less effective with low signal/high ambient, complex scenes
Single-Photon Depth Estimation
[Kirmani et al. 12;
Shin et al. 15, 16]
[Rapp and Goyal 17]
Single-Photon Depth Estimation
Active 3D Imaging
Pulsed Continuous Wave
Algorithms
Hardware
Kinect-type systemsConventional LIDAR
Single-Photon Avalanche
Diodes (SPADs)
[Kirmani et al. 12;
Shin et al. 15, 16]
[Rapp and Goyal 17] [Marco et al. 17,
Su et al. 18]
Deep Depth Deep Sensor Fusion
[Li et al. 16,
Hui et al. 16]
Tradeoff
+ Maximum Range
- Acquisition Speed
- Resolution
Tradeoff
- Maximum Range
+ Acquisition Speed
+ Resolution
+ single-photon depth estimation
- make assumptions
- Applies to CW-TOF
Depth Estimation
Active 3D Imaging
Pulsed Continuous Wave
Algorithms
Hardware
Kinect-type systemsConventional LIDAR
Single-Photon Avalanche
Diodes (SPADs)
[Kirmani et al. 12;
Shin et al. 15, 16]
[Rapp and Goyal 17] [Marco et al. 17,
Su et al. 18]
Deep Depth Deep Sensor Fusion
[Li et al. 16,
Hui et al. 16]
Tradeoff
+ Maximum Range
- Acquisition Speed
- Resolution
Tradeoff
- Maximum Range
+ Acquisition Speed
+ Resolution
+ single-photon depth estimation
- make assumptions
- don’t use raw photon
counts
+ Maximum range
+ Acquisition speed
+ Resolution
Intensity Image
Photon Detections (20 Hz)
avg. < 1 laser photon
per spatial position
3D Reconstruction
x
y t CNN Sensor Fusion
Single-Photon 3D Imaging
Single-Photon Avalanche Diodes
(SPADs) + intensity Image
CNN
Processing
Photon-Efficient
Prototype
Single-Photon Avalanche Diodes
(SPADs) + intensity Image
CNN
Processing
Photon-Efficient
Prototype
Single-Photon 3D Imaging
SPAD Image Formation
Avg.Detections
Poisson
Sampling
Num.
Laser
Shots
Detection
Efficiency
Radial Falloff/
Reflectivity
Avg. Laser
Photon Arrivals
Avg. Ambient Photons +
Sensor Noise Detections
Measurement
Histogram Avg.Detections
SPAD Image Formation
Poisson
Sampling
Num.
Laser
Shots
Detection
Efficiency
Radial Falloff/
Reflectivity
Avg. Laser
Photon Arrivals
Avg. Ambient Photons +
Sensor Noise Detections
Measurement
Histogram
Parameter Value
pulse duration 80 ps
peak power 30 W
range 200 m
num. pulses (N) 10000 (~8 ms exposure)
~ 2 photons
~ 1 photon
SPAD Image Formation
Single-Photon Avalanche Diodes
(SPADs) + intensity Image
CNN
Processing
Photon-Efficient
Prototype
Dataset
• ~16K Simulated SPAD
measurements from NYU2 v2
• Time of flight from depth
• Intrinsic decomposition
[Chen and Koltun 13]
• Train on various avg. signal and
noise levels Intensity + Depth Images from NYU v2
[Silberman et al. 12]
Noisy Detections CNN
Processing
Censored Detections &
Regressed Pulse
Intensity Image
Estimated Depth Estimated Depth +
Upsampling
Processing Pipeline
Training
Simulated Noisy Detections
(+ intensity image)
CNN
Processing
Estimated Pulse
𝐿 ℎ, ℎ =
𝑘
𝐷 𝐾𝐿(ℎ 𝑘
, ℎ 𝑘
)
KL Divergence LossGround Truth Pulse
Depth Map
CNNArchitecture(1of3)
CNN Architecture for Depth Estimation
SPAD measurements only
(1 of 3)
SPAD measurements
Input
Output(1of3)
GeometryIntensity image
SPAD measurements
Input
Intensity image
CNNArchitecture(2of3)
Output(1of3)
Geometry
Output(2of3)
Geometry
(2 of 3)CNN Architecture for Depth Estimation
Improved denoising by sensor fusion
SPAD measurements
Input
Intensity image
CNN Architecture for Depth Estimation
Guided upsampling by sensor fusion
CNNArchitecture(3of3)
Output(3of3)
Geometry
Output(1of3)
Geometry
Output(2of3)
Geometry
(3 of 3)
Depth Estimation (simulated)
Ground Truth Depth Intensity
Depth Estimation (simulated)
Ground Truth Depth Measurements (SBR: 2/50) Log-matched Filter (RMSE: 5.7336)
Shin et al. 2016 (RMSE: 5.0787) Rapp and Goyal 2017 (RMSE: 0.0482) Ours with intensity (RMSE: 0.0343)
Intensity Log-matched Filter (RMSE: 5.7336)Ground Truth Depth
Shin et al. 2016 (RMSE: 5.0787) Ours with intensity (RMSE: 0.0343)
4x magnification 4x magnification
Depth Estimation (simulated)
Rapp and Goyal 2017 (RMSE: 0.0482)
Ground Truth Depth Intensity
Depth Estimation (simulated)
Ground Truth Depth Log-matched Filter (RMSE: 5.795)
Rapp and Goyal 2017 (RMSE: 0.0242) Ours with intensity (RMSE: 0.0239)Ours w/o intensity (RMSE: 0.0765)
Depth Estimation (simulated)
Measurements (SBR: 2/50)
Depth Estimation+Upsampling (simulated)
Intensity
High-Res. Ground Truth Depth
Depth Estimation+Upsampling (simulated)
Low-Res. Measurements (SBR 2/50)
High-Res. Ground Truth Depth
Depth CNN + Bicubic Upsampling
RMSE: 0.0622
Rapp & Goyal + Upsample CNN
RMSE: 0.1156
Depth CNN + Upsample CNN
RMSE: 0.0663
Proposed End-to-End
RMSE: 0.0593
Depth Estimation+Upsampling (simulated)
Intensity
High-Res. Ground Truth Depth
Depth CNN+ Bicubic Upsampling
RMSE: 0.0622
Rapp & Goyal + Upsample CNN
RMSE: 0.1156
Depth CNN + Upsample CNN
RMSE: 0.0663
Proposed End-to-End
RMSE: 0.0593
Single-Photon Avalanche Diodes
(SPADs) + intensity Image
CNN
Processing
Photon-Efficient
Prototype
verticalscanline
illumination optics
imaging optics
Intensity Camera
SPAD Line Array
vertical scanline
(note: laser illumination is too weak to observe visually while scanning under ambient light)
scan rate: 20 Hz lights on
scan rate: 20 Hz lights off
Intensity image
SPAD measurements (20 Hz)
Average per spatial position
0.64 Signal Detections
0.87 Background Detections
y t
x
Intensity image Log-matched filter [Rapp and Goyal 2017]
Denoised (w/o intensity) Denoised (w/ intensity) Denoised + Guided upsampling
Intensity image
SPAD measurements (5 Hz)
Average per spatial position
0.85 Signal Detections
2.9 Background Detections
y t
x
Intensity image Log-matched filter [Rapp and Goyal 2017]
Denoised (w/o intensity) Denoised (w/ intensity) Denoised + Guided upsampling
Intensity image
SPAD measurements (5 Hz)
(Outdoors under indirect sunlight)
y t
x
Intensity image
Log-matched filter [Rapp and Goyal 2017]
Denoised (w/o intensity) Denoised (w/ intensity) Denoised + Guided upsampling
Limitations
Signal limitations
~800 m SPAD Ranging
[Pawlikowska et al. 17]
• processing
• intensity image
• temporal consistency
• 2D SPAD arrays
Limitations
• processing
• intensity image
• temporal consistency
• 2D SPAD arrays
Limitations
• processing
• intensity image
• temporal consistency
• 2D SPAD arrays
Limitations
• processing
• intensity image
• temporal consistency
• 2D SPAD arrays
Limitations
[Burri et al.17]
Summary
Single-Photon Avalanche Diodes
(SPADs) + intensity Image
CNN
Processing
Photon-Efficient
Prototype
• Maximum range
• Acquisition speed
• Resolution
Single-Photon 3D Imaging with Deep Sensor Fusion
David B. Lindell, Matthew O’Toole, Gordon Wetzstein
Stanford University
Contact: lindell@stanford.edu
Code and data available: computationalimaging.org
y t
x

More Related Content

PPTX
2014.02.20_5章ニューラルネットワーク
Takeshi Sakaki
 
PDF
Real-Time Global Illumination Techniques
Jangho Lee
 
PDF
[PRML] パターン認識と機械学習(第3章:線形回帰モデル)
Ryosuke Sasaki
 
PDF
PRML5
Hidekazu Oiwa
 
PDF
Ndc11 이창희_hdr
changehee lee
 
PPTX
Stochastic Screen-Space Reflections
Electronic Arts / DICE
 
PDF
Build Lightmap system
Jaesik Hwang
 
PDF
PRML上巻勉強会 at 東京大学 資料 第4章4.3.1 〜 4.5.2
Hiroyuki Kato
 
2014.02.20_5章ニューラルネットワーク
Takeshi Sakaki
 
Real-Time Global Illumination Techniques
Jangho Lee
 
[PRML] パターン認識と機械学習(第3章:線形回帰モデル)
Ryosuke Sasaki
 
Ndc11 이창희_hdr
changehee lee
 
Stochastic Screen-Space Reflections
Electronic Arts / DICE
 
Build Lightmap system
Jaesik Hwang
 
PRML上巻勉強会 at 東京大学 資料 第4章4.3.1 〜 4.5.2
Hiroyuki Kato
 

What's hot (20)

PPTX
Physically Based and Unified Volumetric Rendering in Frostbite
Electronic Arts / DICE
 
PPTX
ウェーブレットと多重解像度処理
h_okkah
 
PPTX
PRML 4.4-4.5.2 ラプラス近似
KokiTakamiya
 
PDF
Prml 2.3
Yuuki Saitoh
 
PPT
Osaka prml reading_2.3.1-2
florets1
 
PDF
はじパタ11章 後半
Atsushi Hayakawa
 
PDF
PRML 5.2.1-5.3.3 ニューラルネットワークの学習 (誤差逆伝播) / Training Neural Networks (Backpropa...
Akihiro Nitta
 
PPTX
Deep Fakes Detection
Yusuke Uchida
 
PPTX
【論文読み会】Analytic-DPM_an Analytic Estimate of the Optimal Reverse Variance in D...
ARISE analytics
 
PDF
[OSGeo-KR Tech Workshop] Deep Learning for Single Image Super-Resolution
Taegyun Jeon
 
PPTX
W8PRML5.1-5.3
Masahito Ohue
 
PPTX
Unreal Fest 2023 - Lumen with Immortalis
Owen Wu
 
PPTX
Progressive Growing of GANs for Improved Quality, Stability, and Variation Re...
태엽 김
 
PPTX
ウェーブレット変換の基礎と応用事例:連続ウェーブレット変換を中心に
Ryosuke Tachibana
 
PDF
Modern OpenGL Usage: Using Vertex Buffer Objects Well
Mark Kilgard
 
PDF
PRML 2.3節 - ガウス分布
Yuki Soma
 
PDF
A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)
Takahiro Harada
 
PDF
Journal Club: VQ-VAE2
Takuya Koumura
 
PPT
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)
repii
 
PDF
PRML 5章 PP.227-PP.247
Tomoki Hayashi
 
Physically Based and Unified Volumetric Rendering in Frostbite
Electronic Arts / DICE
 
ウェーブレットと多重解像度処理
h_okkah
 
PRML 4.4-4.5.2 ラプラス近似
KokiTakamiya
 
Prml 2.3
Yuuki Saitoh
 
Osaka prml reading_2.3.1-2
florets1
 
はじパタ11章 後半
Atsushi Hayakawa
 
PRML 5.2.1-5.3.3 ニューラルネットワークの学習 (誤差逆伝播) / Training Neural Networks (Backpropa...
Akihiro Nitta
 
Deep Fakes Detection
Yusuke Uchida
 
【論文読み会】Analytic-DPM_an Analytic Estimate of the Optimal Reverse Variance in D...
ARISE analytics
 
[OSGeo-KR Tech Workshop] Deep Learning for Single Image Super-Resolution
Taegyun Jeon
 
W8PRML5.1-5.3
Masahito Ohue
 
Unreal Fest 2023 - Lumen with Immortalis
Owen Wu
 
Progressive Growing of GANs for Improved Quality, Stability, and Variation Re...
태엽 김
 
ウェーブレット変換の基礎と応用事例:連続ウェーブレット変換を中心に
Ryosuke Tachibana
 
Modern OpenGL Usage: Using Vertex Buffer Objects Well
Mark Kilgard
 
PRML 2.3節 - ガウス分布
Yuki Soma
 
A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)
Takahiro Harada
 
Journal Club: VQ-VAE2
Takuya Koumura
 
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)
repii
 
PRML 5章 PP.227-PP.247
Tomoki Hayashi
 
Ad

Similar to Single photon 3D Imaging with Deep Sensor Fusion (20)

PDF
“An Introduction to Single-Photon Avalanche Diodes—A New Type of Imager for C...
Edge AI and Vision Alliance
 
PDF
“Introduction to Depth Sensing,” a Presentation from Meta
Edge AI and Vision Alliance
 
PDF
Depth Fusion from RGB and Depth Sensors IV
Yu Huang
 
PDF
Depth Fusion from RGB and Depth Sensors II
Yu Huang
 
PDF
High-Speed Single-Photon SPAD Camera
Fabrizio Guerrieri
 
PDF
Goal location prediction based on deep learning using RGB-D camera
journalBEEI
 
PPTX
MEMS Laser Scanning, the platform for next generation of 3D Depth Sensors
MicroVision
 
PPT
Sensors On 3d Digitization
Rajan Kumar
 
PDF
Depth Fusion from RGB and Depth Sensors III
Yu Huang
 
PDF
Final_draft_Practice_School_II_report
Rishikesh Bagwe
 
DOC
Sensorson3ddigitizationseminarreport 120902025355-phpapp02
Ahamad Surya
 
DOCX
Sensors optimized for 3 d digitization
Basavaraj Patted
 
PPT
3D Scanning technology of industrial .ppt
preetheshdj
 
PPSX
MEMS Laser Scanning, the platform for next generation of 3D Depth Sensors
Jari Honkanen
 
PPTX
Lidar- light detection and ranging
Karthick Subramaniam
 
PDF
3-d interpretation from single 2-d image for autonomous driving II
Yu Huang
 
PPSX
Laser Beam Scanning LiDAR: MEMS-Driven 3D Sensing Automotive Applications fro...
MicroVision
 
PDF
Depth Fusion from RGB and Depth Sensors by Deep Learning
Yu Huang
 
PPTX
CAMSAP19
Julián Tachella
 
PDF
"How to Choose a 3D Vision Sensor," a Presentation from Capable Robot Components
Edge AI and Vision Alliance
 
“An Introduction to Single-Photon Avalanche Diodes—A New Type of Imager for C...
Edge AI and Vision Alliance
 
“Introduction to Depth Sensing,” a Presentation from Meta
Edge AI and Vision Alliance
 
Depth Fusion from RGB and Depth Sensors IV
Yu Huang
 
Depth Fusion from RGB and Depth Sensors II
Yu Huang
 
High-Speed Single-Photon SPAD Camera
Fabrizio Guerrieri
 
Goal location prediction based on deep learning using RGB-D camera
journalBEEI
 
MEMS Laser Scanning, the platform for next generation of 3D Depth Sensors
MicroVision
 
Sensors On 3d Digitization
Rajan Kumar
 
Depth Fusion from RGB and Depth Sensors III
Yu Huang
 
Final_draft_Practice_School_II_report
Rishikesh Bagwe
 
Sensorson3ddigitizationseminarreport 120902025355-phpapp02
Ahamad Surya
 
Sensors optimized for 3 d digitization
Basavaraj Patted
 
3D Scanning technology of industrial .ppt
preetheshdj
 
MEMS Laser Scanning, the platform for next generation of 3D Depth Sensors
Jari Honkanen
 
Lidar- light detection and ranging
Karthick Subramaniam
 
3-d interpretation from single 2-d image for autonomous driving II
Yu Huang
 
Laser Beam Scanning LiDAR: MEMS-Driven 3D Sensing Automotive Applications fro...
MicroVision
 
Depth Fusion from RGB and Depth Sensors by Deep Learning
Yu Huang
 
"How to Choose a 3D Vision Sensor," a Presentation from Capable Robot Components
Edge AI and Vision Alliance
 
Ad

Recently uploaded (20)

PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PDF
Software Development Methodologies in 2025
KodekX
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PDF
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PDF
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
Software Development Methodologies in 2025
KodekX
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 

Single photon 3D Imaging with Deep Sensor Fusion

Editor's Notes

  • #3: LIDAR stands for light detection and ranging, and is an active 3d imaging technology that is frequently used for example in autonomous navigation. LIDAR calculates distance by measuring the time it takes for a pulse of light to travel to an object and back. Here, the measurements show the time of the detected photons from the pulse
  • #4: If we increase the distance and we have ambient detections, the signal to noise is much lower and it becomes difficult to see the return pulse in the measurements. So we could solve this by increasing the exposure time, but then overall acquisition time goes down and so does the amount of points we could potentially scan. So we have this tradeoff between range, acquisition speed, and resolution.
  • #5: Commercial lidars that need to achieve long range return a sparsely scanned 3d pointcloud of the scene.
  • #6: So how does a lidar work? There are different types of are used in LIDAR system, but one emerging class of extremely sensitive sensor is called a single photon avalanche diode, or SPAD. Here’s how a SPAD works: We use a picosecond laser to send millions of short pulses of light into the scene every second. Each pulse interacts reflects off the scene scatters back to our detector. For each pulse emitted into the scene, the SPAD has a chance at detecting the arrival of a single photon. When a detection occurs, a piece of circuitry called a time to digital converter or TDC generates a timestamp to record the time of flight of the photon. Time zero corresponds to when the laser pulse was emitted and the largest timestamp value corresponds to the instant just before the next laser pulse is emitted. By binning detections by their time of flight, we start to when the laser pulse arrives This histogram comprises a single spatial scan of a 3D scene.
  • #7: Pulsed systems scan a concentrated pulse of light around the scene and can be paired with sensitive detectors called SPADs which detect the arrival time of down to single photons. These systems trade off long range for long acquisition time and low resolution. Kinect type systems diffuse their light over the whole scene, and so since they don’t require scanning, they can trade off long range for high acquisition speed and resolution.
  • #8: There are also different reconstruction algorithms for these sensors. SPADs can be used to recover depth with only a single detected photon, existing algorithms make limiting assumptions about the types of scenes that can be reconstructed Depth estimation with deep networks and sensor fusion approaches exist for Kinect-type sensors, but they use entirely different measurements than sensitive SPAD detectors.
  • #9: These methods take as input a noisy set of photon detections at each pixel. Again, some photons come from the laser and some come from sensor noise or ambient light. The ambient and noise detections make solving for the depth a non-convex problem, so these methods attempt to remove any photon detections from noise or ambient light and then solve an optimization problem to determine the depth from the remaining photon detections.
  • #10: These methods take as input a noisy set of photon detections at each pixel. Again, some photons come from the laser and some come from sensor noise or ambient light. The ambient and noise detections make solving for the depth a non-convex problem, so these methods attempt to remove any photon detections from noise or ambient light and then solve an optimization problem to determine the depth from the remaining photon detections.
  • #11: These methods take as input a noisy set of photon detections at each pixel. Again, some photons come from the laser and some come from sensor noise or ambient light. The ambient and noise detections make solving for the depth a non-convex problem, so these methods attempt to remove any photon detections from noise or ambient light and then solve an optimization problem to determine the depth from the remaining photon detections.
  • #12: There are different heuristic approaches for censoring the ambient and noise photons. Some methods filter based on comparison to the median time of flight values for groups of pixels, average measurements together from pixels with similar albedo values to try to increase the amount of signal from the laser pulse. Another method histograms the detections and tries to identify peaks which correspond to a sparse set of depth planes. These types of heuristic approaches might not work as well for scenes with complex geometry or extremely low signal and high ambient light.
  • #13: There are also different reconstruction algorithms for these sensors. SPADs can be used to recover depth with only a single detected photon, existing algorithms make limiting assumptions about the types of scenes that can be reconstructed Depth estimation with deep networks and sensor fusion approaches exist for Kinect-type sensors, but they use entirely different measurements than sensitive SPAD detectors.
  • #14: So we want to alleviate this tradeoff between range, acquisition speed and resolution.
  • #15: In this work, we use sensor fusion with a normal intensity image and the time of flight of detected photons from sensitive photodetectors to robustly recover 3D geometry with less than a single laser photon returning on average at each spatial position. We’d like to fuse information from the intensity image and the time of flight, because they contain complimentary information. The intensity image is 2D with low noise and high-resolution, and the depth map is 3D, but noisy and low resolution. But it’s not clear how we can actually jointly model and exploit these modalities, so we propose to use a learned approach with convolutional neural networks to robustly recover 3d geometry. Moreover, I’ll show how we can learn and end-to-end mapping for depth upsampling, and we also imagine using such an approach for higher-level goals such as object detection or way point estimation for driving from the raw measurements.
  • #16: So I’m going to talk about how we model these SPAD measurements, a CNN framework for depth estimation and sensor fusion, and how we combine this with a photon efficient prototype for depth estimation.
  • #17: I’ll first talk about how these spad sensors work and the types of measurements that we get
  • #18: If we look at this histogram of measurements, we can identify three sources of signal: noise from the sensor, ambient light, and our laser pulse. And we can model this histogram with the average rate of detections from each source. The forward image formation of the measurement histogram is then given by sampling a poisson process with these average detections as a time varying arrival rate. This is described by this equation where N is the number of laser pulses over our exposure period, tau is the arrival rate of the laser pulse, gamma accounts for radial distance falloff and object reflectivity, eta accounts for the efficiency of our photodetector, and a and d indicate the average ambient and sensor noise detections.
  • #19: If we look at this histogram of measurements, we can identify three sources of signal: noise from the sensor, ambient light, and our laser pulse. And we can model this histogram with the average rate of detections from each source. The forward image formation of the measurement histogram is then given by sampling a poisson process with these average detections as a time varying arrival rate. This is described by this equation where N is the number of laser pulses over our exposure period, tau is the arrival rate of the laser pulse, gamma accounts for radial distance falloff and object reflectivity, eta accounts for the efficiency of our photodetector, and a and d indicate the average ambient and sensor noise detections.
  • #20: So we could run the numbers for a simple example with 80 ps pulse duration, 30 W peak power and 200 m range with 10K pulses. On average from the return pulse we might expect to get back just a couple photons, and then during that same time interval we would get around a single photon from ambient light. So this helps to identify some of the difficulty facing commercial lidar systems that are trying to estimate depth at such long ranges. And this also motivates a sensor fusion approach to help disentangle the signal photons from the background detections.
  • #21: Additional information might help with the problem. For example, a conventional image of the scene contains image gradients with information about physical structure and a measure of how much ambient light comes from each location.
  • #22: Additional information might help with the problem. For example, a conventional image of the scene contains image gradients with information about physical structure and a measure of how much ambient light comes from each location. But identifying an analytical a mapping between these complimentary sensor measurements is not easily modeled, so we look towards a learned algorithm. Moreover, we can train a learned approach end-to-end for not just denoising tasks, but upsampling, classification, or even to estimate waypoints for self driving cars.
  • #23: So given the difficulty in isolating the signal
  • #24: So how can we use CNNs to learn to perform the sensor fusion and depth estimation task?
  • #25: In order to train the CNN, we use our image formation model to construct a dataset of SPAD measurements and intensity images which we use to train our CNN. We model the SPAD measurements from the NYU v2 dataset of RGB-D images where at each pixel location we calculate the laser pulse arrival time and radial falloff based on the depth, and we account for object reflectivity by estimating the albedo using a method for intrinsic decomposition. We don’t model multipath effects, though we don’t find this is necessary for the CNN to learn the depth reconstruction.
  • #26: The algorithm takes as input the noisy photon detections and optionally an intensity image. The CNN censors spurious detections and regresses the laser pulse from the measurements at each spatial location. From there we can output a depth map, or use the intensity image, which is higher resolution than the SPAD sensor, to do a guided depth upsampling.
  • #27: We train the network on the dataset of SPAD measurements. The network regresses the pulse at each spatial location and we can take the peak to determine the depth value. We compare the output pulse from the network to the normalized ground truth pulse arrival and use the KL divergence as a loss function. The intuition is that depth values close to the ground truth should be penalized less than far away, and this loss allows the network to receive some reward for getting close to the ground truth pulse if it’s not perfect.
  • #28: Here’s an overview of the CNN architecture for the case where we use only the SPAD measurements to estimate depth. We pass the spad measurement volume into a network of 3D convolutions at multiple resolution scales, which are then upsampled to the original resolution and processed with additional 3d convolutional layers. The output of the network is the regressed laser pulse at each spatial location.
  • #29: We can concatenate the intensity image into the network and jointly process it with features from the SPAD measurements to improve the output depth estimate.
  • #30: Finally, we can use the high-resolution intensity image to upsample the initial low-resolution depth map output. We model the upsampling network on a state of the art image-guided depth upsampling approach. This takes a high-pass filtered version of the low-resolution depth map, upsamples it, and adds it to a bicubicly upsampled version of the depth map to predict the high-resolution version. In this way we can have an end-to-end trained network that predicts an upsampled depth map from the raw photon counts and an intensity image.
  • #31: Here are some results for simulated SPAD measurements on a scene from the middlebury dataset. The scene and the following are all simulated with an average of 2 laser photon detections and 50 background detections per pixel. We can outperform conventional log-matched filtering and Shin et al.,
  • #32: Here are some results for simulated SPAD measurements on a scene from the middlebury dataset. The scene and the following are all simulated with an average of 2 laser photon detections and 50 background detections per pixel. We can outperform conventional log-matched filtering and Shin et al.,
  • #33: and recover more fine details in the scene compared to recent work from Rapp and Goyal.
  • #34: Here is another scene. Our method without the intensity image recovers details in the laundry basket that are smoothed over in Rapp and Goyal. These details are also preserved in our approach without the intensity image, though at this low noise level there are artifacts in the reconstruction. Note that the RMSE metric does not heavily penalize the output of Rapp and Goyal despite the loss of detail.
  • #35: Here is another scene. Our method without the intensity image recovers details in the laundry basket that are smoothed over in Rapp and Goyal. These details are also preserved in our approach without the intensity image, though at this low noise level there are artifacts in the reconstruction. Note that the RMSE metric does not heavily penalize the output of Rapp and Goyal despite the loss of detail.
  • #36: This table shows RMSE values averaged across a testset of middlebury scenes for a range of signal and background detection levels. Our approach with the intensity image achieves comparable quantitative results to Rapp and Goyal, but qualitatively better preserves detail as shown in the Laundry scene. These results are for a single model trained across a range of noise levels. We can also improve the performance in some cases by training models specific to each noise level.
  • #37: Here’s an example of the image guided depth upsampling. In this case we show the upsampled result after depth estimation with different methods. At lower resolutions the super-pixel clustering approach of Rapp and Goyal doesn’t work as well to produce a good initial depth estimate and so the upsampled output fails to recover many of the details. We can run our depth estimate with the intensity image and then use bicubic upsampling, or run our depth cnn and an image guided depth upsampling cnn in a two step procedure. Finally we can train the depth estimation and upsampling end to end and achieve a result which demonstrates the least error and has a better qualitative appearance than the other images.
  • #38: Here’s an example of the image guided depth upsampling. In this case we show the upsampled result after depth estimation with different methods. At lower resolutions the super-pixel clustering approach of Rapp and Goyal doesn’t work as well to produce a good initial depth estimate and so the upsampled output fails to recover many of the details. We can run our depth estimate with the intensity image and then use bicubic upsampling, or run our depth cnn and an image guided depth upsampling cnn in a two step procedure. Finally we can train the depth estimation and upsampling end to end and achieve a result which demonstrates the least error and has a better qualitative appearance than the other images.
  • #39: Our end-to-end approach also demonstrates significant improvements over other approaches to image guided upsampling from the raw spad measurements.
  • #40: Finally, to demonstrate our method, we built a hardware prototype
  • #41: The prototype contains two optical paths: one path for the illumination optics and one for the imaging optics. Along the illumination path, a picosecond laser pulse passes through a cylindrical lens to illuminate a scanline on the scene. We use a linear 256x1 array of SPAD pixels to image this illuminated line. The linear spad array and the laser line are scanned using two sets of synchronized scanning mirrors. * Label spad and camera
  • #42: Here’s an example of the prototype in action
  • #44: In slow motion you can see the scanning path of the laser.
  • #45: Here’s an example video we captured with the SPAD prototype along with the SPAD measurement volume
  • #46: Our reconstruction approach with and without the intensity image outperforms both log-matched filtering and rapp and goyal’s approach. We can also increase the density of the point cloud with upsampling, though at the cost of introducing floating pixel artifacts.
  • #47: Here’s another example of an indoor scene which we captured.
  • #48: … and the reconstructed point clouds
  • #49: Here’s another example of an indoor scene which we captured.
  • #50: … and the reconstructed point clouds
  • #51: Finally, we also captured these last scenes outdoors in indirect sunlight.
  • #53: Finally, we also captured this last scene outdoors in indirect sunlight. Notice that we barely get any signal, so it’s surprising we can recover anything at all.
  • #55: One clear limitation of the prototype is that our range was limited to a couple of meters. This is because our low power laser and low-fill factor spad give us really low photon counts and limit the range we can get. This first plot shows the amount of photon detections vs distance, and shows that at around 1.5 meters we receive less than a single photon detection from the laser pulse on average. And this is less than the average number of ambient and dark counts that are detected. However, another group has been able to achieve 800 m range for the same or better signal to background ratio with a more engineered optical system, and our method could apply to that as well.
  • #56: Finally, we note that our algorithm along with other techniques for single-photon depth estimation currently require offline processing of the data. We use a fairly large input measurement volume, which could be downsampled to mitigate the processing requirements. But along with other solutions, I would point out that compute resources are becoming increasingly capable- so the processing on a TPU would increase the bandwidth already by nearly 20x. Also if we were to deploy the system at night-time, we wouldn’t be able to make much use of the intensity image. However, the measurements would also be much less noisy without ambient light, and so we could still use the denoising framework without the intensity image. Note too that since the SPAD observes the world at the laser wavelength and the intensity image captures other wavelengths of light, we assume that the image of the scene changes smoothly over the spectrum so that the measurements are compatible. While we do achieve fairly good temporal consistency in the video results, this isn’t explicitly encoded into the framework, this could be exploited with a different kind of architecture. Finally, acquisition speed could be further improved using SPAD array rather than a line sensor, which would mitigate the scanning requirement. SPADs are CMOS-based technology and 2D SPAD arrays are already being produced at sizeable resolutions, including greater than 256 x 256.
  • #57: Finally, we note that our algorithm along with other techniques for single-photon depth estimation currently require offline processing of the data. We use a fairly large input measurement volume, which could be downsampled to mitigate the processing requirements. But along with other solutions, I would point out that compute resources are becoming increasingly capable- so the processing on a TPU would increase the bandwidth already by nearly 20x. Also if we were to deploy the system at night-time, we wouldn’t be able to make much use of the intensity image. However, the measurements would also be much less noisy without ambient light, and so we could still use the denoising framework without the intensity image. Note too that since the SPAD observes the world at the laser wavelength and the intensity image captures other wavelengths of light, we assume that the image of the scene changes smoothly over the spectrum so that the measurements are compatible. While we do achieve fairly good temporal consistency in the video results, this isn’t explicitly encoded into the framework, this could be exploited with a different kind of architecture. Finally, acquisition speed could be further improved using SPAD array rather than a line sensor, which would mitigate the scanning requirement. SPADs are CMOS-based technology and 2D SPAD arrays are already being produced at sizeable resolutions, including greater than 256 x 256.
  • #58: Finally, we note that our algorithm along with other techniques for single-photon depth estimation currently require offline processing of the data. We use a fairly large input measurement volume, which could be downsampled to mitigate the processing requirements. But along with other solutions, I would point out that compute resources are becoming increasingly capable- so the processing on a TPU would increase the bandwidth already by nearly 20x. Also if we were to deploy the system at night-time, we wouldn’t be able to make much use of the intensity image. However, the measurements would also be much less noisy without ambient light, and so we could still use the denoising framework without the intensity image. Note too that since the SPAD observes the world at the laser wavelength and the intensity image captures other wavelengths of light, we assume that the image of the scene changes smoothly over the spectrum so that the measurements are compatible. While we do achieve fairly good temporal consistency in the video results, this isn’t explicitly encoded into the framework, this could be exploited with a different kind of architecture. Finally, acquisition speed could be further improved using SPAD array rather than a line sensor, which would mitigate the scanning requirement. SPADs are CMOS-based technology and 2D SPAD arrays are already being produced at sizeable resolutions, including greater than 256 x 256.
  • #59: Finally, we note that our algorithm along with other techniques for single-photon depth estimation currently require offline processing of the data. We use a fairly large input measurement volume, which could be downsampled to mitigate the processing requirements. But along with other solutions, I would point out that compute resources are becoming increasingly capable- so the processing on a TPU would increase the bandwidth already by nearly 20x. Also if we were to deploy the system at night-time, we wouldn’t be able to make much use of the intensity image. However, the measurements would also be much less noisy without ambient light, and so we could still use the denoising framework without the intensity image. Note too that since the SPAD observes the world at the laser wavelength and the intensity image captures other wavelengths of light, we assume that the image of the scene changes smoothly over the spectrum so that the measurements are compatible. While we do achieve fairly good temporal consistency in the video results, this isn’t explicitly encoded into the framework, this could be exploited with a different kind of architecture. Finally, acquisition speed could be further improved using SPAD array rather than a line sensor, which would mitigate the scanning requirement. SPADs are CMOS-based technology and 2D SPAD arrays are already being produced at sizeable resolutions, including greater than 256 x 256.
  • #60: In summary, we’ve presented a photon-efficient method for 3D imaging which leverages Single photon avalanche diodes and a high-resolution intensity image. While sensor fusion for this task is a very challenging analytical problem, we can readily leverage CNNs for this task in a learned approach and we demonstrate a photon-efficient hardware prototype.
  • #61: I’d like to thank our sponsors and also mention that code and data for this project are online on our project webpage. Thank you