【计算机视觉】基于神经动态图像渲染的空间时间视图合成：复杂动态场景下的高质量新视角生成系统设计资源-CSDN下载

156 浏览量 2025-07-04 09:59:03 上传评论收藏 7.33MB PDF 举报

资源推荐

资源详情

资源评论

DynIBaR: Neural Dynamic Image-Based Rendering

Zhengqi Li

, Qianqian Wang

1,2

, Forrester Cole

, Richard Tucker

, Noah Snavely

Google Research

Cornell Tech

HyperNeRF

NSFF

Ours

HyperNeRF

NSFF

Ours

0.31 0.19

0.04

Figure 1. Recent methods for synthesizing novel views from monocular videos of dynamic scenes–like HyperNeRF [50] and NSFF [35]–

struggle to render high-quality views from long videos featuring complex camera and scene motion. We present a new approach that

addresses these limitations, illustrated above via an application to

DoF video stabilization, where we apply our approach and prior methods

on a 30-second, shaky video clip, and compare novel views rendered along a smoothed camera path (

left

). On a dynamic scenes dataset

(

right

) [75], our approach signiﬁcantly improves rendering ﬁdelity, as indicated by synthesized images and LPIPS errors computed on pixels

corresponding to moving objects (yellow numbers). Please see the supplementary video for full results.

Abstract

We address the problem of synthesizing novel views from

a monocular video depicting a complex dynamic scene. State-

of-the-art methods based on temporally varying Neural Ra-

diance Fields (aka dynamic NeRFs) have shown impressive

results on this task. However, for long videos with com-

plex object motions and uncontrolled camera trajectories,

these methods can produce blurry or inaccurate renderings,

hampering their use in real-world applications. Instead of

encoding the entire dynamic scene within the weights of

MLPs, we present a new approach that addresses these lim-

itations by adopting a volumetric image-based rendering

framework that synthesizes new viewpoints by aggregating

features from nearby views in a scene motion–aware manner.

Our system retains the advantages of prior methods in its

ability to model complex scenes and view-dependent effects,

but also enables synthesizing photo-realistic novel views

from long videos featuring complex scene dynamics with

unconstrained camera trajectories. We demonstrate signiﬁ-

cant improvements over state-of-the-art methods on dynamic

scene datasets, and also apply our approach to in-the-wild

videos with challenging camera and object motion, where

prior methods fail to produce high-quality renderings.

1. Introduction

Computer vision methods can now produce free-

viewpoint renderings of static 3D scenes with spectacular

quality. What about moving scenes, like those featuring peo-

ple or pets? Novel view synthesis from a monocular video of

a dynamic scene is a much more challenging dynamic scene

reconstruction problem. Recent work has made progress

towards synthesizing novel views in both space and time,

thanks to new time-varying neural volumetric representa-

tions like HyperNeRF [50] and Neural Scene Flow Fields

(NSFF) [35], which encode spatiotemporally varying scene

content volumetrically within a coordinate-based multi-layer

perceptron (MLP).

However, these dynamic NeRF methods have limitations

arXiv:2211.11082v3 [cs.CV] 24 Apr 2023

that prevent their application to casual, in-the-wild videos.

Local scene ﬂow–based methods like NSFF struggle to

scale to longer input videos captured with unconstrained

camera motions: the NSFF paper only claims good perfor-

mance for 1-second, forward-facing videos [35]. Methods

like HyperNeRF that construct a canonical model are mostly

constrained to object-centric scenes with controlled camera

paths, and can fail on scenes with complex object motion.

In this work, we present a new approach that is scalable to

dynamic videos captured with 1) long time duration, 2) un-

bounded scenes, 3) uncontrolled camera trajectories, and 4)

fast and complex object motion. Our approach retains the ad-

vantages of volumetric scene representations that can model

intricate scene geometry with view-dependent effects, while

signiﬁcantly improving rendering ﬁdelity for both static and

dynamic scene content compared to recent methods [35, 50],

as illustrated in Fig. 1.

We take inspiration from recent methods for rendering

static scenes that synthesize novel images by aggregating

local image features from nearby views along epipolar

lines [39, 64, 70]. However, scenes that are in motion vi-

olate the epipolar constraints assumed by those methods. We

instead propose to aggregate multi-view image features in

scene motion–adjusted ray space, which allows us to cor-

rectly reason about spatio-temporally varying geometry and

appearance.

We also encountered many efﬁciency and robustness chal-

lenges in scaling up aggregation-based methods to dynamic

scenes. To efﬁciently model scene motion across multiple

views, we model this motion using motion trajectory ﬁelds

that span multiple frames, represented with learned basis

functions. Furthermore, to achieve temporal coherence in

our dynamic scene reconstruction, we introduce a new tem-

poral photometric loss that operates in motion-adjusted ray

space. Finally, to improve the quality of novel views, we pro-

pose to factor the scene into static and dynamic components

through a new IBR-based motion segmentation technique

within a Bayesian learning framework.

On two dynamic scene benchmarks, we show that our

approach can render highly detailed scene content and sig-

niﬁcantly improves upon the state-of-the-art, leading to an

average reduction in LPIPS errors by over 50% both across

entire scenes, as well as on regions corresponding to dynamic

objects. We also show that our method can be applied to in-

the-wild videos with long duration, complex scene motion,

and uncontrolled camera trajectories, where prior state-of-

the-art methods fail to produce high quality renderings. We

hope that our work advances the applicability of dynamic

view synthesis methods to real-world videos.

2. Related Work

Novel view synthesis.

Classic image-based rendering

(IBR) methods synthesize novel views by integrating pixel

information from input images [58], and can be categorized

according to their dependence on explicit geometry. Light

ﬁeld or lumigraph rendering methods [9, 21, 26, 32] generate

new views by ﬁltering and interpolating sampled rays, with-

out use of explicit geometric models. To handle sparser input

views, many approaches [7, 14, 18, 23, 24, 26, 30, 52, 54, 55]

leverage pre-computed proxy geometry such as depth maps

or meshes to render novel views.

Recently, neural representations have demonstrated high-

quality novel view synthesis [12, 17, 38, 40, 46, 48, 59

–

62,

72, 81]. In particular, Neural Radiance Fields (NeRF) [46]

achieves an unprecedented level of ﬁdelity by encoding con-

tinuous scene radiance ﬁelds within multi-layer perceptrons

(MLPs). Among all methods building on NeRF, IBRNet [70]

is the most relevant to our work. IBRNet combines classical

IBR techniques with volume rendering to produce a general-

ized IBR module that can render high-quality views without

per-scene optimization. Our work extends this kind of volu-

metric IBR framework designed for static scenes [11, 64, 70]

to more challenging dynamic scenes. Note that our focus is

on synthesizing higher-quality novel views for long videos

with complex camera and object motion, rather than on gen-

eralization across scenes.

Dynamic scene view synthesis.

Our work is related to ge-

ometric reconstruction of dynamic scenes from RGBD [5,

15, 25, 47, 68, 83] or monocular videos [31, 44, 78, 80]. How-

ever, depth- or mesh-based representations struggle to model

complex geometry and view-dependent effects.

Most prior work on novel view synthesis for dynamic

scenes requires multiple synchronized input videos [1, 3,

6, 27, 33, 63, 69, 76, 82], limiting their real-world applica-

bility. Some methods [8, 13, 22, 51, 71] use domain knowl-

edge such as template models to achieve high-quality re-

sults, but are restricted to speciﬁc categories [41, 56]. More

recently, many works propose to synthesize novel views

of dynamic scenes from a single camera. Yoon et al. [75]

render novel views through explicit warping using depth

maps obtained via single-view depth and multi-view stereo.

However, this method fails to model complex scene geom-

etry and to ﬁll in realistic and consistent content at disoc-

clusions. With advances in neural rendering, NeRF-based

dynamic view synthesis methods have shown state-of-the-

art results [16, 35, 53, 66, 74]. Some approaches, such as

Nerﬁes [49] and HyperNeRF [50], represent scenes using a

deformation ﬁeld mapping each local observation to a canoni-

cal scene representation. These deformations are conditioned

on time [53] or a per-frame latent code [49, 50, 66], and are

parameterized as translations [53, 66] or rigid body motion

ﬁelds [49,50]. These methods can handle long videos, but are

mostly limited to object-centric scenes with relatively small

object motion and controlled camera paths. Other methods

represent scenes as time-varying NeRFs [19, 20, 35, 67, 74].

In particular, NSFF uses neural scene ﬂow ﬁelds that can

Volume

Rendering

Ray

Transformer

Target View

Source View

Rendered View

𝜎

ray distance

Figure 2.

Rendering via motion-adjusted multi-view feature ag-

gregation.

Given a sampled location

at time

along a target ray

, we estimate its motion trajectory, which determines the 3D cor-

respondence of

at nearby time

j ∈ N (i)

, denoted

i→j

. Each

warped point is then projected into its corresponding source view.

Image features

extracted along the projected curves are aggre-

gated and fed to the ray transformer with time embedding

γ(i)

producing per-sample color and density

, σ

)

. The ﬁnal pixel

color

is then synthesized by volume rendering

, σ

)

along

capture fast and complex 3D scene motion for in-the-wild

videos [35]. However, this method only works well for short

(1-2 second), forward-facing videos.

3. Dynamic Image-Based Rendering

Given a monocular video of a dynamic scene with

frames

, I

, . . . , I

)

and known camera parameters

, P

, . . . , P

)

, our goal is to synthesize a novel view-

point at any desired time within the video. Like many other

approaches, we train per-video, ﬁrst optimizing a model to

reconstruct the input frames, then using this model to render

novel views.

Rather than encoding 3D color and density directly in

the weights of an MLP as in recent dynamic NeRF methods,

we integrate classical IBR ideas into a volumetric rendering

framework. Compared to explicit surfaces, volumetric repre-

sentations can more readily model complex scene geometry

with view-dependent effects.

The following sections introduce our methods for scene-

motion-adjusted multi-view feature aggregation (Sec. 3.1),

and enforcing temporal consistency via cross-time rendering

in motion-adjusted ray space (Sec. 3.2). Our full system

combines a static model and a dynamic model to produce a

color at each pixel. Accurate scene factorization is achieved

via segmentation masks derived from a separately trained

motion segmentation module within a Bayesian learning

framework (Sec. 3.3).

3.1. Motion-adjusted feature aggregation

We synthesize new views by aggregating features ex-

tracted from temporally nearby source views. To render an

image at time

, we ﬁrst identify source views

within a

temporal radius

frames of

j ∈ N (i) = [i − r, i + r]

. For

each source view, we extract a 2D feature map

through a

shared convolutional encoder network to form an input tuple

, P

, F

To predict the color and density of each point sampled

along a target ray

, we must aggregate source view features

while accounting for scene motion. For a static scene, points

along a target ray will lie along a corresponding epipolar

line in a neighboring source view, hence we can aggregate

potential correspondences by simply sampling along neigh-

boring epipolar lines [64, 70]. However, moving scene ele-

ments violate epipolar constraints, leading to inconsistent

feature aggregation if motion is not accounted for. Hence,

we perform motion-adjusted feature aggregation, as shown

in Fig. 3. To determine correspondence in dynamic scenes,

one straightforward idea is to estimate a scene ﬂow ﬁeld via

an MLP [35] to determine a given point’s motion-adjusted

3D location at a nearby time. However, this strategy is com-

putational infeasible in a volumetric IBR framework due to

recursive unrolling of the MLPs.

Motion trajectory ﬁelds.

Instead, we represent scene mo-

tion using motion trajectory ﬁelds described in terms of

learned basis functions. For a given 3D point

along target

ray

at time

, we encode its trajectory coefﬁcients with an

MLP G

{φ

(x)}

l=1

= G

(γ(x), γ(i)) , (1)

where

∈ R

are basis coefﬁcients (with separate coef-

ﬁcients for

, and

, using the motion basis described

below) and

denotes positional encoding. We choose

L = 6

bases and 16 linearly increasing frequencies for the encoding

, based on the assumption that scene motion tends to be

low frequency [80].

We also introduce a global learnable motion basis

}

l=1

, h

∈ R

, spanning every time step

of the input

video, which is optimized jointly with the MLP. The motion

trajectory of

is then deﬁned as

x,i

(j) =

l=1

(x)

and thus, the relative displacement between

and its 3D

correspondence x

i→j

at time j is computed as

∆

x,i

(j) = Γ

x,i

(j) − Γ

x,i

(i). (2)

With this motion trajectory representation, ﬁnding 3D corre-

spondences for a query point

in neighboring views requires

just a single MLP query, allowing efﬁcient multi-view fea-

ture aggregation within our volume rendering framework.

We initialize the basis

}

l=1

with the DCT basis as pro-

posed by Wang et al. [67], but ﬁne-tune it along with other

components during optimization, since we observe that a

ﬁxed DCT basis can fail to model a wide range of real-world

motions (see third column of Fig. 4).

Using the estimated motion trajectory of

at time

, we

denote

’s corresponding 3D point at time

i→j

x + ∆

x,i

(j)

. We project each warped point

i→j

into its

source view

using camera parameters

, and extract

color and feature vector

at the projected 2D pixel location.

剩余11页未读，继续阅读

评论收藏

内容反馈

码流怪侠

粉丝: 4w+

【计算机视觉】基于神经动态图像渲染的空间时间视图合成：复杂动态场景下的高质量新视角生成系统设计

【计算机视觉】基于StreamGS的无姿态图像流在线通用3D高斯点阵重建：实时高效场景重建与新视角合成系统设计

cpp-从稀疏输入图像中进行新颖视图合成的Tensorflow实现

【计算机视觉】基于MoSca的4D运动支架视频重建与高斯融合渲染系统：从2D先验到3D动态场景合成的技术实现了文档的核心内容（含详细代码及解释）

基于深度学习的图像合成研究综述.pdf

计算机视觉算法与应用.pdf

计算机视觉提升动漫特效和渲染.pptx

cpp-神经场景表示和渲染GQN

计算机图形学之渲染算法：Path Tracing：路径空间与图像空间的关系.docx

计算机视觉的最新权威书籍

基于计算机视觉的虚实场景合成方法研究与应用.docx

基于计算机视觉的虚实场景合成方法研究与应用的论文-计算机应用论文.docx

从真实二维图像的集合合成三维空间的虚拟二维图像.zip

基于DIBR视点合成的空洞填充方法_DIBR空洞填充_虚拟视图合成_

2022年基于计算机视觉的虚实场景合成方法研究与应用.docx

计算机视觉常用术语中英文对照.doc

Computer_Vision Algorithms and Applications2010 ：计算机视觉算法和应用

MPI and neural Rendering

一种基于图像的合成图像细化生成器体系结构_An Image-based Generator Architecture for S

HDR-NeRF-presentation

基于IBR的简单图像拼接算法

【计算机视觉：算法与应用】Computer Vision: Algorithms and Application

网络游戏-基于BP神经网络的单一图像去雾方法及装置.zip

基于三维GIS技术的矢量地图动态LOD渲染方法研究现状.docx

计算机视觉：应用及算法

计算机视觉在VR终端仿真中的作用.pptx

计算机视觉课件

毕设项目：基于C++的3D人脸模型生成系统.zip

linux文件句柄、单个用户进程数、swap分区、软限制、硬限制等系统调优

kubedl-master.zip

最新资源