没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
内容概要:本文介绍了一种新的动态图像基于渲染(DynIBaR)方法,用于从单目视频合成复杂动态场景的新视角。现有方法如HyperNeRF和NSFF在处理长时间、复杂运动场景时存在模糊或不准确的问题。DynIBaR采用体积图像基渲染框架,通过聚合附近视图的特征并考虑场景运动,能够合成高质量的新视角。该方法引入了运动轨迹场来高效建模多帧间的场景运动,并提出了跨时间渲染技术以确保时间一致性。此外,DynIBaR还结合静态和动态模型,通过贝叶斯学习框架下的运动分割模块进行监督。实验表明,DynIBaR在多个基准数据集上显著优于现有方法,并能在野外视频中保持高渲染质量。 适合人群:计算机视觉研究人员、图像处理工程师、深度学习从业者。 使用场景及目标:①合成复杂动态场景的高质量新视角;②处理长时间、无约束摄像机轨迹和快速复杂物体运动的视频;③提高动态场景重建的时间一致性和渲染质量。 其他说明:尽管DynIBaR在许多方面表现出色,但它仍有一些局限性,如无法处理极小的快速移动物体以及对某些退化运动模式敏感。此外,渲染静态内容的质量取决于选择的源视图。总体而言,DynIBaR为解决现实世界中的动态场景重建提供了有力工具。
资源推荐
资源详情
资源评论





























DynIBaR: Neural Dynamic Image-Based Rendering
Zhengqi Li
1
, Qianqian Wang
1,2
, Forrester Cole
1
, Richard Tucker
1
, Noah Snavely
1
1
Google Research
2
Cornell Tech
HyperNeRF
NSFF
Ours
HyperNeRF
NSFF
Ours
GT
0.31 0.19
0.04
Figure 1. Recent methods for synthesizing novel views from monocular videos of dynamic scenes–like HyperNeRF [50] and NSFF [35]–
struggle to render high-quality views from long videos featuring complex camera and scene motion. We present a new approach that
addresses these limitations, illustrated above via an application to
6
DoF video stabilization, where we apply our approach and prior methods
on a 30-second, shaky video clip, and compare novel views rendered along a smoothed camera path (
left
). On a dynamic scenes dataset
(
right
) [75], our approach significantly improves rendering fidelity, as indicated by synthesized images and LPIPS errors computed on pixels
corresponding to moving objects (yellow numbers). Please see the supplementary video for full results.
Abstract
We address the problem of synthesizing novel views from
a monocular video depicting a complex dynamic scene. State-
of-the-art methods based on temporally varying Neural Ra-
diance Fields (aka dynamic NeRFs) have shown impressive
results on this task. However, for long videos with com-
plex object motions and uncontrolled camera trajectories,
these methods can produce blurry or inaccurate renderings,
hampering their use in real-world applications. Instead of
encoding the entire dynamic scene within the weights of
MLPs, we present a new approach that addresses these lim-
itations by adopting a volumetric image-based rendering
framework that synthesizes new viewpoints by aggregating
features from nearby views in a scene motion–aware manner.
Our system retains the advantages of prior methods in its
ability to model complex scenes and view-dependent effects,
but also enables synthesizing photo-realistic novel views
from long videos featuring complex scene dynamics with
unconstrained camera trajectories. We demonstrate signifi-
cant improvements over state-of-the-art methods on dynamic
scene datasets, and also apply our approach to in-the-wild
videos with challenging camera and object motion, where
prior methods fail to produce high-quality renderings.
1. Introduction
Computer vision methods can now produce free-
viewpoint renderings of static 3D scenes with spectacular
quality. What about moving scenes, like those featuring peo-
ple or pets? Novel view synthesis from a monocular video of
a dynamic scene is a much more challenging dynamic scene
reconstruction problem. Recent work has made progress
towards synthesizing novel views in both space and time,
thanks to new time-varying neural volumetric representa-
tions like HyperNeRF [50] and Neural Scene Flow Fields
(NSFF) [35], which encode spatiotemporally varying scene
content volumetrically within a coordinate-based multi-layer
perceptron (MLP).
However, these dynamic NeRF methods have limitations
arXiv:2211.11082v3 [cs.CV] 24 Apr 2023

that prevent their application to casual, in-the-wild videos.
Local scene flow–based methods like NSFF struggle to
scale to longer input videos captured with unconstrained
camera motions: the NSFF paper only claims good perfor-
mance for 1-second, forward-facing videos [35]. Methods
like HyperNeRF that construct a canonical model are mostly
constrained to object-centric scenes with controlled camera
paths, and can fail on scenes with complex object motion.
In this work, we present a new approach that is scalable to
dynamic videos captured with 1) long time duration, 2) un-
bounded scenes, 3) uncontrolled camera trajectories, and 4)
fast and complex object motion. Our approach retains the ad-
vantages of volumetric scene representations that can model
intricate scene geometry with view-dependent effects, while
significantly improving rendering fidelity for both static and
dynamic scene content compared to recent methods [35, 50],
as illustrated in Fig. 1.
We take inspiration from recent methods for rendering
static scenes that synthesize novel images by aggregating
local image features from nearby views along epipolar
lines [39, 64, 70]. However, scenes that are in motion vi-
olate the epipolar constraints assumed by those methods. We
instead propose to aggregate multi-view image features in
scene motion–adjusted ray space, which allows us to cor-
rectly reason about spatio-temporally varying geometry and
appearance.
We also encountered many efficiency and robustness chal-
lenges in scaling up aggregation-based methods to dynamic
scenes. To efficiently model scene motion across multiple
views, we model this motion using motion trajectory fields
that span multiple frames, represented with learned basis
functions. Furthermore, to achieve temporal coherence in
our dynamic scene reconstruction, we introduce a new tem-
poral photometric loss that operates in motion-adjusted ray
space. Finally, to improve the quality of novel views, we pro-
pose to factor the scene into static and dynamic components
through a new IBR-based motion segmentation technique
within a Bayesian learning framework.
On two dynamic scene benchmarks, we show that our
approach can render highly detailed scene content and sig-
nificantly improves upon the state-of-the-art, leading to an
average reduction in LPIPS errors by over 50% both across
entire scenes, as well as on regions corresponding to dynamic
objects. We also show that our method can be applied to in-
the-wild videos with long duration, complex scene motion,
and uncontrolled camera trajectories, where prior state-of-
the-art methods fail to produce high quality renderings. We
hope that our work advances the applicability of dynamic
view synthesis methods to real-world videos.
2. Related Work
Novel view synthesis.
Classic image-based rendering
(IBR) methods synthesize novel views by integrating pixel
information from input images [58], and can be categorized
according to their dependence on explicit geometry. Light
field or lumigraph rendering methods [9, 21, 26, 32] generate
new views by filtering and interpolating sampled rays, with-
out use of explicit geometric models. To handle sparser input
views, many approaches [7, 14, 18, 23, 24, 26, 30, 52, 54, 55]
leverage pre-computed proxy geometry such as depth maps
or meshes to render novel views.
Recently, neural representations have demonstrated high-
quality novel view synthesis [12, 17, 38, 40, 46, 48, 59
–
62,
72, 81]. In particular, Neural Radiance Fields (NeRF) [46]
achieves an unprecedented level of fidelity by encoding con-
tinuous scene radiance fields within multi-layer perceptrons
(MLPs). Among all methods building on NeRF, IBRNet [70]
is the most relevant to our work. IBRNet combines classical
IBR techniques with volume rendering to produce a general-
ized IBR module that can render high-quality views without
per-scene optimization. Our work extends this kind of volu-
metric IBR framework designed for static scenes [11, 64, 70]
to more challenging dynamic scenes. Note that our focus is
on synthesizing higher-quality novel views for long videos
with complex camera and object motion, rather than on gen-
eralization across scenes.
Dynamic scene view synthesis.
Our work is related to ge-
ometric reconstruction of dynamic scenes from RGBD [5,
15, 25, 47, 68, 83] or monocular videos [31, 44, 78, 80]. How-
ever, depth- or mesh-based representations struggle to model
complex geometry and view-dependent effects.
Most prior work on novel view synthesis for dynamic
scenes requires multiple synchronized input videos [1, 3,
6, 27, 33, 63, 69, 76, 82], limiting their real-world applica-
bility. Some methods [8, 13, 22, 51, 71] use domain knowl-
edge such as template models to achieve high-quality re-
sults, but are restricted to specific categories [41, 56]. More
recently, many works propose to synthesize novel views
of dynamic scenes from a single camera. Yoon et al. [75]
render novel views through explicit warping using depth
maps obtained via single-view depth and multi-view stereo.
However, this method fails to model complex scene geom-
etry and to fill in realistic and consistent content at disoc-
clusions. With advances in neural rendering, NeRF-based
dynamic view synthesis methods have shown state-of-the-
art results [16, 35, 53, 66, 74]. Some approaches, such as
Nerfies [49] and HyperNeRF [50], represent scenes using a
deformation field mapping each local observation to a canoni-
cal scene representation. These deformations are conditioned
on time [53] or a per-frame latent code [49, 50, 66], and are
parameterized as translations [53, 66] or rigid body motion
fields [49,50]. These methods can handle long videos, but are
mostly limited to object-centric scenes with relatively small
object motion and controlled camera paths. Other methods
represent scenes as time-varying NeRFs [19, 20, 35, 67, 74].
In particular, NSFF uses neural scene flow fields that can

Volume
Rendering
Ray
Transformer
Target View
Source View
Rendered View
𝜎
ray distance
Figure 2.
Rendering via motion-adjusted multi-view feature ag-
gregation.
Given a sampled location
x
at time
i
along a target ray
r
, we estimate its motion trajectory, which determines the 3D cor-
respondence of
x
at nearby time
j ∈ N (i)
, denoted
x
i→j
. Each
warped point is then projected into its corresponding source view.
Image features
f
j
extracted along the projected curves are aggre-
gated and fed to the ray transformer with time embedding
γ(i)
,
producing per-sample color and density
(c
i
, σ
i
)
. The final pixel
color
ˆ
C
i
is then synthesized by volume rendering
(c
i
, σ
i
)
along
r
.
capture fast and complex 3D scene motion for in-the-wild
videos [35]. However, this method only works well for short
(1-2 second), forward-facing videos.
3. Dynamic Image-Based Rendering
Given a monocular video of a dynamic scene with
frames
(I
1
, I
2
, . . . , I
N
)
and known camera parameters
(P
1
, P
2
, . . . , P
N
)
, our goal is to synthesize a novel view-
point at any desired time within the video. Like many other
approaches, we train per-video, first optimizing a model to
reconstruct the input frames, then using this model to render
novel views.
Rather than encoding 3D color and density directly in
the weights of an MLP as in recent dynamic NeRF methods,
we integrate classical IBR ideas into a volumetric rendering
framework. Compared to explicit surfaces, volumetric repre-
sentations can more readily model complex scene geometry
with view-dependent effects.
The following sections introduce our methods for scene-
motion-adjusted multi-view feature aggregation (Sec. 3.1),
and enforcing temporal consistency via cross-time rendering
in motion-adjusted ray space (Sec. 3.2). Our full system
combines a static model and a dynamic model to produce a
color at each pixel. Accurate scene factorization is achieved
via segmentation masks derived from a separately trained
motion segmentation module within a Bayesian learning
framework (Sec. 3.3).
3.1. Motion-adjusted feature aggregation
We synthesize new views by aggregating features ex-
tracted from temporally nearby source views. To render an
image at time
i
, we first identify source views
I
j
within a
temporal radius
r
frames of
i
,
j ∈ N (i) = [i − r, i + r]
. For
each source view, we extract a 2D feature map
F
i
through a
shared convolutional encoder network to form an input tuple
{I
j
, P
j
, F
j
}.
To predict the color and density of each point sampled
along a target ray
r
, we must aggregate source view features
while accounting for scene motion. For a static scene, points
along a target ray will lie along a corresponding epipolar
line in a neighboring source view, hence we can aggregate
potential correspondences by simply sampling along neigh-
boring epipolar lines [64, 70]. However, moving scene ele-
ments violate epipolar constraints, leading to inconsistent
feature aggregation if motion is not accounted for. Hence,
we perform motion-adjusted feature aggregation, as shown
in Fig. 3. To determine correspondence in dynamic scenes,
one straightforward idea is to estimate a scene flow field via
an MLP [35] to determine a given point’s motion-adjusted
3D location at a nearby time. However, this strategy is com-
putational infeasible in a volumetric IBR framework due to
recursive unrolling of the MLPs.
Motion trajectory fields.
Instead, we represent scene mo-
tion using motion trajectory fields described in terms of
learned basis functions. For a given 3D point
x
along target
ray
r
at time
i
, we encode its trajectory coefficients with an
MLP G
MT
:
{φ
l
i
(x)}
L
l=1
= G
MT
(γ(x), γ(i)) , (1)
where
φ
l
i
∈ R
3
are basis coefficients (with separate coef-
ficients for
x
,
y
, and
z
, using the motion basis described
below) and
γ
denotes positional encoding. We choose
L = 6
bases and 16 linearly increasing frequencies for the encoding
γ
, based on the assumption that scene motion tends to be
low frequency [80].
We also introduce a global learnable motion basis
{h
l
i
}
L
l=1
, h
l
i
∈ R
, spanning every time step
i
of the input
video, which is optimized jointly with the MLP. The motion
trajectory of
x
is then defined as
Γ
x,i
(j) =
P
L
l=1
h
l
j
φ
l
i
(x)
,
and thus, the relative displacement between
x
and its 3D
correspondence x
i→j
at time j is computed as
∆
x,i
(j) = Γ
x,i
(j) − Γ
x,i
(i). (2)
With this motion trajectory representation, finding 3D corre-
spondences for a query point
x
in neighboring views requires
just a single MLP query, allowing efficient multi-view fea-
ture aggregation within our volume rendering framework.
We initialize the basis
{h
l
i
}
L
l=1
with the DCT basis as pro-
posed by Wang et al. [67], but fine-tune it along with other
components during optimization, since we observe that a
fixed DCT basis can fail to model a wide range of real-world
motions (see third column of Fig. 4).
Using the estimated motion trajectory of
x
at time
i
, we
denote
x
’s corresponding 3D point at time
j
as
x
i→j
=
x + ∆
x,i
(j)
. We project each warped point
x
i→j
into its
source view
I
j
using camera parameters
P
j
, and extract
color and feature vector
f
j
at the projected 2D pixel location.
剩余11页未读,继续阅读
资源评论


码流怪侠

- 粉丝: 4w+
上传资源 快速赚钱
我的内容管理 展开
我的资源 快来上传第一个资源
我的收益
登录查看自己的收益我的积分 登录查看自己的积分
我的C币 登录后查看C币余额
我的收藏
我的下载
下载帮助


最新资源
- 阻抗导纳控制技术:Matlab Simulink参数仿真与优化研究
- 数控编程及加工工艺基础.doc
- 收藏的精品资料软件开发实习心得体会.doc
- 多视点立体视频解码算法的优化及应用.doc
- 进化论构建网络的方法.pptx
- 科研项目管理办法(某大学).doc
- MATLAB 绘图复刻-Matlab资源
- 综合布线系统线缆敷设PPT课件.ppt
- 网络培训心得体会范文5篇.doc
- 电子商务专业实践教学体系构建.doc
- 市场部网络运营专项方案.doc
- 项目管理(ppt67)(1).ppt
- 网络游戏开发的灵魂.ppt
- 数据模型决策04网络计划.ppt
- 2022年江苏大学计算机图形学第三次实验报告二维图形变换.doc
- 武汉理工大学2012年c语言考试AB卷试题及答案.doc
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈



安全验证
文档复制为VIP权益,开通VIP直接复制
