没有合适的资源?快使用搜索试试~ 我知道了~
greenplum--orca查询优化器详解1


试读
12页
需积分: 0 5 下载量 198 浏览量
更新于2022-08-04
收藏 1.29MB PDF 举报
Greenplum-Orca 查询优化器详解
Greenplum-Orca 查询优化器是 Pivotal 公司开发的一款新型查询优化器,旨在提高大数据分析查询的性能。Orca 查询优化器采用模块化架构,整合了 state-of-the-art 查询优化技术和原创研究成果,实现了高效、可扩展、可移植的查询优化器架构。
查询优化器的重要性
在数据管理系统中,查询优化器扮演着至关重要的角色。查询优化器的性能直接影响着查询的执行速度和资源利用率。随着大数据时代的到来,查询优化器的重要性日益凸显。高效的查询优化器可以帮助企业快速地分析和处理大量数据,从而获得业务价值。
Orca 查询优化器架构
Orca 查询优化器架构基于模块化设计,包括查询解析、优化、执行三个主要模块。查询解析模块负责将 SQL 查询语句解析成查询树;优化模块负责根据查询树生成执行计划;执行模块负责执行生成的执行计划。这种模块化架构使得 Orca 查询优化器具有高度的灵活性和可扩展性。
Orca 查询优化器的特点
Orca 查询优化器具有以下几个特点:
* 模块化架构:Orca 查询优化器采用模块化架构,易于维护和扩展。
* 高效查询优化:Orca 查询优化器采用 state-of-the-art 查询优化技术,能够生成高效的执行计划。
* 可移植性:Orca 查询优化器可以移植到不同的数据管理系统中,包括 Greenplum 和 HAWQ。
* 可扩展性:Orca 查询优化器的模块化架构使得其具有高度的可扩展性,能够满足不断增长的大数据需求。
Orca 查询优化器对 Greenplum 的影响
Orca 查询优化器的出现对 Greenplum 数据库管理系统产生了深远的影响。Orca 查询优化器可以帮助 Greenplum 数据库管理系统提高查询性能,满足大数据分析的需求。同时,Orca 查询优化器也可以帮助 Greenplum 数据库管理系统扩展到更多的应用场景中。
结论
Orca 查询优化器是 Pivotal 公司开发的一款新型查询优化器,旨在提高大数据分析查询的性能。Orca 查询优化器采用模块化架构,整合了 state-of-the-art 查询优化技术和原创研究成果,实现了高效、可扩展、可移植的查询优化器架构。Orca 查询优化器的出现对 Greenplum 数据库管理系统产生了深远的影响,提高了查询性能和可扩展性。

Orca: A Modular Query Optimizer Architecture for Big Data
Mohamed A. Soliman
∗
, Lyublena Antova
∗
, Venkatesh Raghavan
∗
, Amr El-Helw
∗
,
Zhongxian Gu
∗
, Entong Shen
∗
, George C. Caragea
∗
, Carlos Garcia-Alvarado
∗
,
Foyzur Rahman
∗
, Michalis Petropoulos
∗
, Florian Waas
‡
,
Sivaramakrishnan Narayanan
§
, Konstantinos Krikellas
†
, Rhonda Baldwin
∗
∗
Pivotal Inc.
Palo Alto, USA
‡
Datometry Inc.
San Francisco, USA
†
Google Inc.
Mountain View, USA
§
Qubole Inc.
Mountain View, USA
ABSTRACT
The performance of analytical query processing in data man-
agement systems depends primarily on the capabilities of
the system’s query optimizer. Increased data volumes and
heightened interest in processing complex analytical queries
have prompted Pivotal to build a new query optimizer.
In this paper we present the architecture of Orca, the new
query optimizer for all Pivotal data management products,
including Pivotal Greenplum Database and Pivotal HAWQ.
Orca is a comprehensive development uniting state-of-the-
art query optimization technology with own original research
resulting in a modular and portable optimizer architecture.
In addition to describing the overall architecture, we high-
light several unique features and present performance com-
parisons against other systems.
Categories and Subject Descriptors
H.2.4 [Database Management]: Systems—Query pro-
cessing; Distributed databases
Keywords
Query Optimization, Cost Model, MPP, Parallel Processing
1. INTRODUCTION
Big Data has brought about a renewed interest in query
optimization as a new breed of data management systems
has pushed the envelope in terms of unprecedented scal-
ability, availability, and processing capabilities (cf. e.g.,
[9, 18, 20, 21]), which makes large datasets of hundreds of
terabytes or even petabytes readily accessible for analysis
through SQL or SQL-like interfaces. Differences between
good and mediocre optimizers have always been known to
be substantial [15]. However, the increased amount of data
these systems have to process magnifies optimization mis-
takes and stresses the importance of query optimization
more than ever.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full cita-
tion on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specific permission
Copyright 2014 ACM 978-1-4503-2376-5/14/06 ...$15.00.
http://dx.doi.org/10.1145/2588555.2595637.
Despite a plethora of research in this area, most exist-
ing query optimizers in both commercial and open source
projects are still primarily based on technology dating back
to the early days of commercial database development [22],
and are frequently prone to produce suboptimal results.
Realizing this significant gap between research and prac-
tical implementations, we have set out to devise an architec-
ture that meets current requirements, yet promises enough
headroom for future developments.
In this paper, we describe Orca, the result of our recent re-
search and development efforts at Greenplum/Pivotal. Orca
is a state-of-the-art query optimizer specifically designed for
demanding analytics workloads. It is distinguished from
other optimizers in several important ways:
Modularity. Using a highly extensible abstraction of meta-
data and system description, Orca is no longer confined
to a specific host system like traditional optimizers. In-
stead it can be ported to other data management sys-
tems quickly through plug-ins supported by its Meta-
data Provider SDK.
Extensibility. By representing all elements of a query and
its optimization as first-class citizens of equal foot-
ing, Orca avoids the trap of multi-phase optimization
where certain optimizations are dealt with as an af-
terthought. Multi-phase optimizers are notoriously
difficult to extend as new optimizations or query con-
structs often do not match the previously set phase
boundaries.
Multi-core ready. Orca deploys a highly efficient multi-
core aware scheduler that distributes individual fine-
grained optimization subtasks across multiple cores for
speed-up of the optimization process.
Verifiability. Orca has special provisions for ascertaining
correctness and performance on the level of built-in
mechanisms. Besides improving engineering practices,
these tools enable rapid development with high confi-
dence and lead to reduced turnaround time for both
new features as well as bug fixes.
Performance. Orca is a substantial improvement over our
previous system and in many cases offers query speed-
up of 10x up to 1000x.
We describe the architecture of Orca and highlight some
of the advanced features enabled by its design. We provide
SIGMOD’14, June 22–27, 2014, Snowbird, UT, USA.
337

12/5/13 gp-dia-3-0.png (650× 502)
www.gopivotal.com/assets/image s/gp-dia-3-0.png 1/1
Figure 1: High level GPDB architecture
a blueprint of various components and detail the engineer-
ing practices we have pioneered and deployed to realize this
project. Lastly, we give performance results based on the
TPC-DS benchmark comparing Orca to other systems. In
particular, we focus on query processing systems contributed
to the open source space.
The remainder of this paper is organized as follows. We
give preliminaries on the computing architecture in Sec-
tion 2. In Section 3, we present the architecture of Orca
and describe its components. Section 4 presents the query
optimization workflow. Section 5 describes how Orca ex-
changes metadata with the backend database system. We
describe in Section 6 the tools we built to maintain a veri-
fiable query optimizer. Section 7 presents our experimental
study, followed by a discussion of related work in Section 8.
We summarize this paper with final remarks in Section 9.
2. PRELIMINARIES
We give preliminaries on massively parallel processing
databases (Section 2.1), and Hadoop query engines (Sec-
tion 2.2).
2.1 Massively Parallel Processing
Pivotal’s Greenplum Database (GPDB) [20] is a mas-
sively parallel processing (MPP) analytics database. GPDB
adopts a shared-nothing computing architecture with two
or more cooperating processors. Each processor has its own
memory, operating system and disks. GPDB leverages this
high-performance system architecture to distribute the load
of petabyte data warehouses, and use system resources in
parallel to process a given query.
Figure 1 shows a high level architecture of GPDB. Stor-
age and processing of large amounts of data are handled
by distributing the load across several servers or hosts to
create an array of individual databases, all working to-
gether to present a single database image. The master is
the entry point to GPDB, where clients connect and sub-
mit SQL statements. The master coordinates work with
other database instances, called segments, to handle data
processing and storage. When a query is submitted to the
master, it is optimized and broken into smaller components
dispatched to segments to work together on delivering the
final results. The interconnect is the networking layer re-
sponsible for inter-process communication between the seg-
ments. The interconnect uses a standard Gigabit Ethernet
switching fabric.
During query execution, data can be distributed to seg-
ments in multiple ways including hashed distribution, where
tuples are distributed to segments based on some hash func-
tion, replicated distribution, where a full copy of a table is
stored at each segment and singleton distribution, where the
whole distributed table is gathered from multiple segments
to a single host (usually the master).
2.2 SQL on Hadoop
Processing analytics queries on Hadoop is becoming in-
creasingly popular. Initially, the queries were expressed as
MapReduce jobs and the Hadoop’s appeal was attributed
to its scalability and fault-tolerance. Coding, manually op-
timizing and maintaining complex queries in MapReduce
though is hard, thus SQL-like declarative languages, such
as Hive [28], were developed on top of Hadoop. HiveQL
queries are compiled into MapReduce jobs and executed by
Hadoop. HiveQL accelerated the coding of complex queries
but also made apparent that an optimizer is needed in the
Hadoop ecosystem, since the compiled MapReduce jobs ex-
hibited poor performance.
Pivotal responded to the challenge by introducing
HAWQ [21], a massively parallel SQL-compliant engine on
top of HDFS. HAWQ employes Orca in its core to devise
efficient query plans minimizing the cost of accessing data
in Hadoop clusters. The architecture of HAWQ combines
an innovative state-of-the-art cost-based optimizer with the
scalability and fault-tolerance of Hadoop to enable interac-
tive processing of data at petabyte scale.
Recently, a number of other efforts, including Cloudera’s
Impala [17] and Facebook’s Presto [7], introduced new op-
timizers to enable SQL processing on Hadoop. Currently,
these efforts support only a subset of the SQL standard fea-
tures and their optimizations are restricted to rule-based.
In comparison, HAWQ has a full-fledged standard compli-
ant SQL interface and a cost-based optimizer, both of which
are unprecedented features in Hadoop query engines. We il-
lustrate in our experimental study in Section 7 the key role
that Orca plays in differentiating HAWQ from other Hadoop
SQL engines on both functional and performance sides.
3. ORCA ARCHITECTURE
Orca is the new query optimizer for Pivotal data man-
agement products, including GPDB and HAWQ. Orca is a
modern top-down query optimizer based on the Cascades op-
timization framework [13]. While many Cascades optimizers
are tightly-coupled with their host systems, a unique feature
of Orca is its ability to run outside the database system as a
stand-alone optimizer. This ability is crucial to supporting
products with different computing architectures (e.g., MPP
and Hadoop) using one optimizer. It also allows leverag-
ing the extensive legacy of relational optimization in new
query processing paradigms like Hadoop [7, 10, 16, 17]. Fur-
thermore, running the optimizer as a stand-alone product
enables elaborate testing without going through the mono-
lithic structure of a database system.
DXL. Decoupling the optimizer from the database system
requires building a communication mechanism to process
queries. Orca includes a framework for exchanging informa-
tion between the optimizer and the database system called
Data eXchange Language (DXL). The framework uses an
XML-based language to encode the necessary information
338
剩余11页未读,继续阅读
资源推荐
资源评论

135 浏览量
2009-01-01 上传
143 浏览量
169 浏览量
109 浏览量
2024-03-29 上传
2018-10-18 上传
147 浏览量
101 浏览量
119 浏览量
115 浏览量
177 浏览量
168 浏览量
181 浏览量
147 浏览量
2019-12-16 上传
2020-04-07 上传
199 浏览量
145 浏览量
2019-01-08 上传
2017-12-26 上传
138 浏览量
2018-04-11 上传
2020-02-19 上传
791 浏览量
资源评论


点墨楼
- 粉丝: 37
上传资源 快速赚钱
我的内容管理 展开
我的资源 快来上传第一个资源
我的收益
登录查看自己的收益我的积分 登录查看自己的积分
我的C币 登录后查看C币余额
我的收藏
我的下载
下载帮助


最新资源
- Java语言字符串前补零和后补零的快速方法
- 基于RRT与自重构技术的UAV编队避障与动态队形调整研究 · RRT
- 基于Simulink的单轮车辆ABS防抱死控制系统仿真模型及其应用 · Simulink 深度版
- Java语言移动整个文件夹或单个文件到另一个文件夹
- Python实现神经网络模型的数学公式识别源码文档说明
- 电力系统中配电网最优潮流的SOCP松弛技术应用与实现
- WinCC高级报表工具:自定义模板与多格式输出在工业自动化中的应用
- 基于ADRC控制的Matlab Simulink半车主动悬架建模:优化车身加速度与悬架性能的仿真研究 MatlabSimulink
- Java中文件与字节数组(byte)相互转换
- 使用PyTorch深度学习框架基于BiLSTM CRF的中文分词系统
- 基于BP神经网络的MNIST手写数字识别Python源码(期末大作业)
- C#基于.NET框架的串口数据读取与多曲线显示系统的实现
- Java语言清空文件夹下所有文件
- 基于OpenCV C#开发的圆卡尺、矩形卡尺等测量工具源码集,含视觉控件与自定义图形工具,运行稳定且操作便捷 v3.0
- PFC5.0技术下的预制裂隙含锚杆试样单轴压缩特性研究
- COMSOL多物理场仿真:压电效应中结构力学与静电场耦合模型及其应用
安全验证
文档复制为VIP权益,开通VIP直接复制
