【免费】greenplum--orca查询优化器详解1_greenplum-sql优化资源-CSDN下载

需积分: 0 198 浏览量更新于2022-08-04 收藏 1.29MB PDF 举报

Greenplum-Orca 查询优化器详解 Greenplum-Orca 查询优化器是 Pivotal 公司开发的一款新型查询优化器，旨在提高大数据分析查询的性能。Orca 查询优化器采用模块化架构，整合了 state-of-the-art 查询优化技术和原创研究成果，实现了高效、可扩展、可移植的查询优化器架构。查询优化器的重要性在数据管理系统中，查询优化器扮演着至关重要的角色。查询优化器的性能直接影响着查询的执行速度和资源利用率。随着大数据时代的到来，查询优化器的重要性日益凸显。高效的查询优化器可以帮助企业快速地分析和处理大量数据，从而获得业务价值。 Orca 查询优化器架构 Orca 查询优化器架构基于模块化设计，包括查询解析、优化、执行三个主要模块。查询解析模块负责将 SQL 查询语句解析成查询树；优化模块负责根据查询树生成执行计划；执行模块负责执行生成的执行计划。这种模块化架构使得 Orca 查询优化器具有高度的灵活性和可扩展性。 Orca 查询优化器的特点 Orca 查询优化器具有以下几个特点： * 模块化架构：Orca 查询优化器采用模块化架构，易于维护和扩展。 * 高效查询优化：Orca 查询优化器采用 state-of-the-art 查询优化技术，能够生成高效的执行计划。 * 可移植性：Orca 查询优化器可以移植到不同的数据管理系统中，包括 Greenplum 和 HAWQ。 * 可扩展性：Orca 查询优化器的模块化架构使得其具有高度的可扩展性，能够满足不断增长的大数据需求。 Orca 查询优化器对 Greenplum 的影响 Orca 查询优化器的出现对 Greenplum 数据库管理系统产生了深远的影响。Orca 查询优化器可以帮助 Greenplum 数据库管理系统提高查询性能，满足大数据分析的需求。同时，Orca 查询优化器也可以帮助 Greenplum 数据库管理系统扩展到更多的应用场景中。结论 Orca 查询优化器是 Pivotal 公司开发的一款新型查询优化器，旨在提高大数据分析查询的性能。Orca 查询优化器采用模块化架构，整合了 state-of-the-art 查询优化技术和原创研究成果，实现了高效、可扩展、可移植的查询优化器架构。Orca 查询优化器的出现对 Greenplum 数据库管理系统产生了深远的影响，提高了查询性能和可扩展性。

Orca: A Modular Query Optimizer Architecture for Big Data

Mohamed A. Soliman

∗

, Lyublena Antova

∗

, Venkatesh Raghavan

∗

, Amr El-Helw

∗

Zhongxian Gu

∗

, Entong Shen

∗

, George C. Caragea

∗

, Carlos Garcia-Alvarado

∗

Foyzur Rahman

∗

, Michalis Petropoulos

∗

, Florian Waas

‡

Sivaramakrishnan Narayanan

, Konstantinos Krikellas

†

, Rhonda Baldwin

∗

Pivotal Inc.

Palo Alto, USA

‡

Datometry Inc.

San Francisco, USA

†

Google Inc.

Mountain View, USA

Qubole Inc.

Mountain View, USA

ABSTRACT

The performance of analytical query processing in data man-

agement systems depends primarily on the capabilities of

the system’s query optimizer. Increased data volumes and

heightened interest in processing complex analytical queries

have prompted Pivotal to build a new query optimizer.

In this paper we present the architecture of Orca, the new

query optimizer for all Pivotal data management products,

including Pivotal Greenplum Database and Pivotal HAWQ.

Orca is a comprehensive development uniting state-of-the-

art query optimization technology with own original research

resulting in a modular and portable optimizer architecture.

In addition to describing the overall architecture, we high-

light several unique features and present performance com-

parisons against other systems.

Categories and Subject Descriptors

H.2.4 [Database Management]: Systems—Query pro-

cessing; Distributed databases

Keywords

Query Optimization, Cost Model, MPP, Parallel Processing

1. INTRODUCTION

Big Data has brought about a renewed interest in query

optimization as a new breed of data management systems

has pushed the envelope in terms of unprecedented scal-

ability, availability, and processing capabilities (cf. e.g.,

[9, 18, 20, 21]), which makes large datasets of hundreds of

terabytes or even petabytes readily accessible for analysis

through SQL or SQL-like interfaces. Diﬀerences between

good and mediocre optimizers have always been known to

be substantial [15]. However, the increased amount of data

these systems have to process magniﬁes optimization mis-

takes and stresses the importance of query optimization

more than ever.

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for proﬁt or commercial advantage and that copies bear this notice and the full cita-

tion on the ﬁrst page. Copyrights for components of this work owned by others than

ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-

publish, to post on servers or to redistribute to lists, requires prior speciﬁc permission

and/or a fee. Request permissions from [email protected].

http://dx.doi.org/10.1145/2588555.2595637.

Despite a plethora of research in this area, most exist-

ing query optimizers in both commercial and open source

projects are still primarily based on technology dating back

to the early days of commercial database development [22],

and are frequently prone to produce suboptimal results.

Realizing this signiﬁcant gap between research and prac-

tical implementations, we have set out to devise an architec-

ture that meets current requirements, yet promises enough

headroom for future developments.

In this paper, we describe Orca, the result of our recent re-

search and development eﬀorts at Greenplum/Pivotal. Orca

is a state-of-the-art query optimizer speciﬁcally designed for

demanding analytics workloads. It is distinguished from

other optimizers in several important ways:

Modularity. Using a highly extensible abstraction of meta-

data and system description, Orca is no longer conﬁned

to a speciﬁc host system like traditional optimizers. In-

stead it can be ported to other data management sys-

tems quickly through plug-ins supported by its Meta-

data Provider SDK.

Extensibility. By representing all elements of a query and

its optimization as ﬁrst-class citizens of equal foot-

ing, Orca avoids the trap of multi-phase optimization

where certain optimizations are dealt with as an af-

terthought. Multi-phase optimizers are notoriously

diﬃcult to extend as new optimizations or query con-

structs often do not match the previously set phase

boundaries.

Multi-core ready. Orca deploys a highly eﬃcient multi-

core aware scheduler that distributes individual ﬁne-

grained optimization subtasks across multiple cores for

speed-up of the optimization process.

Veriﬁability. Orca has special provisions for ascertaining

correctness and performance on the level of built-in

mechanisms. Besides improving engineering practices,

these tools enable rapid development with high conﬁ-

dence and lead to reduced turnaround time for both

new features as well as bug ﬁxes.

Performance. Orca is a substantial improvement over our

previous system and in many cases oﬀers query speed-

up of 10x up to 1000x.

We describe the architecture of Orca and highlight some

of the advanced features enabled by its design. We provide

SIGMOD’14, June 22–27, 2014, Snowbird, UT, USA.

337

12/5/13 gp-dia-3-0.png (650× 502)

www.gopivotal.com/assets/image s/gp-dia-3-0.png 1/1

Figure 1: High level GPDB architecture

a blueprint of various components and detail the engineer-

ing practices we have pioneered and deployed to realize this

project. Lastly, we give performance results based on the

TPC-DS benchmark comparing Orca to other systems. In

particular, we focus on query processing systems contributed

to the open source space.

The remainder of this paper is organized as follows. We

give preliminaries on the computing architecture in Sec-

tion 2. In Section 3, we present the architecture of Orca

and describe its components. Section 4 presents the query

optimization workﬂow. Section 5 describes how Orca ex-

changes metadata with the backend database system. We

describe in Section 6 the tools we built to maintain a veri-

ﬁable query optimizer. Section 7 presents our experimental

study, followed by a discussion of related work in Section 8.

We summarize this paper with ﬁnal remarks in Section 9.

2. PRELIMINARIES

We give preliminaries on massively parallel processing

databases (Section 2.1), and Hadoop query engines (Sec-

tion 2.2).

2.1 Massively Parallel Processing

Pivotal’s Greenplum Database (GPDB) [20] is a mas-

sively parallel processing (MPP) analytics database. GPDB

adopts a shared-nothing computing architecture with two

or more cooperating processors. Each processor has its own

memory, operating system and disks. GPDB leverages this

high-performance system architecture to distribute the load

of petabyte data warehouses, and use system resources in

parallel to process a given query.

Figure 1 shows a high level architecture of GPDB. Stor-

age and processing of large amounts of data are handled

by distributing the load across several servers or hosts to

create an array of individual databases, all working to-

gether to present a single database image. The master is

the entry point to GPDB, where clients connect and sub-

mit SQL statements. The master coordinates work with

other database instances, called segments, to handle data

processing and storage. When a query is submitted to the

master, it is optimized and broken into smaller components

dispatched to segments to work together on delivering the

ﬁnal results. The interconnect is the networking layer re-

sponsible for inter-process communication between the seg-

ments. The interconnect uses a standard Gigabit Ethernet

switching fabric.

During query execution, data can be distributed to seg-

ments in multiple ways including hashed distribution, where

tuples are distributed to segments based on some hash func-

tion, replicated distribution, where a full copy of a table is

stored at each segment and singleton distribution, where the

whole distributed table is gathered from multiple segments

to a single host (usually the master).

2.2 SQL on Hadoop

Processing analytics queries on Hadoop is becoming in-

creasingly popular. Initially, the queries were expressed as

MapReduce jobs and the Hadoop’s appeal was attributed

to its scalability and fault-tolerance. Coding, manually op-

timizing and maintaining complex queries in MapReduce

though is hard, thus SQL-like declarative languages, such

as Hive [28], were developed on top of Hadoop. HiveQL

queries are compiled into MapReduce jobs and executed by

Hadoop. HiveQL accelerated the coding of complex queries

but also made apparent that an optimizer is needed in the

Hadoop ecosystem, since the compiled MapReduce jobs ex-

hibited poor performance.

Pivotal responded to the challenge by introducing

HAWQ [21], a massively parallel SQL-compliant engine on

top of HDFS. HAWQ employes Orca in its core to devise

eﬃcient query plans minimizing the cost of accessing data

in Hadoop clusters. The architecture of HAWQ combines

an innovative state-of-the-art cost-based optimizer with the

scalability and fault-tolerance of Hadoop to enable interac-

tive processing of data at petabyte scale.

Recently, a number of other eﬀorts, including Cloudera’s

Impala [17] and Facebook’s Presto [7], introduced new op-

timizers to enable SQL processing on Hadoop. Currently,

these eﬀorts support only a subset of the SQL standard fea-

tures and their optimizations are restricted to rule-based.

In comparison, HAWQ has a full-ﬂedged standard compli-

ant SQL interface and a cost-based optimizer, both of which

are unprecedented features in Hadoop query engines. We il-

lustrate in our experimental study in Section 7 the key role

that Orca plays in diﬀerentiating HAWQ from other Hadoop

SQL engines on both functional and performance sides.

3. ORCA ARCHITECTURE

Orca is the new query optimizer for Pivotal data man-

agement products, including GPDB and HAWQ. Orca is a

modern top-down query optimizer based on the Cascades op-

timization framework [13]. While many Cascades optimizers

are tightly-coupled with their host systems, a unique feature

of Orca is its ability to run outside the database system as a

stand-alone optimizer. This ability is crucial to supporting

products with diﬀerent computing architectures (e.g., MPP

and Hadoop) using one optimizer. It also allows leverag-

ing the extensive legacy of relational optimization in new

query processing paradigms like Hadoop [7, 10, 16, 17]. Fur-

thermore, running the optimizer as a stand-alone product

enables elaborate testing without going through the mono-

lithic structure of a database system.

DXL. Decoupling the optimizer from the database system

requires building a communication mechanism to process

queries. Orca includes a framework for exchanging informa-

tion between the optimizer and the database system called

Data eXchange Language (DXL). The framework uses an

XML-based language to encode the necessary information

338

剩余11页未读，继续阅读

身份认证购VIP最低享 7 折!

30元优惠券

资源推荐

资源评论

点墨楼

粉丝: 37

greenplum--orca查询优化器详解1

greenplum 优化

greenplum-db（oopen-source-greenplum-db-6.19.0-rhel7-x86_64.rpm）

RHEL4-U4-x86_64-AS Oracle.10g.10201_database_linux_x86_64 安装文档

greenplum-db（open-source-greenplum-db-6.19.0-rhel6-x86_64.rpm）

greenplum-db（greenplum-db-6.19.0-ubuntu18.04-amd64.deb）

greenplum-db-6.0.0-beta.7-rhel7-x86_64.zip

greenplum-db-6.10.0-rhel7-x86-64.rpm

greenplum5.9.0安装包，greenplum-cc-web安装包

greenplum-db-6.1.0-rhel7-x86_64.rpm

greenplum-db-5.11.3-rhel7-x86_64.zip

greenplum-db-6.13.0-rhel7-x86_64.zip

Greenplum-db-4.2.2.4-build-1-CE-RHEL5-i386安装文件(验证绝对可用)

greenplum-spark_2.11-1.6.2.jar

greenplum-loaders-5.23.0-rhel7-x86_64.zip

greenplum监控台greenplum-cc-web 3.3.0 for linux

greenplum-db-6.0.0-rhel7-x86_64.rpm

greenplum-cc-web-6.3.0-gp6-rhel7-x86_64.zip

greenplum-cc-web-3.3.3-LINUX-x86_64.zip

greenplum-db-6.7.0-rhel7-x86_64.rpm

greenplum-db-6.2.1-rhel7-x86_64.rpm

greenplum-db-6.2.1-rhel7-x86_64(GitHub).rpm

greenplum-cc-web-6.8.0-gp6-rhel7-x86_64.zip

greenplum-cc-web-6.2.0-gp6-rhel7-x86-64.7z

greenplum-db-5.0.0-rhel6-x86_64.zip

greenplum/postgresql驱动包：greenplum-1.0.jar

Greenplum6分布式数据库CentOS7系统下一键安装包greenplum-installer-master.zip

greenplum-db-5.8.0-rhel7-x86_64.rpm

greenplum-db-4.3.8.2-build-1-RHEL5-x86_64.bin

greenplum-db-6.11.1-rhel7-x86_64.rpm

greenplum-db-6.0.1-rhel7-x86_64.rpm安装包

JAVA：创建对象有几种方式的技术指南

蓝色迷情风格for dvbbs7.0 sp2

最新资源