Hadoop Application Architectures Mark Grover Ted Malaska Jonathan Seidman Gwen Shapira

Hadoop Application Architectures Mark Grover Ted
Malaska Jonathan Seidman Gwen Shapira download
https://ebookbell.com/product/hadoop-application-architectures-
mark-grover-ted-malaska-jonathan-seidman-gwen-shapira-56428346
Explore and download more ebooks at ebookbell.com

Here are some recommended products that we believe you will be
interested in. You can click the link to download.
Hadoop Application Architectures Designing Realworld Big Data
Applications 1st Edition Rajat Mark Grover
https://ebookbell.com/product/hadoop-application-architectures-
designing-realworld-big-data-applications-1st-edition-rajat-mark-
grover-42543940
Agile Data Science Building Data Analytics Applications With Hadoop
1st Edition Russell Jurney
https://ebookbell.com/product/agile-data-science-building-data-
analytics-applications-with-hadoop-1st-edition-russell-jurney-38556478
Practical Hadoop Migration How To Integrate Your Rdbms With The Hadoop
Ecosystem And Rearchitect Relational Applications To Nosql 1st Edition
Bhushan Lakhe
https://ebookbell.com/product/practical-hadoop-migration-how-to-
integrate-your-rdbms-with-the-hadoop-ecosystem-and-rearchitect-
relational-applications-to-nosql-1st-edition-bhushan-lakhe-33349910
Hadoop Essentials Delve Into The Key Concepts Of Hadoop And Get A
Thorough Understanding Of The Hadoop Ecosystem 1st Edition Shiva
Achari
https://ebookbell.com/product/hadoop-essentials-delve-into-the-key-
concepts-of-hadoop-and-get-a-thorough-understanding-of-the-hadoop-
ecosystem-1st-edition-shiva-achari-50949564

Hadoop The Definitive Guide Third White Tom
https://ebookbell.com/product/hadoop-the-definitive-guide-third-white-
tom-55285128
Hadoop For Finance Essentials 1st Edition Rajiv Tiwari
https://ebookbell.com/product/hadoop-for-finance-essentials-1st-
edition-rajiv-tiwari-55292446
Hadoop Blueprints Anurag Shrivastava Tanmay Deshpande
https://ebookbell.com/product/hadoop-blueprints-anurag-shrivastava-
tanmay-deshpande-55292464
Hadoop Mapreduce V2 Cookbook Second Edition Thilina Gunarathne
https://ebookbell.com/product/hadoop-mapreduce-v2-cookbook-second-
edition-thilina-gunarathne-55292468
Hadoop Cluster Deployment 1st Edition Zburivsky Danil
https://ebookbell.com/product/hadoop-cluster-deployment-1st-edition-
zburivsky-danil-55292482

Hadoop Application Architectures
Mark Grover, Ted Malaska,
Jonathan Seidman & Gwen Shapira

Hadoop Application Architectures
by Mark Grover, Ted Malaska, Jonathan Seidman, and Gwen Shapira
Copyright © 2015 Jonathan Seidman, Gwen Shapira, Ted Malaska, and Mark Grover. All
rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA
95472.
O’Reilly books may be purchased for educational, business, or sales promotional use.
Online editions are also available for most titles (http://safaribooksonline.com). For more
information, contact our corporate/institutional sales department: 800-998-9938 or
corporate@oreilly.com.
Editors: Ann Spencer and Brian Anderson
Production Editor: Nicole Shelby
Copyeditor: Rachel Monaghan
Proofreader: Elise Morrison
Indexer: Ellen Troutman
Interior Designer: David Futato
Cover Designer: Ellie Volckhausen
Illustrator: Rebecca Demarest
July 2015: First Edition

Revision History for the First Edition
2015-06-26: First Release
See http://oreilly.com/catalog/errata.csp?isbn=9781491900086 for release details.
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Hadoop Application
Architectures, the cover image, and related trade dress are trademarks of O’Reilly Media,
Inc.
While the publisher and the authors have used good faith efforts to ensure that the
information and instructions contained in this work are accurate, the publisher and the
authors disclaim all responsibility for errors or omissions, including without limitation
responsibility for damages resulting from the use of or reliance on this work. Use of the
information and instructions contained in this work is at your own risk. If any code
samples or other technology this work contains or describes is subject to open source
licenses or the intellectual property rights of others, it is your responsibility to ensure that
your use thereof complies with such licenses and/or rights.
978-1-491-90008-6
[LSI]

Foreword
Apache Hadoop has blossomed over the past decade.
It started in Nutch as a promising capability — the ability to scalably process petabytes. In
2005 it hadn’t been run on more than a few dozen machines, and had many rough edges. It
was only used by a few folks for experiments. Yet a few saw promise there, that an
affordable, scalable, general-purpose data storage and processing framework might have
broad utility.
By 2007 scalability had been proven at Yahoo!. Hadoop now ran reliably on thousands of
machines. It began to be used in production applications, first at Yahoo! and then at other
Internet companies, like Facebook, LinkedIn, and Twitter. But while it enabled scalable
processing of petabytes, the price of adoption was high, with no security and only a Java
batch API.
Since then Hadoop’s become the kernel of a complex ecosystem. Its gained fine-grained
security controls, high availability (HA), and a general-purpose scheduler (YARN).
A wide variety of tools have now been built around this kernel. Some, like HBase and
Accumulo, provide online keystores that can back interactive applications. Others, like
Flume, Sqoop, and Apache Kafka, help route data in and out of Hadoop’s storage.
Improved processing APIs are available through Pig, Crunch, and Cascading. SQL queries
can be processed with Apache Hive and Cloudera Impala. Apache Spark is a superstar,
providing an improved and optimized batch API while also incorporating real-time stream
processing, graph processing, and machine learning. Apache Oozie and Azkaban
orchestrate and schedule many of the above.
Confused yet? This menagerie of tools can be overwhelming. Yet, to make effective use of
this new platform, you need to understand how these tools all fit together and which can
help you. The authors of this book have years of experience building Hadoop-based
systems and can now share with you the wisdom they’ve gained.
In theory there are billions of ways to connect and configure these tools for your use. But
in practice, successful patterns emerge. This book describes best practices, where each
tool shines, and how best to use it for a particular task. It also presents common-use cases.
At first users improvised, trying many combinations of tools, but this book describes the
patterns that have proven successful again and again, sparing you much of the exploration.
These authors give you the fundamental knowledge you need to begin using this powerful
new platform. Enjoy the book, and use it to help you build great Hadoop applications.
Doug Cutting
Shed in the Yard, California

Preface
It’s probably not an exaggeration to say that Apache Hadoop has revolutionized data
management and processing. Hadoop’s technical capabilities have made it possible for
organizations across a range of industries to solve problems that were previously
impractical with existing technologies. These capabilities include:
Scalable processing of massive amounts of data
Flexibility for data processing, regardless of the format and structure (or lack of
structure) in the data
Another notable feature of Hadoop is that it’s an open source project designed to run on
relatively inexpensive commodity hardware. Hadoop provides these capabilities at
considerable cost savings over traditional data management solutions.
This combination of technical capabilities and economics has led to rapid growth in
Hadoop and tools in the surrounding ecosystem. The vibrancy of the Hadoop community
has led to the introduction of a broad range of tools to support management and processing
of data with Hadoop.
Despite this rapid growth, Hadoop is still a relatively young technology. Many
organizations are still trying to understand how Hadoop can be leveraged to solve
problems, and how to apply Hadoop and associated tools to implement solutions to these
problems. A rich ecosystem of tools, application programming interfaces (APIs), and
development options provide choice and flexibility, but can make it challenging to
determine the best choices to implement a data processing application.
The inspiration for this book comes from our experience working with numerous
customers and conversations with Hadoop users who are trying to understand how to build
reliable and scalable applications with Hadoop. Our goal is not to provide detailed
documentation on using available tools, but rather to provide guidance on how to combine
these tools to architect scalable and maintainable applications on Hadoop.
We assume readers of this book have some experience with Hadoop and related tools. You
should have a familiarity with the core components of Hadoop, such as the Hadoop
Distributed File System (HDFS) and MapReduce. If you need to come up to speed on
Hadoop, or need refreshers on core Hadoop concepts, Hadoop: The Definitive Guide by
Tom White remains, well, the definitive guide.
The following is a list of other tools and technologies that are important to understand in
using this book, including references for further reading:
YARN

Up until recently, the core of Hadoop was commonly considered as being HDFS and
MapReduce. This has been changing rapidly with the introduction of additional
processing frameworks for Hadoop, and the introduction of YARN accelarates the
move toward Hadoop as a big-data platform supporting multiple parallel processing
models. YARN provides a general-purpose resource manager and scheduler for
Hadoop processing, which includes MapReduce, but also extends these services to
other processing models. This facilitates the support of multiple processing
frameworks and diverse workloads on a single Hadoop cluster, and allows these
different models and workloads to effectively share resources. For more on YARN,
see Hadoop: The Definitive Guide, or the Apache YARN documentation.
Java
Hadoop and many of its associated tools are built with Java, and much application
development with Hadoop is done with Java. Although the introduction of new tools
and abstractions increasingly opens up Hadoop development to non-Java developers,
having an understanding of Java is still important when you are working with
Hadoop.
SQL
Although Hadoop opens up data to a number of processing frameworks, SQL
remains very much alive and well as an interface to query data in Hadoop. This is
understandable since a number of developers and analysts understand SQL, so
knowing how to write SQL queries remains relevant when you’re working with
Hadoop. A good introduction to SQL is Head First SQL by Lynn Beighley
(O’Reilly).
Scala
Scala is a programming language that runs on the Java virtual machine (JVM) and
supports a mixed object-oriented and functional programming model. Although
designed for general-purpose programming, Scala is becoming increasingly prevalent
in the big-data world, both for implementing projects that interact with Hadoop and
for implementing applications to process data. Examples of projects that use Scala as
the basis for their implementation are Apache Spark and Apache Kafka. Scala, not
surprisingly, is also one of the languages supported for implementing applications
with Spark. Scala is used for many of the examples in this book, so if you need an
introduction to Scala, see Scala for the Impatient by Cay S. Horstmann (Addison-
Wesley Professional) or for a more in-depth overview see Programming Scala, 2nd
Edition, by Dean Wampler and Alex Payne (O’Reilly).
Apache Hive

Speaking of SQL, Hive, a popular abstraction for modeling and processing data on
Hadoop, provides a way to define structure on data stored in HDFS, as well as write
SQL-like queries against this data. The Hive project also provides a metadata store,
which in addition to storing metadata (i.e., data about data) on Hive structures is also
accessible to other interfaces such as Apache Pig (a high-level parallel programming
abstraction) and MapReduce via the HCatalog component. Further, other open source
projects — such as Cloudera Impala, a low-latency query engine for Hadoop — also
leverage the Hive metastore, which provides access to objects defined through Hive.
To learn more about Hive, see the Hive website, Hadoop: The Definitive Guide, or
Programming Hive by Edward Capriolo, et al. (O’Reilly).
Apache HBase
HBase is another frequently used component in the Hadoop ecosystem. HBase is a
distributed NoSQL data store that provides random access to extremely large
volumes of data stored in HDFS. Although referred to as the Hadoop database,
HBase is very different from a relational database, and requires those familiar with
traditional database systems to embrace new concepts. HBase is a core component in
many Hadoop architectures, and is referred to throughout this book. To learn more
about HBase, see the HBase website, HBase: The Definitive Guide by Lars George
(O’Reilly), or HBase in Action by Nick Dimiduk and Amandeep Khurana (Manning).
Apache Flume
Flume is an often used component to ingest event-based data, such as logs, into
Hadoop. We provide an overview and details on best practices and architectures for
leveraging Flume with Hadoop, but for more details on Flume refer to the Flume
documentation or Using Flume (O’Reilly).
Apache Sqoop
Sqoop is another popular tool in the Hadoop ecosystem that facilitates moving data
between external data stores such as a relational database and Hadoop. We discuss
best practices for Sqoop and where it fits in a Hadoop architecture, but for more
details on Sqoop see the Sqoop documentation or the Apache Sqoop Cookbook
(O’Reilly).
Apache ZooKeeper
The aptly named ZooKeeper project is designed to provide a centralized service to
facilitate coordination for the zoo of projects in the Hadoop ecosystem. A number of
the components that we discuss in this book, such as HBase, rely on the services
provided by ZooKeeper, so it’s good to have a basic understanding of it. Refer to the
ZooKeeper site or ZooKeeper by Flavio Junqueira and Benjamin Reed (O’Reilly).
As you may have noticed, the emphasis in this book is on tools in the open source Hadoop
ecosystem. It’s important to note, though, that many of the traditional enterprise software
vendors have added support for Hadoop, or are in the process of adding this support. If
your organization is already using one or more of these enterprise tools, it makes a great

deal of sense to investigate integrating these tools as part of your application development
efforts on Hadoop. The best tool for a task is often the tool you already know. Although
it’s valuable to understand the tools we discuss in this book and how they’re integrated to
implement applications on Hadoop, choosing to leverage third-party tools in your
environment is a completely valid choice.
Again, our aim for this book is not to go into details on how to use these tools, but rather,
to explain when and why to use them, and to balance known best practices with
recommendations on when these practices apply and how to adapt in cases when they
don’t. We hope you’ll find this book useful in implementing successful big data solutions
with Hadoop.

A Note About the Code Examples
Before we move on, a brief note about the code examples in this book. Every effort has
been made to ensure the examples in the book are up-to-date and correct. For the most
current versions of the code examples, please refer to the book’s GitHub repository at
https://github.com/hadooparchitecturebook/hadoop-arch-book.

Who Should Read This Book
Hadoop Application Architectures was written for software developers, architects, and
project leads who need to understand how to use Apache Hadoop and tools in the Hadoop
ecosystem to build end-to-end data management solutions or integrate Hadoop into
existing data management architectures. Our intent is not to provide deep dives into
specific technologies — for example, MapReduce — as other references do. Instead, our
intent is to provide you with an understanding of how components in the Hadoop
ecosystem are effectively integrated to implement a complete data pipeline, starting from
source data all the way to data consumption, as well as how Hadoop can be integrated into
existing data management systems.
We assume you have some knowledge of Hadoop and related tools such as Flume, Sqoop,
HBase, Pig, and Hive, but we’ll refer to appropriate references for those who need a
refresher. We also assume you have experience programming with Java, as well as
experience with SQL and traditional data-management systems, such as relational
database-management systems.
So if you’re a technologist who’s spent some time with Hadoop, and are now looking for
best practices and examples for architecting and implementing complete solutions with it,
then this book is meant for you. Even if you’re a Hadoop expert, we think the guidance
and best practices in this book, based on our years of experience working with Hadoop,
will provide value.
This book can also be used by managers who want to understand which technologies will
be relevant to their organization based on their goals and projects, in order to help select
appropriate training for developers.

Why We Wrote This Book
We have all spent years implementing solutions with Hadoop, both as users and
supporting customers. In that time, the Hadoop market has matured rapidly, along with the
number of resources available for understanding Hadoop. There are now a large number of
useful books, websites, classes, and more on Hadoop and tools in the Hadoop ecosystem
available. However, despite all of the available materials, there’s still a shortage of
resources available for understanding how to effectively integrate these tools into
complete solutions.
When we talk with users, whether they’re customers, partners, or conference attendees,
we’ve found a common theme: there’s still a gap between understanding Hadoop and
being able to actually leverage it to solve problems. For example, there are a number of
good references that will help you understand Apache Flume, but how do you actually
determine if it’s a good fit for your use case? And once you’ve selected Flume as a
solution, how do you effectively integrate it into your architecture? What best practices
and considerations should you be aware of to optimally use Flume?
This book is intended to bridge this gap between understanding Hadoop and being able to
actually use it to build solutions. We’ll cover core considerations for implementing
solutions with Hadoop, and then provide complete, end-to-end examples of implementing
some common use cases with Hadoop.

Navigating This Book
The organization of chapters in this book is intended to follow the same flow that you
would follow when architecting a solution on Hadoop, starting with modeling data on
Hadoop, moving data into and out of Hadoop, processing the data once it’s in Hadoop, and
so on. Of course, you can always skip around as needed. Part I covers the considerations
around architecting applications with Hadoop, and includes the following chapters:
Chapter 1 covers considerations around storing and modeling data in Hadoop — for
example, file formats, data organization, and metadata management.
Chapter 2 covers moving data into and out of Hadoop. We’ll discuss considerations and
patterns for data ingest and extraction, including using common tools such as Flume,
Sqoop, and file transfers.
Chapter 3 covers tools and patterns for accessing and processing data in Hadoop. We’ll
talk about available processing frameworks such as MapReduce, Spark, Hive, and
Impala, and considerations for determining which to use for particular use cases.
Chapter 4 will expand on the discussion of processing frameworks by describing the
implementation of some common use cases on Hadoop. We’ll use examples in Spark
and SQL to illustrate how to solve common problems such as de-duplication and
working with time series data.
Chapter 5 discusses tools to do large graph processing on Hadoop, such as Giraph and
GraphX.
Chapter 6 discusses tying everything together with application orchestration and
scheduling tools such as Apache Oozie.
Chapter 7 discusses near-real-time processing on Hadoop. We discuss the relatively
new class of tools that are intended to process streams of data such as Apache Storm
and Apache Spark Streaming.
In Part II, we cover the end-to-end implementations of some common applications with
Hadoop. The purpose of these chapters is to provide concrete examples of how to use the
components discussed in Part I to implement complete solutions with Hadoop:
Chapter 8 provides an example of clickstream analysis with Hadoop. Storage and
processing of clickstream data is a very common use case for companies running large
websites, but also is applicable to applications processing any type of machine data.
We’ll discuss ingesting data through tools like Flume and Kafka, cover storing and
organizing the data efficiently, and show examples of processing the data.
Chapter 9 will provide a case study of a fraud detection application on Hadoop, an
increasingly common use of Hadoop. This example will cover how HBase can be
leveraged in a fraud detection solution, as well as the use of near-real-time processing.

Chapter 10 provides a case study exploring another very common use case: using
Hadoop to extend an existing enterprise data warehouse (EDW) environment. This
includes using Hadoop as a complement to the EDW, as well as providing functionality
traditionally performed by data warehouses.

Conventions Used in This Book
The following typographical conventions are used in this book:
Italic
Indicates new terms, URLs, email addresses, filenames, and file extensions.
Constant width
Used for program listings, as well as within paragraphs to refer to program elements
such as variable or function names, databases, data types, environment variables,
statements, and keywords.
Constant width bold
Shows commands or other text that should be typed literally by the user.
Constant width italic
Shows text that should be replaced with user-supplied values or by values determined
by context.
NOTE
This icon signifies a tip, suggestion, or general note.
WARNING
This icon indicates a warning or caution.

Using Code Examples
Supplemental material (code examples, exercises, etc.) is available for download at
https://github.com/hadooparchitecturebook/hadoop-arch-book.
This book is here to help you get your job done. In general, if example code is offered
with this book, you may use it in your programs and documentation. You do not need to
contact us for permission unless you’re reproducing a significant portion of the code. For
example, writing a program that uses several chunks of code from this book does not
require permission. Selling or distributing a CD-ROM of examples from O’Reilly books
does require permission. Answering a question by citing this book and quoting example
code does not require permission. Incorporating a significant amount of example code
from this book into your product’s documentation does require permission.
We appreciate, but do not require, attribution. An attribution usually includes the title,
author, publisher, and ISBN. For example: “Hadoop Application Architectures by Mark
Grover, Ted Malaska, Jonathan Seidman, and Gwen Shapira (O’Reilly). Copyright 2015
Jonathan Seidman, Gwen Shapira, Ted Malaska, and Mark Grover, 978-1-491-90008-6.”
If you feel your use of code examples falls outside fair use or the permission given above,
feel free to contact us at permissions@oreilly.com.

Safari® Books Online
NOTE
Safari Books Online is an on-demand digital library that delivers expert content in both
book and video form from the world’s leading authors in technology and business.
Technology professionals, software developers, web designers, and business and creative
professionals use Safari Books Online as their primary resource for research, problem
solving, learning, and certification training.
Safari Books Online offers a range of plans and pricing for enterprise, government,
education, and individuals.
Members have access to thousands of books, training videos, and prepublication
manuscripts in one fully searchable database from publishers like O’Reilly Media,
Prentice Hall Professional, Addison-Wesley Professional, Microsoft Press, Sams, Que,
Peachpit Press, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan
Kaufmann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders,
McGraw-Hill, Jones & Bartlett, Course Technology, and hundreds more. For more
information about Safari Books Online, please visit us online.

How to Contact Us
Please address comments and questions concerning this book to the publisher:
O’Reilly Media, Inc.
1005 Gravenstein Highway North
Sebastopol, CA 95472
800-998-9938 (in the United States or Canada)
707-829-0515 (international or local)
707-829-0104 (fax)
We have a web page for this book, where we list errata, examples, and any additional
information. You can access this page at http://bit.ly/hadoop_app_arch_1E.
To comment or ask technical questions about this book, send email to
bookquestions@oreilly.com.
For more information about our books, courses, conferences, and news, see our website at
http://www.oreilly.com.
Find us on Facebook: http://facebook.com/oreilly
Follow us on Twitter: http://twitter.com/oreillymedia
Watch us on YouTube: http://www.youtube.com/oreillymedia

Acknowledgments
We would like to thank the larger Apache community for its work on Hadoop and the
surrounding ecosystem, without which this book wouldn’t exist. We would also like to
thank Doug Cutting for providing this book’s forward, and not to mention for co-creating
Hadoop.
There are a large number of folks whose support and hard work made this book possible,
starting with Eric Sammer. Eric’s early support and encouragement was invaluable in
making this book a reality. Amandeep Khurana, Kathleen Ting, Patrick Angeles, and Joey
Echeverria also provided valuable proposal feedback early on in the project.
Many people provided invaluable feedback and support while writing this book, especially
the following who provided their time and expertise to review content: Azhar Abubacker,
Sean Allen, Ryan Blue, Ed Capriolo, Eric Driscoll, Lars George, Jeff Holoman, Robert
Kanter, James Kinley, Alex Moundalexis, Mac Noland, Sean Owen, Mike Percy, Joe
Prosser, Jairam Ranganathan, Jun Rao, Hari Shreedharan, Jeff Shmain, Ronan Stokes,
Daniel Templeton, Tom Wheeler.
Andre Araujo, Alex Ding, and Michael Ernest generously gave their time to test the code
examples. Akshat Das provided help with diagrams and our website.
Many reviewers helped us out and greatly improved the quality of this book, so any
mistakes left are our own.
We would also like to thank Cloudera management for enabling us to write this book. In
particular, we’d like to thank Mike Olson for his constant encouragement and support
from day one.
We’d like to thank our O’Reilly editor Brian Anderson and our production editor Nicole
Shelby for their help and contributions throughout the project. In addition, we really
appreciate the help from many other folks at O’Reilly and beyond — Ann Spencer,
Courtney Nash, Rebecca Demarest, Rachel Monaghan, and Ben Lorica — at various times
in the development of this book.
Our apologies to those who we may have mistakenly omitted from this list.
Mark Grover’s Acknowledgements
First and foremost, I would like to thank my parents, Neelam and Parnesh Grover. I
dedicate it all to the love and support they continue to shower in my life every single day.
I’d also like to thank my sister, Tracy Grover, who I continue to tease, love, and admire for
always being there for me. Also, I am very thankful to my past and current managers at
Cloudera, Arun Singla and Ashok Seetharaman for their continued support of this project.
Special thanks to Paco Nathan and Ed Capriolo for encouraging me to write a book.
Ted Malaska’s Acknowledgements
I would like to thank my wife, Karen, and TJ and Andrew — my favorite two boogers.

Jonathan Seidman’s Acknowledgements
I’d like to thank the three most important people in my life, Tanya, Ariel, and Madeleine,
for their patience, love, and support during the (very) long process of writing this book.
I’d also like to thank Mark, Gwen, and Ted for being great partners on this journey.
Finally, I’d like to dedicate this book to the memory of my parents, Aaron and Frances
Seidman.
Gwen Shapira’s Acknowledgements
I would like to thank my husband, Omer Shapira, for his emotional support and patience
during the many months I spent writing this book, and my dad, Lior Shapira, for being my
best marketing person and telling all his friends about the “big data book.” Special thanks
to my manager Jarek Jarcec Cecho for his support for the project, and thanks to my team
over the last year for handling what was perhaps more than their fair share of the work.

Part I. Architectural Considerations for
Hadoop Applications

Chapter 1. Data Modeling in Hadoop
At its core, Hadoop is a distributed data store that provides a platform for implementing
powerful parallel processing frameworks. The reliability of this data store when it comes
to storing massive volumes of data, coupled with its flexibility in running multiple
processing frameworks makes it an ideal choice for your data hub. This characteristic of
Hadoop means that you can store any type of data as is, without placing any constraints on
how that data is processed.
A common term one hears in the context of Hadoop is Schema-on-Read. This simply
refers to the fact that raw, unprocessed data can be loaded into Hadoop, with the structure
imposed at processing time based on the requirements of the processing application.
This is different from Schema-on-Write, which is generally used with traditional data
management systems. Such systems require the schema of the data store to be defined
before the data can be loaded. This leads to lengthy cycles of analysis, data modeling, data
transformation, loading, testing, and so on before data can be accessed. Furthermore, if a
wrong decision is made or requirements change, this cycle must start again. When the
application or structure of data is not as well understood, the agility provided by the
Schema-on-Read pattern can provide invaluable insights on data not previously accessible.
Relational databases and data warehouses are often a good fit for well-understood and
frequently accessed queries and reports on high-value data. Increasingly, though, Hadoop
is taking on many of these workloads, particularly for queries that need to operate on
volumes of data that are not economically or technically practical to process with
traditional systems.
Although being able to store all of your raw data is a powerful feature, there are still many
factors that you should take into consideration before dumping your data into Hadoop.
These considerations include:
Data storage formats
There are a number of file formats and compression formats supported on Hadoop.
Each has particular strengths that make it better suited to specific applications.
Additionally, although Hadoop provides the Hadoop Distributed File System (HDFS)
for storing data, there are several commonly used systems implemented on top of
HDFS, such as HBase for additional data access functionality and Hive for additional
data management functionality. Such systems need to be taken into consideration as
well.
Multitenancy

It’s common for clusters to host multiple users, groups, and application types.
Supporting multitenant clusters involves a number of important considerations when
you are planning how data will be stored and managed.
Schema design
Despite the schema-less nature of Hadoop, there are still important considerations to
take into account around the structure of data stored in Hadoop. This includes
directory structures for data loaded into HDFS as well as the output of data
processing and analysis. This also includes the schemas of objects stored in systems
such as HBase and Hive.
Metadata management
As with any data management system, metadata related to the stored data is often as
important as the data itself. Understanding and making decisions related to metadata
management are critical.
We’ll discuss these items in this chapter. Note that these considerations are fundamental to
architecting applications on Hadoop, which is why we’re covering them early in the book.
Another important factor when you’re making storage decisions with Hadoop, but one
that’s beyond the scope of this book, is security and its associated considerations. This
includes decisions around authentication, fine-grained access control, and encryption —
both for data on the wire and data at rest. For a comprehensive discussion of security with
Hadoop, see Hadoop Security by Ben Spivey and Joey Echeverria (O’Reilly).

Data Storage Options
One of the most fundamental decisions to make when you are architecting a solution on
Hadoop is determining how data will be stored in Hadoop. There is no such thing as a
standard data storage format in Hadoop. Just as with a standard filesystem, Hadoop allows
for storage of data in any format, whether it’s text, binary, images, or something else.
Hadoop also provides built-in support for a number of formats optimized for Hadoop
storage and processing. This means users have complete control and a number of options
for how data is stored in Hadoop. This applies to not just the raw data being ingested, but
also intermediate data generated during data processing and derived data that’s the result
of data processing. This, of course, also means that there are a number of decisions
involved in determining how to optimally store your data. Major considerations for
Hadoop data storage include:
File format
There are multiple formats that are suitable for data stored in Hadoop. These include
plain text or Hadoop-specific formats such as SequenceFile. There are also more
complex but more functionally rich options, such as Avro and Parquet. These
different formats have different strengths that make them more or less suitable
depending on the application and source-data types. It’s possible to create your own
custom file format in Hadoop, as well.
Compression
This will usually be a more straightforward task than selecting file formats, but it’s
still an important factor to consider. Compression codecs commonly used with
Hadoop have different characteristics; for example, some codecs compress and
uncompress faster but don’t compress as aggressively, while other codecs create
smaller files but take longer to compress and uncompress, and not surprisingly
require more CPU. The ability to split compressed files is also a very important
consideration when you’re working with data stored in Hadoop — we’ll discuss
splittability considerations further later in the chapter.
Data storage system
While all data in Hadoop rests in HDFS, there are decisions around what the
underlying storage manager should be — for example, whether you should use
HBase or HDFS directly to store the data. Additionally, tools such as Hive and
Impala allow you to define additional structure around your data in Hadoop.
Before beginning a discussion on data storage options for Hadoop, we should note a
couple of things:
We’ll cover different storage options in this chapter, but more in-depth discussions on
best practices for data storage are deferred to later chapters. For example, when we talk
about ingesting data into Hadoop we’ll talk more about considerations for storing that
data.

Although we focus on HDFS as the Hadoop filesystem in this chapter and throughout
the book, we’d be remiss in not mentioning work to enable alternate filesystems with
Hadoop. This includes open source filesystems such as GlusterFS and the Quantcast
File System, and commercial alternatives such as Isilon OneFS and NetApp. Cloud-
based storage systems such as Amazon’s Simple Storage System (S3) are also
becoming common. The filesystem might become yet another architectural
consideration in a Hadoop deployment. This should not, however, have a large impact
on the underlying considerations that we’re discussing here.

Standard File Formats
We’ll start with a discussion on storing standard file formats in Hadoop — for example,
text files (such as comma-separated value [CSV] or XML) or binary file types (such as
images). In general, it’s preferable to use one of the Hadoop-specific container formats
discussed next for storing data in Hadoop, but in many cases you’ll want to store source
data in its raw form. As noted before, one of the most powerful features of Hadoop is the
ability to store all of your data regardless of format. Having online access to data in its
raw, source form — “full fidelity” data — means it will always be possible to perform
new processing and analytics with the data as requirements change. The following
discussion provides some considerations for storing standard file formats in Hadoop.
Text data
A very common use of Hadoop is the storage and analysis of logs such as web logs and
server logs. Such text data, of course, also comes in many other forms: CSV files, or
unstructured data such as emails. A primary consideration when you are storing text data
in Hadoop is the organization of the files in the filesystem, which we’ll discuss more in
the section “HDFS Schema Design”. Additionally, you’ll want to select a compression
format for the files, since text files can very quickly consume considerable space on your
Hadoop cluster. Also, keep in mind that there is an overhead of type conversion associated
with storing data in text format. For example, storing 1234 in a text file and using it as an
integer requires a string-to-integer conversion during reading, and vice versa during
writing. It also takes up more space to store 1234 as text than as an integer. This overhead
adds up when you do many such conversions and store large amounts of data.
Selection of compression format will be influenced by how the data will be used. For
archival purposes you may choose the most compact compression available, but if the data
will be used in processing jobs such as MapReduce, you’ll likely want to select a splittable
format. Splittable formats enable Hadoop to split files into chunks for processing, which is
critical to efficient parallel processing. We’ll discuss compression types and
considerations, including the concept of splittability, later in this chapter.
Note also that in many, if not most cases, the use of a container format such as
SequenceFiles or Avro will provide advantages that make it a preferred format for most
file types, including text; among other things, these container formats provide
functionality to support splittable compression. We’ll also be covering these container
formats later in this chapter.
Structured text data
A more specialized form of text files is structured formats such as XML and JSON. These
types of formats can present special challenges with Hadoop since splitting XML and
JSON files for processing is tricky, and Hadoop does not provide a built-in InputFormat
for either. JSON presents even greater challenges than XML, since there are no tokens to
mark the beginning or end of a record. In the case of these formats, you have a couple of

options:
Use a container format such as Avro. Transforming the data into Avro can provide a
compact and efficient way to store and process the data.
Use a library designed for processing XML or JSON files. Examples of this for XML
include XMLLoader in the PiggyBank library for Pig. For JSON, the Elephant Bird
project provides the LzoJsonInputFormat. For more details on processing these
formats, see the book Hadoop in Practice by Alex Holmes (Manning), which provides
several examples for processing XML and JSON files with MapReduce.
Binary data
Although text is typically the most common source data format stored in Hadoop, you can
also use Hadoop to process binary files such as images. For most cases of storing and
processing binary files in Hadoop, using a container format such as SequenceFile is
preferred. If the splittable unit of binary data is larger than 64 MB, you may consider
putting the data in its own file, without using a container format.

Hadoop File Types
There are several Hadoop-specific file formats that were specifically created to work well
with MapReduce. These Hadoop-specific file formats include file-based data structures
such as sequence files, serialization formats like Avro, and columnar formats such as
RCFile and Parquet. These file formats have differing strengths and weaknesses, but all
share the following characteristics that are important for Hadoop applications:
Splittable compression
These formats support common compression formats and are also splittable. We’ll
discuss splittability more in the section “Compression”, but note that the ability to
split files can be a key consideration for storing data in Hadoop because it allows
large files to be split for input to MapReduce and other types of jobs. The ability to
split a file for processing by multiple tasks is of course a fundamental part of parallel
processing, and is also key to leveraging Hadoop’s data locality feature.
Agnostic compression
The file can be compressed with any compression codec, without readers having to
know the codec. This is possible because the codec is stored in the header metadata
of the file format.
We’ll discuss the file-based data structures in this section, and subsequent sections will
cover serialization formats and columnar formats.
File-based data structures
The SequenceFile format is one of the most commonly used file-based formats in Hadoop,
but other file-based formats are available, such as MapFiles, SetFiles, ArrayFiles, and
BloomMapFiles. Because these formats were specifically designed to work with
MapReduce, they offer a high level of integration for all forms of MapReduce jobs,
including those run via Pig and Hive. We’ll cover the SequenceFile format here, because
that’s the format most commonly employed in implementing Hadoop jobs. For a more
complete discussion of the other formats, refer to Hadoop: The Definitive Guide.
SequenceFiles store data as binary key-value pairs. There are three formats available for
records stored within SequenceFiles:
Uncompressed
For the most part, uncompressed SequenceFiles don’t provide any advantages over
their compressed alternatives, since they’re less efficient for input/output (I/O) and
take up more space on disk than the same data in compressed form.
Record-compressed
This format compresses each record as it’s added to the file.
Block-compressed

This format waits until data reaches block size to compress, rather than as each
record is added. Block compression provides better compression ratios compared to
record-compressed SequenceFiles, and is generally the preferred compression option
for SequenceFiles. Also, the reference to block here is unrelated to the HDFS or
filesystem block. A block in block compression refers to a group of records that are
compressed together within a single HDFS block.
Regardless of format, every SequenceFile uses a common header format containing basic
metadata about the file, such as the compression codec used, key and value class names,
user-defined metadata, and a randomly generated sync marker. This sync marker is also
written into the body of the file to allow for seeking to random points in the file, and is
key to facilitating splittability. For example, in the case of block compression, this sync
marker will be written before every block in the file.
SequenceFiles are well supported within the Hadoop ecosystem, however their support
outside of the ecosystem is limited. They are also only supported in Java. A common use
case for SequenceFiles is as a container for smaller files. Storing a large number of small
files in Hadoop can cause a couple of issues. One is excessive memory use for the
NameNode, because metadata for each file stored in HDFS is held in memory. Another
potential issue is in processing data in these files — many small files can lead to many
processing tasks, causing excessive overhead in processing. Because Hadoop is optimized
for large files, packing smaller files into a SequenceFile makes the storage and processing
of these files much more efficient. For a more complete discussion of the small files
problem with Hadoop and how SequenceFiles provide a solution, refer to Hadoop: The
Definitive Guide.
Figure 1-1 shows an example of the file layout for a SequenceFile using block
compression. An important thing to note in this diagram is the inclusion of the sync
marker before each block of data, which allows readers of the file to seek to block
boundaries.

Figure 1-1. An example of a SequenceFile using block compression

Serialization Formats
Serialization refers to the process of turning data structures into byte streams either for
storage or transmission over a network. Conversely, deserialization is the process of
converting a byte stream back into data structures. Serialization is core to a distributed
processing system such as Hadoop, since it allows data to be converted into a format that
can be efficiently stored as well as transferred across a network connection. Serialization
is commonly associated with two aspects of data processing in distributed systems:
interprocess communication (remote procedure calls, or RPC) and data storage. For
purposes of this discussion we’re not concerned with RPC, so we’ll focus on the data
storage aspect in this section.
The main serialization format utilized by Hadoop is Writables. Writables are compact and
fast, but not easy to extend or use from languages other than Java. There are, however,
other serialization frameworks seeing increased use within the Hadoop ecosystem,
including Thrift, Protocol Buffers, and Avro. Of these, Avro is the best suited, because it
was specifically created to address limitations of Hadoop Writables. We’ll examine Avro
in more detail, but let’s first briefly cover Thrift and Protocol Buffers.
Thrift
Thrift was developed at Facebook as a framework for implementing cross-language
interfaces to services. Thrift uses an Interface Definition Language (IDL) to define
interfaces, and uses an IDL file to generate stub code to be used in implementing RPC
clients and servers that can be used across languages. Using Thrift allows us to implement
a single interface that can be used with different languages to access different underlying
systems. The Thrift RPC layer is very robust, but for this chapter, we’re only concerned
with Thrift as a serialization framework. Although sometimes used for data serialization
with Hadoop, Thrift has several drawbacks: it does not support internal compression of
records, it’s not splittable, and it lacks native MapReduce support. Note that there are
externally available libraries such as the Elephant Bird project to address these drawbacks,
but Hadoop does not provide native support for Thrift as a data storage format.
Protocol Buffers
The Protocol Buffer (protobuf) format was developed at Google to facilitate data exchange
between services written in different languages. Like Thrift, protobuf structures are
defined via an IDL, which is used to generate stub code for multiple languages. Also like
Thrift, Protocol Buffers do not support internal compression of records, are not splittable,
and have no native MapReduce support. But also like Thrift, the Elephant Bird project can
be used to encode protobuf records, providing support for MapReduce, compression, and
splittability.
Avro
Avro is a language-neutral data serialization system designed to address the major

downside of Hadoop Writables: lack of language portability. Like Thrift and Protocol
Buffers, Avro data is described through a language-independent schema. Unlike Thrift and
Protocol Buffers, code generation is optional with Avro. Since Avro stores the schema in
the header of each file, it’s self-describing and Avro files can easily be read later, even
from a different language than the one used to write the file. Avro also provides better
native support for MapReduce since Avro data files are compressible and splittable.
Another important feature of Avro that makes it superior to SequenceFiles for Hadoop
applications is support for schema evolution; that is, the schema used to read a file does
not need to match the schema used to write the file. This makes it possible to add new
fields to a schema as requirements change.
Avro schemas are usually written in JSON, but may also be written in Avro IDL, which is
a C-like language. As just noted, the schema is stored as part of the file metadata in the
file header. In addition to metadata, the file header contains a unique sync marker. Just as
with SequenceFiles, this sync marker is used to separate blocks in the file, allowing Avro
files to be splittable. Following the header, an Avro file contains a series of blocks
containing serialized Avro objects. These blocks can optionally be compressed, and within
those blocks, types are stored in their native format, providing an additional boost to
compression. At the time of writing, Avro supports Snappy and Deflate compression.
While Avro defines a small number of primitive types such as Boolean, int, float, and
string, it also supports complex types such as array, map, and enum.

Columnar Formats
Until relatively recently, most database systems stored records in a row-oriented fashion.
This is efficient for cases where many columns of the record need to be fetched. For
example, if your analysis heavily relied on fetching all fields for records that belonged to a
particular time range, row-oriented storage would make sense. This option can also be
more efficient when you’re writing data, particularly if all columns of the record are
available at write time because the record can be written with a single disk seek. More
recently, a number of databases have introduced columnar storage, which provides several
benefits over earlier row-oriented systems:
Skips I/O and decompression (if applicable) on columns that are not a part of the query.
Works well for queries that only access a small subset of columns. If many columns are
being accessed, then row-oriented is generally preferable.
Is generally very efficient in terms of compression on columns because entropy within
a column is lower than entropy within a block of rows. In other words, data is more
similar within the same column, than it is in a block of rows. This can make a huge
difference especially when the column has few distinct values.
Is often well suited for data-warehousing-type applications where users want to
aggregate certain columns over a large collection of records.
Not surprisingly, columnar file formats are also being utilized for Hadoop applications.
Columnar file formats supported on Hadoop include the RCFile format, which has been
popular for some time as a Hive format, as well as newer formats such as the Optimized
Row Columnar (ORC) and Parquet, which are described next.
RCFile
The RCFile format was developed specifically to provide efficient processing for
MapReduce applications, although in practice it’s only seen use as a Hive storage format.
The RCFile format was developed to provide fast data loading, fast query processing, and
highly efficient storage space utilization. The RCFile format breaks files into row splits,
then within each split uses column-oriented storage.
Although the RCFile format provides advantages in terms of query and compression
performance compared to SequenceFiles, it also has some deficiencies that prevent
optimal performance for query times and compression. Newer columnar formats such as
ORC and Parquet address many of these deficiencies, and for most newer applications,
they will likely replace the use of RCFile. RCFile is still a fairly common format used
with Hive storage.
ORC
The ORC format was created to address some of the shortcomings with the RCFile format,

specifically around query performance and storage efficiency. The ORC format provides
the following features and benefits, many of which are distinct improvements over
RCFile:
Provides lightweight, always-on compression provided by type-specific readers and
writers. ORC also supports the use of zlib, LZO, or Snappy to provide further
compression.
Allows predicates to be pushed down to the storage layer so that only required data is
brought back in queries.
Supports the Hive type model, including new primitives such as decimal and complex
types.
Is a splittable storage format.
A drawback of ORC as of this writing is that it was designed specifically for Hive, and so
is not a general-purpose storage format that can be used with non-Hive MapReduce
interfaces such as Pig or Java, or other query engines such as Impala. Work is under way
to address these shortcomings, though.
Parquet
Parquet shares many of the same design goals as ORC, but is intended to be a general-
purpose storage format for Hadoop. In fact, ORC came after Parquet, so some could say
that ORC is a Parquet wannabe. As such, the goal is to create a format that’s suitable for
different MapReduce interfaces such as Java, Hive, and Pig, and also suitable for other
processing engines such as Impala and Spark. Parquet provides the following benefits,
many of which it shares with ORC:
Similar to ORC files, Parquet allows for returning only required data fields, thereby
reducing I/O and increasing performance.
Provides efficient compression; compression can be specified on a per-column level.
Is designed to support complex nested data structures.
Stores full metadata at the end of files, so Parquet files are self-documenting.
Fully supports being able to read and write to with Avro and Thrift APIs.
Uses efficient and extensible encoding schemas — for example, bit-packaging/run
length encoding (RLE).
Avro and Parquet
Over time, we have learned that there is great value in having a single interface to all the
files in your Hadoop cluster. And if you are going to pick one file format, you will want to

pick one with a schema because, in the end, most data in Hadoop will be structured or
semistructured data.
So if you need a schema, Avro and Parquet are great options. However, we don’t want to
have to worry about making an Avro version of the schema and a Parquet version.
Thankfully, this isn’t an issue because Parquet can be read and written to with Avro APIs
and Avro schemas.
This means we can have our cake and eat it too. We can meet our goal of having one
interface to interact with our Avro and Parquet files, and we can have a block and
columnar options for storing our data.
COMPARING FAILURE BEHAVIOR FOR DIFFERENT FILE FORMATS
An important aspect of the various file formats is failure handling; some formats handle corruption better than
others:
Columnar formats, while often efficient, do not work well in the event of failure, since this can lead to
incomplete rows.
Sequence files will be readable to the first failed row, but will not be recoverable after that row.
Avro provides the best failure handling; in the event of a bad record, the read will continue at the next sync
point, so failures only affect a portion of a file.

Compression
Compression is another important consideration for storing data in Hadoop, not just in
terms of reducing storage requirements, but also to improve data processing performance.
Because a major overhead in processing large amounts of data is disk and network I/O,
reducing the amount of data that needs to be read and written to disk can significantly
decrease overall processing time. This includes compression of source data, but also the
intermediate data generated as part of data processing (e.g., MapReduce jobs). Although
compression adds CPU load, for most cases this is more than offset by the savings in I/O.
Although compression can greatly optimize processing performance, not all compression
formats supported on Hadoop are splittable. Because the MapReduce framework splits
data for input to multiple tasks, having a nonsplittable compression format is an
impediment to efficient processing. If files cannot be split, that means the entire file needs
to be passed to a single MapReduce task, eliminating the advantages of parallelism and
data locality that Hadoop provides. For this reason, splittability is a major consideration in
choosing a compression format as well as file format. We’ll discuss the various
compression formats available for Hadoop, and some considerations in choosing between
them.
Snappy
Snappy is a compression codec developed at Google for high compression speeds with
reasonable compression. Although Snappy doesn’t offer the best compression sizes, it
does provide a good trade-off between speed and size. Processing performance with
Snappy can be significantly better than other compression formats. It’s important to note
that Snappy is intended to be used with a container format like SequenceFiles or Avro,
since it’s not inherently splittable.
LZO
LZO is similar to Snappy in that it’s optimized for speed as opposed to size. Unlike
Snappy, LZO compressed files are splittable, but this requires an additional indexing step.
This makes LZO a good choice for things like plain-text files that are not being stored as
part of a container format. It should also be noted that LZO’s license prevents it from
being distributed with Hadoop and requires a separate install, unlike Snappy, which can be
distributed with Hadoop.
Gzip
Gzip provides very good compression performance (on average, about 2.5 times the
compression that’d be offered by Snappy), but its write speed performance is not as good
as Snappy’s (on average, it’s about half of Snappy’s). Gzip usually performs almost as
well as Snappy in terms of read performance. Gzip is also not splittable, so it should be
used with a container format. Note that one reason Gzip is sometimes slower than Snappy
for processing is that Gzip compressed files take up fewer blocks, so fewer tasks are

required for processing the same data. For this reason, using smaller blocks with Gzip can
lead to better performance.
bzip2
bzip2 provides excellent compression performance, but can be significantly slower than
other compression codecs such as Snappy in terms of processing performance. Unlike
Snappy and Gzip, bzip2 is inherently splittable. In the examples we have seen, bzip2 will
normally compress around 9% better than GZip, in terms of storage space. However, this
extra compression comes with a significant read/write performance cost. This performance
difference will vary with different machines, but in general bzip2 is about 10 times slower
than GZip. For this reason, it’s not an ideal codec for Hadoop storage, unless your primary
need is reducing the storage footprint. One example of such a use case would be using
Hadoop mainly for active archival purposes.
Compression recommendations
In general, any compression format can be made splittable when used with container file
formats (Avro, SequenceFiles, etc.) that compress blocks of records or each record
individually. If you are doing compression on the entire file without using a container file
format, then you have to use a compression format that inherently supports splitting (e.g.,
bzip2, which inserts synchronization markers between blocks).
Here are some recommendations on compression in Hadoop:
Enable compression of MapReduce intermediate output. This will improve
performance by decreasing the amount of intermediate data that needs to be read and
written to and from disk.
Pay attention to how data is ordered. Often, ordering data so that like data is close
together will provide better compression levels. Remember, data in Hadoop file
formats is compressed in chunks, and it is the entropy of those chunks that will
determine the final compression. For example, if you have stock ticks with the columns
timestamp, stock ticker, and stock price, then ordering the data by a repeated field, such
as stock ticker, will provide better compression than ordering by a unique field, such as
time or stock price.
Consider using a compact file format with support for splittable compression, such as
Avro. Figure 1-2 illustrates how Avro or SequenceFiles support splittability with
otherwise nonsplittable compression formats. A single HDFS block can contain
multiple Avro or SequenceFile blocks. Each of the Avro or SequenceFile blocks can be
compressed and decompressed individually and independently of any other
Avro/SequenceFile blocks. This, in turn, means that each of the HDFS blocks can be
compressed and decompressed individually, thereby making the data splittable.

Figure 1-2. An example of compression with Avro

HDFS Schema Design
As pointed out in the previous section, HDFS and HBase are two very commonly used
storage managers. Depending on your use case, you can store your data in HDFS or
HBase (which internally stores it on HDFS).
In this section, we will describe the considerations for good schema design for data that
you decide to store in HDFS directly. As mentioned earlier, Hadoop’s Schema-on-Read
model does not impose any requirements when loading data into Hadoop. Data can be
simply ingested into HDFS by one of many methods (which we will discuss further in
Chapter 2) without our having to associate a schema or preprocess the data.
While many people use Hadoop for storing and processing unstructured data (such as
images, videos, emails, or blog posts) or semistructured data (such as XML documents
and logfiles), some order is still desirable. This is especially true since Hadoop often
serves as a data hub for the entire organization, and the data stored in HDFS is intended to
be shared among many departments and teams. Creating a carefully structured and
organized repository of your data will provide many benefits. To list a few:
A standard directory structure makes it easier to share data between teams working
with the same data sets.
It also allows for enforcing access and quota controls to prevent accidental deletion or
corruption.
Oftentimes, you’d “stage” data in a separate location before all of it was ready to be
processed. Conventions regarding staging data will help ensure that partially loaded
data will not get accidentally processed as if it were complete.
Standardized organization of data allows for reuse of some code that may process the
data.
Some tools in the Hadoop ecosystem sometimes make assumptions regarding the
placement of data. It is often simpler to match those assumptions when you are initially
loading data into Hadoop.
The details of the data model will be highly dependent on the specific use case. For
example, data warehouse implementations and other event stores are likely to use a
schema similar to the traditional star schema, including structured fact and dimension
tables. Unstructured and semistructured data, on the other hand, are likely to focus more
on directory placement and metadata management.
The important points to keep in mind when designing the schema, regardless of the project
specifics, are:
Develop standard practices and enforce them, especially when multiple teams are
sharing the data.

Make sure your design will work well with the tools you are planning to use. For
example, the version of Hive you are planning to use may only support table partitions
on directories that are named a certain way. This will impact the schema design in
general and how you name your table subdirectories, in particular.
Keep usage patterns in mind when designing a schema. Different data processing and
querying patterns work better with different schema designs. Understanding the main
use cases and data retrieval requirements will result in a schema that will be easier to
maintain and support in the long term as well as improve data processing performance.

Location of HDFS Files
To talk in more concrete terms, the first decisions to make when you’re designing an
HDFS schema is the location of the files. Standard locations make it easier to find and
share data between teams. The following is an example HDFS directory structure that we
recommend. This directory structure simplifies the assignment of permissions to various
groups and users:
/user/<username>
Data, JARs, and configuration files that belong only to a specific user. This is usually
scratch type data that the user is currently experimenting with but is not part of a
business process. The directories under /user will typically only be readable and
writable by the users who own them.
/etl
Data in various stages of being processed by an ETL (extract, transform, and load)
workflow. The /etl directory will be readable and writable by ETL processes (they
typically run under their own user) and members of the ETL team. The /etl directory
tree will have subdirectories for the various groups that own the ETL processes, such
as business analytics, fraud detection, and marketing. The ETL workflows are
typically part of a larger application, such as clickstream analysis or recommendation
engines, and each application should have its own subdirectory under the /etl
directory. Within each application-specific directory, you would have a directory for
each ETL process or workflow for that application. Within the workflow directory,
there are subdirectories for each of the data sets. For example, if your Business
Intelligence (BI) team has a clickstream analysis application and one of its processes
is to aggregate user preferences, the recommended name for the directory that
contains the data would be /etl/BI/clickstream/aggregate_preferences. In some cases,
you may want to go one level further and have directories for each stage of the
process: input for the landing zone where the data arrives, processing for the
intermediate stages (there may be more than one processing directory), output for the
final result, and bad where records or files that are rejected by the ETL process land
for manual troubleshooting. In such cases, the final structure will look similar to
/etl/<group>/<application>/<process>/{input,processing,output,bad}
/tmp
Temporary data generated by tools or shared between users. This directory is
typically cleaned by an automated process and does not store long-term data. This
directory is typically readable and writable by everyone.
/data

Data sets that have been processed and are shared across the organization. Because
these are often critical data sources for analysis that drive business decisions, there
are often controls around who can read and write this data. Very often user access is
read-only, and data is written by automated (and audited) ETL processes. Since data
in /data is typically business-critical, only automated ETL processes are typically
allowed to write them — so changes are controlled and audited. Different business
groups will have read access to different directories under /data, depending on their
reporting and processing needs. Since /data serves as the location for shared
processed data sets, it will contain subdirectories for each data set. For example, if
you were storing all orders of a pharmacy in a table called medication_orders, we
recommend that you store this data set in a directory named /data/medication_orders.
/app
Includes everything required for Hadoop applications to run, except data. This
includes JAR files, Oozie workflow definitions, Hive HQL files, and more. The
application code directory /app is used for application artifacts such as JARs for
Oozie actions or Hive user-defined functions (UDFs). It is not always necessary to
store such application artifacts in HDFS, but some Hadoop applications such as
Oozie and Hive require storing shared code and configuration on HDFS so it can be
used by code executing on any node of the cluster. This directory should have a
subdirectory for each group and application, similar to the structure used in /etl. For a
given application (say, Oozie), you would need a directory for each version of the
artifacts you decide to store in HDFS, possibly tagging, via a symlink in HDFS, the
latest artifact as latest and the currently used one as current. The directories
containing the binary artifacts would be present under these versioned directories.
This will look similar to: /app/<group>/<application>/<version>/<artifact
directory>/<artifact>. To continue our previous example, the JAR for the latest build
of our aggregate preferences process would be in a directory structure like
/app/BI/clickstream/latest/aggregate_preferences/uber-aggregate-preferences.jar.
/metadata
Stores metadata. While most table metadata is stored in the Hive metastore, as
described later in the “Managing Metadata”, some extra metadata (for example, Avro
schema files) may need to be stored in HDFS. This directory would be the best
location for storing such metadata. This directory is typically readable by ETL jobs
but writable by the user used for ingesting data into Hadoop (e.g., Sqoop user). For
example, the Avro schema file for a data set called movie may exist at a location like
this: /metadata/movielens/movie/movie.avsc. We will discuss this particular example
in more detail in Chapter 10.

Advanced HDFS Schema Design
Once the broad directory structure has been decided, the next important decision is how
data will be organized into files. While we have already talked about how the format of
the ingested data may not be the most optimal format for storing it, it’s important to note
that the default organization of ingested data may not be optimal either. There are a few
strategies to best organize your data. We will talk about partitioning, bucketing, and
denormalizing here.
Partitioning
Partitioning a data set is a very common technique used to reduce the amount of I/O
required to process the data set. When you’re dealing with large amounts of data, the
savings brought by reducing I/O can be quite significant. Unlike traditional data
warehouses, however, HDFS doesn’t store indexes on the data. This lack of indexes plays
a large role in speeding up data ingest, but it also means that every query will have to read
the entire data set even when you’re processing only a small subset of the data (a pattern
called full table scan). When the data sets grow very big, and queries only require access
to subsets of data, a very good solution is to break up the data set into smaller subsets, or
partitions. Each of these partitions would be present in a subdirectory of the directory
containing the entire data set. This will allow queries to read only the specific partitions
(i.e., subdirectories) they require, reducing the amount of I/O and improving query times
significantly.
For example, say you have a data set that stores all the orders for various pharmacies in a
data set called medication_orders, and you’d like to check order history for just one
physician over the past three months. Without partitioning, you’d need to read the entire
data set and filter out all the records that don’t pertain to the query.
However, if we were to partition the entire orders data set so each partition included only a
single day’s worth of data, a query looking for information from the past three months will
only need to read 90 or so partitions and not the entire data set.
When placing the data in the filesystem, you should use the following directory format for
partitions: <data set name>/<partition_column_name=partition_column_value>/{files}.
In our example, this translates to: medication_orders/date=20131101/{order1.csv,
order2.csv}
This directory structure is understood by various tools, like HCatalog, Hive, Impala, and
Pig, which can leverage partitioning to reduce the amount of I/O required during
processing.
Bucketing
Bucketing is another technique for decomposing large data sets into more manageable
subsets. It is similar to the hash partitions used in many relational databases. In the
preceding example, we could partition the orders data set by date because there are a large

number of orders done daily and the partitions will contain large enough files, which is
what HDFS is optimized for. However, if we tried to partition the data by physician to
optimize for queries looking for specific physicians, the resulting number of partitions
may be too large and resulting files may be too small in size. This leads to what’s called
the small files problem. As detailed in “File-based data structures”, storing a large number
of small files in Hadoop can lead to excessive memory use for the NameNode, since
metadata for each file stored in HDFS is held in memory. Also, many small files can lead
to many processing tasks, causing excessive overhead in processing.
The solution is to bucket by physician, which will use a hashing function to map
physicians into a specified number of buckets. This way, you can control the size of the
data subsets (i.e., buckets) and optimize for query speed. Files should not be so small that
you’ll need to read and manage a huge number of them, but also not so large that each
query will be slowed down by having to scan through huge amounts of data. A good
average bucket size is a few multiples of the HDFS block size. Having an even
distribution of data when hashed on the bucketing column is important because it leads to
consistent bucketing. Also, having the number of buckets as a power of two is quite
common.
An additional benefit of bucketing becomes apparent when you’re joining two data sets.
The word join here is used to represent the general idea of combining two data sets to
retrieve a result. Joins can be done via SQL-on-Hadoop systems but also in MapReduce,
or Spark, or other programming interfaces to Hadoop.
When both the data sets being joined are bucketed on the join key and the number of
buckets of one data set is a multiple of the other, it is enough to join corresponding
buckets individually without having to join the entire data sets. This significantly reduces
the time complexity of doing a reduce-side join of the two data sets. This is because doing
a reduce-side join is computationally expensive. However, when two bucketed data sets
are joined, instead of joining the entire data sets together, you can join just the
corresponding buckets with each other, thereby reducing the cost of doing a join. Of
course, the buckets from both the tables can be joined in parallel. Moreover, because the
buckets are typically small enough to easily fit into memory, you can do the entire join in
the map stage of a Map-Reduce job by loading the smaller of the buckets in memory. This
is called a map-side join, and it improves the join performance as compared to a reduce-
side join even further. If you are using Hive for data analysis, it should automatically
recognize that tables are bucketed and apply this optimization.
If the data in the buckets is sorted, it is also possible to use a merge join and not store the
entire bucket in memory when joining. This is somewhat faster than a simple bucket join
and requires much less memory. Hive supports this optimization as well. Note that it is
possible to bucket any table, even when there are no logical partition keys. It is
recommended to use both sorting and bucketing on all large tables that are frequently
joined together, using the join key for bucketing.

As you can tell from the preceding discussion, the schema design is highly dependent on
the way the data will be queried. You will need to know which columns will be used for
joining and filtering before deciding on partitioning and bucketing of the data. In cases
when there are multiple common query patterns and it is challenging to decide on one
partitioning key, you have the option of storing the same data set multiple times, each with
different physical organization. This is considered an anti-pattern in relational databases,
but with Hadoop, this solution can make sense. For one thing, in Hadoop data is typically
write-once, and few updates are expected. Therefore, the usual overhead of keeping
duplicated data sets in sync is greatly reduced. In addition, the cost of storage in Hadoop
clusters is significantly lower, so there is less concern about wasted disk space. These
attributes allow us to trade space for greater query speed, which is often desirable.
Denormalizing
Although we talked about joins in the previous subsections, another method of trading
disk space for query performance is denormalizing data sets so there is less of a need to
join data sets. In relational databases, data is often stored in third normal form. Such a
schema is designed to minimize redundancy and provide data integrity by splitting data
into smaller tables, each holding a very specific entity. This means that most queries will
require joining a large number of tables together to produce final result sets.
In Hadoop, however, joins are often the slowest operations and consume the most
resources from the cluster. Reduce-side joins, in particular, require sending entire tables
over the network. As we’ve already seen, it is important to optimize the schema to avoid
these expensive operations as much as possible. While bucketing and sorting do help
there, another solution is to create data sets that are prejoined — in other words,
preaggregated. The idea is to minimize the amount of work queries have to do by doing as
much as possible in advance, especially for queries or subqueries that are expected to
execute frequently. Instead of running the join operations every time a user tries to look at
the data, we can join the data once and store it in this form.
Looking at the difference between a typical Online Transaction Processing (OLTP)
schema and an HDFS schema of a particular use case, you will see that the Hadoop
schema consolidates many of the small dimension tables into a few larger dimensions by
joining them during the ETL process. In the case of our pharmacy example, we
consolidate frequency, class, admin route, and units into the medications data set, to avoid
repeated joining.
Other types of data preprocessing, like aggregation or data type conversion, can be done to
speed up processes as well. Since data duplication is a lesser concern, almost any type of
processing that occurs frequently in a large number of queries is worth doing once and
reusing. In relational databases, this pattern is often known as Materialized Views. In
Hadoop, you instead have to create a new data set that contains the same data in its
aggregated form.

Random documents with unrelated
content Scribd suggests to you:

Hartz district, and the acid passes by retorts into receivers. The
earthen retorts, a, are arranged in the furnace as in the illustration,
and the receivers, b, containing a little sulphuric acid, are firmly fixed
to them. The oily brown product fumes in the air, and is called
“fuming sulphuric acid,” or Nordhausen acid. Sulphuric acid is very
much used in chemical manufactures, and the prices of many
necessaries, such as soap, soda, calico, stearin, paper, etc., are in
close relationship with the cost and production of sulphur, which also
plays an important part in the making of gunpowder. The
manufacture of the acid is carried on in platinum stills.
Fig. 385.—Experiment to show the existence of gases in solution.
Sulphuretted hydrogen, or the hydric sulphide (H2S), is a colourless
and horribly-smelling gas, and arises from putrefying vegetable and
animal matter which contains sulphur. The odour of rotten eggs is
due to this gas, which is very dangerous when breathed in a pure
state in drains, etc. It can be made by treating a sulphide with
sulphuric acid. It is capable of precipitating the metals when in
solution, and so by its aid we can discover the metallic ingredient if

it be present. The gas is soluble in water, and makes its presence
known in certain sulphur springs. The colour imparted to egg-spoons
and fish-knives and forks sometimes is due to the presence of
metallic sulphides. The solution is called hydro-sulphuric acid.
Phosphorus occurs in very small quantities, though in the form of
phosphates we are acquainted with it pretty generally, and as such it
is absorbed by plants, and is useful in agricultural operations. In our
organization—in the brain, the nerves, flesh, and particularly in
bones—phosphorus is present, and likewise in all animals.
Nevertheless it is highly poisonous. It is usually obtained from the
calcined bones of mammalia by obtaining phosphoric acid by means
of acting upon the bone-ash with sulphuric acid. Phosphorus when
pure is colourless, nearly transparent, soft, and easily cut. It has a
strong affinity for oxygen. It evolves white vapour in atmospheric air,
and is luminous; to this element is attributable the luminosity of
bones of decaying animal matter. It should be kept in water, and
handled—or indeed not handled—but grasped with a proper
instrument.
Phosphorus is much used in the manufacture of lucifer matches, and
we are all aware of the ghastly appearance and ghostly presentment
it gives when rubbed upon the face and hands in the dark. In the
ripples of the waves and under the counter of ships at sea, the
phosphorescence of the ocean is very marked. In Calais harbour we
have frequently noticed it of a very brilliant appearance as the mail
steamer slowly came to her moorings. This appearance is due to the
presence of phosphorus in the tiny animalculæ of the sea. It is also
observable in the female glow-worm, and the “fire-fly.” Phosphorus
was discovered by Brandt in 1669.

Fig. 386.—Manufacture of sulphuric acid.
It forms two compounds with oxygen-phosphorous acid, H2PO4, and
phosphoric acid, H3PO4. The compound with hydrogen is well
marked as phosphuretted hydrogen, and is a product of animal and
vegetable decomposition. It may frequently be observed in stagnant
pools, for when emitted it becomes luminous by contact with
atmospheric air. There is a very pretty but not altogether safe
experiment to be performed when phosphuretted hydrogen has
been prepared in the following manner. Heat small pieces of
phosphorus with milk of lime or a solution of caustic potash; or
make a paste of quick-lime and phosphorus, and put into the flask
with some quick-lime powdered. Fix a tube to the neck, and let the
other end be inserted in a basin of water. (See illustration, fig. 388.)
Apply heat; the phosphuretted hydrogen will be given off, and will
emerge from the water in the basin in luminous rings of a very
beautiful appearance. The greatest care should be taken in the
performance of this very simple experiment. No water must on any
account come in contact with the mixture in the flask. If even a drop
or two find its way in through the bent tube a tremendous explosion
will result, and then the fire generated will surely prove disastrous.
The experiment can be performed in a cheaper and less dangerous
fashion by dropping phosphate of lime into the basin. We strongly
recommend the latter course to the student unless he has had some
practice in the handling of these inflammable substances, and learnt

caution by experience. The effect when the experiment is properly
performed is very good, the smoke rising in a succession of coloured
rings.
Fig. 387.—(Phosphuretted hydrogen and marsh gas) Will-o’-the-Wisp.
Silicon is not found in a free state in nature, but, combined with
oxygen, as Silica it constitutes the major portion of our earth, and
even occurs in wheat stalks and bones of animals. As flint or quartz
(see Mineralogy) it is very plentiful, and in its purest form is known
as rock crystal, and approaches the form of carbon known as
diamond. When separated from oxygen, silicon is a powder of
greyish-brown appearance, and when heated in an atmosphere of
oxygen forms silicic “acid” again, which, however, is not acid to the
taste, and is also termed “silica,” or “silex.” It is fused with great
difficulty, but enters into the manufacture of glass in the form of
sand. The chemical composition of glass is mixed silicate of
potassium or sodium, with silicates of calcium, lead, etc. Ordinary

window-glass is a mixture of silicates of sodium and calcium; crown
glass contains calcium and silicate of potassium. Crystal glass is a
mixture of the same silicate and lead. Flint glass is of a heavier
composition. Glass can be coloured by copper to a gold tinge, blue
by cobalt, green by chromium, etc. Glass made on a large scale is
composed of the following materials, according to the kind of glass
that is required.
Flint glass (“crystal”) is very heavy and moderately soft, very white
and bright. It is essentially a table-glass, and was used in the
construction of the Crystal Palace. Its composition is—pure white
sea-sand, 52 parts, potash 14 parts, oxide of lead, 34 parts = 100.
Plate Glass. Crown Glass. Green (Bottle) Glass.
Pure white sand 55 parts.Fine sand 63 parts.Sea sand 80 parts.
Soda 35 ” Chalk 7 ” Salt 10 ”
Nitre 8 ” Soda 30 ” Lime 10 ”
Lime 2 ”
100 ” 100 ” 100 ”
The ingredients to be made into glass (of whatever kind it may be)
are thoroughly mixed together and thrown from time to time into
large crucibles placed in a circle, a a (fig. 389), in a furnace resting
on buttresses, b b, and heated to whiteness by means of a fire in the
centre, c, blown by a blowing machine, the tube of which is seen at
d. This furnace is shown in prospective in fig. 390. The ingredients
melt and sink down into a clear fluid, throwing up a scum, which is
removed. This clear glass in the fused state is kept at a white heat
till all air-bubbles have disappeared; the heat is then lowered to a
bright redness, when the glass assumes a consistence and ductility
suitable to the purposes of the “blower.”

Fig. 389.—Crucibles.
Fig. 388.—Experiment with phosphuretted hydrogen.
Glass blowing requires great care and dexterity, and is done by
twirling a hollow rod of iron on one end of which is a globe of
melted glass, the workman blowing into the other end all the time.
By reheating and twirling a sheet of glass is produced. Plate glass is
formed by pouring the molten glass upon a table with raised edges.
When cold it is ground with emery powder, and then polished by
machinery.
Many glass articles are cast, or “struck-up,” by
compression in moulds, and are made to
resemble cut-glass, but they are much inferior in
appearance. The best are first blown, and
afterwards cut and polished. Of whatever kind of
glass the article may be, it is so brittle that the
slightest blow would break it, a bad quality which
is got rid of by a process called “annealing,” that
is, placing it while quite hot on the floor of an
oven, which is allowed to cool very gradually. This slow cooling takes
off the brittleness, consequently articles of glass well annealed are
very much tougher than others, and will scarcely break in boiling
water.
The kind generally used for ornamental cutting is flint-glass.
Decanters and wine-glasses are therefore made of it; it is very

Fig. 390.—Plate-glass casting—
bringing out the pot.
bright, white, and easily cut. The
cutting is performed by means of
wheels of different sizes and materials,
turned by a treadle, as in a common
lathe, or by steam power; some
wheels are made of fine sandstone,
some of iron, others of tin or copper;
the edges of some are square, or
round, or sharp. They are used with
sand and water, or emery and water,
stone wheels with water only.

Fig. 391.—Glass furnace. (See also fig. 390 for detail.)
Fig. 392.—Glass-cutting.
In a soluble form silicic acid is found in springs, and thus enters into
the composition of most plants and grasses, while the shells and
scales of “infusoria” consist of silica. As silicate of alumina,—i.e.,
clay,—it plays a very important rôle in our porcelain and pottery
works.
Boron is found in volcanic districts, in lakes as boracic acid, in
combination with oxygen. It is a brownish-green, insoluble powder,
in a free state, but as boracic acid it is white. It is used to colour
fireworks with the beautiful green tints we see. Soda and boracic

acid combine to make borax (or biborate of soda). Another and
inferior quality of this combination is tinkal, found in Thibet. Borax is
much used in art and manufactures, and in glazing porcelain.
(Symbol B, Atomic Weight 11).
Selenium is a very rare element. It was found by Berzilius in a
sulphuric-acid factory. It is not found in a free state in nature. It
closely resembles sulphur in its properties. Its union with hydrogen
produces a gas, seleniuretted hydrogen, which is even more
offensive than sulphuretted hydrogen. (Symbol Se, Atomic Weight
79).
Tellurium is also a rare substance generally found in combination
with gold and silver. It is like bismuth, and is lustrous in appearance.
Telluretted hydrogen is horrible as a gas. Tellurium, like selenium,
sulphur, and oxygen, combines with two atoms of hydrogen.
(Symbol Te, Atomic Weight 129).
Fig. 393.—Casting plate-glass.

Arsenic, like tellurium, possesses many attributes of a metal, and on
the other hand has some resemblance to phosphorus. Arsenic is
sometimes found free, but usually combined with metals, and is
reduced from the ores by roasting; and uniting with oxygen in the
air, is known as “white arsenic.” The brilliant greens on papers, etc.,
contain arsenic, and are poisonous on that account. Arsenic and
hydrogen unite (as do sulphur and hydrogen, etc.), and produce a
foetid gas of a most deadly quality. This element also unites with
sulphur. If poured into a glass containing chlorine it will sparkle and
scintillate as in the illustration (fig. 395). (Symbol As, Atomic Weight
75).
Before closing this division, and passing on to a brief review of the
Metals, we would call attention to a few facts connected with the
metalloids we have been considering. Some, we have seen, unite
with hydrogen only, as chlorine; some with two atoms of hydrogen,
as oxygen, sulphur, etc., and some with three, as nitrogen and
phosphorus; some again with four, as carbon and silicon. It has been
impossible in the pages we have been able to devote to the
Metalloids to do more than mention each briefly and incompletely,
but the student will find sufficient, we trust, to interest him, and to
induce him to search farther, while the general reader will have
gathered some few facts to add to his store of interesting
knowledge. We now pass on to the Metals.

Fig. 394—The manufacture of porcelain in China.
Fig. 395.—Experiment showing affinity between arsenic and chlorine.

Fig. 396.—Laminater.
CHAPTER XXIX.
THE METALS.
WHAT METALS ARE—CHARACTERISTICS AND GENERAL
PROPERTIES OF METALS—CLASSIFICATION—SPECIFIC GRAVITY
—DESCRIPTIONS.
We have learnt that the elements are divided into metalloids and
metals, but the line of demarcation is very faint. It is very difficult to
define what a metal is, though we can say what it is not. It is indeed
impossible to give any absolute definition of a metal, except as “an
element which does not unite with hydrogen, or with another metal
to form a chemical compound.” This definition has been lately given
by Mr. Spencer, and we may accept it as the nearest affirmative
definition of a metal, though obviously not quite accurate.
A metal is usually supposed to be
solid, heavy, opaque, ductile,
malleable, and tenacious; to possess
good conducting powers for heat and
electricity, and to exhibit a certain
shiny appearance known as “metallic
lustre.” These are all the conditions,
but they are by no means necessary,
for very few metals possess them all,
and many non-metallic elements
possess several. The “alkali” metals
are lighter than water; mercury is a
fluid. The opacity of a mass is only in
relation to its thickness, for Faraday
beat out metals into plates so thin that
they became transparent. All metals are not malleable, nor are they

ductile. Tin and lead, for example, have very little ductility or
tenacity, while bismuth and antimony have none at all. Carbon is a
much better conductor of electricity than many metals in which such
power is extremely varied. Lustre, again, though possessed by
metals, is a characteristic of some non-metals. So we see that while
we can easily say what is not a metal, we can scarcely define an
actual metal, nor depend upon unvarying properties to guide us in
our determination.
The affinity of metals for oxygen is in an inverse ratio to their
specific gravity, as can be ascertained by experiment, when the
heaviest metal will be the least ready to oxidise. Metals differ in
other respects, and thus classification and division become easier.
The fusibility of metals is of a very wide range, rising from a
temperature below zero to the highest heat obtainable in the blow-
pipe, and even then in the case of osmium there is a difficulty. While
there can be no question that certain elements, iron, copper, gold,
silver, etc., are metals proper, there are many which border upon the
line of demarcation very closely, and as in the case of arsenic even
occupy the debatable land.
Specific Gravity is the relation which the weight of substance bears to
the weight of an equal volume of water, as already pointed out in
Physics. The specific gravities of the metals vary very much, as will
be seen from the table following—water being, as usual, taken as 1:
—
Aluminium2·56 Lead 11·3 Rubidium 1·5
Antimony 6·7 Lithium ·593 Ruthenium11·4
Arsenic 6· Magnesium 1·74 Silver 105
Bismuth 9·7 Manganese 8· Sodium ·972
Cadmium 8·6 Mercury 13·5 Strontium 2·5
Calcium 1·5 Molybdenum 8·6 Thallium 11·8
Chromium 6·8 Nickel 8·8 Tin 7·2
Cobalt 8·9 Osmium 21·4 Titanium 5·3
Copper 8·9 Palladium 11·8 Tungsten 17·6
Gold 19·3 Platinum 21·5 Uranium 18·4

Indium 7·3 Potassium ·865 Zinc 7·1
Iridium 21·1 Rhodium 12·1 Zircon 4·3
Some metals are therefore lighter and some heavier than water.
The table underneath gives the approximate fusing points of some
of the metals (Centigrade Scale)—
(Ice melts at 0°.)
Platinum21 about 1500° Zinc ” 400°
Gold ” 1200° Lead ” 330°
Silver ” 1000° Bismuth ” 265°
Cast iron 1000-1200° Tin ” 235°
Wrought iron ” 1500° Sodium ” 97°
Copper ” 1100° Potassium” 60°
Antimony ” 432° Mercury ” 40°
There are some metals which, instead of fusing,—that is, passing
from the solid to the liquid state,—go away in vapour. These are
volatile metals. Mercury, potassium, and sodium, can be thus
distilled. Some do not expand with heat, but contract (like ice),
antimony and bismuth, for instance, while air pressure has a
considerable effect upon the fusing point. Some vaporise at once
without liquefying; others, such as iron, become soft before melting.
Alloys are combinations of metals which are used for many
purposes, and become harder in union. Amalgams are alloys in
which mercury is one constituent. Some of the most useful alloys are
under-stated:—
Name of Alloy. Composition.
Aluminium bronzeCopper and aluminium.
Bell metal Copper and tin.
Bronze ”
Gun metal ”
Brass Copper and zinc.
Dutch metal ”
Mosaic gold ”
Ormulu ”

Tombac ”
German silver Copper, nickel, and zinc.
Britannia metal Antimony and tin.
Solder ”
Pewter Tin and lead.
Type metal Lead and antimony (also copper at times).
Shot Lead and arsenic.
Gold currency Gold and copper.
Silver currency Silver and copper.
Stereotype metal Lead, antimony, and bismuth.
Metals combine with chlorine, and produce chlorides,
Metals combine with sulphu, and produce sulphides,
Metals combine with oxygen, and produce oxides, and so on.
The metals may be classed as follows in divisions:—
Metals of the
alkalies
as Potassium, Sodium, Lithium, Ammonium.
Metals of the
alkaline earths
as Barium, Calcium, Magnesium, Strontium.
Metals of the
earths
as Aluminium, Cerium, Didymium, Erbium, Glucinium,
Lanthanum, Terbium, Thorium, Yttrium, Zirconium.
Metals proper—
Common Metals
as {Iron, Manganese, Cobalt, Nickel, Copper, Bismuth, Lead, Tin,
Zinc, Chromium, Antimony.
Noble Metals
as {Mercury, Silver, Gold, Platinum, Palladium, Rhodium,
Ruthenium, Osmium, Iridium.
We cannot attempt an elaborate description of all the metals, but we
will endeavour to give a few particulars concerning the important
ones, leaving many parts for Mineralogy to supplement and enlarge
upon. We shall therefore mention only the most useful of the metals
in this place. We will commence with Potassium.
Metals of the Alkalies.
Potassium has a bright, almost silvery, appearance, and is so greatly
attracted by oxygen that it cannot be kept anywhere if that element

be present—not even in water, for combustion will immediately
ensue on water; and in air it is rapidly tarnished. It burns with a
beautiful violet colour, and a very pretty experiment may easily be
performed by throwing a piece upon a basin of water. The fragment
combines with the oxygen of the water, the hydrogen is evolved, and
burns, and the potassium vapour gives the gas its purple or violet
colour. The metal can be procured by pulverizing carbonate of
potassium and charcoal, and heating them in an iron retort. The
vapour condenses into globules in the receiver, which is surrounded
by ice in a wire basket. It must be collected and kept in naphtha, or
it would be oxidised. Potassium was first obtained by Sir Humphrey
Davy in 1807. Potash is the oxide of potassium, and comes from the
“ashes” of wood.
Fig. 397.—Preparation of potassium.
The compounds of potassium are numerous, and exist in nature, and
by burning plants we can obtain potash (“pearlash”). Nitrate of
potassium, or nitre (saltpetre), (KNO3), is a very important salt. It is
found in the East Indies. It is a constituent of gunpowder, which
consists of seventy-five parts of nitre, fifteen of charcoal, and ten of

Fig. 398.—Machine for cutting
soap in bars.
sulphur. The hydrated oxide of potassium, or “caustic potash”
(obtained from the carbonate), is much used in soap manufactories.
It is called “caustic” from its property of cauterizing the tissues.
Iodide, bromide, and cyanide of potassium, are used in medicine and
photography.
Soap is made by combining soda (for
hard soap), or potash (for soft soap),
with oil or tallow. Yellow soap has
turpentine, and occasionally palm oil,
added. Oils and fats combine with
metallic oxides, and oxide of lead with
olive oil and resin forms the adhesive
plaister with which we are all familiar
when the mixture is spread upon linen.
Fats boiled with potash or soda make
soaps; the glycerine is sometimes set
free and purified as we have it.
Sometimes it is retained for glycerine soap. Fancy soap is only
common soap coloured. White and brown Windsor are the same
soap—in the latter case browned to imitate age! Soap is quite
soluble in spirits, but in ordinary water it is not so greatly soluble,
and produces a lather, owing to the lime in the water being present
in more or less quantity, to make the water more or less “hard.”

Fig. 399.—Soap-boiling house.
Sodium is not unlike potassium, not only in appearance, but in its
attributes; it can be obtained from the carbonate, as potassium is
obtained from its carbonate. Soda is the oxide of sodium, but the
most common and useful compound of sodium is the chloride, or
common salt, which is found in mines in England, Poland, and
elsewhere. Salt may also be obtained by the evaporation of sea
water. Rock salt is got at Salzburg, and the German salt mines and
works produce a large quantity. The Carbonate of Soda is
manufactured from the chloride of sodium, although it can be
procured from the salsoda plants by burning. The chloride of sodium
is converted into sulphate, and then ignited with carbonate of lime
and charcoal. The soluble carbonate is extracted in warm water, and
sold in crystals as soda, or (anhydrous) “soda ash.” The large
quantity of hydrochloric acid produced in the first part of the process
is used in the process of making chloride of lime. A few years back,
soda was got from Hungary and various other countries where it
exists as a natural efflorescence on the shores of some lakes, also by
burning sea-weeds, especially the common bladder wrack (Fucus
vesiculosus), the ashes of which were melted into masses, and came

Fig. 400.—Mottled soap-frames.
to market in various states of purity. The bi-carbonate of soda is
obtained by passing carbonic acid gas over the carbonate crystals.
Soda does not attract moisture from the air. It is used in washing, in
glass manufactories, in dyeing, soap-making, etc.
Sulphate of Soda is “Glauber’s Salt”; it is also employed in glass-
making. Mixed with sulphuric acid and water, it forms a freezing
mixture. Glass, as we have seen, is made with silicic acid (sand),
soda, potassa, oxide of lead, and lime, and is an artificial silicate of
soda.
Lithium is the lightest of metals, and
forms the link between alkaline and the
alkaline earth metals. The salts are
found in many places in solution. The
chloride when decomposed by
electricity yields the metal.
Cæsium and Rubidium require no detailed
notice from us. They were first found in
the solar spectrum, and resemble
potassium.
Ammonium is only a conjectural metal. Ammonia, of which we have
already treated, is so like a metallic oxide that chemists have come
to the conclusion that its compounds contain a metallic body, which
they have named hypothetically Ammonium. It is usually classed
amongst the alkaline metals. The salts of ammonia are important,
and have already been mentioned. Muriate (chloride) of ammonia,
or sal-ammoniac, is analogous to chloride of sodium and chloride of
potassium. It is decomposed by heating it with slaked lime, and then
gaseous ammonia is given off.
The Metals of the Alkaline Earths.

Fig. 401.—Soda furnace.
Barium is the first of the four metals we have to notice in this group,
and will not detain us long, for it is little known in a free condition.
Its most important compound is heavy spar (sulphate of baryta),
which, when powdered, is employed as a white paint. The oxide of
barium, BaO, is termed baryta.
Nitrate of Baryta is used for “green fire,” which is made as follows:—
Sulphur, twenty parts; chlorate of potassium, thirty-three parts; and
nitrate of baryta, eighty parts (by weight).
Calcium forms a considerable quantity of our earth’s crust. It is the
metal of lime, which is the oxide of calcium. In a metallic state it
possesses no great interest, but its combinations are very important
to us. Lime is, of course, familiar to all. It is obtained by evolving the
carbonic acid from carbonate of lime (CaO).
The properties of this lime are its white appearance, and it develops
a considerable amount of heat when mixed with water, combining to
make hydrate of lime, or “slaked lime.” This soon crumbles into
powder, and as a mortar attracts the carbonic acid from the air, by
which means it assumes the carbonate and very solid form, which

renders it valuable for cement and mortar, which, when mixed with
sand, hardens. Caustic lime is used in whitewashing, etc.
Carbonate of Lime (CaCO3) occurs in nature in various forms, as
limestone, chalk, marble, etc. Calc-spar (arragonite) is colourless,
and occurs as crystals. Marble is white (sometimes coloured by
metallic oxides), hard, and granular. Chalk is soft and pulverizing. It
occurs in mountainous masses, and in the tiniest shells, for
carbonate of lime is the main component of the shells of the
crustacea, of corals, and of the shell of the egg; it enters likewise
into the composition of bones, and hence we must regard it as one
of the necessary constituents of the food of animals. It is an almost
invariable constituent of the waters we meet with in Nature,
containing, as they always do, a portion of carbonic acid, which has
the power of dissolving carbonate of lime. But when gently warmed,
the volatile gas is expelled, and the carbonate of lime deposited in
the form of white incrustations upon the bottom of the vessel, which
are particularly observed on the bottoms of tea-kettles, and if the
water contains a large quantity of calcareous matter, even our water-
bottles and drinking-glasses become covered with a thin film of
carbonate of lime. These depositions may readily be removed by
pouring into the vessels a little dilute hydrochloric acid, or some
strong vinegar, which in a short time dissolves the carbonate of lime.
Sulphate of Lime (CaSO4) is found in considerable masses, and is
commonly known under the name of Gypsum. It occurs either
crystallized or granulated, and is of dazzling whiteness; in the latter
form it is termed Alabaster, which is so soft as to admit of being cut
with a chisel, and is admirably adapted for various kinds of works of
art. Gypsum contains water of crystallization, which is expelled at a
gentle heat. But when ignited, ground, and mixed into a paste with
water, it acquires the property of entering into chemical combination
with it, and forming the original hydrate, which in a short time
becomes perfectly solid. Thus it offers to the artist a highly valuable
material for preparing the well-known plaster of Paris figures, and by
its use the noblest statues of ancient and modern art have now been

placed within the reach of all. Gypsum, moreover, has received a
valuable application as manure. In water it is slightly soluble, and
imparts to it a disagreeable and somewhat bitterish, earthy taste. It
is called “selenite” when transparent.
Phosphate of Lime constitutes the principal mass of the bones of
animals, and is extensively employed in the preparation of
phosphorus; in the form of ground bones it is likewise used as a
manure. It appears to belong to those mineral constituents which
are essential to the nutrition of animals. It is found in corn and
cereals, and used in making bread; so we derive the phosphorus
which is so useful to our system.
Chloride of Lime is a white powder smelling of chlorine, and is
produced by passing the gas over the hydrate of lime spread on
trays for the purpose. It is the well-known “bleaching powder.” It is
also used as a disinfectant. The Fluoride of Calcium is Derbyshire
spar, or “Blue John.” Fluor spar is generally of a purple hue. We may
add that hard water can be softened by adding a little powdered
lime to it.
Magnesium sometimes finds a place with the other metals, for it bears
a resemblance to zinc. Magnesium may be prepared by heating its
chloride with sodium. Salt is formed, and the metal is procured. It
burns very brightly, and forms an oxide of magnesia (MgO).
Magnesium appears in the formation of mountains occasionally. It is
ductile and malleable, and may be easily melted.
Carbonate of Magnesia, combining with carbonate of lime, form the
Dolomite Hills. When pure, the carbonate is a light powder, and
when the carbonic acid is taken from it by burning it is called
Calcined Magnesia.
The Sulphate of Magnesia occurs in sea-water, and in saline springs
such as Epsom. It is called “Epsom Salts.” Magnesium wire burns
brightly, and may be used as an illuminating agent for final scenes in
private theatricals. Magnesite will be mentioned among Minerals.

Strontium is a rare metal, and is particularly useful in the composition
of “red-fire.” There are the carbonate and sulphate of strontium; the
latter is known as Celestine. The red fire above referred to can be
made as follows, in a dry mixture. Ten parts nitrate of strontia, 1½
parts chlorate of potassium, 3½ parts of sulphur, 1 part sulphide of
antimony, and ½ part charcoal. Mix well without moisture, enclose in
touch paper, and burn. A gorgeous crimson fire will result.
Metals of the Earths.
Aluminium (Aluminum) is like gold in appearance when in alloy with
copper, and can be procured from its chloride by decomposition with
electricity. It occurs largely in nature in composition with clays and
slates. Its oxide, alumina (Al2O3), composes a number of minerals,
and accordingly forms a great mass of the earth. Alumina is present
in various forms (see Minerals) in the earth, all of which will be
mentioned under Crystallography and Mineralogy. The other nine
metals in this class do not call for special notice.
Heavy Metals
Iron, which is the most valuable of all our metals, may fitly head our
list. So many useful articles are made of it, that without
consideration any one can name twenty. The arts of peace and the
glories of war are all produced with the assistance of iron, and its
occurrence with coal has rendered us the greatest service, and
placed us at the head of nations. It occurs native in meteoric stones.
Iron is obtained from certain ores in England and Sweden, and these
contain oxygen and iron (see Mineralogy). We have thus to drive
away the former to obtain the latter. This is done by putting the ores
in small pieces into a blast furnace (fig. 402) mixed with limestone
and coal. The process of severing the metal from its ores is termed
smelting, the air supplied to the furnace being warmed, and termed
the “hot blast.” The “cold blast” is sometimes used. The ores when
dug from the mine are generally stamped into powder, then

“roasted,”—that is, made hot, and kept so for some time to drive off
water, sulphur, or arsenic, which would prevent the “fluxes” acting
properly. The fluxes are substances which will mix with, melt, and
separate the matters to be got rid of, the chief being charcoal, coke,
and limestone. The ore is then mixed with the flux, and the whole
raised to a great heat; as the metal is separated it melts, runs to the
bottom of the “smelting furnace,” and is drawn off into moulds made
of sand; it is thus cast into short thick bars called “pigs,” so we hear
of pig-iron, and pig-lead. Iron is smelted from “ironstone,” which is
mixed with coke and limestone. The heat required to smelt iron is so
very great, that a steam-engine is now generally employed to blow
the furnace. (Before the invention of the steam-engine, water-mills
were used for the same purpose.) The smelting is conducted in what
is called a blast furnace. When the metal has all been “reduced,” or
melted, and run down to the bottom of the furnace, a hole is made,
out of which it runs into the moulds; this is called “tapping the
furnace.”

Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.
More than just a book-buying platform, we strive to be a bridge
connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.
Join us on a journey of knowledge exploration, passion nurturing, and
personal growth every day!
ebookbell.com

Hadoop Application Architectures Mark Grover Ted Malaska Jonathan Seidman Gwen Shapira

More Related Content

Similar to Hadoop Application Architectures Mark Grover Ted Malaska Jonathan Seidman Gwen Shapira (20)

Recently uploaded (20)

Hadoop Application Architectures Mark Grover Ted Malaska Jonathan Seidman Gwen Shapira