Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetInData

Elephants in The Cloud
or How to Become Cloud Ready
Krzysztof Adamski, GetInData

So You Say You Don’t Use Cloud?
HR System Online Documents Mobile PhoneEmail Server

Trust as a Key Factor
Image source: https://www.forbes.com/sites/louiscolumbus/2017/04/23/2017-state-of-cloud-adoption-and-security

More Secure or Not
In the end, do you
really think you can
provide better
infrastructure security
than cloud providers
???

Migration Questions?
How fast can you start/expand your analytics initiative?
1
How often is your cluster fully busy and your employees want more computing
power right now?2
How much time you spend on maintaining your infra?
3
How much time does it take you to gracefully apply all the security patches in
your Hadoop cluster?4
Do you need hardware that you don’t have in your data-center e.g. GPU,
terrible amounts of RAM5

Migration Goals
Transition from infrastructure engineering
towards data engineering
1
Use the best possible technology stack in the
world
2
Free your time
3
Attract the best engineers
4
Ultimate world domination ;)
5

Before You Start
Be smart
with which
service you
choose
Avoid
lock-in
Try to
estimate
the costs
See what
others
are doing
Technology
choices
Yet another
migration
Hardware,
engineering, legal
Netflix, Spotify, Etsy

What’s different in
the Cloud ?

Decoupled
storage
and
processing

Different Technologies
Hadoop Ecosystem Google Cloud Platform
File System HDFS Google Cloud Storage
Key Value Store HBase, Cassandra BigTable
SQL Hive, SparkSQL, Presto BigQuery
Messaging Queue Kafka PubSub
Geo-Replicated
RDBMS
CockroachDB Spanner

Strong Global Consistency
Google Cloud Storage provides strong global consistency for the following
operations, including both data and metadata:
● Read-after-write
● Read-after-metadata-update
● Read-after-delete
● Bucket listing, Object listing
● Granting access to resources

Eventual Consistency
● Revoking access from resources
It typically takes about a minute for revoking access to take effect. In some
cases it may take longer.
Beware of a cache though.

Pricing
● Pay-per-second billing
Keep in mind that if you often do sub-10
minute analyses using VMs, serverless
options may be better suited since VMs
are relatively slow to boot and serverless
functions are billed at every 100ms.

I want to start.
What’s next?

Data
repository
in a good
shape

Find best
candidates
for
migration
Isolated / self-contained
applications
With mainly external
(public data)
dependencies
Global use case

Baby Steps
Prepare your hadoop cluster to interact
with object storage.
1
Look for existing operators for popular
tools like Apache Airflow.
2
Make a copy of your critical datasets to
the cloud.
3
Use both BigQuery for fast analytics and
GCS output for more advanced trials.
4
Audit costs per query.
5

Networking
High bandwidth, low
latency and consistent
network connectivity is
critical.
Pay attention to such
things like choosing the
right region, number of
cores or even TCP
window size.
But to get the full speed
dedicated interconnect /
direct peering is the way
to go.
Multiple VPN tunnels
are a good starting
point to increase
bandwidth.
Transfer appliances for
offline data migration.

Package Your Deployments
● Containers (docker) for tooling.
● Deployment artifacts (Spark / MR
jars).
● Tools like Spydra can help you
executing your packages in both
worlds
$ cat examples.json
{
"client_id": "simple-spydra-test",
"cluster_type": "dataproc",
"log_bucket": "spydra-test-logs",
"region": "europe-west1",
"cluster": {
"options": {
"project": "spydra-test"
}
},
"submit": {
"job_args": [
"pi",
"8",
"100"
],
"options": {
"jar": "hadoop-mapreduce-examples.jar"
}
}
}
$ spydra submit --spydra-json example.json

Other Important Features
● Cluster pooling - using init actions to kill old clusters
● Autoscaling - based on the workload
● Preemptible instances:
○ A reasonable choice for your cluster
○ Keep in mind final resilience (idempotence)
○ Available also with GPUs

No Long-Lived Services
● No patching! - YAY
● No wasting resources
● Latest security patches
applied automatically

Predictions
Forrester predicts
SaaS vendors will de-prioritize
their platform efforts to attain
global scale.
They will compete more at the platform level by running
portions of their services on AWS, Azure, GCP or Oracle Cloud
in 2018.
”
”

Future
Interesting projects:
● Spark on k8s
● dA Platform 2

There no right answer - it's tradeoff that depends on many variables
Should I Stay or Should I Go?

Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetInData

Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetInData

More Related Content

What's hot (19)

Similar to Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetInData (20)

More from Evention (20)

Recently uploaded (20)

Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetInData