Deep Learning Images For Google Compute Engine, The Definitive Guide

12 min readJul 3, 2018

Co-author of the article: Mike Cheng

Google Cloud Platform now provides machine learning images designed for deep learning practitioners. This article will cover the fundamentals of the Google Deep Learning Images, how they benefit the developer, creating a deep learning instance, and common workflows.

Disclaimers:

* At the time of writing, the product is still in Beta, therefore it is not covered by any SLAs.
* We will update this guide as new features are available. Please see the bottom of the page to see when the article was updated.

What are the Google Deep Learning images?

The Google Deep Learning images are a set of prepackaged VM images with a deep learning framework ready to be run out of the box. Currently, there are images supporting TensorFlow, PyTorch, and generic high-performance computing, with versions for both CPU-only and GPU-enabled workflows. To understand which set of images is right for your use case, consult the graph below.

We also have experimental families:

All of the images are based on Debian 9.

What do the images contain?

All images come with python 2.7/3.5 with pre-installed core packages:

numpy
sklearn
scipy
pandas
nltk
pillow
many others

Jupyter environments (Lab and Notebook) for doing quick prototyping.

Nvidia packages (GPU images only):

CUDA 9.0 / 9.1 / 9.2 / 10.0
CuDNN 7.* (up to 7.4)
NCCL 2.* (up to 2.3)
latest Nvidia Driver

The list is constantly growing, so I would advise the reader to keep an eye out for any updates.

Why would I use these deep learning images?

Let’s say that you want to train some models with Keras and TensorFlow. You care about performance, so you want to attach GPUs. To see any benefit from the GPUs, you will need to install the Nvidia stack (Nvidia driver + CUDA + CuDNN). Not only is this tricky in and of itself, but you will also need to consider compatibility with the ML framework binaries. For example, the official TensorFlow 1.10 binary is compiled with CUDA 9.0, therefore a machine with CUDA 9.2 or 10.0 will NOT work with the official TensorFlow binary. As anyone who has set up this stack can tell you, matching dependencies is a nontrivial pain.

Now, let say that after some sleepless nights you’ve finally installed required CUDA stack. Now we can consider the question of optimization. Performance is key, since it means less time to convergence (less thumb-twiddling and a lower total cost). Is the CUDA + framework combination the fastest stack for the hardware that one will be using? For example, will a GCE instance with a SkyLake CPU, one Volta V100 GPU, and CUDA 9.0 show the highest possible performance for TensorFlow 1.10? With improvements coming in constantly from Nvidia, the framework, and even the platform itself, it’s hard to know for certain.

In order to be sure, you will have to compile TensorFlow yourself with different compilation keys and different versions of the Nvidia stack, run measurements, and pick the best one. All of this trial and error will require GPU instances, which are quite pricey. To top it off, you will have to do this all over again as each new version of Nvidia stack or TensorFlow is released. Safe to say, this process is not something you really want to handle yourself.

This is where the Google Deep Learning images come into play. The TensorFlow images have a custom build of TensorFlow that is optimized precisely for the hardware that we have on Google Cloud Engine. We’ve also tested each configuration of Nvidia stack and packaged the one that has the best possible speed. And, on top of this, almost all of the important packages you will need over the course of your research come pre-baked in the image

How do I create a VM instance from these images?

There are two ways of creating an instance with our image:

from the UI
from the CLI

From The UI

Our UI is very simple, so I would just show to you how it looks like:

You can start using it right away by going here: https://console.cloud.google.com/mlengine/notebooks/instances

Since this part is simple I will leave you with the official documentation for the UI and we will move for the CLI part for the users who either need to do an automation or more flexibility.

From The CLI

Before starting, be sure that you have installed the gcloud CLI on your local machine. Alternatively you can also use Google Cloud Shell; however, beware that WebPreview from Google Cloud Shell is currently not supported.

Now, you will need to pick the family of images you want for your VM instance. For ease of reference, I’ve duplicated the non-experimental images families graph:

Let’s say you want a TensorFlow GPU image. You would select the family tf-latest-gpu, which will reference an up-to-date image with the most recent release of TensorFlow. We also have family tf-latest-cu100 that will have latest TF with the CUDA 10.0. We will be using this family later on across the article. However, beware that after migration to the newest CUDA (like CUDA 10) this group will not be supported, so I would suggest to use tf-latest-gpu.

What if I need a specific version of TensorFlow?

Let’s say you have a project that requires TensorFlow 1.11, but TensorFlow 1.12 has already come out, and the tf-latest images have already moved to 1.12. For this scenario, we provide image families that refer directly to the framework version (in this case, tf-1–11-cpu and tf-1–11-gpu). These images will still be updated with the necessary security patches, but the framework is fixed.

What if I want to create a cluster or use the same image in the future? How can I guarantee that the image has not changed?

I do understand that there might be plenty of cases where you might want to reuse the exact same image again and again. There are many cases where this is actually preferred. For example, if you are spinning up a cluster it is NOT recommended to reference the images by image family in your scripts, because if the family is updated while your script is working, you will end up having different images (therefore different version of the software) on some instances in the cluster. In these cases, you’ll want to get the name of the latest image in the family directly:

Unfortunately, I do not have a fancy bash function for you for this one, simply because it is not something that I use very often. All the deep learning images are public and can be listed from the project “deeplearning-platform-release” where they are hosted.

Now you have the EXACT image name: e.g. tf-latest-gpu-20181023 that you can reuse wherever you want.

Now, let’s create some instances!

To create an instance from an image family:

If using an image name, you should replace “ — image-family=image-family” with “ — image=image-name” in the command.

Several things to note here:

Pick right instance type. Example of the command is using “n1-standard-8”, which has 8 vCPUs and 30 GB of RAM. You might want a cheaper instance, or more powerful. All available instance types can be found here.

Pick right disk size. You probably do not want to find out that disk is not big enough to host all your data for the training so make a good decision about the size upfront.

If you are using an instance with GPUs, there are a few points that you should be aware of:

Pick a valid zone. If you are creating an instance with a certain GPU, be sure that the GPU is available in the zone you’ve selected. See the documentation that gives a list of zones with GPUs. As you can see us-west1-b is the ONLY zone that have all 3 different GPUs.

Verify that you have quota to create an instance with GPUs. Even if you have picked the right region it does not mean that you have quota to create a GPU instance. By default, quotas for GPU are zero, therefore any attempts to create an instance with GPUs would result in a failure. A good explanation how to request increase of the quotas can be found here.

Verify that the zone has enough GPUs to fulfill your request. Even if you have picked the right region and you have quotas, it does not mean that there available GPUs in the zone that you have picked. Unfortunately, I’m not aware of any simple way to check the availability of the resource in any other way other than actually trying and creating the resource.

Pick the right GPU type and amount. The “accelerator” flag in the command controls the GPU type and the amount of GPUs that will be attached to the instance: e.g “— accelerator=’type=nvidia-tesla-v100,count=8'”. Each GPU has certain valid counts. Here are the supported types and counts that can be used with them (in order of least to most powerful):

nvidia-tesla-k80, can have counts: 1, 2, 4, 8
nvidia-tesla-p100, can have counts: 1, 2, 4
nvidia-tesla-v100, can have counts: 1, 8
nvidia-tesla-p4, can have counts: 1, 2, 4

Give permission for Google Cloud to install the Nvidia driver on your behalf. The Nvidia driver is required for your instance to interface with the GPUs correctly. Due to reasons that are out of the scope of this article, images do NOT come preinstalled with the Nvidia driver. However, it is possible to authorize Google Cloud to install this driver on your behalf. This is done via the flag “ — metadata=’install-nvidia-driver=True’”. If you do not opt into the automatic install using the metadata flag, the VM will prompt you to install the driver on your first SSH.

Unfortunately, the process of installing the driver affects the startup time for the first boot, since it must download the files from Nvidia, install the driver. This should take no more than 1 minute, but it might affect certain use cases. We will discuss how to reduce time to boot time later in this article.

If you are not planning to use GPU you might want to add the following key to your command:

this will guarantee that you will have the instance with the fastest possible CPU by the time for writing. However, you need also be sure that this CPU is available in your region. You can check it here. So the overall command for creating a CPU-only instance would be:

Create Instance With Simplified Jupyter Access Feature. Beta.

There is a way to create Deep Learning VM with the special HTTPS link. The link will give you access to the Jupyter Lab that is running on the VM. With the link, you will not have to use SSH (unless you want to) and port mapping.

This feature is currently in Beta and is supported ONLY for instances in US/EU and Asia.

In order to use this feature, you need to change the following line:

--metadata='install-nvidia-driver=True'

--scopes=https://www.googleapis.com/auth/cloud-platform \
--metadata='install-nvidia-driver=True,proxy-mode=project_editors'

So the final command will look like this:

This will show your instance in the new Notebooks UI that we have shown before now you either go directly to the UI to start using your instance here: https://console.cloud.google.com/mlengine/notebooks/instances or if you want you can can get the url for accessing Jupyter on the VM by calling the following command:

If everything is ok you will see URL that can be used in the browser to access Jupyter Lab.

It might take a minute for the URL to show up

Access your instance using SSH

This can be done via simple command:

gcloud will propagate your SSH keys and create your user, so you don’t have to worry about that part. If you want to make the process even easier, I actually have several bash helper functions that simplify things for me. I prefer to ssh like this:

BTW, you can find many of my gcloud related functions here. Before we jump to questions like how fast is the image, or what can be done with it, let me address one last question related to image start time.

How can I reduce the startup time?

This is very simple. You can do the following:

Create a n1-standard-1 (default size) instance with one K80 (the cheapest GPU).
Wait until boot is completed.
Verify that Nvidia driver is installed (you can do this by running “nvidia-smi” after SSHing to your instance).
Stop the instance.
Create your own image.
Profit — all VMs created from your own image will have a very fast time to boot, since the driver is all ready to go.

We already have covered how to create the instance, ssh to it, and verify that driver is installed. Now, let me tell you how to stop instance from CLI and how to create your own image.

To stop your instance, you can use the following command (from your local machine, not on the instance):

To create your own image:

After this command finishes, you will have your own image with Nvidia drivers pre-installed that can be used to start new VMs.

What About Preemptible Instances?

First of all, if you do not know what is “preemptible”, first familiarize yourself with the official documentation here. And now the good part, in order to create a preemptible instance the ONLY change you need to do to the creation command is to add the following flag in the end: “preemptible”.

What About Jupyter Lab?

I would strongly advise you to use the access method from the section “Create Instance With Simplified Jupyter Access Feature. Beta.”. Use this section only if that method did not work or can NOT be used.

As soon as your VM is up next step probably would be to start the Jupyter Lab and do some DL :) This is very simple with images. Jupyter Lab is actually already running (also there is a user “jupyter” that is created in the system and is used by the Jupyter Lab), you just need to connect to the instance with port mapping. By default, it will be on port 8080.

Now you can open your browser on your local machine on http://localhost:8080 and this is it!

How Fast Is It?

This is a very important question. However, the answer to this question probably would be even longer than everything written here up to this point. Therefore, my friends, you will have to wait for the next article :)

However, just a very quick estimation shows a speed of training on ImageNet around 6100 images/sec (ResNet-50). I did not have a personal budget in order to finish the training but I guess it is safe to assume that with such speed one can train the model to 75% accuracy in slightly more than 5 hours.

How To Get Help?

If you need anything, please do not hesitate to file question on the stack overflow with the tag google-dl-platform. Even if you think that the question is small, just file it:)

Also, you can now write to the public Google Group.

Any Possible Feedback Is Welcome

If you have any feedback, please do not hesitate to contact me by any possible means. In fact, if you want to say thank you for the article, the best way is to play with the images and give your feedback! It would be really-really nice to hear what can be improved!

Conclusion

These new images are bridging the gap between the software stack for deep learning (TensorFlow) and the GCE hardware (like Nvidia Volta). And since Google is behind TensorFlow they definitely have the expertise required to provide the best possible performance on the hardware.

Deep Learning as I See It