Automating Your Snowflake Cloning Strategy with AWS
Photo by Denys Nevozhai on Unsplash

Automating Your Snowflake Cloning Strategy with AWS


Serving refreshed data on a daily basis with CloudWatch and Lambda

This topic again?

I’ve written a number of blogs recently on why you shouldn’t be developing with data in the ways of the old world. What I mean by that is there are plenty of ways to get data parity across Dev, Test, and Production environments with little effort in a cloud based ecosystem. However, there are few companies that are taking this type of approach in their data strategy.

I’ve written how to accomplish this on GCP for BigQuery, on GCP for Snowflake, and this time I’ll show how to accomplish this on AWS for Snowflake.

How can we accomplish this?

Since we’ve established that we’re using Snowflake, we can take advantage of the Zero Copy Clone functionality. What this means is that we can easy clone an entire database, and with zero additional cost if we don’t edit the data in the cloned copy. Also, we established that we’re using AWS, which is a platform that many Snowflake customers are comfortable with, and using Lambda and CloudWatch are great solutions for this problem. These two services will allow us to:

  1. Execute a command to tell Snowflake what to clone
  2. Tell that command when to run

That solution architecture looks something like this when it’s all put together.

No alt text provided for this image

Let’s get down to business.

Setting up the Lambda Function

I’ll be using Node.js for this example, but that doesn’t mean that you couldn’t do this in Python, or any other supported Snowflake driver language. 

The function’s code

Since we are going to use CloudWatch events to trigger our function, we need to structure it in a way that Lambda likes for those events. Your index.js file should look something like this:

exports.handler = async (event, context, callback) => {
console.log('event');

//import the necessary package
var snowflake = require('snowflake-sdk');

//create the connection
var connection = snowflake.createConnection({
  account: '<your account name here>',
  username: '<your username here>',
  password: '<your password here>'
});
connection.connect(function(err, conn) {
  if (err) {
    console.error('Unable to connect: ' + err.message);
  } else {
    console.log('Successfully connected as id: ' + connection.getId());
  }
});

//run the sql statement
var statement = await connection.execute({
  sqlText: 'create or replace database phils_dev_database clone demo_db;',
  complete: function(err, stmt, rows) {
    if (err) {
      console.error('Failed to execute statement due to the following error: ' + err.message);
    } else {
      console.log('Successfully executed statement: ' + stmt.getSqlText());
    }
  }
});
callback(null, 'Finished');
};

What this function will do is establish a connection to Snowflake, and then run the SQL command you specify. In this case, it is:

create or replace database phils_dev_database clone demo_db;

You will need to create a package.json object to accompany this index.js file, as well as zip it all up. If you’re not sure how to create that, it’s pretty simple; in your root folder for your function, run npm install . This will create the dependencies and install them in the node_modules folder. 

Pro tip on zipping: if you are on macOS and zipping folders for Lambda, use the command below — Lambda can get a bit picky about folder structuring and this will cure the issue.

zip -r ../<your_folder_name_here>.zip *

Once your zip folder is configured and ready, load it into Lambda, and be sure to select the Node.js 8.10 runtime.

No alt text provided for this image

Function configuration

In the spirit of AWS’s poor documentation, I’ll skip how I set up my function… Just kidding! I’ll help you guys and gals out.

Be sure to select CloudWatch logs as the triggering event when setting up the function. We’ll configure CloudWatch soon, but it is important that Lambda is listening to the correct service. Other than that, the rest of the function set up will be the same as all other Lambda functions in your AWS environment; in general, follow your organization’s recommended settings.

My set up worked fine with a new service role for this function, and the minimum amount of memory allocated. I tested this both with VPCs and security groups and rules, and without a VPC and had performant results. Like I said above, the most important thing is having the correct listener, and then follow what you usually do for function set ups. If you get stuck, this page is a great resource that mirrors what we’re trying to accomplish here. 

Be sure to save and test your function once it’s all configured. 

Setting up CloudWatch

This is also relatively simple. Navigate over to CloudWatch in the console and create a rule either from the landing page, or from the left hand side select “Rules” and then create one on the next page. From there, it’s a few simple clicks to set up the event.

No alt text provided for this image

Set up the event to based on a schedule, and then edit the schedule to fit your needs. Then select the function we just created, and leave the other settings in their default states.

Then all we need to do is give it a name, and it click on create.

No alt text provided for this image

Once you’re function runs for the first time, be sure to head over to Snowflake to see your cloned database!

Wrap Up

Working with fresh data is a no brainer when it’s as easy to deliver as setting up some Lambda functions and CloudWatch events. Implementing this kind of solution has so much ROI, not only in the form of code quality, but also in terms of time to market and team moral. 



Phil Goerdt is the founder of Erteso, a consultancy that focuses on cloud and data solutions, which is a Snowflake partner. He is certified in Snowflake, is a GCP Certified Professional Cloud Architect and a GCP Certified Professional Data Engineer. You can contact him at phil.goerdt@erteso.com.

To view or add a comment, sign in

More articles by Phil Goerdt

  • Bring us to red alert! How to Implement Stackdriver Alerting on policy changes to BigQuery Datasets

    Stackdriver Alerting on policy changes to BigQuery Datasets Sometimes it can be hard knowing what is going on in your…

  • Automating Snowflake Database Cloning with GCP

    Why you shouldn’t be working with old data, and how to change it Out with the old I recently wrote a blog post arguing…

    1 Comment
  • Dynamically Duplicating A BigQuery DataSet’s Tables

    …Using CloudScheduler, Pub/Sub, CloudFunctions and a little bit of Node.js StackOverflow Gods, why have you forsaken…

  • How Fresh is Your Data?

    Why Bad and Stale Data is Keeping You From Delivering More Value Each year I push aside whatever I’ve been reading and…

    1 Comment
  • All Things Being (un)Equal…

    Why you should be using weighted averages more often A whiskey by any other name… or appearance? When I introduced this…

  • Roadmapping Ratings

    Why thinking about what data you want at the start helps in the end Before we get into some of the more intricate parts…

  • The Damn Dram Journal

    A chronicle of cloud, analytics and whiskey This blog series documents one guy’s quest to find the best damn dram he…

    4 Comments
  • Chasing the Horizon

    What’s this all about? Today (9/7/2018) is my last day at Red Pill Analytics. It’s about as hard to type as it is to…

    5 Comments
  • What’s the Big Deal with Big Data?

    What HBR’s Winter 2017 OnPoint Tells Us About Business & Data It was a November evening that I was at the grocery…

    1 Comment
  • The Good, The Bad, and the Gorgeous

    Some tips, tricks and thoughts on using Google Data Studio Red Pill Analytics recently delivered a solution that…

Explore content categories