Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)

Taming The Data Load/Unload in Snowflake
Sample Code and Best Practice
(Faysal Shaarani)

Loading Data Into Your Snowflake’s Database(s) from raw data files

[1. CREATE YOUR APPLICABLE FILE FORMAT]:
The syntax in this section below allows you to create your CSV file format if you are loading
data from CSV files. Please note that the backslash escapes are coded for use from the
sfsql command line, not from the UI. If this FILE FORMAT below is to be used from the
Snowflake UI, the occurrences should be changed to
[CSV FILE FORMAT]
CREATE OR REPLACE FILE FORMAT DEMO_DB.PUBLIC.CSV
-- Comma field delimited and n record terminator
TYPE = 'CSV'
COMPRESSION = 'AUTO'
FIELD_DELIMITER = ','
RECORD_DELIMITER = 'n'
SKIP_HEADER = 1
FIELD_OPTIONALLY_ENCLOSED_BY = 'NONE'
TRIM_SPACE = false
ERROR_ON_COLUMN_COUNT_MISMATCH = true
ESCAPE = 'NONE'
ESCAPE_UNENCLOSED_FIELD = '134'
DATE_FORMAT = 'AUTO'
TIMESTAMP_FORMAT = 'AUTO'
NULL_IF = ('');

The syntax in the section below allows you to create JSON file format if you are loading
JSON data into Snowflake.

[JSON FILE FORMAT]
CREATE OR REPLACE FILE FORMAT DEMO_DB.PUBLIC.JSON
TYPE ='JSON'
COMPRESSION = 'AUTO'
ENABLE_OCTAL = false
ALLOW_DUPLICATE = false
STRIP_OUTER_ARRAY = false;
[2. CREATE YOUR DESTINATION TABLE]:
Pre-create your table before loading the CSV data into that table.

CREATE OR REPLACE TABLE exhibit
(Id STRING
,Title STRING
,Year NUMBER
,Collection_Id STRING
,Timeline_Id STRING
,Depth INT);
CREATE OR REPLACE TABLE timelines
(Id STRING
,Title VARCHAR
,Regime STRING
,FromYear NUMBER
,ToYear NUMBER
,Height NUMBER
,Timeline_Id STRING
,Collection_Id STRING
,ForkNode NUMBER
,Depth NUMBER
,SubtreeSize NUMBER);

If you are using data files that have been staged on your Snowflake’s Customer
Account S3 bucket assigned to your company.

Run COPY Command To Load Data From Raw CSV Files
Load the data from your CSV file into the pre-created EXHIBIT table. If you encounter a data
error on any of the records continue to load what you could. If you do not specify
ON_ERROR, the Default would be to skip the file on the first error it encounters on any of the
records in that file. The example below would load whatever it could skipping any bad
records in the file.

COPY INTO exhibit FROM @~/errorsExhibit/exhibit_03.txt.gz
FILE_FORMAT = CSV ON_ERROR='continue';
OR

Load the data into the pre-created EXHIBIT table from several CSV files matching the file
name regular expression shown on the sample code below:

COPY INTO exhibit
FROM @~/errorsExhibit
PATTERN='.*0[1-2].txt.gz'

To check the listing of files under the subdirectory CleanData under the @~ staging area for
your Snowflake Beta Customer account, while in the sfsql command line, use the following
command:
ls @~/CleanData
To check on the listing of all files whose file names match the regular expression specified in
the PATTERN parameter, use the command below:
ls @~/CleanData PATTERN='.*0[1-2].txt.gz';

Verify that the data was loaded successfully into the EXHIBIT table.

Select * from EXHIBIT;
Run COPY Command to Load/Parse JSON Data Raw Staged Files
[1. Upload JSON File Into The Customer Account's S3 Staging Area]
PUT
file:///Users/fshaarani/SVN/CUSTOMERREPO/examples/json/json_sam
ple_data @~/json/
[2. Create an External Table with a VARIANT Column To Contain The JSON Data]
CREATE OR REPLACE TABLE public.json_data_table_ext
(json_data variant)
STAGE_LOCATION='@~'
STAGE_FILE_FORMAT=demo_db.public.json
COMMENT='json Data preview table';
[3. COPY the JSON Raw Data Into the Table]
COPY INTO json_data_table_ext
FROM @~/json/json_sample_data
FILE_FORMAT = 'JSON' ON_ERROR='continue';

JSON FILE CONTENT:
Validate that the data in the JSON raw file got loaded into the table
select * from public.json_data_table_ext;
Output:
{ "root": [ { "age": 22, "children": [ { "age": "6", "gender": "Female", "name": "Jane" }, { "age": "15", ...
select json_data:root[0].kind, json_data:root[0].fullName,
json_data:root[0].age from public.json_data_table_ext ;
Output:

If you are using data files that have been staged on your own company’s Amazon S3
bucket:
Run COPY Command To Load Data From Raw CSV Files
This syntax below is needed to create a stage ONLY if you are using your own company’s
Amazon S3 bucket. If you are using your Snowflake assigned bucket, you do not need to
create a stage object.
Create the staging database object.

CREATE OR REPLACE STAGE DEMO_DB.PUBLIC.SAMPLE_STAGE
URL = 'S3://<YOUR_COMPANY_NAME>/<SUBFOLDER NAME>/'
CREDENTIALS = (AWS_KEY_ID = 'YOUR KEY VALUE'
AWS_SECRET_KEY = 'SECRET KEY VALUE’)
COMMENT = 'Stage object pointing to the customer's own AWS S3
bucket. Independent of Snowflake';

Load the data from your CSV file into the pre-created EXHIBIT table.
COPY INTO exhibit
FROM @DEMO_DB/errorsExhibit/exhibit_03.txt.gz
OR

Load the data into the pre-created EXHIBIT table from several CSV files matching the file
name pattern regular expression stated on the sample code below.

COPY INTO exhibit
FROM @DEMO_DB/errorsExhibit
PATTERN='.*0[1-2].txt.gz'
Verify that the data was loaded successfully into the EXHIBIT table.
Select * from EXHIBIT;
Run COPY Command to Load/Parse JSON Data Raw Staged Files

[1. Create a Stage Database Object Pointing to The Location of The JSON File]
This syntax below is needed to create a stage ONLY if you are using your own company’s
Amazon S3 bucket. If you are using your Snowflake assigned bucket, you do not need to

create a stage object.
CREATE OR REPLACE STAGE DEMO_DB.PUBLIC.STAGE_JSON
URL = 'S3://<YOUR_COMPANY_NAME>/<SUBFOLDER NAME>/'
CREDENTIALS = (AWS_KEY_ID = 'YOUR KEY VALUE'
AWS_SECRET_KEY = 'SECRET KEY VALUE’)
COMMENT = 'Stage object pointing to the customer's own AWS S3
bucket. Independent of Snowflake';
Place your file in the staging location defined by the above command
PUT
file:///Users/fshaarani/SVN/CUSTOMERREPO/examples/json/json_sam
ple_data @stage_JSON/json/
[2. Create an External Table with a VARIANT Column To Contain The JSON Data]
CREATE OR REPLACE TABLE public.json_data_table_ext
(json_data variant)
STAGE_LOCATION=demo_db.public.stage_json
STAGE_FILE_FORMAT= demo_db.public.json
COMMENT='json Data preview table';
[3. COPY the JSON Raw Data Into the Table]
COPY INTO json_data_table_ext
FROM @stage_json/json/json_sample_data
FILE_FORMAT = 'JSON' ON_ERROR='continue';
JSON FILE CONTENT:

Validate that the data in the JSON raw file got loaded into the table
select * from public.json_data_table_ext;
Output:
{ "root": [ { "age": 22, "children": [ { "age": "6", "gender": "Female", "name": "Jane" }, { "age": "15", ...
select json_data:root[0].kind, json_data:root[0].fullName,
json_data:root[0].age from public.json_data_table_ext ;
Output:

Using Snowflake to Validate Your Data Files

In this section, we will go over validating the raw data files before performing the actual data
load. To illustrate this, we will attempt to load raw data files containing errors and thus
making it intentionally fail to load that data into Snowflake. The VALIDATION_MODE option
on the COPY command would process the data without loading it in the destination table in
Snowflake.
In the proceeding example, the PUT command will stage the exhibit*.txt and timelines.txt files
at the default S3 staging location set up for the Beta Customer Account in Snowflake as well

as illustrate how we can load files under a sub-directory below the root staging directory of a
Snowflake Beta Customer Account.
PUT
file:///Users/fshaarani/SVN/CUSTOMERREPO/examples/dataloading/cs
v_samples/ErrorData/exhibit*.txt @~/errorsExhibit/;

----------------+-------------------+-------------+-------------+--------------------+--------------------+----------+---------+
source | target | source_size | target_size | source_compression | target_compression | status | details |
----------------+-------------------+-------------+-------------+--------------------+--------------------+----------+---------+
exhibit_03.txt | exhibit_03.txt.gz | 8353 | 3734 | NONE | GZIP | UPLOADED | |
----------------+-------------------+-------------+-------------+--------------------+--------------------+----------+---------+
3 rows in result (first row: 1.501 sec; total: 1.504 sec)
Below are three possible raw data validation scenarios and sample code:

1. The following example would allow the previewing of 10 records from the first raw data
file exhibit_01.txt. This file does not have any errors.
COPY INTO exhibit
FROM @~/errorsExhibit/exhibit_01.txt.gz FILE_FORMAT =
(FIELD_DELIMITER = '|' SKIP_HEADER = 1)
VALIDATION_MODE='return_10_rows';
2. The following example below simulates the scenario of having an extra delimiter in
the record and how the errors that would be displayed.
FILE_FORMAT = (FIELD_DELIMITER = '|' SKIP_HEADER = 1)
VALIDATION_MODE='return_errors';

3.The following example below simulates the scenario of having a column value that is of
the wrong data type and how the error would look like the output below after running
the COPY command below:

FILE_FORMAT = (FIELD_DELIMITER = '|' SKIP_HEADER = 1)

Using Snowflake to Unload Your Snowflake Data to Files

To create a data file from a table in the Snowflake database, use the below command:
COPY INTO S3 FROM EXHIBIT table
COPY INTO @~/giant_file/ from exhibit;
OR to overwrite the existing files in the same directory, use the OVERWRITE option as in the
command below:
COPY INTO @~/giant_file/ from exhibit overwrite=true;
Please note that by default, Snowflake will unload the data from the table into multiple files of
a size (16 MB) per file. If you want your data to be unloaded to a single file, then you need to
use the SINGLE option on the COPY command as in the example below:
COPY INTO @~/giant_file/ from exhibit
Single=true
overwrite=true;
Please note that AWS S3 has a limit of (5 GB) on the file size you can stage on S3. You can
use the optional MAX_FILE_SIZE (in bytes) to change the Snowflake default file size. Use
the command below if you want to specify bigger or smaller file sizes than the Snowflake
default file size as long as you do not exceed the AWS S3 max file size. For example, the
below command unloads the data in the EXHIBIT table into files of 50M each:
COPY INTO @~/giant_file/ from exhibit
max_file_size= 50000000 overwrite=true;
Using Snowflake to Split Your Data Files Into Smaller Files
If you are using data files that have been staged on your Snowflake’s Customer
Account S3 bucket assigned to your company.
When loading data into Snowflake, it is recommended that the raw data is split into as many
files as possible to maximize the parallelization of the data loading process and thus

completing the data load in the shortest amount of time possible. If your raw data is in one
raw data file, you can use Snowflake to split your large data file, into multiple files before
loading the data into Snowflake. Below are the steps for achieving this:
• Place the Snowflake sample giant_file from your local machine's directory into the
@~/giant_file/ S3 bucket using the following command:
PUT
v_samples/CleanData/giant_file_sample.csv.gz @~/giant_file/;

• Create a single-column file format for examining the data in the
data file.
CREATE OR REPLACE FILE FORMAT single_column_rows
TYPE='CSV'
SKIP_HEADER=0
RECORD_DELIMITER='n'
TRIM_SPACE=false
DATE_FORMAT='AUTO'
TIMESTAMP_FORMAT='AUTO'
FIELD_DELIMITER='NONE'
FIELD_OPTIONALLY_ENCLOSED_BY='NONE'
ESCAPE_UNENCLOSED_FIELD='134'
NULL_IF=('')
COMMENT='copy each line into single-column row';

• Create an external table in the Snowflake database specifying the staging area and file
format to be used:

CREATE OR REPLACE TABLE GiantFile_ext
(fullrow varchar(4096) )
STAGE_LOCATION=@~/giant_file/
STAGE_FILE_FORMAT= single_column_rows
COMMENT='GiantFile preview table';

• Run the COPY command below to create small files while limiting the file size to 2MB.
This would split the data across multiple small files at 2MB each from a single original
data file.

COPY INTO @~/giant_file_parts/
FROM (SELECT * FROM
table(stage(GiantFile_ext ,
pattern => '.*giant_file_sample.csv.gz')))
max_file_size= 2000000;

• Verify Files of the Data You Unloaded
ls @~/giant_file_parts;
To place a copy of the S3 giant file parts onto your local machine after they have been split
into several files of 2 MB each, use the below command:
get
@~/giant_file_parts/
v_samples/CleanData/
To remove all the files at the staging bucket location you want to clean up, use the following
command:

remove @~/giant_file_parts;

To remove a specific set of files from the giant file directory whose names match a regular
expression (i.e. remove all the files whose name ends with .csv.gz', use the following
command:
remove @~/giant_file pattern='.*.csv.gz';

Recommended Approach to Debug and Resolve Data Load Issues
Related to Data Problems

[WHAT IF YOU HAVE DATA FILES THAT HAVE PROBLEMS]:
This section below suggests a recommended flow for iterating through data fix on the data
file, and loading data into Snowflake via the COPY command. Snowflake’s COPY command
syntax supports several parameters that are helpful in debugging or bypassing bad data files
that are not possible to load due to various potential data problems, which may need to be
fixed before the data file can be read and loaded
[FIRST PASS] LOAD DATA WITH ONE OF THE THREE OPTIONS BELOW:

SKIPPING BAD DATA FILES:
1. Attempt to load with the ON_ERROR = 'SKIP_FILE' error handling parameter. With
this error handling parameter setting, files with errors will be skipped and will not be
loaded.

[ON_ERROR=’SKIP_FILE’]
COPY INTO exhibit
ON_ERROR='skip_file';
OR

SKIPPING BAD DATA FILES IF ERRORS EXCEED A SPECIFIED LIMIT:
1. Attempt to load with more tolerant error handling
ON_ERROR=SKIP_FILE_[error_limit]. With this option for error handling, files with
errors could be partially loaded as long as the number of errors does not exceed
the stated limit. The file is skipped when the number of errors exceeds the stated
error limit.
[ON_ERROR=’SKIP_FILE_[error_limit]’]
COPY INTO exhibit
ON_ERROR='skip_file_10';
OR

PERFORM PARTIAL LOAD FROM THE BAD DATA FILES:
1. Attempt to load with more tolerant error handling using ON_ERROR=’CONTINUE’.
With this option for error handling, files with errors could be partially loaded.
[ON_ERROR=’CONTINUE’]
COPY INTO exhibit
ON_ERROR='continue';

[SECOND PASS] RETURN THE DATA ERRORS:
Validate the files, which were skipped and failed to load from the first pass. This time,
attempt to load the bad data files with VALIDATION_MODE='RETURN_ERRORS'. This
allows the COPY command to return the list of errors within each data file and the position of
those errors.

COPY INTO exhibit
[FIX THE BAD RECORDS]
A. Download the bad data files containing the bad records from the staging area to
your local drive:
get
@~/errorsExhibit/exhibit_02.txt.gz
file:///Users/fshaarani/SVN/CUSTOMERREPO/examples/dataloadin
g/csv_samples/ErrorData
[PREVIEW RECORDS FROM YOUR BAD DATA FILE(S)]
To get visibility into the records in the data file and few of its records, use an external
table and read few record from the data file to see what a sample record looks like.
CREATE OR REPLACE TABLE PreviewFile_ext
(fullrow varchar(4096) )
STAGE_LOCATION=@~/errorsExhibit/
STAGE_FILE_FORMAT= single_column_rows
COMMENT='Bad data file preview table';
SELECT *
FROM table(stage(PreviewFile_ext ,
pattern => '.*exhibit_02.txt.gz')) LIMIT 10;
Fix the bad records manually and write them to a new data file, or regenerate a new
data file from the data source containing only the bad records that did not load (as
applicable).
B. Upload the fixed bad data file(s) into the staging area for re-loading and attempt
reloading from that fixed file:
PUT
file:///Users/fshaarani/SVN/CUSTOMERREPO/examples/dataloading
/csv_samples/ErrorData/exhibit_02.txt.gz @~/errorsExhibit/

Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)

More Related Content

What's hot (20)

Similar to Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani) (7)

Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)