SlideShare a Scribd company logo
Taming	The	Data	Load/Unload	in	Snowflake	
Sample	Code	and	Best	Practice	
(Faysal	Shaarani)	
		
	
	
Loading Data Into Your Snowflake’s Database(s) from raw data files
	
[1. CREATE YOUR APPLICABLE FILE FORMAT]:
The syntax in this section below allows you to create your CSV file format if you are loading
data from CSV files. Please note that the backslash escapes are coded for use from the
sfsql command line, not from the UI. If this FILE FORMAT below is to be used from the
Snowflake UI, the  occurrences should be changed to 
[CSV FILE FORMAT]
CREATE OR REPLACE FILE FORMAT DEMO_DB.PUBLIC.CSV
-- Comma field delimited and n record terminator
TYPE = 'CSV'
COMPRESSION = 'AUTO'
FIELD_DELIMITER = ','
RECORD_DELIMITER = 'n'
SKIP_HEADER = 1
FIELD_OPTIONALLY_ENCLOSED_BY = 'NONE'
TRIM_SPACE = false
ERROR_ON_COLUMN_COUNT_MISMATCH = true
ESCAPE = 'NONE'
ESCAPE_UNENCLOSED_FIELD = '134'
DATE_FORMAT = 'AUTO'
TIMESTAMP_FORMAT = 'AUTO'
NULL_IF = ('');
	
The syntax in the section below allows you to create JSON file format if you are loading
JSON data into Snowflake.
	
[JSON FILE FORMAT]
CREATE OR REPLACE FILE FORMAT DEMO_DB.PUBLIC.JSON
TYPE ='JSON'
COMPRESSION = 'AUTO'
ENABLE_OCTAL = false
ALLOW_DUPLICATE = false
STRIP_OUTER_ARRAY = false;
[2. CREATE YOUR DESTINATION TABLE]:
Pre-create your table before loading the CSV data into that table.
CREATE OR REPLACE TABLE exhibit
(Id STRING
,Title STRING
,Year NUMBER
,Collection_Id STRING
,Timeline_Id STRING
,Depth INT);
CREATE OR REPLACE TABLE timelines
(Id STRING
,Title VARCHAR
,Regime STRING
,FromYear NUMBER
,ToYear NUMBER
,Height NUMBER
,Timeline_Id STRING
,Collection_Id STRING
,ForkNode NUMBER
,Depth NUMBER
,SubtreeSize NUMBER);
	
If you are using data files that have been staged on your Snowflake’s Customer
Account S3 bucket assigned to your company.
	
Run COPY Command To Load Data From Raw CSV Files
Load the data from your CSV file into the pre-created EXHIBIT table. If you encounter a data
error on any of the records continue to load what you could. If you do not specify
ON_ERROR, the Default would be to skip the file on the first error it encounters on any of the
records in that file. The example below would load whatever it could skipping any bad
records in the file.
	
COPY INTO exhibit FROM @~/errorsExhibit/exhibit_03.txt.gz
FILE_FORMAT = CSV ON_ERROR='continue';
OR
	
Load the data into the pre-created EXHIBIT table from several CSV files matching the file
name regular expression shown on the sample code below:
	
COPY INTO exhibit
FROM @~/errorsExhibit
PATTERN='.*0[1-2].txt.gz'
FILE_FORMAT = CSV ON_ERROR='continue';
To check the listing of files under the subdirectory CleanData under the @~ staging area for
your Snowflake Beta Customer account, while in the sfsql command line, use the following
command:
ls @~/CleanData
To check on the listing of all files whose file names match the regular expression specified in
the PATTERN parameter, use the command below:
ls @~/CleanData PATTERN='.*0[1-2].txt.gz';
	
Verify that the data was loaded successfully into the EXHIBIT table.
	
	 Select * from EXHIBIT;
Run COPY Command to Load/Parse JSON Data Raw Staged Files
[1. Upload JSON File Into The Customer Account's S3 Staging Area]
PUT
file:///Users/fshaarani/SVN/CUSTOMERREPO/examples/json/json_sam
ple_data @~/json/
[2. Create an External Table with a VARIANT Column To Contain The JSON Data]
CREATE OR REPLACE TABLE public.json_data_table_ext
(json_data variant)
STAGE_LOCATION='@~'
STAGE_FILE_FORMAT=demo_db.public.json
COMMENT='json Data preview table';
[3. COPY the JSON Raw Data Into the Table]
COPY INTO json_data_table_ext
FROM @~/json/json_sample_data
FILE_FORMAT = 'JSON' ON_ERROR='continue';
JSON FILE CONTENT:
Validate that the data in the JSON raw file got loaded into the table
select * from public.json_data_table_ext;
Output:
{ "root": [ { "age": 22, "children": [ { "age": "6", "gender": "Female", "name": "Jane" }, { "age": "15", ...
select json_data:root[0].kind, json_data:root[0].fullName,
json_data:root[0].age from public.json_data_table_ext ;
Output:
If you are using data files that have been staged on your own company’s Amazon S3
bucket:
Run COPY Command To Load Data From Raw CSV Files
This syntax below is needed to create a stage ONLY if you are using your own company’s
Amazon S3 bucket. If you are using your Snowflake assigned bucket, you do not need to
create a stage object.
Create the staging database object.
	
CREATE OR REPLACE STAGE DEMO_DB.PUBLIC.SAMPLE_STAGE
URL = 'S3://<YOUR_COMPANY_NAME>/<SUBFOLDER NAME>/'
CREDENTIALS = (AWS_KEY_ID = 'YOUR KEY VALUE'
AWS_SECRET_KEY = 'SECRET KEY VALUE’)
COMMENT = 'Stage object pointing to the customer's own AWS S3
bucket. Independent of Snowflake';
	
Load	the	data	from	your	CSV	file	into	the	pre-created	EXHIBIT	table.	
COPY INTO exhibit
FROM @DEMO_DB/errorsExhibit/exhibit_03.txt.gz
FILE_FORMAT = CSV ON_ERROR='continue';
OR
	
Load the data into the pre-created EXHIBIT table from several CSV files matching the file
name pattern regular expression stated on the sample code below.
	
COPY INTO exhibit
FROM @DEMO_DB/errorsExhibit
PATTERN='.*0[1-2].txt.gz'
FILE_FORMAT = CSV ON_ERROR='continue';
Verify that the data was loaded successfully into the EXHIBIT table.
Select * from EXHIBIT;
Run COPY Command to Load/Parse JSON Data Raw Staged Files
	
[1. Create a Stage Database Object Pointing to The Location of The JSON File]
This syntax below is needed to create a stage ONLY if you are using your own company’s
Amazon S3 bucket. If you are using your Snowflake assigned bucket, you do not need to
create a stage object.
CREATE OR REPLACE STAGE DEMO_DB.PUBLIC.STAGE_JSON
URL = 'S3://<YOUR_COMPANY_NAME>/<SUBFOLDER NAME>/'
CREDENTIALS = (AWS_KEY_ID = 'YOUR KEY VALUE'
AWS_SECRET_KEY = 'SECRET KEY VALUE’)
COMMENT = 'Stage object pointing to the customer's own AWS S3
bucket. Independent of Snowflake';
Place your file in the staging location defined by the above command
PUT
file:///Users/fshaarani/SVN/CUSTOMERREPO/examples/json/json_sam
ple_data @stage_JSON/json/
[2. Create an External Table with a VARIANT Column To Contain The JSON Data]
CREATE OR REPLACE TABLE public.json_data_table_ext
(json_data variant)
STAGE_LOCATION=demo_db.public.stage_json
STAGE_FILE_FORMAT= demo_db.public.json
COMMENT='json Data preview table';
[3. COPY the JSON Raw Data Into the Table]
COPY INTO json_data_table_ext
FROM @stage_json/json/json_sample_data
FILE_FORMAT = 'JSON' ON_ERROR='continue';
JSON FILE CONTENT:
Validate that the data in the JSON raw file got loaded into the table
select * from public.json_data_table_ext;
Output:
{ "root": [ { "age": 22, "children": [ { "age": "6", "gender": "Female", "name": "Jane" }, { "age": "15", ...
select json_data:root[0].kind, json_data:root[0].fullName,
json_data:root[0].age from public.json_data_table_ext ;
Output:
	
Using Snowflake to Validate Your Data Files
	
In this section, we will go over validating the raw data files before performing the actual data
load. To illustrate this, we will attempt to load raw data files containing errors and thus
making it intentionally fail to load that data into Snowflake. The VALIDATION_MODE option
on the COPY command would process the data without loading it in the destination table in
Snowflake.
In the proceeding example, the PUT command will stage the exhibit*.txt and timelines.txt files
at the default S3 staging location set up for the Beta Customer Account in Snowflake as well
as illustrate how we can load files under a sub-directory below the root staging directory of a
Snowflake Beta Customer Account.
PUT
file:///Users/fshaarani/SVN/CUSTOMERREPO/examples/dataloading/cs
v_samples/ErrorData/exhibit*.txt @~/errorsExhibit/;
	
----------------+-------------------+-------------+-------------+--------------------+--------------------+----------+---------+
source | target | source_size | target_size | source_compression | target_compression | status | details |
----------------+-------------------+-------------+-------------+--------------------+--------------------+----------+---------+
exhibit_03.txt | exhibit_03.txt.gz | 8353 | 3734 | NONE | GZIP | UPLOADED | |
exhibit_01.txt | exhibit_01.txt.gz | 14733 | 6207 | NONE | GZIP | UPLOADED | |
exhibit_02.txt | exhibit_02.txt.gz | 14730 | 6106 | NONE | GZIP | UPLOADED | |
----------------+-------------------+-------------+-------------+--------------------+--------------------+----------+---------+
3 rows in result (first row: 1.501 sec; total: 1.504 sec)	
Below are three possible raw data validation scenarios and sample code:
	
1. The following example would allow the previewing of 10 records from the first raw data
file exhibit_01.txt. This file does not have any errors.
COPY INTO exhibit
FROM @~/errorsExhibit/exhibit_01.txt.gz FILE_FORMAT =
(FIELD_DELIMITER = '|' SKIP_HEADER = 1)
VALIDATION_MODE='return_10_rows';
2. The following example below simulates the scenario of having an extra delimiter in
the record and how the errors that would be displayed.
COPY INTO exhibit FROM @~/errorsExhibit/exhibit_02.txt.gz
FILE_FORMAT = (FIELD_DELIMITER = '|' SKIP_HEADER = 1)
VALIDATION_MODE='return_errors';
	
3.The	following	example	below	simulates	the	scenario	of	having	a	column	value	that	is	of	
the	wrong	data	type	and	how	the	error	would	look	like	the	output	below	after	running	
the	COPY	command	below:		
COPY INTO exhibit FROM @~/errorsExhibit/exhibit_03.txt.gz
FILE_FORMAT = (FIELD_DELIMITER = '|' SKIP_HEADER = 1)
VALIDATION_MODE='return_errors';
	
Using Snowflake to Unload Your Snowflake Data to Files
	
To create a data file from a table in the Snowflake database, use the below command:
COPY INTO S3 FROM EXHIBIT table
COPY INTO @~/giant_file/ from exhibit;
OR to overwrite the existing files in the same directory, use the OVERWRITE option as in the
command below:
COPY INTO @~/giant_file/ from exhibit overwrite=true;
Please note that by default, Snowflake will unload the data from the table into multiple files of
a size (16 MB) per file. If you want your data to be unloaded to a single file, then you need to
use the SINGLE option on the COPY command as in the example below:
COPY INTO @~/giant_file/ from exhibit
Single=true
overwrite=true;
Please note that AWS S3 has a limit of (5 GB) on the file size you can stage on S3. You can
use the optional MAX_FILE_SIZE (in bytes) to change the Snowflake default file size. Use
the command below if you want to specify bigger or smaller file sizes than the Snowflake
default file size as long as you do not exceed the AWS S3 max file size. For example, the
below command unloads the data in the EXHIBIT table into files of 50M each:
COPY INTO @~/giant_file/ from exhibit
max_file_size= 50000000 overwrite=true;
Using Snowflake to Split Your Data Files Into Smaller Files
If you are using data files that have been staged on your Snowflake’s Customer
Account S3 bucket assigned to your company.
When loading data into Snowflake, it is recommended that the raw data is split into as many
files as possible to maximize the parallelization of the data loading process and thus
completing the data load in the shortest amount of time possible. If your raw data is in one
raw data file, you can use Snowflake to split your large data file, into multiple files before
loading the data into Snowflake. Below are the steps for achieving this:
• Place the Snowflake sample giant_file from your local machine's directory into the
@~/giant_file/ S3 bucket using the following command:
PUT
file:///Users/fshaarani/SVN/CUSTOMERREPO/examples/dataloading/cs
v_samples/CleanData/giant_file_sample.csv.gz @~/giant_file/;
	
• Create	a	single-column file format for examining the data in the
data file.
CREATE OR REPLACE FILE FORMAT single_column_rows
TYPE='CSV'
SKIP_HEADER=0
RECORD_DELIMITER='n'
TRIM_SPACE=false
DATE_FORMAT='AUTO'
TIMESTAMP_FORMAT='AUTO'
FIELD_DELIMITER='NONE'
FIELD_OPTIONALLY_ENCLOSED_BY='NONE'
ESCAPE_UNENCLOSED_FIELD='134'
NULL_IF=('')
COMMENT='copy each line into single-column row';
	
• Create an external table in the Snowflake database specifying the staging area and file
format to be used:
	
CREATE OR REPLACE TABLE GiantFile_ext
(fullrow varchar(4096) )
STAGE_LOCATION=@~/giant_file/
STAGE_FILE_FORMAT= single_column_rows
COMMENT='GiantFile preview table';
	
• Run the COPY command below to create small files while limiting the file size to 2MB.
This would split the data across multiple small files at 2MB each from a single original
data file.
	
COPY INTO @~/giant_file_parts/
FROM (SELECT * FROM
table(stage(GiantFile_ext ,
pattern => '.*giant_file_sample.csv.gz')))
max_file_size= 2000000;
• Verify Files of the Data You Unloaded
ls @~/giant_file_parts;
To place a copy of the S3 giant file parts onto your local machine after they have been split
into several files of 2 MB each, use the below command:
get
@~/giant_file_parts/
file:///Users/fshaarani/SVN/CUSTOMERREPO/examples/dataloading/cs
v_samples/CleanData/
To remove all the files at the staging bucket location you want to clean up, use the following
command:
	
remove @~/giant_file_parts;
	
To remove a specific set of files from the giant file directory whose names match a regular
expression (i.e. remove all the files whose name ends with .csv.gz', use the following
command:
remove @~/giant_file pattern='.*.csv.gz';
	
	
Recommended Approach to Debug and Resolve Data Load Issues
Related to Data Problems
	
[WHAT IF YOU HAVE DATA FILES THAT HAVE PROBLEMS]:
This section below suggests a recommended flow for iterating through data fix on the data
file, and loading data into Snowflake via the COPY command. Snowflake’s COPY command
syntax supports several parameters that are helpful in debugging or bypassing bad data files
that are not possible to load due to various potential data problems, which may need to be
fixed before the data file can be read and loaded
[FIRST PASS] LOAD DATA WITH ONE OF THE THREE OPTIONS BELOW:
SKIPPING BAD DATA FILES:
1. Attempt to load with the ON_ERROR = 'SKIP_FILE' error handling parameter. With	
this	error	handling	parameter	setting,	files	with	errors	will	be	skipped	and	will	not	be	
loaded.	
	
[ON_ERROR=’SKIP_FILE’]	
COPY INTO exhibit
FROM @~/errorsExhibit/exhibit_02.txt.gz FILE_FORMAT =
(FIELD_DELIMITER = '|' SKIP_HEADER = 1)
ON_ERROR='skip_file';
OR
	
SKIPPING BAD DATA FILES IF ERRORS EXCEED A SPECIFIED LIMIT:
1. Attempt to load with more tolerant error handling
ON_ERROR=SKIP_FILE_[error_limit]. With this option for error handling, files with
errors could be partially loaded as long as the number of errors does not exceed
the stated limit. The file is skipped when the number of errors exceeds the stated
error limit.
[ON_ERROR=’SKIP_FILE_[error_limit]’]
COPY INTO exhibit
FROM @~/errorsExhibit/exhibit_02.txt.gz FILE_FORMAT =
(FIELD_DELIMITER = '|' SKIP_HEADER = 1)
ON_ERROR='skip_file_10';
OR
	
PERFORM PARTIAL LOAD FROM THE BAD DATA FILES:
1. Attempt to load with more tolerant error handling using ON_ERROR=’CONTINUE’.
With this option for error handling, files with errors could be partially loaded.
[ON_ERROR=’CONTINUE’]
COPY INTO exhibit
FROM @~/errorsExhibit/exhibit_02.txt.gz FILE_FORMAT =
(FIELD_DELIMITER = '|' SKIP_HEADER = 1)
ON_ERROR='continue';
	
[SECOND PASS] RETURN THE DATA ERRORS:
Validate the files, which were skipped and failed to load from the first pass. This time,
attempt to load the bad data files with VALIDATION_MODE='RETURN_ERRORS'. This
allows the COPY command to return the list of errors within each data file and the position of
those errors.
COPY INTO exhibit
FROM @~/errorsExhibit/exhibit_02.txt.gz FILE_FORMAT =
(FIELD_DELIMITER = '|' SKIP_HEADER = 1)
VALIDATION_MODE='return_errors';
[FIX THE BAD RECORDS]
A. Download the bad data files containing the bad records from the staging area to
your local drive:
get
@~/errorsExhibit/exhibit_02.txt.gz
file:///Users/fshaarani/SVN/CUSTOMERREPO/examples/dataloadin
g/csv_samples/ErrorData
[PREVIEW RECORDS FROM YOUR BAD DATA FILE(S)]
To get visibility into the records in the data file and few of its records, use an external
table and read few record from the data file to see what a sample record looks like.
CREATE OR REPLACE TABLE PreviewFile_ext
(fullrow varchar(4096) )
STAGE_LOCATION=@~/errorsExhibit/
STAGE_FILE_FORMAT= single_column_rows
COMMENT='Bad data file preview table';
SELECT *
FROM table(stage(PreviewFile_ext ,
pattern => '.*exhibit_02.txt.gz')) LIMIT 10;
Fix the bad records manually and write them to a new data file, or regenerate a new
data file from the data source containing only the bad records that did not load (as
applicable).
B. Upload the fixed bad data file(s) into the staging area for re-loading and attempt
reloading from that fixed file:
PUT
file:///Users/fshaarani/SVN/CUSTOMERREPO/examples/dataloading
/csv_samples/ErrorData/exhibit_02.txt.gz @~/errorsExhibit/

More Related Content

What's hot (20)

PPTX
Building an Effective Data Warehouse Architecture
James Serra
 
PDF
Snowflake for Data Engineering
Harald Erb
 
PPTX
Azure Data Factory Data Flow
Mark Kromer
 
PPTX
Snowflake Data Loading.pptx
Parag860410
 
PDF
Snowflake SnowPro Certification Exam Cheat Sheet
Jeno Yamma
 
PPTX
Azure Synapse Analytics Overview (r2)
James Serra
 
PPTX
Building a modern data warehouse
James Serra
 
PDF
Intro to Delta Lake
Databricks
 
PDF
Introduction to Azure Data Lake
Antonios Chatzipavlis
 
PPT
Informatica session
vinuthanallam
 
PPTX
Snowflake Architecture.pptx
chennakesava44
 
PPTX
Free Training: How to Build a Lakehouse
Databricks
 
PPTX
Azure Data Factory
HARIHARAN R
 
PPTX
Ppt
bullsrockr666
 
PPTX
An intro to Azure Data Lake
Rick van den Bosch
 
PPTX
Microsoft Data Integration Pipelines: Azure Data Factory and SSIS
Mark Kromer
 
PDF
Data Quality With or Without Apache Spark and Its Ecosystem
Databricks
 
PDF
Azure Synapse 101 Webinar Presentation
Matthew W. Bowers
 
PPTX
Databricks Fundamentals
Dalibor Wijas
 
PPTX
Delta lake and the delta architecture
Adam Doyle
 
Building an Effective Data Warehouse Architecture
James Serra
 
Snowflake for Data Engineering
Harald Erb
 
Azure Data Factory Data Flow
Mark Kromer
 
Snowflake Data Loading.pptx
Parag860410
 
Snowflake SnowPro Certification Exam Cheat Sheet
Jeno Yamma
 
Azure Synapse Analytics Overview (r2)
James Serra
 
Building a modern data warehouse
James Serra
 
Intro to Delta Lake
Databricks
 
Introduction to Azure Data Lake
Antonios Chatzipavlis
 
Informatica session
vinuthanallam
 
Snowflake Architecture.pptx
chennakesava44
 
Free Training: How to Build a Lakehouse
Databricks
 
Azure Data Factory
HARIHARAN R
 
An intro to Azure Data Lake
Rick van den Bosch
 
Microsoft Data Integration Pipelines: Azure Data Factory and SSIS
Mark Kromer
 
Data Quality With or Without Apache Spark and Its Ecosystem
Databricks
 
Azure Synapse 101 Webinar Presentation
Matthew W. Bowers
 
Databricks Fundamentals
Dalibor Wijas
 
Delta lake and the delta architecture
Adam Doyle
 

Similar to Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani) (7)

PDF
Etl confessions pg conf us 2017
Corey Huinker
 
PDF
Composable Data Processing with Apache Spark
Databricks
 
PPT
An overview of snowflake
Sivakumar Ramar
 
PDF
All course slides.pdf
ssuser98bffa1
 
PPT
data loading and unloading in IBM Netezza by www.etraining.guru
Ravikumar Nandigam
 
PPTX
ACS DataMart_ppt
Jeremy Searls
 
PPTX
ACS DataMart_ppt
Jeremy Searls
 
Etl confessions pg conf us 2017
Corey Huinker
 
Composable Data Processing with Apache Spark
Databricks
 
An overview of snowflake
Sivakumar Ramar
 
All course slides.pdf
ssuser98bffa1
 
data loading and unloading in IBM Netezza by www.etraining.guru
Ravikumar Nandigam
 
ACS DataMart_ppt
Jeremy Searls
 
ACS DataMart_ppt
Jeremy Searls
 
Ad

Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)

  • 1. Taming The Data Load/Unload in Snowflake Sample Code and Best Practice (Faysal Shaarani) Loading Data Into Your Snowflake’s Database(s) from raw data files [1. CREATE YOUR APPLICABLE FILE FORMAT]: The syntax in this section below allows you to create your CSV file format if you are loading data from CSV files. Please note that the backslash escapes are coded for use from the sfsql command line, not from the UI. If this FILE FORMAT below is to be used from the Snowflake UI, the occurrences should be changed to [CSV FILE FORMAT] CREATE OR REPLACE FILE FORMAT DEMO_DB.PUBLIC.CSV -- Comma field delimited and n record terminator TYPE = 'CSV' COMPRESSION = 'AUTO' FIELD_DELIMITER = ',' RECORD_DELIMITER = 'n' SKIP_HEADER = 1 FIELD_OPTIONALLY_ENCLOSED_BY = 'NONE' TRIM_SPACE = false ERROR_ON_COLUMN_COUNT_MISMATCH = true ESCAPE = 'NONE' ESCAPE_UNENCLOSED_FIELD = '134' DATE_FORMAT = 'AUTO' TIMESTAMP_FORMAT = 'AUTO' NULL_IF = (''); The syntax in the section below allows you to create JSON file format if you are loading JSON data into Snowflake. [JSON FILE FORMAT] CREATE OR REPLACE FILE FORMAT DEMO_DB.PUBLIC.JSON TYPE ='JSON' COMPRESSION = 'AUTO' ENABLE_OCTAL = false ALLOW_DUPLICATE = false STRIP_OUTER_ARRAY = false; [2. CREATE YOUR DESTINATION TABLE]: Pre-create your table before loading the CSV data into that table.
  • 2. CREATE OR REPLACE TABLE exhibit (Id STRING ,Title STRING ,Year NUMBER ,Collection_Id STRING ,Timeline_Id STRING ,Depth INT); CREATE OR REPLACE TABLE timelines (Id STRING ,Title VARCHAR ,Regime STRING ,FromYear NUMBER ,ToYear NUMBER ,Height NUMBER ,Timeline_Id STRING ,Collection_Id STRING ,ForkNode NUMBER ,Depth NUMBER ,SubtreeSize NUMBER); If you are using data files that have been staged on your Snowflake’s Customer Account S3 bucket assigned to your company. Run COPY Command To Load Data From Raw CSV Files Load the data from your CSV file into the pre-created EXHIBIT table. If you encounter a data error on any of the records continue to load what you could. If you do not specify ON_ERROR, the Default would be to skip the file on the first error it encounters on any of the records in that file. The example below would load whatever it could skipping any bad records in the file. COPY INTO exhibit FROM @~/errorsExhibit/exhibit_03.txt.gz FILE_FORMAT = CSV ON_ERROR='continue'; OR Load the data into the pre-created EXHIBIT table from several CSV files matching the file name regular expression shown on the sample code below: COPY INTO exhibit FROM @~/errorsExhibit PATTERN='.*0[1-2].txt.gz' FILE_FORMAT = CSV ON_ERROR='continue';
  • 3. To check the listing of files under the subdirectory CleanData under the @~ staging area for your Snowflake Beta Customer account, while in the sfsql command line, use the following command: ls @~/CleanData To check on the listing of all files whose file names match the regular expression specified in the PATTERN parameter, use the command below: ls @~/CleanData PATTERN='.*0[1-2].txt.gz'; Verify that the data was loaded successfully into the EXHIBIT table. Select * from EXHIBIT; Run COPY Command to Load/Parse JSON Data Raw Staged Files [1. Upload JSON File Into The Customer Account's S3 Staging Area] PUT file:///Users/fshaarani/SVN/CUSTOMERREPO/examples/json/json_sam ple_data @~/json/ [2. Create an External Table with a VARIANT Column To Contain The JSON Data] CREATE OR REPLACE TABLE public.json_data_table_ext (json_data variant) STAGE_LOCATION='@~' STAGE_FILE_FORMAT=demo_db.public.json COMMENT='json Data preview table'; [3. COPY the JSON Raw Data Into the Table] COPY INTO json_data_table_ext FROM @~/json/json_sample_data FILE_FORMAT = 'JSON' ON_ERROR='continue';
  • 4. JSON FILE CONTENT: Validate that the data in the JSON raw file got loaded into the table select * from public.json_data_table_ext; Output: { "root": [ { "age": 22, "children": [ { "age": "6", "gender": "Female", "name": "Jane" }, { "age": "15", ... select json_data:root[0].kind, json_data:root[0].fullName, json_data:root[0].age from public.json_data_table_ext ; Output:
  • 5. If you are using data files that have been staged on your own company’s Amazon S3 bucket: Run COPY Command To Load Data From Raw CSV Files This syntax below is needed to create a stage ONLY if you are using your own company’s Amazon S3 bucket. If you are using your Snowflake assigned bucket, you do not need to create a stage object. Create the staging database object. CREATE OR REPLACE STAGE DEMO_DB.PUBLIC.SAMPLE_STAGE URL = 'S3://<YOUR_COMPANY_NAME>/<SUBFOLDER NAME>/' CREDENTIALS = (AWS_KEY_ID = 'YOUR KEY VALUE' AWS_SECRET_KEY = 'SECRET KEY VALUE’) COMMENT = 'Stage object pointing to the customer's own AWS S3 bucket. Independent of Snowflake'; Load the data from your CSV file into the pre-created EXHIBIT table. COPY INTO exhibit FROM @DEMO_DB/errorsExhibit/exhibit_03.txt.gz FILE_FORMAT = CSV ON_ERROR='continue'; OR Load the data into the pre-created EXHIBIT table from several CSV files matching the file name pattern regular expression stated on the sample code below. COPY INTO exhibit FROM @DEMO_DB/errorsExhibit PATTERN='.*0[1-2].txt.gz' FILE_FORMAT = CSV ON_ERROR='continue'; Verify that the data was loaded successfully into the EXHIBIT table. Select * from EXHIBIT; Run COPY Command to Load/Parse JSON Data Raw Staged Files [1. Create a Stage Database Object Pointing to The Location of The JSON File] This syntax below is needed to create a stage ONLY if you are using your own company’s Amazon S3 bucket. If you are using your Snowflake assigned bucket, you do not need to
  • 6. create a stage object. CREATE OR REPLACE STAGE DEMO_DB.PUBLIC.STAGE_JSON URL = 'S3://<YOUR_COMPANY_NAME>/<SUBFOLDER NAME>/' CREDENTIALS = (AWS_KEY_ID = 'YOUR KEY VALUE' AWS_SECRET_KEY = 'SECRET KEY VALUE’) COMMENT = 'Stage object pointing to the customer's own AWS S3 bucket. Independent of Snowflake'; Place your file in the staging location defined by the above command PUT file:///Users/fshaarani/SVN/CUSTOMERREPO/examples/json/json_sam ple_data @stage_JSON/json/ [2. Create an External Table with a VARIANT Column To Contain The JSON Data] CREATE OR REPLACE TABLE public.json_data_table_ext (json_data variant) STAGE_LOCATION=demo_db.public.stage_json STAGE_FILE_FORMAT= demo_db.public.json COMMENT='json Data preview table'; [3. COPY the JSON Raw Data Into the Table] COPY INTO json_data_table_ext FROM @stage_json/json/json_sample_data FILE_FORMAT = 'JSON' ON_ERROR='continue'; JSON FILE CONTENT:
  • 7. Validate that the data in the JSON raw file got loaded into the table select * from public.json_data_table_ext; Output: { "root": [ { "age": 22, "children": [ { "age": "6", "gender": "Female", "name": "Jane" }, { "age": "15", ... select json_data:root[0].kind, json_data:root[0].fullName, json_data:root[0].age from public.json_data_table_ext ; Output: Using Snowflake to Validate Your Data Files In this section, we will go over validating the raw data files before performing the actual data load. To illustrate this, we will attempt to load raw data files containing errors and thus making it intentionally fail to load that data into Snowflake. The VALIDATION_MODE option on the COPY command would process the data without loading it in the destination table in Snowflake. In the proceeding example, the PUT command will stage the exhibit*.txt and timelines.txt files at the default S3 staging location set up for the Beta Customer Account in Snowflake as well
  • 8. as illustrate how we can load files under a sub-directory below the root staging directory of a Snowflake Beta Customer Account. PUT file:///Users/fshaarani/SVN/CUSTOMERREPO/examples/dataloading/cs v_samples/ErrorData/exhibit*.txt @~/errorsExhibit/; ----------------+-------------------+-------------+-------------+--------------------+--------------------+----------+---------+ source | target | source_size | target_size | source_compression | target_compression | status | details | ----------------+-------------------+-------------+-------------+--------------------+--------------------+----------+---------+ exhibit_03.txt | exhibit_03.txt.gz | 8353 | 3734 | NONE | GZIP | UPLOADED | | exhibit_01.txt | exhibit_01.txt.gz | 14733 | 6207 | NONE | GZIP | UPLOADED | | exhibit_02.txt | exhibit_02.txt.gz | 14730 | 6106 | NONE | GZIP | UPLOADED | | ----------------+-------------------+-------------+-------------+--------------------+--------------------+----------+---------+ 3 rows in result (first row: 1.501 sec; total: 1.504 sec) Below are three possible raw data validation scenarios and sample code: 1. The following example would allow the previewing of 10 records from the first raw data file exhibit_01.txt. This file does not have any errors. COPY INTO exhibit FROM @~/errorsExhibit/exhibit_01.txt.gz FILE_FORMAT = (FIELD_DELIMITER = '|' SKIP_HEADER = 1) VALIDATION_MODE='return_10_rows'; 2. The following example below simulates the scenario of having an extra delimiter in the record and how the errors that would be displayed. COPY INTO exhibit FROM @~/errorsExhibit/exhibit_02.txt.gz FILE_FORMAT = (FIELD_DELIMITER = '|' SKIP_HEADER = 1) VALIDATION_MODE='return_errors'; 3.The following example below simulates the scenario of having a column value that is of the wrong data type and how the error would look like the output below after running the COPY command below: COPY INTO exhibit FROM @~/errorsExhibit/exhibit_03.txt.gz
  • 9. FILE_FORMAT = (FIELD_DELIMITER = '|' SKIP_HEADER = 1) VALIDATION_MODE='return_errors'; Using Snowflake to Unload Your Snowflake Data to Files To create a data file from a table in the Snowflake database, use the below command: COPY INTO S3 FROM EXHIBIT table COPY INTO @~/giant_file/ from exhibit; OR to overwrite the existing files in the same directory, use the OVERWRITE option as in the command below: COPY INTO @~/giant_file/ from exhibit overwrite=true; Please note that by default, Snowflake will unload the data from the table into multiple files of a size (16 MB) per file. If you want your data to be unloaded to a single file, then you need to use the SINGLE option on the COPY command as in the example below: COPY INTO @~/giant_file/ from exhibit Single=true overwrite=true; Please note that AWS S3 has a limit of (5 GB) on the file size you can stage on S3. You can use the optional MAX_FILE_SIZE (in bytes) to change the Snowflake default file size. Use the command below if you want to specify bigger or smaller file sizes than the Snowflake default file size as long as you do not exceed the AWS S3 max file size. For example, the below command unloads the data in the EXHIBIT table into files of 50M each: COPY INTO @~/giant_file/ from exhibit max_file_size= 50000000 overwrite=true; Using Snowflake to Split Your Data Files Into Smaller Files If you are using data files that have been staged on your Snowflake’s Customer Account S3 bucket assigned to your company. When loading data into Snowflake, it is recommended that the raw data is split into as many files as possible to maximize the parallelization of the data loading process and thus
  • 10. completing the data load in the shortest amount of time possible. If your raw data is in one raw data file, you can use Snowflake to split your large data file, into multiple files before loading the data into Snowflake. Below are the steps for achieving this: • Place the Snowflake sample giant_file from your local machine's directory into the @~/giant_file/ S3 bucket using the following command: PUT file:///Users/fshaarani/SVN/CUSTOMERREPO/examples/dataloading/cs v_samples/CleanData/giant_file_sample.csv.gz @~/giant_file/; • Create a single-column file format for examining the data in the data file. CREATE OR REPLACE FILE FORMAT single_column_rows TYPE='CSV' SKIP_HEADER=0 RECORD_DELIMITER='n' TRIM_SPACE=false DATE_FORMAT='AUTO' TIMESTAMP_FORMAT='AUTO' FIELD_DELIMITER='NONE' FIELD_OPTIONALLY_ENCLOSED_BY='NONE' ESCAPE_UNENCLOSED_FIELD='134' NULL_IF=('') COMMENT='copy each line into single-column row'; • Create an external table in the Snowflake database specifying the staging area and file format to be used: CREATE OR REPLACE TABLE GiantFile_ext (fullrow varchar(4096) ) STAGE_LOCATION=@~/giant_file/ STAGE_FILE_FORMAT= single_column_rows COMMENT='GiantFile preview table'; • Run the COPY command below to create small files while limiting the file size to 2MB. This would split the data across multiple small files at 2MB each from a single original data file. COPY INTO @~/giant_file_parts/ FROM (SELECT * FROM table(stage(GiantFile_ext , pattern => '.*giant_file_sample.csv.gz'))) max_file_size= 2000000;
  • 11. • Verify Files of the Data You Unloaded ls @~/giant_file_parts; To place a copy of the S3 giant file parts onto your local machine after they have been split into several files of 2 MB each, use the below command: get @~/giant_file_parts/ file:///Users/fshaarani/SVN/CUSTOMERREPO/examples/dataloading/cs v_samples/CleanData/ To remove all the files at the staging bucket location you want to clean up, use the following command: remove @~/giant_file_parts; To remove a specific set of files from the giant file directory whose names match a regular expression (i.e. remove all the files whose name ends with .csv.gz', use the following command: remove @~/giant_file pattern='.*.csv.gz'; Recommended Approach to Debug and Resolve Data Load Issues Related to Data Problems [WHAT IF YOU HAVE DATA FILES THAT HAVE PROBLEMS]: This section below suggests a recommended flow for iterating through data fix on the data file, and loading data into Snowflake via the COPY command. Snowflake’s COPY command syntax supports several parameters that are helpful in debugging or bypassing bad data files that are not possible to load due to various potential data problems, which may need to be fixed before the data file can be read and loaded [FIRST PASS] LOAD DATA WITH ONE OF THE THREE OPTIONS BELOW:
  • 12. SKIPPING BAD DATA FILES: 1. Attempt to load with the ON_ERROR = 'SKIP_FILE' error handling parameter. With this error handling parameter setting, files with errors will be skipped and will not be loaded. [ON_ERROR=’SKIP_FILE’] COPY INTO exhibit FROM @~/errorsExhibit/exhibit_02.txt.gz FILE_FORMAT = (FIELD_DELIMITER = '|' SKIP_HEADER = 1) ON_ERROR='skip_file'; OR SKIPPING BAD DATA FILES IF ERRORS EXCEED A SPECIFIED LIMIT: 1. Attempt to load with more tolerant error handling ON_ERROR=SKIP_FILE_[error_limit]. With this option for error handling, files with errors could be partially loaded as long as the number of errors does not exceed the stated limit. The file is skipped when the number of errors exceeds the stated error limit. [ON_ERROR=’SKIP_FILE_[error_limit]’] COPY INTO exhibit FROM @~/errorsExhibit/exhibit_02.txt.gz FILE_FORMAT = (FIELD_DELIMITER = '|' SKIP_HEADER = 1) ON_ERROR='skip_file_10'; OR PERFORM PARTIAL LOAD FROM THE BAD DATA FILES: 1. Attempt to load with more tolerant error handling using ON_ERROR=’CONTINUE’. With this option for error handling, files with errors could be partially loaded. [ON_ERROR=’CONTINUE’] COPY INTO exhibit FROM @~/errorsExhibit/exhibit_02.txt.gz FILE_FORMAT = (FIELD_DELIMITER = '|' SKIP_HEADER = 1) ON_ERROR='continue'; [SECOND PASS] RETURN THE DATA ERRORS: Validate the files, which were skipped and failed to load from the first pass. This time, attempt to load the bad data files with VALIDATION_MODE='RETURN_ERRORS'. This allows the COPY command to return the list of errors within each data file and the position of those errors.
  • 13. COPY INTO exhibit FROM @~/errorsExhibit/exhibit_02.txt.gz FILE_FORMAT = (FIELD_DELIMITER = '|' SKIP_HEADER = 1) VALIDATION_MODE='return_errors'; [FIX THE BAD RECORDS] A. Download the bad data files containing the bad records from the staging area to your local drive: get @~/errorsExhibit/exhibit_02.txt.gz file:///Users/fshaarani/SVN/CUSTOMERREPO/examples/dataloadin g/csv_samples/ErrorData [PREVIEW RECORDS FROM YOUR BAD DATA FILE(S)] To get visibility into the records in the data file and few of its records, use an external table and read few record from the data file to see what a sample record looks like. CREATE OR REPLACE TABLE PreviewFile_ext (fullrow varchar(4096) ) STAGE_LOCATION=@~/errorsExhibit/ STAGE_FILE_FORMAT= single_column_rows COMMENT='Bad data file preview table'; SELECT * FROM table(stage(PreviewFile_ext , pattern => '.*exhibit_02.txt.gz')) LIMIT 10; Fix the bad records manually and write them to a new data file, or regenerate a new data file from the data source containing only the bad records that did not load (as applicable). B. Upload the fixed bad data file(s) into the staging area for re-loading and attempt reloading from that fixed file: PUT file:///Users/fshaarani/SVN/CUSTOMERREPO/examples/dataloading /csv_samples/ErrorData/exhibit_02.txt.gz @~/errorsExhibit/