Load data from GCS to BigTable with GCP Dataproc Serverless

3 min readOct 11, 2024

Recently, I have a need to transfer data from Google Cloud Storage (GCS) into Bigtable by utilizing Dataproc Serverless Spark. The combination of Spark and Bigtable provides a potent data processing solution. Dataproc Serverless Spark constitutes a fully managed robust solution within the Google Cloud Platform. It enables the design of Spark applications without the necessity of platform management considerations.BigTable, a part of Google Cloud, is an enterprise-level, low-latency NoSQL database service.

Problem Statement:

The primary challenge lies in establishing a mapping between Spark structured dataframes and BigTable column families. BigTable can encompass multiple column families, each potentially housing distinct columns.

For illustration purpose, I provided Spark datafrme structure and BigTable column family structure.

BigTabe Side :

_Key, Location column family with columns LatD LatM LatS LonD LonM LonS, Address column family with columns NS EW City State.

Solution:

Traditionally, Bigtable has provided client libraries that assist in iterating through dataframes and preparing rows for ingestion into BigTable. However, this approach involves lengthy and verbose code that can be challenging to maintain over time. Recognizing the need for a native Spark-friendly solution, we sought an alternative approach which is Spark BigTable connector. That is where we used GCStoBigTable dataproc template.

GCStoBigTable Configuration:

You need to clone dataproc-template repository where you will find out GCStoBigTable template.
Assume that you already have dataproc serverless enabled with required APIs.
You required to configure certain properties.
- GCP_PROJECT: The GCP project to run dataproc serverless job.
- REGION : The region to run dataproc serverless job.
- GCS_STAGING_LOCATION : A GCS location to where dataproc and template will staging jars/configs.
You required to prepare Catalog json file which will map your dataframe columns with BigTable column families. Make sure to pass table name as exactly as BigTable table name.
Provide necessary template properties.

Sample Catalog json file.

{
  "table": {"name": "addressbook"},
  "rowkey": "id_rowkey",
  "columns": {
    "key": {"cf": "rowkey", "col": "id_rowkey", "type": "string"},
    "LatD": {"cf": "location", "col": "latd", "type": "long"},
    "LatM": {"cf": "location", "col": "latm", "type": "long"},
    "LatS": {"cf": "location", "col": "lats", "type": "long"},
    "LonD": {"cf": "location", "col": "lond", "type": "long"},
    "LonM": {"cf": "location", "col": "lonm", "type": "long"},
    "LonS": {"cf": "location", "col": "lons", "type": "long"},
    "NS": {"cf": "address", "col": "ns", "type": "string"},
    "EW": {"cf": "address", "col": "ew", "type": "string"},
    "City": {"cf": "address", "col": "city", "type": "string"},
    "State": {"cf": "address", "col": "state", "type": "string"}
  }
}

Sample execution command.

project.id : GCP Project ID

gcs.bigtable.input.location : GCS input files location which you would like to ingest to BigTable

gcs.bigtable.input.format : Input file format can be csv , parquet , avro

gcs.bigtable.output.project.id : BigTable project id

gcs.bigtable.catalog.location : Catalog json file GCS location

bin/start.sh \
-- --template GCSTOBIGTABLE \
--templateProperty project.id=<gcp-project-id> \
--templateProperty gcs.bigtable.input.location=<gcs file location> \
--templateProperty gcs.bigtable.input.format=<csv|parquet|avro> \
--templateProperty gcs.bigtable.output.instance.id=<bigtable instance Id> \
--templateProperty gcs.bigtable.output.project.id=<bigtable project Id> \
--templateProperty gcs.bigtable.catalog.location="gs://dataproc-templates/conf/employeecatalog.json"

Note: Service account which used to submit a job requires BigTable read/write access. Please refer IAM policies for more details.

After submitting the job, you can refer Dataproc Batches UI for your logs and metrics.

You can use BigTable studio which is in preview for analyze data. Below is output of my sample data stored in BigTable. Please design proper key for your BigTable. Please refer public documentation for more details.

Note: Table data with datatypes such as int, long are stored as serialized format but when you use Spark BigTable connector to read data at that time it shows correct deserialized formatted data.

For any queries/suggestions reach out to: dataproc-templates-support-external@googlegroups.com

Google Cloud - Community

Load data from GCS to BigTable with GCP Dataproc Serverless

Problem Statement:

Solution:

Published in Google Cloud - Community

Written by Raj Patel

No responses yet