From the course: Big Data Analytics with Hadoop and Apache Spark

Unlock the full course today

Join today to access over 24,800 courses taught by industry experts.

Pushing down projections

Pushing down projections

- [Instructor] In this chapter, we will review some of the techniques that can be used during data processing to optimize Spark and HDFS performance. The code for this chapter is available in the notebook, code 05 XX Optimizing Data Processing. We create a spark session first to use in the rest of the chapter. We set default parallelism to eight. We start with pushing down projections. Projection here means the set or subset of columns that are selected from a dataset. Typically, we read an entire file with all the columns into memory and then only use a subset of columns later for computations. During lazy evaluation, spark is smart enough to identify the subset of columns that will actually be used and then only fetch them into memory. This is called projection push down. In this example, we read the entire Parquet file into the sales data data frame. Later, we only select the product and quantity columns. Spark identifies this and only fetches those two columns into memory. Let's…

Contents