From the course: Big Data Analytics with Hadoop and Apache Spark
Unlock the full course today
Join today to access over 24,800 courses taught by industry experts.
Pushing down projections
From the course: Big Data Analytics with Hadoop and Apache Spark
Pushing down projections
- [Instructor] In this chapter, we will review some of the techniques that can be used during data processing to optimize Spark and HDFS performance. The code for this chapter is available in the notebook, code 05 XX Optimizing Data Processing. We create a spark session first to use in the rest of the chapter. We set default parallelism to eight. We start with pushing down projections. Projection here means the set or subset of columns that are selected from a dataset. Typically, we read an entire file with all the columns into memory and then only use a subset of columns later for computations. During lazy evaluation, spark is smart enough to identify the subset of columns that will actually be used and then only fetch them into memory. This is called projection push down. In this example, we read the entire Parquet file into the sales data data frame. Later, we only select the product and quantity columns. Spark identifies this and only fetches those two columns into memory. Let's…
Practice while you learn with exercise files
Download the files the instructor uses to teach the course. Follow along and learn by watching, listening and practicing.