LinkedIn respects your privacy

LinkedIn and 3rd parties use essential and non-essential cookies to provide, secure, analyze and improve our Services, and to show you relevant ads (including professional and job ads) on and off LinkedIn. Learn more in our Cookie Policy.

Select Accept to consent or Reject to decline non-essential cookies for this use. You can update your choices at any time in your settings.

Join now Sign in

From the course: Big Data Analytics with Hadoop and Apache Spark

Unlock the full course today

Join today to access over 24,800 courses taught by industry experts.

Pushing down projections

Pushing down projections

From the course: Big Data Analytics with Hadoop and Apache Spark

Start my 1-month free trial Buy for my team

Pushing down projections

“

- [Instructor] In this chapter, we will review some of the techniques that can be used during data processing to optimize Spark and HDFS performance. The code for this chapter is available in the notebook, code 05 XX Optimizing Data Processing. We create a spark session first to use in the rest of the chapter. We set default parallelism to eight. We start with pushing down projections. Projection here means the set or subset of columns that are selected from a dataset. Typically, we read an entire file with all the columns into memory and then only use a subset of columns later for computations. During lazy evaluation, spark is smart enough to identify the subset of columns that will actually be used and then only fetch them into memory. This is called projection push down. In this example, we read the entire Parquet file into the sales data data frame. Later, we only select the product and quantity columns. Spark identifies this and only fetches those two columns into memory. Let's…

Contents