SlideShare a Scribd company logo
Spark Application Carousel
Spark Summit East 2015
About Today’s Talk
• About Me:
• Vida Ha - Solutions Engineer at Databricks.
• Goal:
• For beginning/early intermediate Spark Developers.
• Motivate you to start writing more apps in Spark.
• Share some tips I’ve learned along the way.
2
Today’s Applications Covered
• Web Logs Analysis
• Basic Data Pipeline - Spark & Spark SQL
• Wikipedia Dataset
• Machine Learning
• Facebook API
• Graph Algorithms
3
4
Application  1:  Web  Log  Analysis
Web Logs
• Why?
• Most organizations have web log data.
• Dataset is too expensive to store in a database.
• Awesome, easy way to learn Spark!
• What?
• Standard Apache Access Logs.
• Web logs flow in each day from a web server.
5
Reading in Log Files
access_logs  =  (sc.textFile(DBFS_SAMPLE_LOGS_FOLDER)  
                              #  Call  the  parse_apace_log_line  on  each  line.  
                              .map(parse_apache_log_line)  
                              #  Caches  the  objects  in  memory.  
                            .cache())  
#  Call  an  action  on  the  RDD  to  actually  populate  the  cache.  
access_logs.count()
6
Calculate Context Size Stats
content_sizes  =  (access_logs  
                                  .map(lambda  row:  row.contentSize)  
                                  .cache())  #  Cache  since  multiple  queries.  
average_content_size  =  content_sizes.reduce(  
      lambda  x,  y:  x  +  y)  /  content_sizes.count()  
min_content_size  =  content_sizes.min()  
max_content_size  =  content_sizes.max()
7
Frequent IPAddresses - Key/Value Pairs
ip_addresses_rdd  =  (access_logs  
                                        .map(lambda  log:  (log.ipAddress,  1))  
                                        .reduceByKey(lambda  x,  y  :  x  +  y)  
                                        .filter(lambda  s:  s[1]  >  n)  
                                        .map(lambda  s:  Row(ip_address  =  s[0],  
                                                                                                                    count  =  s[1])))  
         #  Alternately,  could  just  collect()  the  values.  
           .registerTempTable(“ip_addresses”)
8
Other Statistics to Compute
• Response Code Count.
• Top Endpoints & Distribution.
• …and more.
Great way to learn various Spark Transformations and
actions and how to chain them together.
!
* BUT Spark SQL makes this much easier!
9
Better: Register Logs as a Spark SQL Table
sqlContext.sql(“CREATE  EXTERNAL  TABLE  access_logs  
      (  ipaddress  STRING  …  contextSize  INT  …  )  
      ROW  FORMAT    
      SERDE  'org.apache.hadoop.hive.serde2.RegexSerDe'  
    WITH  SERDEPROPERTIES  (  
        "input.regex"  =  '^(S+)  (S+)  (S+)  …’)  
    LOCATION  ”/tmp/sample_logs”  
”)
10
Context Sizes with Spark SQL
sqlContext.sql(“SELECT    
     (SUM(contentsize)  /  COUNT(*)),    #  Average  
   MIN(contentsize),    
   MAX(contentsize)  
FROM  access_logs”)
11
Frequent IPAddress with Spark SQL
sqlContext.sql(“SELECT    
                ipaddress,  
                COUNT(*)  AS  total          
          FROM  access_logs  
          GROUP  BY  ipaddress  
          HAVING  total  >  N”)
12
Tip: Use Partitioning
• Only analyze files from days you care about.
!
sqlContext.sql(“ALTER  TABLE  access_logs    
      ADD  PARTITION  (date='20150318')    
      LOCATION  ‘/logs/2015/3/18’”)  
!
• If your data rolls between days - perhaps those few missed
logs don’t matter.
13
Tip: Define Last N Day Tables for caching
• Create  another  table  with  a  similar  format.  
• Only  register  partitions  for  the  last  N  days.  
• Each  night:  
• Uncache  the  table.  
• Update  the  partition  definitions.  
• Recache:  
      sqlContext.sql(“CACHE  access_logs_last_7_days”)
14
Tip: Monitor the Pipeline with Spark SQL
• Detect if your batch jobs are taking too long.
• Programmatically create a temp table with stats from
one run.
sqlContext.sql(“CREATE  TABLE  IF  NOT  EXISTS  
pipelineStats  (runStart  INT,  runDuration  INT)”)  
sqlContext.sql(“insert  into  TABLE  pipelineStats  select  
runStart,  runDuration  from  oneRun  limit  1”)  
• Coalesce the table from time to time.
15
16
Demo:  Web  Log  Analysis
17
Application:  Wikipedia
Tip: Use Spark to Parallelize Downloading
Wikipedia can be downloaded in one giant file, or you can
download the 27 parts.
!
val  articlesRDD  =  sc.parallelize(articlesToRetrieve.toList,  4)  
val  retrieveInPartitions  =  (iter:  Iterator[String])  =>  {  
    iter.map(article  =>  retrieveArticleAndWriteToS3(article))  }  
val  fetchedArticles  =    
    articlesRDD.mapPartitions(retrieveInPartitions).collect()
18
Processing XML data
• Excessively large (> 1GB) compressed XML data is hard
to process.
• Not easily splittable.
• Solution: Break into text files where there is one XML
element per line.
19
ETL-ing your data with Spark
• Use an XML Parser to pull out fields of interest in the
XML document.
• Save the data in Parquet File format for faster querying.
• Register the Parquet format files as Spark SQL since
there is a clearly defined schema.
20
Using Spark for Fast Data Exploration
!
• CACHE the dataset for faster querying.
• Interactive programming experience.
• Use a mix of Python or Scala combined with SQL to
analyze the dataset.
21
Tip: Use MLLib to Learn from Dataset
• Wikipedia articles are an rich set of data for the English
language.
• Word2Vec is a simple algorithm to learn synonyms and
can be applied to the wikipedia article.
• Try out your favorite ML/NLP algorithms!
22
23
Demo:  Wikipedia  App
24
Application:  Facebook  API
Tip: Use Spark to Scrape Facebook Data
• Use Spark to Facebook to make requests for friends of
friends in parallel.
• NOTE: Latest Facebook API will only show friends that
have also enabled the app.
• If you build a Facebook App and get more users to
accept it, you can build a more complete picture of the
social graph!
25
Tip: Use GraphX to learn on Data
• Use the Page Rank algorithm to determine who’s the
most popular**.
• Output User Data: Facebook User Id to name.
• Output Edges: User Id to User Id
!
** In this case it’s my friends, so I’m clearly the most
popular.
26
27
Demo:  Facebook  API
Conclusion
• I hope this talk has inspired you to want to write Spark
applications on your favorite dataset.
• Hacking (and making mistakes) is the best way to learn.
• If you want to walk through some examples, see the
Databricks Spark Reference Applications:
• https://github.com/databricks/reference-apps
28
THE END

More Related Content

What's hot (20)

PDF
End-to-end Data Pipeline with Apache Spark
Databricks
 
PDF
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Databricks
 
PDF
Visualizing big data in the browser using spark
Databricks
 
PDF
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Databricks
 
PDF
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Databricks
 
PDF
Spark streaming State of the Union - Strata San Jose 2015
Databricks
 
PDF
Spark streaming state of the union
Databricks
 
PDF
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
PDF
New directions for Apache Spark in 2015
Databricks
 
PDF
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Databricks
 
PDF
Spark Summit EU 2015: Lessons from 300+ production users
Databricks
 
PDF
A look ahead at spark 2.0
Databricks
 
PDF
Jump Start into Apache® Spark™ and Databricks
Databricks
 
PPTX
Building a modern Application with DataFrames
Spark Summit
 
PDF
Stanford CS347 Guest Lecture: Apache Spark
Reynold Xin
 
PDF
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Spark Summit
 
PDF
New Directions for Spark in 2015 - Spark Summit East
Databricks
 
PDF
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Databricks
 
PDF
The BDAS Open Source Community
jeykottalam
 
PPTX
Large-Scale Data Science in Apache Spark 2.0
Databricks
 
End-to-end Data Pipeline with Apache Spark
Databricks
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Databricks
 
Visualizing big data in the browser using spark
Databricks
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Databricks
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Databricks
 
Spark streaming State of the Union - Strata San Jose 2015
Databricks
 
Spark streaming state of the union
Databricks
 
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
New directions for Apache Spark in 2015
Databricks
 
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Databricks
 
Spark Summit EU 2015: Lessons from 300+ production users
Databricks
 
A look ahead at spark 2.0
Databricks
 
Jump Start into Apache® Spark™ and Databricks
Databricks
 
Building a modern Application with DataFrames
Spark Summit
 
Stanford CS347 Guest Lecture: Apache Spark
Reynold Xin
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Spark Summit
 
New Directions for Spark in 2015 - Spark Summit East
Databricks
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Databricks
 
The BDAS Open Source Community
jeykottalam
 
Large-Scale Data Science in Apache Spark 2.0
Databricks
 

Similar to Spark Application Carousel: Highlights of Several Applications Built with Spark (20)

PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
PDF
Not Your Father's Database by Databricks
Caserta
 
PDF
Streaming & Scaling Spark - London Spark Meetup 2016
Holden Karau
 
PPTX
OVERVIEW ON SPARK.pptx
Aishg4
 
PDF
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Holden Karau
 
PDF
Web-Scale Graph Analytics with Apache® Spark™
Databricks
 
PDF
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Databricks
 
PDF
Not Your Father's Database by Vida Ha
Spark Summit
 
PDF
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Holden Karau
 
PDF
Dev Ops Training
Spark Summit
 
PDF
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Spark Summit
 
PDF
Jump Start into Apache Spark (Seattle Spark Meetup)
Denny Lee
 
PDF
OCF.tw's talk about "Introduction to spark"
Giivee The
 
PDF
Sparkcamp stratasingapore
Cheng Feng
 
PDF
Simplifying Big Data Analytics with Apache Spark
Databricks
 
PPTX
Intro to Spark - for Denver Big Data Meetup
Gwen (Chen) Shapira
 
PDF
Metadata and the Power of Pattern-Finding
DATAVERSITY
 
PPTX
Intro to Spark
Kyle Burke
 
PDF
What's new with Apache Spark?
Paco Nathan
 
PDF
Big data distributed processing: Spark introduction
Hektor Jacynycz García
 
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
Not Your Father's Database by Databricks
Caserta
 
Streaming & Scaling Spark - London Spark Meetup 2016
Holden Karau
 
OVERVIEW ON SPARK.pptx
Aishg4
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Holden Karau
 
Web-Scale Graph Analytics with Apache® Spark™
Databricks
 
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Databricks
 
Not Your Father's Database by Vida Ha
Spark Summit
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Holden Karau
 
Dev Ops Training
Spark Summit
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Spark Summit
 
Jump Start into Apache Spark (Seattle Spark Meetup)
Denny Lee
 
OCF.tw's talk about "Introduction to spark"
Giivee The
 
Sparkcamp stratasingapore
Cheng Feng
 
Simplifying Big Data Analytics with Apache Spark
Databricks
 
Intro to Spark - for Denver Big Data Meetup
Gwen (Chen) Shapira
 
Metadata and the Power of Pattern-Finding
DATAVERSITY
 
Intro to Spark
Kyle Burke
 
What's new with Apache Spark?
Paco Nathan
 
Big data distributed processing: Spark introduction
Hektor Jacynycz García
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PDF
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
PDF
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
PPTX
Help for Correlations in IBM SPSS Statistics.pptx
Version 1 Analytics
 
PPTX
Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...
bbedford2
 
PPTX
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
PDF
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
PPTX
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
Coefficient of Variance in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
PDF
Technical-Careers-Roadmap-in-Software-Market.pdf
Hussein Ali
 
PDF
AOMEI Partition Assistant Crack 10.8.2 + WinPE Free Downlaod New Version 2025
bashirkhan333g
 
PDF
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
PDF
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
PDF
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
PDF
Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...
logixshapers59
 
PDF
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 
PPTX
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
PDF
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
PDF
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
PDF
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
Help for Correlations in IBM SPSS Statistics.pptx
Version 1 Analytics
 
Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...
bbedford2
 
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Coefficient of Variance in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
Technical-Careers-Roadmap-in-Software-Market.pdf
Hussein Ali
 
AOMEI Partition Assistant Crack 10.8.2 + WinPE Free Downlaod New Version 2025
bashirkhan333g
 
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...
logixshapers59
 
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 

Spark Application Carousel: Highlights of Several Applications Built with Spark

  • 2. About Today’s Talk • About Me: • Vida Ha - Solutions Engineer at Databricks. • Goal: • For beginning/early intermediate Spark Developers. • Motivate you to start writing more apps in Spark. • Share some tips I’ve learned along the way. 2
  • 3. Today’s Applications Covered • Web Logs Analysis • Basic Data Pipeline - Spark & Spark SQL • Wikipedia Dataset • Machine Learning • Facebook API • Graph Algorithms 3
  • 4. 4 Application  1:  Web  Log  Analysis
  • 5. Web Logs • Why? • Most organizations have web log data. • Dataset is too expensive to store in a database. • Awesome, easy way to learn Spark! • What? • Standard Apache Access Logs. • Web logs flow in each day from a web server. 5
  • 6. Reading in Log Files access_logs  =  (sc.textFile(DBFS_SAMPLE_LOGS_FOLDER)                                #  Call  the  parse_apace_log_line  on  each  line.                                .map(parse_apache_log_line)                                #  Caches  the  objects  in  memory.                              .cache())   #  Call  an  action  on  the  RDD  to  actually  populate  the  cache.   access_logs.count() 6
  • 7. Calculate Context Size Stats content_sizes  =  (access_logs                                    .map(lambda  row:  row.contentSize)                                    .cache())  #  Cache  since  multiple  queries.   average_content_size  =  content_sizes.reduce(       lambda  x,  y:  x  +  y)  /  content_sizes.count()   min_content_size  =  content_sizes.min()   max_content_size  =  content_sizes.max() 7
  • 8. Frequent IPAddresses - Key/Value Pairs ip_addresses_rdd  =  (access_logs                                          .map(lambda  log:  (log.ipAddress,  1))                                          .reduceByKey(lambda  x,  y  :  x  +  y)                                          .filter(lambda  s:  s[1]  >  n)                                          .map(lambda  s:  Row(ip_address  =  s[0],                                                                                                                      count  =  s[1])))         #  Alternately,  could  just  collect()  the  values.           .registerTempTable(“ip_addresses”) 8
  • 9. Other Statistics to Compute • Response Code Count. • Top Endpoints & Distribution. • …and more. Great way to learn various Spark Transformations and actions and how to chain them together. ! * BUT Spark SQL makes this much easier! 9
  • 10. Better: Register Logs as a Spark SQL Table sqlContext.sql(“CREATE  EXTERNAL  TABLE  access_logs        (  ipaddress  STRING  …  contextSize  INT  …  )        ROW  FORMAT          SERDE  'org.apache.hadoop.hive.serde2.RegexSerDe'      WITH  SERDEPROPERTIES  (          "input.regex"  =  '^(S+)  (S+)  (S+)  …’)      LOCATION  ”/tmp/sample_logs”   ”) 10
  • 11. Context Sizes with Spark SQL sqlContext.sql(“SELECT         (SUM(contentsize)  /  COUNT(*)),    #  Average     MIN(contentsize),       MAX(contentsize)   FROM  access_logs”) 11
  • 12. Frequent IPAddress with Spark SQL sqlContext.sql(“SELECT                    ipaddress,                  COUNT(*)  AS  total                    FROM  access_logs            GROUP  BY  ipaddress            HAVING  total  >  N”) 12
  • 13. Tip: Use Partitioning • Only analyze files from days you care about. ! sqlContext.sql(“ALTER  TABLE  access_logs          ADD  PARTITION  (date='20150318')          LOCATION  ‘/logs/2015/3/18’”)   ! • If your data rolls between days - perhaps those few missed logs don’t matter. 13
  • 14. Tip: Define Last N Day Tables for caching • Create  another  table  with  a  similar  format.   • Only  register  partitions  for  the  last  N  days.   • Each  night:   • Uncache  the  table.   • Update  the  partition  definitions.   • Recache:       sqlContext.sql(“CACHE  access_logs_last_7_days”) 14
  • 15. Tip: Monitor the Pipeline with Spark SQL • Detect if your batch jobs are taking too long. • Programmatically create a temp table with stats from one run. sqlContext.sql(“CREATE  TABLE  IF  NOT  EXISTS   pipelineStats  (runStart  INT,  runDuration  INT)”)   sqlContext.sql(“insert  into  TABLE  pipelineStats  select   runStart,  runDuration  from  oneRun  limit  1”)   • Coalesce the table from time to time. 15
  • 16. 16 Demo:  Web  Log  Analysis
  • 18. Tip: Use Spark to Parallelize Downloading Wikipedia can be downloaded in one giant file, or you can download the 27 parts. ! val  articlesRDD  =  sc.parallelize(articlesToRetrieve.toList,  4)   val  retrieveInPartitions  =  (iter:  Iterator[String])  =>  {      iter.map(article  =>  retrieveArticleAndWriteToS3(article))  }   val  fetchedArticles  =        articlesRDD.mapPartitions(retrieveInPartitions).collect() 18
  • 19. Processing XML data • Excessively large (> 1GB) compressed XML data is hard to process. • Not easily splittable. • Solution: Break into text files where there is one XML element per line. 19
  • 20. ETL-ing your data with Spark • Use an XML Parser to pull out fields of interest in the XML document. • Save the data in Parquet File format for faster querying. • Register the Parquet format files as Spark SQL since there is a clearly defined schema. 20
  • 21. Using Spark for Fast Data Exploration ! • CACHE the dataset for faster querying. • Interactive programming experience. • Use a mix of Python or Scala combined with SQL to analyze the dataset. 21
  • 22. Tip: Use MLLib to Learn from Dataset • Wikipedia articles are an rich set of data for the English language. • Word2Vec is a simple algorithm to learn synonyms and can be applied to the wikipedia article. • Try out your favorite ML/NLP algorithms! 22
  • 25. Tip: Use Spark to Scrape Facebook Data • Use Spark to Facebook to make requests for friends of friends in parallel. • NOTE: Latest Facebook API will only show friends that have also enabled the app. • If you build a Facebook App and get more users to accept it, you can build a more complete picture of the social graph! 25
  • 26. Tip: Use GraphX to learn on Data • Use the Page Rank algorithm to determine who’s the most popular**. • Output User Data: Facebook User Id to name. • Output Edges: User Id to User Id ! ** In this case it’s my friends, so I’m clearly the most popular. 26
  • 28. Conclusion • I hope this talk has inspired you to want to write Spark applications on your favorite dataset. • Hacking (and making mistakes) is the best way to learn. • If you want to walk through some examples, see the Databricks Spark Reference Applications: • https://github.com/databricks/reference-apps 28