SlideShare a Scribd company logo
Guide : 
Mrs. Juhi Singh 
Submitted by: Hitesh DuaCSE 4thYear05510402711
Apache spark
Apache spark
Sustained exponential growth, as one of the most active Apache projects
•ApacheSparkisanopensourceparallelprocessingframeworkthatenablesuserstorunlarge-scaledataanalyticsapplicationsacrossclusteredcomputers. 
•ApacheSparkcanprocessdatafromavarietyofdatarepositories.Itsupportsin-memoryprocessingtoboosttheperformanceofbigdataanalyticsapplications,butitcanalsodoconventionaldisk-basedprocessingwhendatasetsaretoolargetofitintotheavailablesystemmemory.
●Open Source 
●Alternative to Map Reduce for certain applications 
●A low latency cluster computing system for very large data sets 
●Higher level library for stream processing, through Spark Streaming. 
●May be 100 times faster than Map Reduce for 
–Iterative algorithms 
–Interactive data mining
Apache spark
•Started as a research project at theUC Berkeley AMPLabin 2009, and was open sourced in early 2010. 
•After being released, Spark grew a developer community on GitHuband entered Apache in 2013 as its permanent home. 
•Codebase size 
Spark : 20,000 LOC 
Hadoop 1.0 : 90,000 LOC
Apache spark
•MapReduce greatly simplified big data analysis. 
•But as soon as it got popular, users wanted more: 
»More complex, multi‐stage applications (e.g. iterative graph algorithms and machine learning) 
»More interactive ad-hoc queries. 
•Both multi‐stage and interactive apps require fasterdata sharing across parallel jobs.
Apache spark
Apache spark
•Resilient Distributed Datasets (RDDs) are basic building block. 
Distributed collections of objects that can be cached in memory across cluster nodes. 
Automatically rebuilt on failure. 
•RDD operations 
Transformations: Creates new dataset from existing one. e.g. Map. 
Actions: Return a value to a driver program after running computation on the dataset. e.g. Reduce. 
Spark : Programming Model
Apache spark
SparkStackExtensionSparkpowersastackofhigh-leveltoolsincluding 
•SparkSQL 
•SparkStreaming. 
•MLlibformachinelearning 
•GraphX 
Youcancombinetheseframeworksseamlesslyinthesameapplication.
•SparkStreamingisaSparkcomponentthatenablesprocessinglivestreamsofdata. 
•Examplesofdatastreamsincludelogfilesgeneratedbyproductionwebservers,orqueuesofmessagescontainingstatusupdatespostedbyusersofawebservice
GraphXisalibraryaddedinSpark0.9thatprovidesanAPIformanipulatinggraphs(e.g.,asocialnetwork’sfriendgraph)andperforminggraph-parallelcomputations. 
•Allowsustocreateadirectedgraphwitharbitrarypropertiesattachedtoeachvertexandedge. 
•GraphXalsoprovidessetofoperatorsformanipulatinggraphs 
•libraryofcommongraphalgorithms(e.g.,PageRankandtrianglecounting).
MLlibprovidesmultipletypesofmachinelearningalgorithms,includingbinaryclassification,regression,clusteringandcollaborativefiltering. 
•Supportsfunctionalitysuchasmodelevaluationanddataimport. 
•Designedtoscaleoutacrossacluster. 
•MLlibcontainshigh-qualityalgorithmsthatleverageiteration,andcanyieldbetterresultsthantheone-passapproximationssometimesusedonMapReduce.
Spark SQL provides support for interacting with Spark via SQL as well as the Apache Hive variant of SQL, called the Hive Query Language (HiveQL). 
•Spark SQL represents database tables as Spark RDDs and translates SQL queries into Spark operations. 
•Spark SQL lets you query structured data as a distributed dataset (RDD) in Spark. 
•Spark SQL includes a server mode with industry standard JDBC and ODBC connectivity.
Apache spark

More Related Content

What's hot (20)

PDF
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
PDF
Scalable And Incremental Data Profiling With Spark
Jen Aman
 
PPTX
Tailored for Spark
DataWorks Summit/Hadoop Summit
 
PDF
Spark Streaming and MLlib - Hyderabad Spark Group
Phaneendra Chiruvella
 
PDF
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Databricks
 
PDF
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit
 
PDF
Spark Summit EU talk by Josef Habdank
Spark Summit
 
PDF
End-to-End Data Pipelines with Apache Spark
Burak Yavuz
 
PDF
Big Telco - Yousun Jeong
Spark Summit
 
PPTX
Data Science with Spark & Zeppelin
Vinay Shukla
 
PDF
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Spark Summit
 
PDF
Interactive Visualization of Streaming Data Powered by Spark
Spark Summit
 
PDF
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Databricks
 
PDF
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
Databricks
 
PDF
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
Databricks
 
ODP
Kick-Start with SMACK Stack
Knoldus Inc.
 
PDF
Operationalizing Big Data Pipelines At Scale
Databricks
 
PPTX
How Spark Enables the Internet of Things- Paula Ta-Shma
Spark Summit
 
PDF
Accelerating Machine Learning and Deep Learning At Scale...With Apache Spark:...
Spark Summit
 
PDF
Spark Summit EU talk by Stephan Kessler
Spark Summit
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
Scalable And Incremental Data Profiling With Spark
Jen Aman
 
Tailored for Spark
DataWorks Summit/Hadoop Summit
 
Spark Streaming and MLlib - Hyderabad Spark Group
Phaneendra Chiruvella
 
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Databricks
 
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit
 
Spark Summit EU talk by Josef Habdank
Spark Summit
 
End-to-End Data Pipelines with Apache Spark
Burak Yavuz
 
Big Telco - Yousun Jeong
Spark Summit
 
Data Science with Spark & Zeppelin
Vinay Shukla
 
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Spark Summit
 
Interactive Visualization of Streaming Data Powered by Spark
Spark Summit
 
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Databricks
 
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
Databricks
 
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
Databricks
 
Kick-Start with SMACK Stack
Knoldus Inc.
 
Operationalizing Big Data Pipelines At Scale
Databricks
 
How Spark Enables the Internet of Things- Paula Ta-Shma
Spark Summit
 
Accelerating Machine Learning and Deep Learning At Scale...With Apache Spark:...
Spark Summit
 
Spark Summit EU talk by Stephan Kessler
Spark Summit
 

Similar to Apache spark (20)

PPTX
Spark with anjbnn hfkkjn hbkjbu h jhbk.pptx
nreddyjanga
 
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
PPTX
SPARK ARCHITECTURE
GauravBiswas9
 
PPT
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
PPTX
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
PDF
Learning spark ch01 - Introduction to Data Analysis with Spark
phanleson
 
PPTX
Learning spark ch01 - Introduction to Data Analysis with Spark
phanleson
 
PDF
An introduction To Apache Spark
Amir Sedighi
 
PPTX
Big Data Processing with Apache Spark 2014
mahchiev
 
PDF
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
PPTX
big data analytics (BAD601) Module-5.pptx
AmbikaVenkatesh4
 
PPTX
Apache Spark Fundamentals
Zahra Eskandari
 
PDF
Started with-apache-spark
Happiest Minds Technologies
 
PDF
Spark meetup TCHUG
Ryan Bosshart
 
PPTX
Apache spark
Ramakrishna kapa
 
PPTX
Apachespark 160612140708
Srikrishna k
 
PDF
[@NaukriEngineering] Apache Spark
Naukri.com
 
PDF
SparkPaper
Suraj Thapaliya
 
PPTX
Evolution of spark framework for simplifying data analysis.
Anirudh Gangwar
 
Spark with anjbnn hfkkjn hbkjbu h jhbk.pptx
nreddyjanga
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
SPARK ARCHITECTURE
GauravBiswas9
 
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
Learning spark ch01 - Introduction to Data Analysis with Spark
phanleson
 
Learning spark ch01 - Introduction to Data Analysis with Spark
phanleson
 
An introduction To Apache Spark
Amir Sedighi
 
Big Data Processing with Apache Spark 2014
mahchiev
 
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
big data analytics (BAD601) Module-5.pptx
AmbikaVenkatesh4
 
Apache Spark Fundamentals
Zahra Eskandari
 
Started with-apache-spark
Happiest Minds Technologies
 
Spark meetup TCHUG
Ryan Bosshart
 
Apache spark
Ramakrishna kapa
 
Apachespark 160612140708
Srikrishna k
 
[@NaukriEngineering] Apache Spark
Naukri.com
 
SparkPaper
Suraj Thapaliya
 
Evolution of spark framework for simplifying data analysis.
Anirudh Gangwar
 
Ad

Recently uploaded (20)

PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
Smart Air Quality Monitoring with Serrax AQM190 LITE
SERRAX TECHNOLOGIES LLP
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
TrustArc Webinar - Data Privacy Trends 2025: Mid-Year Insights & Program Stra...
TrustArc
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
Persuasive AI: risks and opportunities in the age of digital debate
Speck&Tech
 
PPTX
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
July Patch Tuesday
Ivanti
 
PDF
Wojciech Ciemski for Top Cyber News MAGAZINE. June 2025
Dr. Ludmila Morozova-Buss
 
PDF
Predicting the unpredictable: re-engineering recommendation algorithms for fr...
Speck&Tech
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
Impact of IEEE Computer Society in Advancing Emerging Technologies including ...
Hironori Washizaki
 
PDF
Why Orbit Edge Tech is a Top Next JS Development Company in 2025
mahendraalaska08
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
Smart Air Quality Monitoring with Serrax AQM190 LITE
SERRAX TECHNOLOGIES LLP
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
TrustArc Webinar - Data Privacy Trends 2025: Mid-Year Insights & Program Stra...
TrustArc
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
Persuasive AI: risks and opportunities in the age of digital debate
Speck&Tech
 
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
July Patch Tuesday
Ivanti
 
Wojciech Ciemski for Top Cyber News MAGAZINE. June 2025
Dr. Ludmila Morozova-Buss
 
Predicting the unpredictable: re-engineering recommendation algorithms for fr...
Speck&Tech
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
Impact of IEEE Computer Society in Advancing Emerging Technologies including ...
Hironori Washizaki
 
Why Orbit Edge Tech is a Top Next JS Development Company in 2025
mahendraalaska08
 
Ad

Apache spark