Apache spark

3 likes524 views

Apache Spark is an open-source parallel processing framework for large-scale data analytics, significantly outperforming MapReduce in speed for certain applications. It supports both in-memory and disk-based processing, and offers a range of high-level libraries for tasks such as machine learning (MLlib), streaming (Spark Streaming), and graph processing (GraphX). Spark SQL facilitates querying structured data through SQL and integrates various Spark components seamlessly.

Technology

Guide :
Mrs. Juhi Singh
Submitted by: Hitesh DuaCSE 4thYear05510402711

Sustained exponential growth, as one of the most active Apache projects

•ApacheSparkisanopensourceparallelprocessingframeworkthatenablesuserstorunlarge-scaledataanalyticsapplicationsacrossclusteredcomputers.
•ApacheSparkcanprocessdatafromavarietyofdatarepositories.Itsupportsin-memoryprocessingtoboosttheperformanceofbigdataanalyticsapplications,butitcanalsodoconventionaldisk-basedprocessingwhendatasetsaretoolargetofitintotheavailablesystemmemory.

●Open Source
●Alternative to Map Reduce for certain applications
●A low latency cluster computing system for very large data sets
●Higher level library for stream processing, through Spark Streaming.
●May be 100 times faster than Map Reduce for
–Iterative algorithms
–Interactive data mining

•Started as a research project at theUC Berkeley AMPLabin 2009, and was open sourced in early 2010.
•After being released, Spark grew a developer community on GitHuband entered Apache in 2013 as its permanent home.
•Codebase size
Spark : 20,000 LOC
Hadoop 1.0 : 90,000 LOC

•MapReduce greatly simplified big data analysis.
•But as soon as it got popular, users wanted more:
»More complex, multi‐stage applications (e.g. iterative graph algorithms and machine learning)
»More interactive ad-hoc queries.
•Both multi‐stage and interactive apps require fasterdata sharing across parallel jobs.

•Resilient Distributed Datasets (RDDs) are basic building block.
Distributed collections of objects that can be cached in memory across cluster nodes.
Automatically rebuilt on failure.
•RDD operations
Transformations: Creates new dataset from existing one. e.g. Map.
Actions: Return a value to a driver program after running computation on the dataset. e.g. Reduce.
Spark : Programming Model

SparkStackExtensionSparkpowersastackofhigh-leveltoolsincluding
•SparkSQL
•SparkStreaming.
•MLlibformachinelearning
•GraphX
Youcancombinetheseframeworksseamlesslyinthesameapplication.

•SparkStreamingisaSparkcomponentthatenablesprocessinglivestreamsofdata.
•Examplesofdatastreamsincludelogfilesgeneratedbyproductionwebservers,orqueuesofmessagescontainingstatusupdatespostedbyusersofawebservice

GraphXisalibraryaddedinSpark0.9thatprovidesanAPIformanipulatinggraphs(e.g.,asocialnetwork’sfriendgraph)andperforminggraph-parallelcomputations.
•Allowsustocreateadirectedgraphwitharbitrarypropertiesattachedtoeachvertexandedge.
•GraphXalsoprovidessetofoperatorsformanipulatinggraphs
•libraryofcommongraphalgorithms(e.g.,PageRankandtrianglecounting).

MLlibprovidesmultipletypesofmachinelearningalgorithms,includingbinaryclassification,regression,clusteringandcollaborativefiltering.
•Supportsfunctionalitysuchasmodelevaluationanddataimport.
•Designedtoscaleoutacrossacluster.
•MLlibcontainshigh-qualityalgorithmsthatleverageiteration,andcanyieldbetterresultsthantheone-passapproximationssometimesusedonMapReduce.

Spark SQL provides support for interacting with Spark via SQL as well as the Apache Hive variant of SQL, called the Hive Query Language (HiveQL).
•Spark SQL represents database tables as Spark RDDs and translates SQL queries into Spark operations.
•Spark SQL lets you query structured data as a distributed dataset (RDD) in Spark.
•Spark SQL includes a server mode with industry standard JDBC and ODBC connectivity.

More Related Content

What's hot (20)

PDF

A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)Spark Summit

PDF

Scalable And Incremental Data Profiling With SparkJen Aman

PPTX

Tailored for SparkDataWorks Summit/Hadoop Summit

PDF

Spark Streaming and MLlib - Hyderabad Spark GroupPhaneendra Chiruvella

PDF

Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Databricks

PDF

Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit

PDF

Spark Summit EU talk by Josef HabdankSpark Summit

PDF

End-to-End Data Pipelines with Apache SparkBurak Yavuz

PDF

Big Telco - Yousun JeongSpark Summit

PPTX

Data Science with Spark & ZeppelinVinay Shukla

PDF

Trends for Big Data and Apache Spark in 2017 by Matei ZahariaSpark Summit

PDF

Interactive Visualization of Streaming Data Powered by SparkSpark Summit

PDF

Cloud-Native Apache Spark Scheduling with YuniKorn SchedulerDatabricks

PDF

OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...Databricks

PDF

A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...Databricks

ODP

Kick-Start with SMACK StackKnoldus Inc.

PDF

Operationalizing Big Data Pipelines At ScaleDatabricks

PPTX

How Spark Enables the Internet of Things- Paula Ta-ShmaSpark Summit

PDF

Accelerating Machine Learning and Deep Learning At Scale...With Apache Spark:...Spark Summit

PDF

Spark Summit EU talk by Stephan KesslerSpark Summit

A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)Spark Summit

Scalable And Incremental Data Profiling With SparkJen Aman

Tailored for SparkDataWorks Summit/Hadoop Summit

Spark Streaming and MLlib - Hyderabad Spark GroupPhaneendra Chiruvella

Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Databricks

Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit

Spark Summit EU talk by Josef HabdankSpark Summit

End-to-End Data Pipelines with Apache SparkBurak Yavuz

Big Telco - Yousun JeongSpark Summit

Data Science with Spark & ZeppelinVinay Shukla

Trends for Big Data and Apache Spark in 2017 by Matei ZahariaSpark Summit

Interactive Visualization of Streaming Data Powered by SparkSpark Summit

Cloud-Native Apache Spark Scheduling with YuniKorn SchedulerDatabricks

OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...Databricks

A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...Databricks

Kick-Start with SMACK StackKnoldus Inc.

Operationalizing Big Data Pipelines At ScaleDatabricks

How Spark Enables the Internet of Things- Paula Ta-ShmaSpark Summit

Accelerating Machine Learning and Deep Learning At Scale...With Apache Spark:...Spark Summit

Spark Summit EU talk by Stephan KesslerSpark Summit

Similar to Apache spark (20)

PPTX

Spark with anjbnn hfkkjn hbkjbu h jhbk.pptxnreddyjanga

PPTX

Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsfsasuke20y4sh

PPTX

SPARK ARCHITECTUREGauravBiswas9

PPT

Big_data_analytics_NoSql_Module-4_SessionRUHULAMINHAZARIKA

PPTX

In Memory Analytics with Apache SparkVenkata Naga Ravi

PDF

Learning spark ch01 - Introduction to Data Analysis with Sparkphanleson

PPTX

Learning spark ch01 - Introduction to Data Analysis with Sparkphanleson

PDF

An introduction To Apache SparkAmir Sedighi

PPTX

Big Data Processing with Apache Spark 2014mahchiev

PDF

Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy

PPTX

big data analytics (BAD601) Module-5.pptxAmbikaVenkatesh4

PPTX

Apache Spark FundamentalsZahra Eskandari

PDF

Started with-apache-sparkHappiest Minds Technologies

PDF

Spark meetup TCHUGRyan Bosshart

PPTX

Apache sparkRamakrishna kapa

PPTX

Apachespark 160612140708Srikrishna k

PDF

[@NaukriEngineering] Apache SparkNaukri.com

PDF

SparkPaperSuraj Thapaliya

PPTX

Spark 101Shahaf Azriely {TopLinked} ☁

PPTX

Evolution of spark framework for simplifying data analysis.Anirudh Gangwar

Spark with anjbnn hfkkjn hbkjbu h jhbk.pptxnreddyjanga

Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsfsasuke20y4sh

SPARK ARCHITECTUREGauravBiswas9

Big_data_analytics_NoSql_Module-4_SessionRUHULAMINHAZARIKA

In Memory Analytics with Apache SparkVenkata Naga Ravi

Learning spark ch01 - Introduction to Data Analysis with Sparkphanleson

An introduction To Apache SparkAmir Sedighi

Big Data Processing with Apache Spark 2014mahchiev

Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy

big data analytics (BAD601) Module-5.pptxAmbikaVenkatesh4

Apache Spark FundamentalsZahra Eskandari

Started with-apache-sparkHappiest Minds Technologies

Spark meetup TCHUGRyan Bosshart

Apache sparkRamakrishna kapa

Apachespark 160612140708Srikrishna k

[@NaukriEngineering] Apache SparkNaukri.com

SparkPaperSuraj Thapaliya

Spark 101Shahaf Azriely {TopLinked} ☁

Evolution of spark framework for simplifying data analysis.Anirudh Gangwar

Recently uploaded (20)

PDF

CIFDAQ Weekly Market Wrap for 11th July 2025CIFDAQ

PDF

Complete JavaScript Notes: From Basics to Advanced Concepts.pdfhaydendavispro

PDF

HubSpot Main Hub: A Unified Growth PlatformJaswinder Singh

PDF

Smart Air Quality Monitoring with Serrax AQM190 LITESERRAX TECHNOLOGIES LLP

PPTX

OpenID AuthZEN - Analyst Briefing July 2025David Brossard

PDF

TrustArc Webinar - Data Privacy Trends 2025: Mid-Year Insights & Program Stra...TrustArc

PDF

DevBcn - Building 10x Organizations Using Modern Productivity MetricsJustin Reock

PDF

Persuasive AI: risks and opportunities in the age of digital debateSpeck&Tech

PPTX

MSP360 Backup Scheduling and Retention Best Practices.pptxMSP360

PPTX

WooCommerce Workshop: Bring Your LaptopLaura Hartwig

PDF

Log-Based Anomaly Detection: Enhancing System Reliability with Machine LearningMohammed BEKKOUCHE

PPTX

UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst ContentDianaGray10

PDF

Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AIdominikamizerska1

PDF

CIFDAQ Token Spotlight for 9th July 2025CIFDAQ

PDF

July Patch TuesdayIvanti

PDF

Wojciech Ciemski for Top Cyber News MAGAZINE. June 2025Dr. Ludmila Morozova-Buss

PDF

Predicting the unpredictable: re-engineering recommendation algorithms for fr...Speck&Tech

PDF

Smart Trailers 2025 Update with History and OverviewPaul Menig

PDF

Impact of IEEE Computer Society in Advancing Emerging Technologies including ...Hironori Washizaki

PDF

Why Orbit Edge Tech is a Top Next JS Development Company in 2025mahendraalaska08

CIFDAQ Weekly Market Wrap for 11th July 2025CIFDAQ

Complete JavaScript Notes: From Basics to Advanced Concepts.pdfhaydendavispro

HubSpot Main Hub: A Unified Growth PlatformJaswinder Singh

Smart Air Quality Monitoring with Serrax AQM190 LITESERRAX TECHNOLOGIES LLP

OpenID AuthZEN - Analyst Briefing July 2025David Brossard

TrustArc Webinar - Data Privacy Trends 2025: Mid-Year Insights & Program Stra...TrustArc

DevBcn - Building 10x Organizations Using Modern Productivity MetricsJustin Reock

Persuasive AI: risks and opportunities in the age of digital debateSpeck&Tech

MSP360 Backup Scheduling and Retention Best Practices.pptxMSP360

WooCommerce Workshop: Bring Your LaptopLaura Hartwig

Log-Based Anomaly Detection: Enhancing System Reliability with Machine LearningMohammed BEKKOUCHE

UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst ContentDianaGray10

Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AIdominikamizerska1

CIFDAQ Token Spotlight for 9th July 2025CIFDAQ

July Patch TuesdayIvanti

Wojciech Ciemski for Top Cyber News MAGAZINE. June 2025Dr. Ludmila Morozova-Buss

Predicting the unpredictable: re-engineering recommendation algorithms for fr...Speck&Tech

Smart Trailers 2025 Update with History and OverviewPaul Menig

Impact of IEEE Computer Society in Advancing Emerging Technologies including ...Hironori Washizaki

Why Orbit Edge Tech is a Top Next JS Development Company in 2025mahendraalaska08

Apache spark

1. Guide : Mrs. Juhi Singh Submitted by: Hitesh DuaCSE 4thYear05510402711

4. Sustained exponential growth, as one of the most active Apache projects

5. •ApacheSparkisanopensourceparallelprocessingframeworkthatenablesuserstorunlarge-scaledataanalyticsapplicationsacrossclusteredcomputers. •ApacheSparkcanprocessdatafromavarietyofdatarepositories.Itsupportsin-memoryprocessingtoboosttheperformanceofbigdataanalyticsapplications,butitcanalsodoconventionaldisk-basedprocessingwhendatasetsaretoolargetofitintotheavailablesystemmemory.

6. ●Open Source ●Alternative to Map Reduce for certain applications ●A low latency cluster computing system for very large data sets ●Higher level library for stream processing, through Spark Streaming. ●May be 100 times faster than Map Reduce for –Iterative algorithms –Interactive data mining

8. •Started as a research project at theUC Berkeley AMPLabin 2009, and was open sourced in early 2010. •After being released, Spark grew a developer community on GitHuband entered Apache in 2013 as its permanent home. •Codebase size Spark : 20,000 LOC Hadoop 1.0 : 90,000 LOC

10. •MapReduce greatly simplified big data analysis. •But as soon as it got popular, users wanted more: »More complex, multi‐stage applications (e.g. iterative graph algorithms and machine learning) »More interactive ad-hoc queries. •Both multi‐stage and interactive apps require fasterdata sharing across parallel jobs.

13. •Resilient Distributed Datasets (RDDs) are basic building block. Distributed collections of objects that can be cached in memory across cluster nodes. Automatically rebuilt on failure. •RDD operations Transformations: Creates new dataset from existing one. e.g. Map. Actions: Return a value to a driver program after running computation on the dataset. e.g. Reduce. Spark : Programming Model

15. SparkStackExtensionSparkpowersastackofhigh-leveltoolsincluding •SparkSQL •SparkStreaming. •MLlibformachinelearning •GraphX Youcancombinetheseframeworksseamlesslyinthesameapplication.

16. •SparkStreamingisaSparkcomponentthatenablesprocessinglivestreamsofdata. •Examplesofdatastreamsincludelogfilesgeneratedbyproductionwebservers,orqueuesofmessagescontainingstatusupdatespostedbyusersofawebservice

17. GraphXisalibraryaddedinSpark0.9thatprovidesanAPIformanipulatinggraphs(e.g.,asocialnetwork’sfriendgraph)andperforminggraph-parallelcomputations. •Allowsustocreateadirectedgraphwitharbitrarypropertiesattachedtoeachvertexandedge. •GraphXalsoprovidessetofoperatorsformanipulatinggraphs •libraryofcommongraphalgorithms(e.g.,PageRankandtrianglecounting).

18. MLlibprovidesmultipletypesofmachinelearningalgorithms,includingbinaryclassification,regression,clusteringandcollaborativefiltering. •Supportsfunctionalitysuchasmodelevaluationanddataimport. •Designedtoscaleoutacrossacluster. •MLlibcontainshigh-qualityalgorithmsthatleverageiteration,andcanyieldbetterresultsthantheone-passapproximationssometimesusedonMapReduce.

19. Spark SQL provides support for interacting with Spark via SQL as well as the Apache Hive variant of SQL, called the Hive Query Language (HiveQL). •Spark SQL represents database tables as Spark RDDs and translates SQL queries into Spark operations. •Spark SQL lets you query structured data as a distributed dataset (RDD) in Spark. •Spark SQL includes a server mode with industry standard JDBC and ODBC connectivity.