Integrating Apache Spark and NiFi for Data Lakes

Download as PPTX, PDF

25 likes11,033 views

This document discusses using Apache Spark and Apache NiFi together for data lakes. It outlines the goals of a data lake including having a central data repository, reducing costs, enabling easier discovery and prototyping. It also discusses what is needed for a Hadoop data lake, including automation of pipelines, governance, and interactive data discovery. The document then provides an example ingestion project and describes using Apache Spark for functions like cleansing, validating, and profiling data. It outlines using Apache NiFi for the pipeline design with drag and drop functionality. Finally, it demonstrates ingesting and preparing data, data self-service and transformation, data discovery, and operational monitoring capabilities.

Technology

More Related Content

What's hot (20)

PPTX

Security and Data Governance using Apache Ranger and Apache AtlasDataWorks Summit/Hadoop Summit

PDF

Apache Iceberg: An Architectural Look Under the CoversScyllaDB

PPTX

Apache NiFi Crash Course IntroDataWorks Summit/Hadoop Summit

PDF

Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.

PPTX

Apache Flink and what it is used forAljoscha Krettek

PPTX

Hive + Tez: A Performance Deep DiveDataWorks Summit

PDF

Introduction to Apache NiFi dws19 DWS - DC 2019Timothy Spann

PDF

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama

PDF

Introduction to Apache NiFi 1.11.4Timothy Spann

PPTX

Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseDataWorks Summit

PDF

Data ingestion and distribution with apache NiFiLev Brailovskiy

PDF

Hudi architecture, fundamentals and capabilitiesNishith Agarwal

PDF

Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...GetInData

PDF

Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFiTimothy Spann

PDF

Building an open data platform with apache icebergAlluxio, Inc.

PPTX

Apache NiFi- MiNiFi meetup SlidesIsheeta Sanghi

PPTX

How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit

PDF

Elasticsearch in NetflixDanny Yuan

PDF

Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia

PPTX

File Format Benchmark - Avro, JSON, ORC & ParquetDataWorks Summit/Hadoop Summit

Security and Data Governance using Apache Ranger and Apache AtlasDataWorks Summit/Hadoop Summit

Apache Iceberg: An Architectural Look Under the CoversScyllaDB

Apache NiFi Crash Course IntroDataWorks Summit/Hadoop Summit

Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.

Apache Flink and what it is used forAljoscha Krettek

Hive + Tez: A Performance Deep DiveDataWorks Summit

Introduction to Apache NiFi dws19 DWS - DC 2019Timothy Spann

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama

Introduction to Apache NiFi 1.11.4Timothy Spann

Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseDataWorks Summit

Data ingestion and distribution with apache NiFiLev Brailovskiy

Hudi architecture, fundamentals and capabilitiesNishith Agarwal

Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...GetInData

Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFiTimothy Spann

Building an open data platform with apache icebergAlluxio, Inc.

Apache NiFi- MiNiFi meetup SlidesIsheeta Sanghi

How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit

Elasticsearch in NetflixDanny Yuan

Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia

File Format Benchmark - Avro, JSON, ORC & ParquetDataWorks Summit/Hadoop Summit

Viewers also liked (20)

PPTX

The Elephant in the CloudsDataWorks Summit/Hadoop Summit

PPTX

Real-Time Data Flows with Apache NiFiManish Gupta

PPTX

Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...Hortonworks

PPTX

Hortonworks Data In Motion Series Part 4Hortonworks

PDF

Dataflow with Apache NiFi - Crash Course - HS16SJDataWorks Summit/Hadoop Summit

PPTX

Building a Smarter Home with Apache NiFi and SparkDataWorks Summit/Hadoop Summit

PPTX

From Zero to Data Flow in Hours with Apache NiFiDataWorks Summit/Hadoop Summit

PPTX

IOT, Streaming Analytics and Machine Learning DataWorks Summit/Hadoop Summit

PPTX

Integrating Apache NiFi and Apache FlinkHortonworks

PPTX

NJ Hadoop Meetup - Apache NiFi Deep DiveBryan Bende

PPTX

Hortonworks Data in Motion Webinar Series - Part 1Hortonworks

PDF

Joe Witt presentation on Apache NiFiMark Kerzner

PPTX

Webinar Series Part 5 New Features of HDF 5Hortonworks

PDF

Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...Chris Fregly

PPTX

Hortonworks Data In Motion Series Part 3 - HDF Ambari Hortonworks

PPTX

The Avant-garde of Apache NiFiDataWorks Summit/Hadoop Summit

PPTX

Make Streaming Analytics work for you: The Devil is in the DetailsDataWorks Summit/Hadoop Summit

PPTX

Combining Machine Learning frameworks with Apache SparkDataWorks Summit/Hadoop Summit

PPTX

Next Gen Big Data Analytics with Apache Apex DataWorks Summit/Hadoop Summit

PPTX

Hortonworks Data In Motion Webinar Series Pt. 2Hortonworks

The Elephant in the CloudsDataWorks Summit/Hadoop Summit

Real-Time Data Flows with Apache NiFiManish Gupta

Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...Hortonworks

Hortonworks Data In Motion Series Part 4Hortonworks

Dataflow with Apache NiFi - Crash Course - HS16SJDataWorks Summit/Hadoop Summit

Building a Smarter Home with Apache NiFi and SparkDataWorks Summit/Hadoop Summit

From Zero to Data Flow in Hours with Apache NiFiDataWorks Summit/Hadoop Summit

IOT, Streaming Analytics and Machine Learning DataWorks Summit/Hadoop Summit

Integrating Apache NiFi and Apache FlinkHortonworks

NJ Hadoop Meetup - Apache NiFi Deep DiveBryan Bende

Hortonworks Data in Motion Webinar Series - Part 1Hortonworks

Joe Witt presentation on Apache NiFiMark Kerzner

Webinar Series Part 5 New Features of HDF 5Hortonworks

Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...Chris Fregly

Hortonworks Data In Motion Series Part 3 - HDF Ambari Hortonworks

The Avant-garde of Apache NiFiDataWorks Summit/Hadoop Summit

Make Streaming Analytics work for you: The Devil is in the DetailsDataWorks Summit/Hadoop Summit

Combining Machine Learning frameworks with Apache SparkDataWorks Summit/Hadoop Summit

Next Gen Big Data Analytics with Apache Apex DataWorks Summit/Hadoop Summit

Hortonworks Data In Motion Webinar Series Pt. 2Hortonworks

Similar to Integrating Apache Spark and NiFi for Data Lakes (20)

PPTX

Marketing Digital Command CenterDataWorks Summit

PPTX

Use of NiFi Product by Apache Foundationgamevasani

PDF

Social Media Monitoring with NiFi, Druid and SupersetThiago Santiago

PPTX

Integração de Dados com Apache NIFI - Marco Garcia CetaxMarco Garcia

PPTX

Overview of NiFi Product by Apache Foundationgamevasani

PPTX

Best practices and lessons learnt from Running Apache NiFi at RenaultDataWorks Summit

PPTX

Data ingestion using NiFi - Quick OverviewDurga Gadiraju

PDF

ApacheCon 2021 - Apache NiFi Deep Dive 300Timothy Spann

PDF

2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI PipelinesTimothy Spann

PPTX

HDF Powered by Apache NiFi IntroductionMilind Pandit

PDF

Enterprise IIoT Edge Processing with Apache NiFiTimothy Spann

PPTX

Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...Data Con LA

PDF

Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines DemoTimothy Spann

PPTX

Turning a Data Pond into a Data Lake with Apache NiFiGene Peters

PDF

Hail hydrate! from stream to lake using open sourceTimothy Spann

PDF

Data Ingest Self Service and Management using Nifi and KafkaDataWorks Summit

PDF

Real time stock processing with apache nifi, apache flink and apache kafkaTimothy Spann

PDF

AIDEVDAY_ Data-in-Motion to Supercharge AITimothy Spann

PDF

Building Real-Time Travel AlertsTimothy Spann

PDF

Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Timothy Spann

Marketing Digital Command CenterDataWorks Summit

Use of NiFi Product by Apache Foundationgamevasani

Social Media Monitoring with NiFi, Druid and SupersetThiago Santiago

Integração de Dados com Apache NIFI - Marco Garcia CetaxMarco Garcia

Overview of NiFi Product by Apache Foundationgamevasani

Best practices and lessons learnt from Running Apache NiFi at RenaultDataWorks Summit

Data ingestion using NiFi - Quick OverviewDurga Gadiraju

ApacheCon 2021 - Apache NiFi Deep Dive 300Timothy Spann

2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI PipelinesTimothy Spann

HDF Powered by Apache NiFi IntroductionMilind Pandit

Enterprise IIoT Edge Processing with Apache NiFiTimothy Spann

Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...Data Con LA

Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines DemoTimothy Spann

Turning a Data Pond into a Data Lake with Apache NiFiGene Peters

Hail hydrate! from stream to lake using open sourceTimothy Spann

Data Ingest Self Service and Management using Nifi and KafkaDataWorks Summit

Real time stock processing with apache nifi, apache flink and apache kafkaTimothy Spann

AIDEVDAY_ Data-in-Motion to Supercharge AITimothy Spann

Building Real-Time Travel AlertsTimothy Spann

Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Timothy Spann

More from DataWorks Summit/Hadoop Summit (20)

PPT

Running Apache Spark & Apache Zeppelin in ProductionDataWorks Summit/Hadoop Summit

PPT

State of Security: Apache Spark & Apache ZeppelinDataWorks Summit/Hadoop Summit

PDF

Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit

PDF

Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit

PDF

Revolutionize Text Mining with Spark and ZeppelinDataWorks Summit/Hadoop Summit

PDF

Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit

PDF

Hadoop Crash CourseDataWorks Summit/Hadoop Summit

PDF

Data Science Crash CourseDataWorks Summit/Hadoop Summit

PDF

Apache Spark Crash CourseDataWorks Summit/Hadoop Summit

PPTX

Schema Registry - Set you Data FreeDataWorks Summit/Hadoop Summit

PPTX

Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit

PDF

Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit

PPTX

Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit

PPTX

How Hadoop Makes the Natixis Pack More Efficient DataWorks Summit/Hadoop Summit

PPTX

HBase in Practice DataWorks Summit/Hadoop Summit

PPTX

The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit

PDF

Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopDataWorks Summit/Hadoop Summit

PPTX

From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit

PPTX

Backup and Disaster Recovery in Hadoop DataWorks Summit/Hadoop Summit

PPTX

Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit/Hadoop Summit

Running Apache Spark & Apache Zeppelin in ProductionDataWorks Summit/Hadoop Summit

State of Security: Apache Spark & Apache ZeppelinDataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit

Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit

Revolutionize Text Mining with Spark and ZeppelinDataWorks Summit/Hadoop Summit

Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit

Hadoop Crash CourseDataWorks Summit/Hadoop Summit

Data Science Crash CourseDataWorks Summit/Hadoop Summit

Apache Spark Crash CourseDataWorks Summit/Hadoop Summit

Schema Registry - Set you Data FreeDataWorks Summit/Hadoop Summit

Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit

Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit

Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit

How Hadoop Makes the Natixis Pack More Efficient DataWorks Summit/Hadoop Summit

HBase in Practice DataWorks Summit/Hadoop Summit

The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit

Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopDataWorks Summit/Hadoop Summit

From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit

Backup and Disaster Recovery in Hadoop DataWorks Summit/Hadoop Summit

Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit/Hadoop Summit

Recently uploaded (20)

PDF

Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...darshakparmar

PDF

[Newgen] NewgenONE Marvin Brochure 1.pdfdarshakparmar

PDF

Smart Trailers 2025 Update with History and OverviewPaul Menig

PPTX

From Sci-Fi to Reality: Exploring AI EvolutionSvetlana Meissner

PPTX

COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGISSharanya Sarkar

PDF

Empower Inclusion Through Accessible Java ApplicationsAna-Maria Mihalceanu

PDF

Agentic AI lifecycle for Enterprise Hyper-AutomationDebmalya Biswas

PDF

Log-Based Anomaly Detection: Enhancing System Reliability with Machine LearningMohammed BEKKOUCHE

PDF

HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...mcastillo49

PDF

Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025faizk77g

PPTX

AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptxsameeraaabegumm

PDF

Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...AWS Chicago

PPTX

Webinar: Introduction to LF Energy EVerestDanBrown980551

PDF

July Patch TuesdayIvanti

PPTX

AI Penetration Testing Essentials: A Cybersecurity Guide for 2025defencerabbit Team

PPTX

"Autonomy of LLM Agents: Current State and Future Prospects", Oles` PetrivFwdays

PDF

Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdfdarshakparmar

PDF

Chris Elwell Woburn, MA - Passionate About IT InnovationChris Elwell Woburn, MA

PDF

The Builder’s Playbook - 2025 State of AI Report.pdfjeroen339954

PDF

NewMind AI - Journal 100 Insights After The 100th IssueNewMind AI