SlideShare a Scribd company logo
17
Most read
20
Most read
21
Most read
Data Engineering:
A Deep Dive into
Databricks
Presenter:
Mohika Rastogi
Sant Singh
Lack of etiquette and manners is a huge turn off.
KnolX Etiquettes
 Punctuality
Join the session 5 minutes prior to the session start time. We start on
time and conclude on time!
 Feedback
Make sure to submit a constructive feedback for all sessions as it is very
helpful for the presenter.
 Silent Mode
Keep your mobile devices in silent mode, feel free to move out of session
in case you need to attend an urgent call.
 Avoid Disturbance
Avoid unwanted chit chat during the session.
1. Introduction
o What is Data Engineering
o Data Engineer vs Analyst vs Scientist
2. Central Repository
o Data Warehouse
o Data Lake
o Data Lakehouse
3. Databricks
o What is Databricks ?
o Use cases
o Managed Integration
o Delta Lake
o Delta Sharing
4. Apache Spark
5. Databricks Workspace
o Workspace Terminologies
6. Demo
Data Engineering A Deep Dive into Databricks
Data Engineering
 Data engineering is the practice of designing and building systems for collecting, storing, and
analyzing data at scale.
 Data engineering is the complex task of making raw data usable to data scientists and groups
within an organization.
Data Engineer vs Data Analyst vs Data Scientist
Data Scientist
A data scientist is someone who
uses their knowledge of statistics,
machine learning, and
programming to extract meaning
from data. They use their skills to
solve complex problems, identify
trends, and make predictions.
Data Analyst
A data analyst is someone who
collects, cleans, and analyzes
data to help businesses make
better decisions. They use
their skills to identify patterns
in data, and to create reports and
visualizations that help others
understand the data.
Data Engineer
A data engineer is someone who
builds and maintains the systems
that data scientists and data
analysts use to collect, store, and
analyze data. They use their
skills to design and build
data pipelines, and to ensure
that data is stored in a secure
and efficient way.
Central Repositories
Data Warehouse
A data lake is an ample storage that can store structured,
semi-structured, and raw data. The schema of the data is
not known as it is a schema-on-read.
Data Lake
A data warehouse is a central repository of business
data stored in structured format to help organizations
gain insights. Schema needs to be known before writing
data into a warehouse.
Data Lakehouse
 Data lakehouse is a realtively new architecture and it is combining the best of the both worlds —
data warehouses and data lakes.
 It serves as a single platform for data warehousing and data lakes. It has data management
features such as ACID transcation coming from a warehouse perspective and low cost storage
like a data lake.
Databricks
A unified, open analytics platform for
building, deploying, sharing, and
maintaining enterprise-grade data,
analytics, and AI solutions at scale.
Databricks
 An Interactive Analytics platform that enables Data Engineers, Data Scientists, and Businesses to
collaborate and work closely on notebooks, experiments, models, data, libraries, and jobs.
 Databricks was founded by creators of Apache Spark in 2013
 A one-stop product for all Data requirements, like Storage and Analysis.
 Databricks is integrated with Microsoft Azure, Amazon Web Services, and Google Cloud Platform.
What is Databricks used for?
 Data processing workflows scheduling and
management
 Working in SQL
 Generating dashboards and visualizations
 Data ingestion
 Managing security, governance, and HA/DR
 Data discovery, annotation, and exploration
 Compute management
 Machine learning (ML) modeling and tracking
 ML model serving
 Source control with Git
The Databricks workspace provides a unified interface and tools for most data tasks, including:
Databricks for Data Engineering
 Simplified data ingestion
 Automated ETL processing
 Reliable workflow orchestration
 End-to-end observability and monitoring
 Next-generation data processing engine
 Foundation of governance, reliability and performance
Databricks excels in data engineering with
its unified platform, leveraging Apache
Spark for efficient processing and
scalability.
Managed integration with open source
The following technologies are open source projects founded by Databricks employees:
 Delta Lake
− Delta Lake is the optimized storage layer that provides the foundation for storing data and tables in the Databricks
Lakehouse Platform.
 Delta Sharing
− An open standard for secure data sharing.
 Apache Spark
− Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on
single-node machines or clusters.
 MLflow
− MLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment,
and a central model registry
Delta Lake
 Delta Lake is the default storage format for all operations on Databricks.
 Delta Lake is open source software that extends Parquet data files with a file-based transaction
log for ACID transactions and scalable metadata handling.
 Delta Lake is fully compatible with Apache Spark APIs, and was developed for tight integration
with Structured Streaming, allowing you to easily use a single copy of data for both batch and
streaming operations and providing incremental processing at scale.
Delta Sharing
 Delta Sharing is the industry’s first open protocol for secure data sharing, making it simple to
share data with other organizations regardless of which computing platforms they use.
 Databricks and the Linux Foundation developed Delta Sharing to provide the first open source
approach to data sharing across data, analytics and AI. Customers can share live data across
platforms, clouds and regions with strong security and governance.
Apache Spark
 Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It is based on Hadoop
MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes
interactive queries and stream processing.
 The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application.
 PySpark:- PySpark is an interface for Apache Spark in Python. With PySpark, you can write
Python and SQL-like commands to manipulate and analyze data in a distributed processing
environment.
Databricks Workspace
The Databricks “Workspace” is an Environment for accessing all of the Databricks “Assets”.
The “Workspace” organizes Objects, such as- “Notebooks”, “Libraries” and “Experiments”
into “Folders”, and, provides access to “Data” and Computational Resources, such as -
“Clusters” and “Jobs”.
The Databricks “Workspace” can be managed using :-
1. Workspace UI
2. Databricks CLI
3. Databricks REST API
Databricks Workspace Terminology
01 02
03
05 06
04
Cluster is a “Set of Computational
Resources and Configurations”, on
which an organization’s Data
Engineering Workloads are run.
Cluster
Job is a way of running a “Notebook”
or a “JAR” either immediately or on a
“Scheduled Basis”.
Jobs
Every “Databricks Deployment” has a
“Central Hive Meta-store”, accessible
by all “Clusters” to persist “Table
Metadata”.
Hive Meta-store
“Notebook” is a “Web-Based Interface”
composed of a “Group of Cells” that
allow to execute coding commands.
Notebooks
DBFS is a Distributed File System
mounted into each Databricks
Workspace. DBFS contains Directorie
s which in turn contains Data Files,
Libraries and other Directories.
DBFS
By default, all tables created in
Databricks are Delta tables. Delta
tables are based on the Delta Lake
open source project.
Delta Table
Data Engineering A Deep Dive into Databricks
Data Engineering A Deep Dive into Databricks
Data Engineering A Deep Dive into Databricks

More Related Content

What's hot (20)

PPTX
Free Training: How to Build a Lakehouse
Databricks
 
PDF
Lakehouse in Azure
Sergio Zenatti Filho
 
PPTX
Building a modern data warehouse
James Serra
 
PDF
adb.pdf
AdityaMehta724216
 
PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Snowflake Architecture.pptx
chennakesava44
 
PPTX
Snowflake essentials
qureshihamid
 
PPTX
Databricks Fundamentals
Dalibor Wijas
 
PDF
Building Reliable Data Lakes at Scale with Delta Lake
Databricks
 
PPTX
Architecting a datalake
Laurent Leturgez
 
PDF
Snowflake for Data Engineering
Harald Erb
 
PDF
Databricks Delta Lake and Its Benefits
Databricks
 
PDF
Azure BI Cloud Architectural Guidelines.pdf
pbonillo1
 
PDF
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Cathrine Wilhelmsen
 
PPTX
Introducing the Snowflake Computing Cloud Data Warehouse
Snowflake Computing
 
PPTX
Introduction to Azure Databricks
James Serra
 
PDF
Snowflake free trial_lab_guide
slidedown1
 
PDF
Building End-to-End Delta Pipelines on GCP
Databricks
 
PDF
Building Lakehouses on Delta Lake with SQL Analytics Primer
Databricks
 
PDF
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 
Free Training: How to Build a Lakehouse
Databricks
 
Lakehouse in Azure
Sergio Zenatti Filho
 
Building a modern data warehouse
James Serra
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Snowflake Architecture.pptx
chennakesava44
 
Snowflake essentials
qureshihamid
 
Databricks Fundamentals
Dalibor Wijas
 
Building Reliable Data Lakes at Scale with Delta Lake
Databricks
 
Architecting a datalake
Laurent Leturgez
 
Snowflake for Data Engineering
Harald Erb
 
Databricks Delta Lake and Its Benefits
Databricks
 
Azure BI Cloud Architectural Guidelines.pdf
pbonillo1
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Cathrine Wilhelmsen
 
Introducing the Snowflake Computing Cloud Data Warehouse
Snowflake Computing
 
Introduction to Azure Databricks
James Serra
 
Snowflake free trial_lab_guide
slidedown1
 
Building End-to-End Delta Pipelines on GCP
Databricks
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Databricks
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 

Similar to Data Engineering A Deep Dive into Databricks (20)

PPTX
DataBricks fundamentals for fresh graduates
SanjeevaniClinicalRe
 
PDF
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
PDF
Master Databricks with AccentFuture – Online Training
Accentfuture
 
PPTX
Data Engineering Overview for freshers.pptx
xeranaw566
 
PPTX
Data Engineering Overview for new learners.pptx
xeranaw566
 
PPTX
Unlock Data-driven Insights in Databricks Using Location Intelligence
Precisely
 
PPTX
Introduction_to_Databricks_power_point_presentation.pptx
xeranaw566
 
PDF
final-the-data-teams-guide-to-the-db-lakehouse-platform-rd-6-14-22.pdf
XIAOZEJIN1
 
PPTX
Introduction to Databricks - AccentFuture
Accentfuture
 
PDF
So You Want to Build a Data Lake?
David P. Moore
 
PDF
What Is Delta Lake ???
✪Computants✪IBM_BP
 
PPTX
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Luan Moreno Medeiros Maciel
 
PPTX
use_case.pptx
vuppalanaveen
 
PDF
Agile data lake? An oxymoron?
samthemonad
 
PDF
52023374-5ab1-4b99-8b31-bdc4ee5a7d89.pdf
vitm11
 
PDF
Streaming Data Into Your Lakehouse With Frank Munz | Current 2022
HostedbyConfluent
 
PDF
Modernizing to a Cloud Data Architecture
Databricks
 
PDF
Spark + AI Summit 2020 イベント概要
Paulo Gutierrez
 
PDF
Building Data Intensive Analytic Application on Top of Delta Lakes
Databricks
 
PDF
4070949. 89-Test-12-File.pdf
raypoll198
 
DataBricks fundamentals for fresh graduates
SanjeevaniClinicalRe
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
Master Databricks with AccentFuture – Online Training
Accentfuture
 
Data Engineering Overview for freshers.pptx
xeranaw566
 
Data Engineering Overview for new learners.pptx
xeranaw566
 
Unlock Data-driven Insights in Databricks Using Location Intelligence
Precisely
 
Introduction_to_Databricks_power_point_presentation.pptx
xeranaw566
 
final-the-data-teams-guide-to-the-db-lakehouse-platform-rd-6-14-22.pdf
XIAOZEJIN1
 
Introduction to Databricks - AccentFuture
Accentfuture
 
So You Want to Build a Data Lake?
David P. Moore
 
What Is Delta Lake ???
✪Computants✪IBM_BP
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Luan Moreno Medeiros Maciel
 
use_case.pptx
vuppalanaveen
 
Agile data lake? An oxymoron?
samthemonad
 
52023374-5ab1-4b99-8b31-bdc4ee5a7d89.pdf
vitm11
 
Streaming Data Into Your Lakehouse With Frank Munz | Current 2022
HostedbyConfluent
 
Modernizing to a Cloud Data Architecture
Databricks
 
Spark + AI Summit 2020 イベント概要
Paulo Gutierrez
 
Building Data Intensive Analytic Application on Top of Delta Lakes
Databricks
 
4070949. 89-Test-12-File.pdf
raypoll198
 
Ad

More from Knoldus Inc. (20)

PPTX
Angular Hydration Presentation (FrontEnd)
Knoldus Inc.
 
PPTX
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
Knoldus Inc.
 
PPTX
Self-Healing Test Automation Framework - Healenium
Knoldus Inc.
 
PPTX
Kanban Metrics Presentation (Project Management)
Knoldus Inc.
 
PPTX
Java 17 features and implementation.pptx
Knoldus Inc.
 
PPTX
Chaos Mesh Introducing Chaos in Kubernetes
Knoldus Inc.
 
PPTX
GraalVM - A Step Ahead of JVM Presentation
Knoldus Inc.
 
PPTX
Nomad by HashiCorp Presentation (DevOps)
Knoldus Inc.
 
PPTX
Nomad by HashiCorp Presentation (DevOps)
Knoldus Inc.
 
PPTX
DAPR - Distributed Application Runtime Presentation
Knoldus Inc.
 
PPTX
Introduction to Azure Virtual WAN Presentation
Knoldus Inc.
 
PPTX
Introduction to Argo Rollouts Presentation
Knoldus Inc.
 
PPTX
Intro to Azure Container App Presentation
Knoldus Inc.
 
PPTX
Insights Unveiled Test Reporting and Observability Excellence
Knoldus Inc.
 
PPTX
Introduction to Splunk Presentation (DevOps)
Knoldus Inc.
 
PPTX
Code Camp - Data Profiling and Quality Analysis Framework
Knoldus Inc.
 
PPTX
AWS: Messaging Services in AWS Presentation
Knoldus Inc.
 
PPTX
Amazon Cognito: A Primer on Authentication and Authorization
Knoldus Inc.
 
PPTX
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
Knoldus Inc.
 
PPTX
Managing State & HTTP Requests In Ionic.
Knoldus Inc.
 
Angular Hydration Presentation (FrontEnd)
Knoldus Inc.
 
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
Knoldus Inc.
 
Self-Healing Test Automation Framework - Healenium
Knoldus Inc.
 
Kanban Metrics Presentation (Project Management)
Knoldus Inc.
 
Java 17 features and implementation.pptx
Knoldus Inc.
 
Chaos Mesh Introducing Chaos in Kubernetes
Knoldus Inc.
 
GraalVM - A Step Ahead of JVM Presentation
Knoldus Inc.
 
Nomad by HashiCorp Presentation (DevOps)
Knoldus Inc.
 
Nomad by HashiCorp Presentation (DevOps)
Knoldus Inc.
 
DAPR - Distributed Application Runtime Presentation
Knoldus Inc.
 
Introduction to Azure Virtual WAN Presentation
Knoldus Inc.
 
Introduction to Argo Rollouts Presentation
Knoldus Inc.
 
Intro to Azure Container App Presentation
Knoldus Inc.
 
Insights Unveiled Test Reporting and Observability Excellence
Knoldus Inc.
 
Introduction to Splunk Presentation (DevOps)
Knoldus Inc.
 
Code Camp - Data Profiling and Quality Analysis Framework
Knoldus Inc.
 
AWS: Messaging Services in AWS Presentation
Knoldus Inc.
 
Amazon Cognito: A Primer on Authentication and Authorization
Knoldus Inc.
 
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
Knoldus Inc.
 
Managing State & HTTP Requests In Ionic.
Knoldus Inc.
 
Ad

Recently uploaded (20)

PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
Advancing WebDriver BiDi support in WebKit
Igalia
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PPTX
Designing Production-Ready AI Agents
Kunal Rai
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
July Patch Tuesday
Ivanti
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Advancing WebDriver BiDi support in WebKit
Igalia
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Designing Production-Ready AI Agents
Kunal Rai
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
July Patch Tuesday
Ivanti
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 

Data Engineering A Deep Dive into Databricks

  • 1. Data Engineering: A Deep Dive into Databricks Presenter: Mohika Rastogi Sant Singh
  • 2. Lack of etiquette and manners is a huge turn off. KnolX Etiquettes  Punctuality Join the session 5 minutes prior to the session start time. We start on time and conclude on time!  Feedback Make sure to submit a constructive feedback for all sessions as it is very helpful for the presenter.  Silent Mode Keep your mobile devices in silent mode, feel free to move out of session in case you need to attend an urgent call.  Avoid Disturbance Avoid unwanted chit chat during the session.
  • 3. 1. Introduction o What is Data Engineering o Data Engineer vs Analyst vs Scientist 2. Central Repository o Data Warehouse o Data Lake o Data Lakehouse 3. Databricks o What is Databricks ? o Use cases o Managed Integration o Delta Lake o Delta Sharing 4. Apache Spark 5. Databricks Workspace o Workspace Terminologies 6. Demo
  • 5. Data Engineering  Data engineering is the practice of designing and building systems for collecting, storing, and analyzing data at scale.  Data engineering is the complex task of making raw data usable to data scientists and groups within an organization.
  • 6. Data Engineer vs Data Analyst vs Data Scientist Data Scientist A data scientist is someone who uses their knowledge of statistics, machine learning, and programming to extract meaning from data. They use their skills to solve complex problems, identify trends, and make predictions. Data Analyst A data analyst is someone who collects, cleans, and analyzes data to help businesses make better decisions. They use their skills to identify patterns in data, and to create reports and visualizations that help others understand the data. Data Engineer A data engineer is someone who builds and maintains the systems that data scientists and data analysts use to collect, store, and analyze data. They use their skills to design and build data pipelines, and to ensure that data is stored in a secure and efficient way.
  • 7. Central Repositories Data Warehouse A data lake is an ample storage that can store structured, semi-structured, and raw data. The schema of the data is not known as it is a schema-on-read. Data Lake A data warehouse is a central repository of business data stored in structured format to help organizations gain insights. Schema needs to be known before writing data into a warehouse.
  • 8. Data Lakehouse  Data lakehouse is a realtively new architecture and it is combining the best of the both worlds — data warehouses and data lakes.  It serves as a single platform for data warehousing and data lakes. It has data management features such as ACID transcation coming from a warehouse perspective and low cost storage like a data lake.
  • 9. Databricks A unified, open analytics platform for building, deploying, sharing, and maintaining enterprise-grade data, analytics, and AI solutions at scale.
  • 10. Databricks  An Interactive Analytics platform that enables Data Engineers, Data Scientists, and Businesses to collaborate and work closely on notebooks, experiments, models, data, libraries, and jobs.  Databricks was founded by creators of Apache Spark in 2013  A one-stop product for all Data requirements, like Storage and Analysis.  Databricks is integrated with Microsoft Azure, Amazon Web Services, and Google Cloud Platform.
  • 11. What is Databricks used for?  Data processing workflows scheduling and management  Working in SQL  Generating dashboards and visualizations  Data ingestion  Managing security, governance, and HA/DR  Data discovery, annotation, and exploration  Compute management  Machine learning (ML) modeling and tracking  ML model serving  Source control with Git The Databricks workspace provides a unified interface and tools for most data tasks, including:
  • 12. Databricks for Data Engineering  Simplified data ingestion  Automated ETL processing  Reliable workflow orchestration  End-to-end observability and monitoring  Next-generation data processing engine  Foundation of governance, reliability and performance Databricks excels in data engineering with its unified platform, leveraging Apache Spark for efficient processing and scalability.
  • 13. Managed integration with open source The following technologies are open source projects founded by Databricks employees:  Delta Lake − Delta Lake is the optimized storage layer that provides the foundation for storing data and tables in the Databricks Lakehouse Platform.  Delta Sharing − An open standard for secure data sharing.  Apache Spark − Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.  MLflow − MLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry
  • 14. Delta Lake  Delta Lake is the default storage format for all operations on Databricks.  Delta Lake is open source software that extends Parquet data files with a file-based transaction log for ACID transactions and scalable metadata handling.  Delta Lake is fully compatible with Apache Spark APIs, and was developed for tight integration with Structured Streaming, allowing you to easily use a single copy of data for both batch and streaming operations and providing incremental processing at scale.
  • 15. Delta Sharing  Delta Sharing is the industry’s first open protocol for secure data sharing, making it simple to share data with other organizations regardless of which computing platforms they use.  Databricks and the Linux Foundation developed Delta Sharing to provide the first open source approach to data sharing across data, analytics and AI. Customers can share live data across platforms, clouds and regions with strong security and governance.
  • 16. Apache Spark  Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing.  The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application.  PySpark:- PySpark is an interface for Apache Spark in Python. With PySpark, you can write Python and SQL-like commands to manipulate and analyze data in a distributed processing environment.
  • 17. Databricks Workspace The Databricks “Workspace” is an Environment for accessing all of the Databricks “Assets”. The “Workspace” organizes Objects, such as- “Notebooks”, “Libraries” and “Experiments” into “Folders”, and, provides access to “Data” and Computational Resources, such as - “Clusters” and “Jobs”. The Databricks “Workspace” can be managed using :- 1. Workspace UI 2. Databricks CLI 3. Databricks REST API
  • 18. Databricks Workspace Terminology 01 02 03 05 06 04 Cluster is a “Set of Computational Resources and Configurations”, on which an organization’s Data Engineering Workloads are run. Cluster Job is a way of running a “Notebook” or a “JAR” either immediately or on a “Scheduled Basis”. Jobs Every “Databricks Deployment” has a “Central Hive Meta-store”, accessible by all “Clusters” to persist “Table Metadata”. Hive Meta-store “Notebook” is a “Web-Based Interface” composed of a “Group of Cells” that allow to execute coding commands. Notebooks DBFS is a Distributed File System mounted into each Databricks Workspace. DBFS contains Directorie s which in turn contains Data Files, Libraries and other Directories. DBFS By default, all tables created in Databricks are Delta tables. Delta tables are based on the Delta Lake open source project. Delta Table