MLlib with MLFlow.pdf

0 likes•121 views

The document outlines the use of MLlib with MLflow for end-to-end machine learning processes in PySpark, covering data preparation, model training, evaluation, and performance logging. It details the essential components of PySpark ML workflows, such as dataframes, transformers, estimators, and pipelines, as well as how to use MLflow for tracking model parameters and metrics. Additionally, it provides instructions for setting up and viewing MLflow tracking through its user interface.

Data & Analytics

MLlib with MLFlow
Michelle Hoogenhout
July 17th 2021

What I’ll cover
Use MLlib with Mlﬂow end-to-end to:
● Prepare data in pyspark for use with MLLib
● Train and evaluate several classiﬁer models
● Log model performance with MLFlow Tracking

What you’ll need
● Pyspark / Docker
● MLﬂow
https://github.com/michellehoog/mllib-example

Why pyspark?
Enables scalable analysis (without having to know Scala!)
Allows distributed processing
Creates ML pipelines
Interacts with Pandas

What algorithms are available on
pyspark MLLib?
● Variety of classiﬁcation and regression models, incl.
○ Linear & Logistic Regression
○ Tree-based models
○ Multilayer Perceptron
○ Naive Bayes
● Clustering
● Collaborative ﬁltering
● Frequent pattern mining

Spark workﬂow
● DataFrame: Spark ML uses DataFrame from Spark SQL as an ML dataset, which
can hold a variety of data types. E.g., a DataFrame could have different columns
storing text, feature vectors, true labels, and predictions.
● Transformer: A Transformer is an algorithm which can transform one DataFrame
into another DataFrame. E.g., an ML model is a Transformer which transforms
DataFrame with features into a DataFrame with predictions.
● Estimator: An Estimator is an algorithm which can be **ﬁt** on a DataFrame to
produce a Transformer. E.g., a learning algorithm is an Estimator which trains on a
DataFrame and produces a model.
● Pipeline: A Pipeline chains multiple Transformers and Estimators together to
specify an ML workﬂow.
● Parameter:: All Transformers and Estimators now share a common API for
specifying parameters.

Things to note
Data format
● Dense format
● Numeric and zero-indexed (non-negative for Naive
Bayes)
● Named ‘label’ and ‘features’
Pipelines

MLFlow
pip install mlflow[extras]
https://www.mlﬂow.org/docs/latest/tutorials-and-examples/
tutorial.html
Open source tracking and deployment of ML models
Not speciﬁc to Spark / MLLib

MLFlow can log:
● Git commit hash
● Start & end time
● Source
● Parameters
● Metrics
● Artifacts (output)

MLﬂow tracking overview
Step 1. Create experiment
Step 2. Add runs to your code
Step 3. View logs

MLﬂow tracking overview
All MLflow runs are logged to the active experiment, which can be set using any of the
following ways:
● Use the mlflow.set_experiment() command.
● Use the experiment_id parameter in the mlflow.start_run() command.
● Set one of the MLflow environment variables MLFLOW_EXPERIMENT_NAME or
MLFLOW_EXPERIMENT_ID.
If no active experiment is set, runs are logged to the notebook experiment.

Viewing the Tracking MLﬂow UI
The tracking API writes data to local ./mlruns directory.
To view:
Run MLﬂow instance with mlflow ui
MLﬂow’s Tracking UI: http://localhost:5000/#/

More Related Content

PDF

MLflow with DatabricksLiangjun Jiang

PDF

Mlflow with databricksLiangjun Jiang

PDF

MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...Databricks

PDF

MLflow: Infrastructure for a Complete Machine Learning Life CycleDatabricks

PDF

Machine learning pipeline with spark mldatamantra

PDF

Introduction to MLflowDatabricks

PDF

Introduction to Spark ML Pipelines WorkshopHolden Karau

PPTX

MLflow_MLOps_Databricks_Architecture.pptxamesar0

MLflow with DatabricksLiangjun Jiang

Mlflow with databricksLiangjun Jiang

MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...Databricks

MLflow: Infrastructure for a Complete Machine Learning Life CycleDatabricks

Machine learning pipeline with spark mldatamantra

Introduction to MLflowDatabricks

Introduction to Spark ML Pipelines WorkshopHolden Karau

MLflow_MLOps_Databricks_Architecture.pptxamesar0

Similar to MLlib with MLFlow.pdf (20)

PPTX

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit

PDF

Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...Databricks

PDF

"Managing the Complete Machine Learning Lifecycle with MLflow"Databricks

PPTX

DAIS Europe Nov. 2020 presentation on MLflow Model Servingamesar0

PDF

Using Spark Mllib Models in a Production Training and Serving Platform: Exper...Databricks

PDF

Applied Machine learning for business analyticsmeghu123

PPTX

Training And Serving ML Model Using Kubeflow by Jayesh SharmaCodeOps Technologies LLP

PDF

GraphQL Bangkok meetup 5.0Tobias Meixner

PDF

ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureFei Chen

PDF

MLFlow 1.0 Meetup Databricks

PDF

Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...Anant Corporation

PPTX

databricks ml flow demonstration using automatic features engineeringMohamed MEJDOUBI

PPTX

Machine Learning Orchestration with AirflowAnant Corporation

PDF

Productionalizing Spark MLdatamantra

PPTX

ML Ops Tools ML ﬂow and Hugging Face(2).pptxMohamedHomoda3

PPTX

Databricks MLflow Object Relationshipsamesar0

PDF

Porting R Models into Scala Sparkcarl_pulley

PPTX

MLflow Model Serving - DAIS 2021amesar0

PDF

Managing the Complete Machine Learning Lifecycle with MLflowDatabricks

PPTX

Scale machine learning deploymentGang Tao

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit

Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...Databricks

"Managing the Complete Machine Learning Lifecycle with MLflow"Databricks

DAIS Europe Nov. 2020 presentation on MLflow Model Servingamesar0

Using Spark Mllib Models in a Production Training and Serving Platform: Exper...Databricks

Applied Machine learning for business analyticsmeghu123

Training And Serving ML Model Using Kubeflow by Jayesh SharmaCodeOps Technologies LLP

GraphQL Bangkok meetup 5.0Tobias Meixner

ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureFei Chen

MLFlow 1.0 Meetup Databricks

Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...Anant Corporation

databricks ml flow demonstration using automatic features engineeringMohamed MEJDOUBI

Machine Learning Orchestration with AirflowAnant Corporation

Productionalizing Spark MLdatamantra

ML Ops Tools ML ﬂow and Hugging Face(2).pptxMohamedHomoda3

Databricks MLflow Object Relationshipsamesar0

Porting R Models into Scala Sparkcarl_pulley

MLflow Model Serving - DAIS 2021amesar0

Managing the Complete Machine Learning Lifecycle with MLflowDatabricks

Scale machine learning deploymentGang Tao

Recently uploaded (20)

PPTX

Introduction-to-Python-Programming-Language (1).pptxdhyeysapariya

PDF

An Uncut Conversation With Grok | PDF DocumentMike Hydes

PDF

Classifcation using Machine Learning and deep learningbhaveshagrawal35

PPTX

Data-Users-in-Database-Management-Systems (1).pptxdharmik832021

PPTX

Presentation on animal welfare a good topickidscream385

PPTX

Pipeline Automatic Leak Detection for Water Distribution SystemsSione Palu

PPTX

White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...RamNeymarjr

PDF

Mastering Financial Analysis Materials.pdfSalamiAbdullahi

PPTX

Data Security Breach: Immediate Action Planvarmabhuvan266

PPTX

M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptxteodoroferiarevanojr

PPTX

The whitetiger novel review for collegeassignment.pptxDhruvPatel754154

PPTX

Introduction to Biostatistics Presentation.pptxAtemJoshua

PPTX

MR and reffffffvvvvvvvfversal_083605.pptxmanjeshjain

PPTX

Databricks-DE-Associate Certification Questions-june-2024.pptxpedelli41

PDF

TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdfThais Ruiz

PPTX

Presentation (1) (1).pptx k8hhfftuiiigffkarthikjagath2005

PDF

Technical Writing Module-I Complete Notes.pdfVedprakashArya13

PDF

717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...pedelli41

PDF

The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdfCA Suvidha Chaplot

PDF

Blitz Campinas - Dia 24 de maio - Piettro.pdffabigreek

Introduction-to-Python-Programming-Language (1).pptxdhyeysapariya

An Uncut Conversation With Grok | PDF DocumentMike Hydes

Classifcation using Machine Learning and deep learningbhaveshagrawal35

Data-Users-in-Database-Management-Systems (1).pptxdharmik832021

Presentation on animal welfare a good topickidscream385

Pipeline Automatic Leak Detection for Water Distribution SystemsSione Palu

White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...RamNeymarjr

Mastering Financial Analysis Materials.pdfSalamiAbdullahi

Data Security Breach: Immediate Action Planvarmabhuvan266

M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptxteodoroferiarevanojr

The whitetiger novel review for collegeassignment.pptxDhruvPatel754154

Introduction to Biostatistics Presentation.pptxAtemJoshua

MR and reffffffvvvvvvvfversal_083605.pptxmanjeshjain

Databricks-DE-Associate Certification Questions-june-2024.pptxpedelli41

TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdfThais Ruiz

Presentation (1) (1).pptx k8hhfftuiiigffkarthikjagath2005

Technical Writing Module-I Complete Notes.pdfVedprakashArya13

717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...pedelli41

The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdfCA Suvidha Chaplot

Blitz Campinas - Dia 24 de maio - Piettro.pdffabigreek

MLlib with MLFlow.pdf

1. MLlib with MLFlow Michelle Hoogenhout July 17th 2021

2. What I’ll cover Use MLlib with Mlﬂow end-to-end to: ● Prepare data in pyspark for use with MLLib ● Train and evaluate several classiﬁer models ● Log model performance with MLFlow Tracking

3. What you’ll need ● Pyspark / Docker ● MLﬂow https://github.com/michellehoog/mllib-example

4. Why pyspark? Enables scalable analysis (without having to know Scala!) Allows distributed processing Creates ML pipelines Interacts with Pandas

5. What algorithms are available on pyspark MLLib? ● Variety of classiﬁcation and regression models, incl. ○ Linear & Logistic Regression ○ Tree-based models ○ Multilayer Perceptron ○ Naive Bayes ● Clustering ● Collaborative ﬁltering ● Frequent pattern mining

6. Spark workflow ● DataFrame: Spark ML uses DataFrame from Spark SQL as an ML dataset, which can hold a variety of data types. E.g., a DataFrame could have different columns storing text, feature vectors, true labels, and predictions. ● Transformer: A Transformer is an algorithm which can transform one DataFrame into another DataFrame. E.g., an ML model is a Transformer which transforms DataFrame with features into a DataFrame with predictions. ● Estimator: An Estimator is an algorithm which can be **fit** on a DataFrame to produce a Transformer. E.g., a learning algorithm is an Estimator which trains on a DataFrame and produces a model. ● Pipeline: A Pipeline chains multiple Transformers and Estimators together to specify an ML workflow. ● Parameter:: All Transformers and Estimators now share a common API for specifying parameters.

7. Things to note Data format ● Dense format ● Numeric and zero-indexed (non-negative for Naive Bayes) ● Named ‘label’ and ‘features’ Pipelines

8. MLFlow pip install mlflow[extras] https://www.mlﬂow.org/docs/latest/tutorials-and-examples/ tutorial.html Open source tracking and deployment of ML models Not speciﬁc to Spark / MLLib

9. MLFlow can log: ● Git commit hash ● Start & end time ● Source ● Parameters ● Metrics ● Artifacts (output)

11. MLﬂow tracking overview Step 1. Create experiment Step 2. Add runs to your code Step 3. View logs

12. MLﬂow tracking overview All MLflow runs are logged to the active experiment, which can be set using any of the following ways: ● Use the mlflow.set_experiment() command. ● Use the experiment_id parameter in the mlflow.start_run() command. ● Set one of the MLflow environment variables MLFLOW_EXPERIMENT_NAME or MLFLOW_EXPERIMENT_ID. If no active experiment is set, runs are logged to the notebook experiment.

13. Viewing the Tracking MLflow UI The tracking API writes data to local ./mlruns directory. To view: Run MLflow instance with mlflow ui MLflow’s Tracking UI: http://localhost:5000/#/