SlideShare a Scribd company logo
Tips and tricks for data
science projects with Python
José Manuel Ortega
Python Developer
Jose Manuel Ortega
Software engineer,
Freelance
1. Introducing Python for machine learning projects
2. Stages of a machine learning project
3. Selecting the best python library for your project
for each stage
4. Python tools for deep learning in data science
projects
Introducing Python for machine learning projects
● Simple and consistent
● Understandable by humans
● General-purpose programming language
● Extensive selection of libraries and
frameworks
Introducing Python for machine learning projects
● Spam filters
● Recommendation systems
● Search engines
● Ppersonal assistants
● Fraud detection systems
Introducing Python for machine learning projects
● Machine learning ● Keras, TensorFlow, and
Scikit-learn
● High-performance
scientific computing
● Numpy, Scipy
● Computer vision ● OpenCV
● Data analysis ● Numpy, Pandas
● Natural language
processing
● NLTK, spaCy
Introducing Python for machine learning projects
Introducing Python for machine learning projects
Introducing Python for machine learning projects
● Reading/writing many different data formats
● Selecting subsets of data
● Calculating across rows and down columns
● Finding and filling missing data
● Applying operations to independent groups within the data
● Reshaping data into different forms
● Combing multiple datasets together
● Advanced time-series functionality
● Visualization through Matplotlib and Seaborn
Introducing Python for machine learning projects
Introducing Python for machine learning projects
import pandas as pd
import pandas_profiling
# read the dataset
data = pd.read_csv('your-data')
prof = pandas_profiling.ProfileReport(data)
prof.to_file(output_file='output.html')
Stages of a machine learning project
Stages of a machine learning project
Stages of a machine learning project
Python libraries
Python libraries
● Supervised and unsupervised machine learning
● Classification, regression, Support Vector Machine
● Clustering, Kmeans, DBSCAN
● Random Forest
Python libraries
● Pipelines
● Grid-search
● Validation curves
● One-hot encoding of categorial data
● Dataset generators
● Principal Component Analysis (PCA)
Python libraries
Pipelines
>>> from sklearn.pipeline import make_pipeline
>>> from sklearn.naive_bayes import MultinomialNB
>>> from sklearn.preprocessing import Binarizer
>>> make_pipeline(Binarizer(), MultinomialNB())
Pipeline(steps=[('binarizer', Binarizer()),
('multinomialnb', MultinomialNB())])
http://scikit-learn.org/stable/modules/pipeline.html
Python libraries
Grid-search
estimator.get_params()
A search consists of:
● an estimator (regressor or classifier such as
sklearn.svm.SVC())
● a parameter space
● a method for searching or sampling candidates
● a cross-validation scheme
● a score function
https://scikit-learn.org/stable/modules/grid_search.html#grid-search
Python libraries
Validation curves
https://scikit-learn.org/stable/modules/learning_curve.html
Python libraries
Validation curves
>>> train_scores, valid_scores = validation_curve(
... Ridge(), X, y, param_name="alpha", param_range=np.logspace(-7, 3, 3),
... cv=5)
>>> train_scores
array([[0.93..., 0.94..., 0.92..., 0.91..., 0.92...],
[0.93..., 0.94..., 0.92..., 0.91..., 0.92...],
[0.51..., 0.52..., 0.49..., 0.47..., 0.49...]])
>>> valid_scores
array([[0.90..., 0.84..., 0.94..., 0.96..., 0.93...],
[0.90..., 0.84..., 0.94..., 0.96..., 0.93...],
[0.46..., 0.25..., 0.50..., 0.49..., 0.52...]])
Python libraries
One-hot encoding
https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features
# importing sklearn one hot encoding
from sklearn.preprocessing import
OneHotEncoder
# initializing one hot encoding
encoding = OneHotEncoder()
# applying one hot encoding in python
transformed_data =
encoding.fit_transform(data[['Status']])
# head
print(transformed_data.toarray())
Python libraries
Dataset generators
https://scikit-learn.org/stable/datasets/sample_generators.html
Python libraries
Principal Component Analysis (PCA)
https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
Python libraries
Principal Component Analysis (PCA)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
Python libraries
Python tools for deep learning
Python tools for deep learning
Python tools for deep learning
Python tools for deep learning
Python tools for deep learning
Python tools for deep learning
Python tools for deep learning
TensorFlow Keras Pytorch
API Level High and Low High Low
Architecture Not easy to use Simple, concise,
readable
Complex, less
readable
Speed Fast,
high-performance
Slow, low
performance
Fast,
high-performance
Trained
Models
Yes Yes Yes
Python tools for deep learning
● tight integration with NumPy – Use numpy.ndarray in Theano-compiled
functions.
● transparent use of a GPU – Perform data-intensive computations much faster
than on a CPU.
● efficient symbolic differentiation – Theano does your derivatives for
functions with one or many inputs.
● speed and stability optimizations – Get the right answer for log(1+x) even
when x is really tiny.
● dynamic C code generation – Evaluate expressions faster.
● extensive unit-testing and self-verification – Detect and diagnose many
types of error
Python tools for deep learning
● Synkhronos Extension to Theano for multi-GPU data
parallelism
● Theano-MPI Theano-MPI a distributed framework for training
models built in Theano based on data-parallelism.
● Platoon Multi-GPU mini-framework for Theano, single node.
● Elephas Distributed Deep Learning with Keras & Spark.
Tips and tricks for data
science projects with Python
@jmortegac
https://www.linkedin.com/in/jmortega1

More Related Content

Similar to Tips and tricks for data science projects with Python (20)

PPTX
ANN-Lecture2-Python Startup.pptx
ShahzadAhmadJoiya3
 
PDF
Travis Oliphant "Python for Speed, Scale, and Science"
Fwdays
 
PPTX
Data analysis using python in Jupyter notebook.pptx
ssuserc26f8f
 
PDF
Introduction to python
Mohammed Rafi
 
PDF
Pyhton-1a-Basics.pdf
Mattupallipardhu
 
PDF
Using_python_webdevolopment_datascience.pdf
Sudipta Bhattacharya
 
PPTX
Data Science With Python | Python For Data Science | Python Data Science Cour...
Simplilearn
 
PDF
Austin Python Meetup 2017: What's New in Pythons 3.5 and 3.6?
Viach Kakovskyi
 
PPT
A Hands-on Intro to Data Science and R Presentation.ppt
Sanket Shikhar
 
PDF
Lecture 4: Deep Learning Frameworks
Mohamed Loey
 
PDF
Monitorama 2015 Netflix Instance Analysis
Brendan Gregg
 
PDF
Best data science course syllabus 2025.pdf
mayra0232020
 
PDF
New Capabilities in the PyData Ecosystem
Turi, Inc.
 
PPTX
Automation tools: making things go... (March 2019)
Artefactual Systems - Archivematica
 
PDF
PPT6: Neuron Demo
akira-ai
 
PPTX
Introduction to-python
Aakashdata
 
PDF
Python and Pytorch tutorial and walkthrough
gabriellekuruvilla
 
PPTX
How to integrate python into a scala stack
Fliptop
 
PDF
A compute infrastructure for data scientists
Stitch Fix Algorithms
 
PDF
Flink Forward Berlin 2017: Zohar Mizrahi - Python Streaming API
Flink Forward
 
ANN-Lecture2-Python Startup.pptx
ShahzadAhmadJoiya3
 
Travis Oliphant "Python for Speed, Scale, and Science"
Fwdays
 
Data analysis using python in Jupyter notebook.pptx
ssuserc26f8f
 
Introduction to python
Mohammed Rafi
 
Pyhton-1a-Basics.pdf
Mattupallipardhu
 
Using_python_webdevolopment_datascience.pdf
Sudipta Bhattacharya
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Simplilearn
 
Austin Python Meetup 2017: What's New in Pythons 3.5 and 3.6?
Viach Kakovskyi
 
A Hands-on Intro to Data Science and R Presentation.ppt
Sanket Shikhar
 
Lecture 4: Deep Learning Frameworks
Mohamed Loey
 
Monitorama 2015 Netflix Instance Analysis
Brendan Gregg
 
Best data science course syllabus 2025.pdf
mayra0232020
 
New Capabilities in the PyData Ecosystem
Turi, Inc.
 
Automation tools: making things go... (March 2019)
Artefactual Systems - Archivematica
 
PPT6: Neuron Demo
akira-ai
 
Introduction to-python
Aakashdata
 
Python and Pytorch tutorial and walkthrough
gabriellekuruvilla
 
How to integrate python into a scala stack
Fliptop
 
A compute infrastructure for data scientists
Stitch Fix Algorithms
 
Flink Forward Berlin 2017: Zohar Mizrahi - Python Streaming API
Flink Forward
 

More from Jose Manuel Ortega Candel (20)

PDF
Seguridad y auditorías en Modelos grandes del lenguaje (LLM).pdf
Jose Manuel Ortega Candel
 
PDF
Beyond the hype: The reality of AI security.pdf
Jose Manuel Ortega Candel
 
PDF
Seguridad de APIs en Drupal_ herramientas, mejores prácticas y estrategias pa...
Jose Manuel Ortega Candel
 
PDF
Security and auditing tools in Large Language Models (LLM).pdf
Jose Manuel Ortega Candel
 
PDF
Herramientas de benchmarks para evaluar el rendimiento en máquinas y aplicaci...
Jose Manuel Ortega Candel
 
PDF
Asegurando tus APIs Explorando el OWASP Top 10 de Seguridad en APIs.pdf
Jose Manuel Ortega Candel
 
PDF
PyGoat Analizando la seguridad en aplicaciones Django.pdf
Jose Manuel Ortega Candel
 
PDF
Ciberseguridad en Blockchain y Smart Contracts: Explorando los Desafíos y Sol...
Jose Manuel Ortega Candel
 
PDF
Evolution of security strategies in K8s environments- All day devops
Jose Manuel Ortega Candel
 
PDF
Evolution of security strategies in K8s environments.pdf
Jose Manuel Ortega Candel
 
PDF
Implementing Observability for Kubernetes.pdf
Jose Manuel Ortega Candel
 
PDF
Computación distribuida usando Python
Jose Manuel Ortega Candel
 
PDF
Seguridad en arquitecturas serverless y entornos cloud
Jose Manuel Ortega Candel
 
PDF
Construyendo arquitecturas zero trust sobre entornos cloud
Jose Manuel Ortega Candel
 
PDF
Sharing secret keys in Docker containers and K8s
Jose Manuel Ortega Candel
 
PDF
Implementing cert-manager in K8s
Jose Manuel Ortega Candel
 
PDF
Python para equipos de ciberseguridad(pycones)
Jose Manuel Ortega Candel
 
PDF
Python para equipos de ciberseguridad
Jose Manuel Ortega Candel
 
PDF
Shodan Tips and tricks. Automatiza y maximiza las búsquedas shodan
Jose Manuel Ortega Candel
 
PDF
ELK para analistas de seguridad y equipos Blue Team
Jose Manuel Ortega Candel
 
Seguridad y auditorías en Modelos grandes del lenguaje (LLM).pdf
Jose Manuel Ortega Candel
 
Beyond the hype: The reality of AI security.pdf
Jose Manuel Ortega Candel
 
Seguridad de APIs en Drupal_ herramientas, mejores prácticas y estrategias pa...
Jose Manuel Ortega Candel
 
Security and auditing tools in Large Language Models (LLM).pdf
Jose Manuel Ortega Candel
 
Herramientas de benchmarks para evaluar el rendimiento en máquinas y aplicaci...
Jose Manuel Ortega Candel
 
Asegurando tus APIs Explorando el OWASP Top 10 de Seguridad en APIs.pdf
Jose Manuel Ortega Candel
 
PyGoat Analizando la seguridad en aplicaciones Django.pdf
Jose Manuel Ortega Candel
 
Ciberseguridad en Blockchain y Smart Contracts: Explorando los Desafíos y Sol...
Jose Manuel Ortega Candel
 
Evolution of security strategies in K8s environments- All day devops
Jose Manuel Ortega Candel
 
Evolution of security strategies in K8s environments.pdf
Jose Manuel Ortega Candel
 
Implementing Observability for Kubernetes.pdf
Jose Manuel Ortega Candel
 
Computación distribuida usando Python
Jose Manuel Ortega Candel
 
Seguridad en arquitecturas serverless y entornos cloud
Jose Manuel Ortega Candel
 
Construyendo arquitecturas zero trust sobre entornos cloud
Jose Manuel Ortega Candel
 
Sharing secret keys in Docker containers and K8s
Jose Manuel Ortega Candel
 
Implementing cert-manager in K8s
Jose Manuel Ortega Candel
 
Python para equipos de ciberseguridad(pycones)
Jose Manuel Ortega Candel
 
Python para equipos de ciberseguridad
Jose Manuel Ortega Candel
 
Shodan Tips and tricks. Automatiza y maximiza las búsquedas shodan
Jose Manuel Ortega Candel
 
ELK para analistas de seguridad y equipos Blue Team
Jose Manuel Ortega Candel
 
Ad

Recently uploaded (20)

PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
Ad

Tips and tricks for data science projects with Python

  • 1. Tips and tricks for data science projects with Python José Manuel Ortega Python Developer
  • 2. Jose Manuel Ortega Software engineer, Freelance
  • 3. 1. Introducing Python for machine learning projects 2. Stages of a machine learning project 3. Selecting the best python library for your project for each stage 4. Python tools for deep learning in data science projects
  • 4. Introducing Python for machine learning projects ● Simple and consistent ● Understandable by humans ● General-purpose programming language ● Extensive selection of libraries and frameworks
  • 5. Introducing Python for machine learning projects ● Spam filters ● Recommendation systems ● Search engines ● Ppersonal assistants ● Fraud detection systems
  • 6. Introducing Python for machine learning projects ● Machine learning ● Keras, TensorFlow, and Scikit-learn ● High-performance scientific computing ● Numpy, Scipy ● Computer vision ● OpenCV ● Data analysis ● Numpy, Pandas ● Natural language processing ● NLTK, spaCy
  • 7. Introducing Python for machine learning projects
  • 8. Introducing Python for machine learning projects
  • 9. Introducing Python for machine learning projects ● Reading/writing many different data formats ● Selecting subsets of data ● Calculating across rows and down columns ● Finding and filling missing data ● Applying operations to independent groups within the data ● Reshaping data into different forms ● Combing multiple datasets together ● Advanced time-series functionality ● Visualization through Matplotlib and Seaborn
  • 10. Introducing Python for machine learning projects
  • 11. Introducing Python for machine learning projects import pandas as pd import pandas_profiling # read the dataset data = pd.read_csv('your-data') prof = pandas_profiling.ProfileReport(data) prof.to_file(output_file='output.html')
  • 12. Stages of a machine learning project
  • 13. Stages of a machine learning project
  • 14. Stages of a machine learning project
  • 16. Python libraries ● Supervised and unsupervised machine learning ● Classification, regression, Support Vector Machine ● Clustering, Kmeans, DBSCAN ● Random Forest
  • 17. Python libraries ● Pipelines ● Grid-search ● Validation curves ● One-hot encoding of categorial data ● Dataset generators ● Principal Component Analysis (PCA)
  • 18. Python libraries Pipelines >>> from sklearn.pipeline import make_pipeline >>> from sklearn.naive_bayes import MultinomialNB >>> from sklearn.preprocessing import Binarizer >>> make_pipeline(Binarizer(), MultinomialNB()) Pipeline(steps=[('binarizer', Binarizer()), ('multinomialnb', MultinomialNB())]) http://scikit-learn.org/stable/modules/pipeline.html
  • 19. Python libraries Grid-search estimator.get_params() A search consists of: ● an estimator (regressor or classifier such as sklearn.svm.SVC()) ● a parameter space ● a method for searching or sampling candidates ● a cross-validation scheme ● a score function https://scikit-learn.org/stable/modules/grid_search.html#grid-search
  • 21. Python libraries Validation curves >>> train_scores, valid_scores = validation_curve( ... Ridge(), X, y, param_name="alpha", param_range=np.logspace(-7, 3, 3), ... cv=5) >>> train_scores array([[0.93..., 0.94..., 0.92..., 0.91..., 0.92...], [0.93..., 0.94..., 0.92..., 0.91..., 0.92...], [0.51..., 0.52..., 0.49..., 0.47..., 0.49...]]) >>> valid_scores array([[0.90..., 0.84..., 0.94..., 0.96..., 0.93...], [0.90..., 0.84..., 0.94..., 0.96..., 0.93...], [0.46..., 0.25..., 0.50..., 0.49..., 0.52...]])
  • 22. Python libraries One-hot encoding https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features # importing sklearn one hot encoding from sklearn.preprocessing import OneHotEncoder # initializing one hot encoding encoding = OneHotEncoder() # applying one hot encoding in python transformed_data = encoding.fit_transform(data[['Status']]) # head print(transformed_data.toarray())
  • 24. Python libraries Principal Component Analysis (PCA) https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
  • 25. Python libraries Principal Component Analysis (PCA) from sklearn.preprocessing import StandardScaler sc = StandardScaler() X_train = sc.fit_transform(X_train) X_test = sc.transform(X_test) from sklearn.decomposition import PCA pca = PCA(n_components=2) X_train = pca.fit_transform(X_train) X_test = pca.transform(X_test)
  • 27. Python tools for deep learning
  • 28. Python tools for deep learning
  • 29. Python tools for deep learning
  • 30. Python tools for deep learning
  • 31. Python tools for deep learning
  • 32. Python tools for deep learning
  • 33. Python tools for deep learning TensorFlow Keras Pytorch API Level High and Low High Low Architecture Not easy to use Simple, concise, readable Complex, less readable Speed Fast, high-performance Slow, low performance Fast, high-performance Trained Models Yes Yes Yes
  • 34. Python tools for deep learning ● tight integration with NumPy – Use numpy.ndarray in Theano-compiled functions. ● transparent use of a GPU – Perform data-intensive computations much faster than on a CPU. ● efficient symbolic differentiation – Theano does your derivatives for functions with one or many inputs. ● speed and stability optimizations – Get the right answer for log(1+x) even when x is really tiny. ● dynamic C code generation – Evaluate expressions faster. ● extensive unit-testing and self-verification – Detect and diagnose many types of error
  • 35. Python tools for deep learning ● Synkhronos Extension to Theano for multi-GPU data parallelism ● Theano-MPI Theano-MPI a distributed framework for training models built in Theano based on data-parallelism. ● Platoon Multi-GPU mini-framework for Theano, single node. ● Elephas Distributed Deep Learning with Keras & Spark.
  • 36. Tips and tricks for data science projects with Python @jmortegac https://www.linkedin.com/in/jmortega1