SlideShare a Scribd company logo
#datapopupseattle
Understanding Feature Space
in Machine Learning
Alice Zheng
Director of Data Science, Dato
RainyData Datoinc
#datapopupseattle
UNSTRUCTURED
Data Science POP-UP in Seattle
www.dominodatalab.com
D
Produced by Domino Data Lab
Domino’s enterprise data science platform is used
by leading analytical organizations to increase
productivity, enable collaboration, and publish
models into production faster.
Understanding Feature Space
in Machine Learning
Alice Zheng, Dato
October, 2015
3
‹#›
My journey so far
Applied+machine+learning+
(Data+science)
Build+ML+tools
Shortage+of+experts+
and+good+tools.
‹#›
Why machine learning?
Model data.
Make predictions.
Build intelligent
applications.
‹#›
The machine learning pipeline
I+fell+in+love+the+instant+I+laid+
my+eyes+on+that+puppy.+His+
big+eyes+and+playful+tail,+his+
soft+furry+paws,+…
Raw+data
Features
Models
Predictions
Deploy+in+
production
‹#›
Three things to know about ML
• Feature = numeric representation of raw data
• Model = mathematical “summary” of features
• Making something that works = choose the right model
and features, given data and task
Feature = Numeric representation of raw data
‹#›
Representing natural text
It#is#a#puppy#and#it#is#
extremely#cute.
What’s+important?+
Phrases?+Specific+
words?+Ordering?+
Subject,+object,+verb?
Classify:++
puppy+or+not?
Raw+Text
{“it”:2,++
+“is”:2,++
+“a”:1,++
+“puppy”:1,++
+“and”:1,+
+“extremely”:1,+
+“cute”:1+}
Bag+of+Words
‹#›
Representing natural text
It#is#a#puppy#and#it#is#
extremely#cute.
Classify:++
puppy+or+not?
Raw+Text Bag+of+Words
it 2
they 0
I 0
am 0
how 0
puppy 1
and 1
cat 0
aardvark 0
cute 1
extremely 1
… …
Sparse+vector+
representation
‹#›
Representing images
Image+source:+“Recognizing+and+learning+object+categories,”++
Li+Fei[Fei,+Rob+Fergus,+Anthony+Torralba,+ICCV+2005—2009.
Raw+image:++
millions+of+RGB+triplets,+
one+for+each+pixel
Classify:++
person+or+animal?
Raw+Image Bag+of+Visual+Words
‹#›
Representing images
Classify:++
person+or+animal?
Raw+Image Deep+learning+features
3.29+
[15+
[5.24+
48.3+
1.36+
47.1+
[1.92
36.5+
2.83+
95.4+
[19+
[89+
5.09+
37.8
Dense+vector+
representation
‹#›
Feature space in machine learning
• Raw data ! high dimensional vectors
• Collection of data points ! point cloud in feature space
• Feature engineering = creating features of the appropriate
granularity for the task
Visualizing Feature Space
Crudely speaking, mathematicians fall into two categories:
the algebraists, who find it easiest to reduce all problems
to sets of numbers and variables, and the geometers, who
understand the world through shapes.



-- Masha Gessen, “Perfect Rigor”
‹#›
Visualizing bag-of-words
puppy
cute
1
1
I+have+a+puppy+and+
it+is+extremely+cute
I#have#a#puppy#and#
it#is#extremely#cute
it 1
they 0
I 1
am 0
how 0
puppy 1
and 1
cat 0
aardvark 0
zebra 0
cute 1
extremely 1
… …
‹#›
Visualizing bag-of-words
puppy
cute
1
1
1
extremely
I+have+a+puppy+and++
it+is+extremely+cute
I+have+an+extremely++
cute+cat
I+have+a+cute++
puppy
‹#›
Document point cloud
word+1
word+2
Model = Mathematical “summary” of features
‹#›
What is a summary?
• Data ! point cloud in feature space
• Model = a geometric shape that best “fits” the point cloud
‹#›
Classification model
Feature+2
Feature+1
Decide+between+two+classes
‹#›
Clustering model
Feature+2
Feature+1
Group+data+points+tightly
‹#›
Regression model
Target
Feature
Fit+the+target+values
Visualizing Feature Engineering

‹#›
When does bag-of-words fail?
puppy
cat
2
1
1
have
I+have+a+puppy
I+have+a+cat
I+have+a+kitten
Task:+find+a+surface+that+separates++
documents+about+dogs+vs.+cats
Problem:+the+word+“have”+adds+fluff++
instead+of+information
I+have+a+dog+
and+I+have+a+pen
1
‹#›
Improving on bag-of-words
• Idea: “normalize” word counts so that popular words are
discounted
• Term frequency (tf) = Number of times a terms appears in a
document
• Inverse document frequency of word (idf) =
• N = total number of documents
• Tf-idf count = tf x idf
‹#›
From BOW to tf-idf
puppy
cat
2
1
1
have
I+have+a+puppy
I+have+a+cat
I+have+a+kitten
idf(puppy)+=+log+4+
idf(cat)+=+log+4+
idf(have)+=+log+1+=+0
I+have+a+dog+
and+I+have+a+pen
1
‹#›
From BOW to tf-idf
puppy
cat1
have
tfidf(puppy)+=+log+4+
tfidf(cat)+=+log+4+
tfidf(have)+=+0
I+have+a+dog+
and+I+have+a+pen,+
I+have+a+kitten
1
log+4
log+4
I+have+a+cat
I+have+a+puppy
Decision+surface
Tf[idf+flattens+
uninformative+
dimensions+in+the+
BOW+point+cloud
‹#›
That’s not all, folks!
• Geometry is the key to understanding feature space and
machine learning
• Many other fun topics:
- Feature normalization
- Feature transformations
- Model regularization
• Dato is hiring! jobs@dato.com
@RainyData,+@DatoInc
#datapopupseattle
@datapopup
#datapopupseattle
#datapopupseattle
Thank You To Our Sponsors

More Related Content

Viewers also liked (17)

PDF
Adding machine learning to a web app
Richard Dallaway
 
PDF
Building Intelligent Data Products
Stephen Whitworth
 
PDF
Applied Computer Vision - a Deep Learning Approach
Jose Berengueres
 
PDF
Launching Data Products for Fun and Profit
Zach Gemignani
 
PPT
Video Marketing for Self-Storage: Get Real Online
SpareFoot
 
PPTX
EEON103 Хичээл 11
E-Gazarchin Online University
 
PPTX
Developing Data Products
Peter Skomoroch
 
PPS
The Most Expensive Cars in the World.
Severus Prime
 
PDF
Simformer. Инструкции по разработке бизнес курсов и тренингов
Sergey Menshikov
 
PDF
Suomalaiset yritykset lean managementin soveltajina
TechFinland
 
PPTX
Unidad 4 grecia antigua
Lucas Chalub
 
PDF
The Shotfarm Product Information Report
FrenchWeb.fr
 
PPTX
Slideshare 8.12.15
Melana Shah
 
PPTX
Build Your Intranet With Office 365
Richard Harbridge
 
PDF
How to Discover and Create Great Visual Content for Facebook
Peg Fitzpatrick
 
PPTX
今さら聞けない人のためのDocker超入門 CentOS 7.2対応版
VirtualTech Japan Inc.
 
PPTX
Scaling up Machine Learning Algorithms for Classification
smatsus
 
Adding machine learning to a web app
Richard Dallaway
 
Building Intelligent Data Products
Stephen Whitworth
 
Applied Computer Vision - a Deep Learning Approach
Jose Berengueres
 
Launching Data Products for Fun and Profit
Zach Gemignani
 
Video Marketing for Self-Storage: Get Real Online
SpareFoot
 
EEON103 Хичээл 11
E-Gazarchin Online University
 
Developing Data Products
Peter Skomoroch
 
The Most Expensive Cars in the World.
Severus Prime
 
Simformer. Инструкции по разработке бизнес курсов и тренингов
Sergey Menshikov
 
Suomalaiset yritykset lean managementin soveltajina
TechFinland
 
Unidad 4 grecia antigua
Lucas Chalub
 
The Shotfarm Product Information Report
FrenchWeb.fr
 
Slideshare 8.12.15
Melana Shah
 
Build Your Intranet With Office 365
Richard Harbridge
 
How to Discover and Create Great Visual Content for Facebook
Peg Fitzpatrick
 
今さら聞けない人のためのDocker超入門 CentOS 7.2対応版
VirtualTech Japan Inc.
 
Scaling up Machine Learning Algorithms for Classification
smatsus
 

Similar to Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle (20)

PPTX
Understanding Feature Space in Machine Learning
Alice Zheng
 
PPTX
Understanding feature-space
Mihran Kalaydjian
 
PPTX
Overview of Machine Learning and Feature Engineering
Turi, Inc.
 
PPTX
The How and Why of Feature Engineering
Alice Zheng
 
PPTX
Feature engineering for diverse data types
Alice Zheng
 
PDF
AI Is Changing The Way We Look At Data Science
Abe
 
PPTX
Oscon Data 2011 Ted Dunning
MapR Technologies
 
PPTX
DCDataFest - Text mining and machine learning
Summit Consulting, LLC
 
PDF
Large-scale analysis of bibliometric networks
Nees Jan van Eck
 
PDF
The Elements of Machine Learning
Alexander Jung
 
PPTX
Machine Learning with Small Data
John Liu
 
PDF
Debugging & fixing Gen-AI applications.pdf
Kuldeep Jiwani
 
KEY
Advanced Data Mining and Integration Research for Europe (ADMIRE)
Jano van Hemert
 
PDF
bag-of-words models
Xiaotao Zou
 
PPTX
Intelligent Ruby + Machine Learning
Ilya Grigorik
 
PDF
Introduction to machine learning NYU.pdf
fankerui92
 
PDF
September 2024 -Top Cite Articles- International Journal on Soft Computing, A...
ijscai
 
PDF
August 2024: Top 10 Cited Articles For Soft Computing, Artificial Intelligence
ijscai
 
KEY
Towards Supporting Data-Intensive Research
Jano van Hemert
 
Understanding Feature Space in Machine Learning
Alice Zheng
 
Understanding feature-space
Mihran Kalaydjian
 
Overview of Machine Learning and Feature Engineering
Turi, Inc.
 
The How and Why of Feature Engineering
Alice Zheng
 
Feature engineering for diverse data types
Alice Zheng
 
AI Is Changing The Way We Look At Data Science
Abe
 
Oscon Data 2011 Ted Dunning
MapR Technologies
 
DCDataFest - Text mining and machine learning
Summit Consulting, LLC
 
Large-scale analysis of bibliometric networks
Nees Jan van Eck
 
The Elements of Machine Learning
Alexander Jung
 
Machine Learning with Small Data
John Liu
 
Debugging & fixing Gen-AI applications.pdf
Kuldeep Jiwani
 
Advanced Data Mining and Integration Research for Europe (ADMIRE)
Jano van Hemert
 
bag-of-words models
Xiaotao Zou
 
Intelligent Ruby + Machine Learning
Ilya Grigorik
 
Introduction to machine learning NYU.pdf
fankerui92
 
September 2024 -Top Cite Articles- International Journal on Soft Computing, A...
ijscai
 
August 2024: Top 10 Cited Articles For Soft Computing, Artificial Intelligence
ijscai
 
Towards Supporting Data-Intensive Research
Jano van Hemert
 
Ad

More from Domino Data Lab (20)

PDF
What's in your workflow? Bringing data science workflows to business analysis...
Domino Data Lab
 
PDF
The Proliferation of New Database Technologies and Implications for Data Scie...
Domino Data Lab
 
PDF
Racial Bias in Policing: an analysis of Illinois traffic stops data
Domino Data Lab
 
PPTX
Data Quality Analytics: Understanding what is in your data, before using it
Domino Data Lab
 
PPTX
Supporting innovation in insurance with randomized experimentation
Domino Data Lab
 
PPTX
Leveraging Data Science in the Automotive Industry
Domino Data Lab
 
PDF
Summertime Analytics: Predicting E. coli and West Nile Virus
Domino Data Lab
 
PPTX
Reproducible Dashboards and other great things to do with Jupyter
Domino Data Lab
 
PDF
GeoViz: A Canvas for Data Science
Domino Data Lab
 
PPTX
Managing Data Science | Lessons from the Field
Domino Data Lab
 
PDF
Doing your first Kaggle (Python for Big Data sets)
Domino Data Lab
 
PDF
Leveraged Analytics at Scale
Domino Data Lab
 
PDF
How I Learned to Stop Worrying and Love Linked Data
Domino Data Lab
 
PDF
Software Engineering for Data Scientists
Domino Data Lab
 
PDF
Making Big Data Smart
Domino Data Lab
 
PPTX
Moving Data Science from an Event to A Program: Considerations in Creating Su...
Domino Data Lab
 
PPTX
Building Data Analytics pipelines in the cloud using serverless technology
Domino Data Lab
 
PPTX
Leveraging Open Source Automated Data Science Tools
Domino Data Lab
 
PPTX
Domino and AWS: collaborative analytics and model governance at financial ser...
Domino Data Lab
 
PDF
The Role and Importance of Curiosity in Data Science
Domino Data Lab
 
What's in your workflow? Bringing data science workflows to business analysis...
Domino Data Lab
 
The Proliferation of New Database Technologies and Implications for Data Scie...
Domino Data Lab
 
Racial Bias in Policing: an analysis of Illinois traffic stops data
Domino Data Lab
 
Data Quality Analytics: Understanding what is in your data, before using it
Domino Data Lab
 
Supporting innovation in insurance with randomized experimentation
Domino Data Lab
 
Leveraging Data Science in the Automotive Industry
Domino Data Lab
 
Summertime Analytics: Predicting E. coli and West Nile Virus
Domino Data Lab
 
Reproducible Dashboards and other great things to do with Jupyter
Domino Data Lab
 
GeoViz: A Canvas for Data Science
Domino Data Lab
 
Managing Data Science | Lessons from the Field
Domino Data Lab
 
Doing your first Kaggle (Python for Big Data sets)
Domino Data Lab
 
Leveraged Analytics at Scale
Domino Data Lab
 
How I Learned to Stop Worrying and Love Linked Data
Domino Data Lab
 
Software Engineering for Data Scientists
Domino Data Lab
 
Making Big Data Smart
Domino Data Lab
 
Moving Data Science from an Event to A Program: Considerations in Creating Su...
Domino Data Lab
 
Building Data Analytics pipelines in the cloud using serverless technology
Domino Data Lab
 
Leveraging Open Source Automated Data Science Tools
Domino Data Lab
 
Domino and AWS: collaborative analytics and model governance at financial ser...
Domino Data Lab
 
The Role and Importance of Curiosity in Data Science
Domino Data Lab
 
Ad

Recently uploaded (20)

DOC
MATRIX_AMAN IRAWAN_20227479046.docbbbnnb
vanitafiani1
 
PDF
Incident Response and Digital Forensics Certificate
VICTOR MAESTRE RAMIREZ
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PPT
Data base management system Transactions.ppt
gandhamcharan2006
 
PPTX
This PowerPoint presentation titled "Data Visualization: Turning Data into In...
HemaDivyaKantamaneni
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PPTX
fashion industry boom.pptx an economics project
TGMPandeyji
 
PPT
1 DATALINK CONTROL and it's applications
karunanidhilithesh
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PPTX
DATA-COLLECTION METHODS, TYPES AND SOURCES
biggdaad011
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PPTX
Usage of Power BI for Pharmaceutical Data analysis.pptx
Anisha Herala
 
PDF
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
PPTX
Introduction to Artificial Intelligence.pptx
StarToon1
 
DOCX
AI/ML Applications in Financial domain projects
Rituparna De
 
PPTX
加拿大尼亚加拉学院毕业证书{Niagara在读证明信Niagara成绩单修改}复刻
Taqyea
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
migbaruget
 
MATRIX_AMAN IRAWAN_20227479046.docbbbnnb
vanitafiani1
 
Incident Response and Digital Forensics Certificate
VICTOR MAESTRE RAMIREZ
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
Data base management system Transactions.ppt
gandhamcharan2006
 
This PowerPoint presentation titled "Data Visualization: Turning Data into In...
HemaDivyaKantamaneni
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
fashion industry boom.pptx an economics project
TGMPandeyji
 
1 DATALINK CONTROL and it's applications
karunanidhilithesh
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
DATA-COLLECTION METHODS, TYPES AND SOURCES
biggdaad011
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
Usage of Power BI for Pharmaceutical Data analysis.pptx
Anisha Herala
 
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
Introduction to Artificial Intelligence.pptx
StarToon1
 
AI/ML Applications in Financial domain projects
Rituparna De
 
加拿大尼亚加拉学院毕业证书{Niagara在读证明信Niagara成绩单修改}复刻
Taqyea
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
Hadoop_EcoSystem slide by CIDAC India.pptx
migbaruget
 

Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle