SlideShare a Scribd company logo
Large Scale Kernel Learning using Block
Coordinate Descent
Shaleen Kumar Gupta, Research Assistant3
Authors:
Stephen Tu1 Rebecca Reolofs1 Shivaram Venkatraman1
Benjamin Recht1,2
1Department of Electrical Engineering and Computer Science
UC Berkeley, Berkeley, CA
2Department of Statistics
UC Berkeley, Berkeley, CA
3Nanyang Technological University, 2016
Outline
1 Overview
Introduction
Background
2 Datasets
TIMIT
Yelp Reviews
CIFAR-10
3 Experimental Results
4 Performance and Scalability
5 Conclusion
Outline
1 Overview
Introduction
Background
2 Datasets
TIMIT
Yelp Reviews
CIFAR-10
3 Experimental Results
4 Performance and Scalability
5 Conclusion
Overview
Kernel methods are a powerful tool in machine learning,
allowing one to discover non-linear structure by mapping data
into a higher dimensional, possibly infinite, feature space.
Problem: They do not scale well.
This paper attempts to exploit distributed computation in
Block CD and present results.
Moreover, the paper attempts to study the performance of
Random Features and Nystrom approximations on three large
datasets from speech (TIMIT), text (Yelp) and image
classification (CIFAR-10) domains.
Outline
1 Overview
Introduction
Background
2 Datasets
TIMIT
Yelp Reviews
CIFAR-10
3 Experimental Results
4 Performance and Scalability
5 Conclusion
Kernel Methods
https://www.reddit.com/r/MachineLearning/comments/15zrpp/please_explain_support_vector_machines_
svm_like_i/c7rkwce
If our data can’t be separated by a straight line we might need
to use a curvy line.
Kernel Methods
https://www.reddit.com/r/MachineLearning/comments/15zrpp/please_explain_support_vector_machines_
svm_like_i/c7rkwce
If our data can’t be separated by a straight line we might need
to use a curvy line.
A straight line in a higher dimensional space can be a curvy
line when projected onto a lower dimensional space.
So what we are really doing is using the kernel to put our data
into a high dimensional space, then finding a hyperplane to
separate the data in that high dimensional space.
This straight line looks like a curvy line when we bring it down
to the lower dimensional space.
Kernel Approximation Techniques (1/2)
Kernel Trick: The essence of the kernel-trick is that if you
can describe an algorithm in a certain way – which is using
only inner products – then you never need to actually use the
feature mapping, as long as you can compute the inner
product in the feature space.
While there are many kernel approximation techniques to do
the Kernel Trick, one prominent one is using the RBF Kernel.
We will also analyze two other Kernel approximation
techniques, namely Nystrom Method and Random Features
Technique, in this paper.
Kernel Approximation Techniques (2/2)
If we would use all data points, we would map to an RN
dimensional space and have the scaling problems.
Also, we would need to store all kernel values.
Nystrom method says that we don’t need go to the full space
spanned by all N training points, but we can just use a subset.
This will only yield an approximate embedding but if we keep
the number of samples we use the same, the resulting
embedding will be independent of dataset size and we can
basically choose the complexity to suit our problem.
Random feature based methods use an element-wise
approximation of the kernel.
Outline
1 Overview
Introduction
Background
2 Datasets
TIMIT
Yelp Reviews
CIFAR-10
3 Experimental Results
4 Performance and Scalability
5 Conclusion
TIMIT
Phone classification task was performed on the TIMIT
dataset, which consisted of spoken audio from 462 speakers
The authors applied a Gaussian (RBF) kernel for the Nystrom
and exact methods and used random cosines for the random
feature method.
Outline
1 Overview
Introduction
Background
2 Datasets
TIMIT
Yelp Reviews
CIFAR-10
3 Experimental Results
4 Performance and Scalability
5 Conclusion
Yelp Reviews
The goal was to predict a rating from one to five stars from
the text of a review.
A usual 80:20 Training:Test split was applied
nltk was used for tokenization and stemming and n-gram
modeling was done with n=3.
For the exact and Nystrom experiments, they apply a linear
kernel.
For random features, they apply a hash kernel using
MurmurHash3 as their hash function.
Since they were predicting ratings for a review, they measured
accuracy by using the root mean square error (RMSE) of the
predicted rating as compared to the actual rating.
Outline
1 Overview
Introduction
Background
2 Datasets
TIMIT
Yelp Reviews
CIFAR-10
3 Experimental Results
4 Performance and Scalability
5 Conclusion
CIFAR-10
The task was to do image classification of the CIFAR-10
dataset.
The dataset contained 500,000 training images and 4096
features per image.
The authors started with these 4096 features in the dataset as
input and used the RBF kernel for the exact and Nystrom
method and random cosines for the random features method.
Experimental Results (1/3)
Figure: Classification Error against Time using different methods on the
TIMIT, Yelp and CIFAR-10 datasets. The little black stars denote the
end of an epoch
Experimental Results (2/3)
Figure: Classification Error against number of features for Nystrom and
Random Features on the TIMIT, Yelp and CIFAR-10 datasets
Experimental Results (3/3)
Performance
Figure: Breakdown of time to compute a single block of coordinate
descent in the first epoch on the TIMIT, Yelp and CIFAR-10 datasets
From the figure, we see that the choice of the kernel
approximation can significantly impact performance since
different kernels take different amounts of time to generate.
For example, the hash random feature used for the Yelp
dataset is much cheaper to compute than the string kernel.
However, computing a block of the RBF kernel is similar in
cost to computing a block of random cosine features.
Scalability of RBF Kernel Generation
Figure: Time taken to compute on eblock of RBF kernel as they scale the
number of examples and the number of machines used
Here, ideal scaling implies that the time to generate a block of the kernel
matrix remains constant as they increase both the data and the number
of machines.
However, computing a block of the RBF kernel involves broadcasting a b
x d matrix to all the machines in the cluster. This causes a slight
decrease in performance as they go from 8 to 128 machines. However,
they believe that the kernel block generation methods will continue to
scale well for larger datasets since broadcast routines scale as O(logM).
Conclusion
This paper shows that scalable kernel machines are feasible
with distributed computation.
Results suggest that the Nystrom method generally achieves
better statistical accuracy than random features
However, it can require significantly more iterations of
optimization.
On the theoretical side, a limitation of this analysis is that
achieving rates better than gradient descent cannot be hoped.
References and Further Reading I
Stephen Tu, Rebecca Roelofs, Shivaram Venkataraman,
Benjamin Recht
Large Scale Kernel Learning using Block Coordinate Descent
February 18, 2016
Tianbao Yang, Yu-feng Li, Mehrdad Mahdavi, Rong Jin,
Zhi-Hua Zhou
Nystrom Method vs Random Fourier Features: A Theoretical
and Empirical Comparison
Advances in Neural Information Processing Systems 25 (NIPS
2012)

More Related Content

What's hot (20)

PDF
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
Jinwon Lee
 
PDF
Enterprise Scale Topological Data Analysis Using Spark
Alpine Data
 
PDF
Colfax-Winograd-Summary _final (1)
Sangamesh Ragate
 
PDF
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
Pooyan Jamshidi
 
PDF
Spine net learning scale permuted backbone for recognition and localization
Devansh16
 
PDF
A Novel Framework and Policies for On-line Block of Cores Allotment for Multi...
ijcsa
 
PDF
Large data with Scikit-learn - Boston Data Mining Meetup - Alex Perrier
Alexis Perrier
 
PDF
virtualization
Avi Nash
 
PDF
Energy Curtailing with Huddling Practices with Fuzzy in Wireless Sensor Network
ijsrd.com
 
PDF
An Introduction to Neural Architecture Search
Bill Liu
 
PDF
Producer consumer-problems
Richard Ashworth
 
PDF
[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review
taeseon ryu
 
PDF
post119s1-file3
Venkata Suhas Maringanti
 
PPT
Chap3 slides
BaliThorat1
 
PDF
Image Processing IEEE 2015 Projects
Vijay Karan
 
PDF
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
Sunghoon Joo
 
PPT
[ppt]
butest
 
PPTX
Beyond data and model parallelism for deep neural networks
JunKudo2
 
PDF
Optimal buffer allocation in
csandit
 
PDF
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
taeseon ryu
 
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
Jinwon Lee
 
Enterprise Scale Topological Data Analysis Using Spark
Alpine Data
 
Colfax-Winograd-Summary _final (1)
Sangamesh Ragate
 
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
Pooyan Jamshidi
 
Spine net learning scale permuted backbone for recognition and localization
Devansh16
 
A Novel Framework and Policies for On-line Block of Cores Allotment for Multi...
ijcsa
 
Large data with Scikit-learn - Boston Data Mining Meetup - Alex Perrier
Alexis Perrier
 
virtualization
Avi Nash
 
Energy Curtailing with Huddling Practices with Fuzzy in Wireless Sensor Network
ijsrd.com
 
An Introduction to Neural Architecture Search
Bill Liu
 
Producer consumer-problems
Richard Ashworth
 
[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review
taeseon ryu
 
post119s1-file3
Venkata Suhas Maringanti
 
Chap3 slides
BaliThorat1
 
Image Processing IEEE 2015 Projects
Vijay Karan
 
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
Sunghoon Joo
 
[ppt]
butest
 
Beyond data and model parallelism for deep neural networks
JunKudo2
 
Optimal buffer allocation in
csandit
 
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
taeseon ryu
 

Similar to Large Scale Kernel Learning using Block Coordinate Descent (20)

PPTX
17- Kernels and Clustering.pptx
ssuser2023c6
 
PPTX
GNR638_Course Project for spring semester
BijayChandraDasTECH0
 
PPTX
IMAGE CLASSIFICATION USING CONVOLUTIONAL NEURAL NETWORK.P.pptx
usmanch7829
 
PPTX
Conventional Neural Networks and compute
YobuDJob1
 
PDF
New feature selection based on kernel
journalBEEI
 
PDF
GNR638_project ppt.pdf
AtulVerma631398
 
PDF
The comparison study of kernel KC-means and support vector machines for class...
TELKOMNIKA JOURNAL
 
PPTX
ABayesianApproachToLocalizedMultiKernelLearningUsingTheRelevanceVectorMachine...
grssieee
 
PDF
Thebook
Ulvi Shukurzade
 
PDF
pdf.pdf
zanaveen1
 
PDF
pdf.pdf
zanaveen1
 
PDF
pdf.pdf
Anikethraj
 
PDF
pdf.pdf
Anikethraj
 
PDF
Machine learning
NagendraPratap Singh
 
PDF
Multiple Target Machine Learning Prediction of Capacity Curves of Reinforced ...
Journal of Soft Computing in Civil Engineering
 
PPTX
Presentation overview of neural & kernel based clustering
Shubham Vijay Vargiy
 
PPT
Kernel analysis of deep networks
Behrang Mehrparvar
 
PDF
Btv thesis defense_v1.02-final
Vinh Bui
 
PDF
BPstudy sklearn 20180925
Shintaro Fukushima
 
PPTX
Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...
MLconf
 
17- Kernels and Clustering.pptx
ssuser2023c6
 
GNR638_Course Project for spring semester
BijayChandraDasTECH0
 
IMAGE CLASSIFICATION USING CONVOLUTIONAL NEURAL NETWORK.P.pptx
usmanch7829
 
Conventional Neural Networks and compute
YobuDJob1
 
New feature selection based on kernel
journalBEEI
 
GNR638_project ppt.pdf
AtulVerma631398
 
The comparison study of kernel KC-means and support vector machines for class...
TELKOMNIKA JOURNAL
 
ABayesianApproachToLocalizedMultiKernelLearningUsingTheRelevanceVectorMachine...
grssieee
 
pdf.pdf
zanaveen1
 
pdf.pdf
zanaveen1
 
pdf.pdf
Anikethraj
 
pdf.pdf
Anikethraj
 
Machine learning
NagendraPratap Singh
 
Multiple Target Machine Learning Prediction of Capacity Curves of Reinforced ...
Journal of Soft Computing in Civil Engineering
 
Presentation overview of neural & kernel based clustering
Shubham Vijay Vargiy
 
Kernel analysis of deep networks
Behrang Mehrparvar
 
Btv thesis defense_v1.02-final
Vinh Bui
 
BPstudy sklearn 20180925
Shintaro Fukushima
 
Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...
MLconf
 
Ad

Recently uploaded (20)

PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
migbaruget
 
PDF
List of all the AI prompt cheat codes.pdf
Avijit Kumar Roy
 
PPTX
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PPTX
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
PPT
Lecture 2-1.ppt at a higher learning institution such as the university of Za...
rachealhantukumane52
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
PPT
Data base management system Transactions.ppt
gandhamcharan2006
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PPTX
Rocket-Launched-PowerPoint-Template.pptx
Arden31
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PPTX
Rational Functions, Equations, and Inequalities (1).pptx
mdregaspi24
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PPTX
Climate Action.pptx action plan for climate
justfortalabat
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
Hadoop_EcoSystem slide by CIDAC India.pptx
migbaruget
 
List of all the AI prompt cheat codes.pdf
Avijit Kumar Roy
 
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
Lecture 2-1.ppt at a higher learning institution such as the university of Za...
rachealhantukumane52
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
Data base management system Transactions.ppt
gandhamcharan2006
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
Rocket-Launched-PowerPoint-Template.pptx
Arden31
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
Rational Functions, Equations, and Inequalities (1).pptx
mdregaspi24
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
Climate Action.pptx action plan for climate
justfortalabat
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
Ad

Large Scale Kernel Learning using Block Coordinate Descent

  • 1. Large Scale Kernel Learning using Block Coordinate Descent Shaleen Kumar Gupta, Research Assistant3 Authors: Stephen Tu1 Rebecca Reolofs1 Shivaram Venkatraman1 Benjamin Recht1,2 1Department of Electrical Engineering and Computer Science UC Berkeley, Berkeley, CA 2Department of Statistics UC Berkeley, Berkeley, CA 3Nanyang Technological University, 2016
  • 2. Outline 1 Overview Introduction Background 2 Datasets TIMIT Yelp Reviews CIFAR-10 3 Experimental Results 4 Performance and Scalability 5 Conclusion
  • 3. Outline 1 Overview Introduction Background 2 Datasets TIMIT Yelp Reviews CIFAR-10 3 Experimental Results 4 Performance and Scalability 5 Conclusion
  • 4. Overview Kernel methods are a powerful tool in machine learning, allowing one to discover non-linear structure by mapping data into a higher dimensional, possibly infinite, feature space. Problem: They do not scale well. This paper attempts to exploit distributed computation in Block CD and present results. Moreover, the paper attempts to study the performance of Random Features and Nystrom approximations on three large datasets from speech (TIMIT), text (Yelp) and image classification (CIFAR-10) domains.
  • 5. Outline 1 Overview Introduction Background 2 Datasets TIMIT Yelp Reviews CIFAR-10 3 Experimental Results 4 Performance and Scalability 5 Conclusion
  • 7. Kernel Methods https://www.reddit.com/r/MachineLearning/comments/15zrpp/please_explain_support_vector_machines_ svm_like_i/c7rkwce If our data can’t be separated by a straight line we might need to use a curvy line. A straight line in a higher dimensional space can be a curvy line when projected onto a lower dimensional space. So what we are really doing is using the kernel to put our data into a high dimensional space, then finding a hyperplane to separate the data in that high dimensional space. This straight line looks like a curvy line when we bring it down to the lower dimensional space.
  • 8. Kernel Approximation Techniques (1/2) Kernel Trick: The essence of the kernel-trick is that if you can describe an algorithm in a certain way – which is using only inner products – then you never need to actually use the feature mapping, as long as you can compute the inner product in the feature space. While there are many kernel approximation techniques to do the Kernel Trick, one prominent one is using the RBF Kernel. We will also analyze two other Kernel approximation techniques, namely Nystrom Method and Random Features Technique, in this paper.
  • 9. Kernel Approximation Techniques (2/2) If we would use all data points, we would map to an RN dimensional space and have the scaling problems. Also, we would need to store all kernel values. Nystrom method says that we don’t need go to the full space spanned by all N training points, but we can just use a subset. This will only yield an approximate embedding but if we keep the number of samples we use the same, the resulting embedding will be independent of dataset size and we can basically choose the complexity to suit our problem. Random feature based methods use an element-wise approximation of the kernel.
  • 10. Outline 1 Overview Introduction Background 2 Datasets TIMIT Yelp Reviews CIFAR-10 3 Experimental Results 4 Performance and Scalability 5 Conclusion
  • 11. TIMIT Phone classification task was performed on the TIMIT dataset, which consisted of spoken audio from 462 speakers The authors applied a Gaussian (RBF) kernel for the Nystrom and exact methods and used random cosines for the random feature method.
  • 12. Outline 1 Overview Introduction Background 2 Datasets TIMIT Yelp Reviews CIFAR-10 3 Experimental Results 4 Performance and Scalability 5 Conclusion
  • 13. Yelp Reviews The goal was to predict a rating from one to five stars from the text of a review. A usual 80:20 Training:Test split was applied nltk was used for tokenization and stemming and n-gram modeling was done with n=3. For the exact and Nystrom experiments, they apply a linear kernel. For random features, they apply a hash kernel using MurmurHash3 as their hash function. Since they were predicting ratings for a review, they measured accuracy by using the root mean square error (RMSE) of the predicted rating as compared to the actual rating.
  • 14. Outline 1 Overview Introduction Background 2 Datasets TIMIT Yelp Reviews CIFAR-10 3 Experimental Results 4 Performance and Scalability 5 Conclusion
  • 15. CIFAR-10 The task was to do image classification of the CIFAR-10 dataset. The dataset contained 500,000 training images and 4096 features per image. The authors started with these 4096 features in the dataset as input and used the RBF kernel for the exact and Nystrom method and random cosines for the random features method.
  • 16. Experimental Results (1/3) Figure: Classification Error against Time using different methods on the TIMIT, Yelp and CIFAR-10 datasets. The little black stars denote the end of an epoch
  • 17. Experimental Results (2/3) Figure: Classification Error against number of features for Nystrom and Random Features on the TIMIT, Yelp and CIFAR-10 datasets
  • 19. Performance Figure: Breakdown of time to compute a single block of coordinate descent in the first epoch on the TIMIT, Yelp and CIFAR-10 datasets From the figure, we see that the choice of the kernel approximation can significantly impact performance since different kernels take different amounts of time to generate. For example, the hash random feature used for the Yelp dataset is much cheaper to compute than the string kernel. However, computing a block of the RBF kernel is similar in cost to computing a block of random cosine features.
  • 20. Scalability of RBF Kernel Generation Figure: Time taken to compute on eblock of RBF kernel as they scale the number of examples and the number of machines used Here, ideal scaling implies that the time to generate a block of the kernel matrix remains constant as they increase both the data and the number of machines. However, computing a block of the RBF kernel involves broadcasting a b x d matrix to all the machines in the cluster. This causes a slight decrease in performance as they go from 8 to 128 machines. However, they believe that the kernel block generation methods will continue to scale well for larger datasets since broadcast routines scale as O(logM).
  • 21. Conclusion This paper shows that scalable kernel machines are feasible with distributed computation. Results suggest that the Nystrom method generally achieves better statistical accuracy than random features However, it can require significantly more iterations of optimization. On the theoretical side, a limitation of this analysis is that achieving rates better than gradient descent cannot be hoped.
  • 22. References and Further Reading I Stephen Tu, Rebecca Roelofs, Shivaram Venkataraman, Benjamin Recht Large Scale Kernel Learning using Block Coordinate Descent February 18, 2016 Tianbao Yang, Yu-feng Li, Mehrdad Mahdavi, Rong Jin, Zhi-Hua Zhou Nystrom Method vs Random Fourier Features: A Theoretical and Empirical Comparison Advances in Neural Information Processing Systems 25 (NIPS 2012)