Spark MLlib之机器学习（二）

最新推荐文章于 2022-02-13 17:51:51 发布

GatsbyNewton

最新推荐文章于 2022-02-13 17:51:51 发布

阅读量2.1k

点赞数

CC 4.0 BY-SA版权

分类专栏： Spark Machine Learning 文章标签： Spark 机器学习 MLlib

本文链接：https://blog.csdn.net/u010376788/article/details/48876991

Spark 同时被 2 个专栏收录

14 篇文章

订阅专栏

Machine Learning

11 篇文章

订阅专栏

本文深入探讨了SparkMLlib中的监督学习算法应用，包括Logistics Regression、Naive Bayes、SVM、Decision Tree和Linear Regression。详细介绍了如何在Spark环境中实现这些算法，并提供了相应的代码示例。通过实践，读者能够掌握如何运用这些算法解决实际问题。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

通过上一篇的简介，我们对Spark MLlib的基础有了一些了解。那么，从这一篇开始，我们进入实战阶段。因为是介绍Spark MLlib的应用，所以我这里不会详细介绍算法的推导，后续我会抽时间整理成专题进行介绍。而这一篇主要介绍Spark MLlib中的监督学习算法：Logistics Regression、Naive Bayes、SVM（Support Vector Machine）、Decision Tree，和Linear Regression。

值得一提的是，虽然Spark MLlib中已经提供了常用算法的接口，但是在看了它的源代码后，如果发现其性能和稳定性不如自己的实现过程好或者其他原因，也可以自己实现这些算法。

1.Supervised Learning（监督学习）

首先，我们先了解一下监督学习的定义，以下是Wikipedia给出的定义：

"Supervised learning is the machine learning task of inferring a function from labeled training data."

可以简单的理解为：监督学习是为了从数据中找规律（即函数）。从一组数中找规律是我们初中就接触的东西，如：1，2，4，8.......它的规律就是2^x次方，x是0到无穷大的整数。

而本篇要介绍的监督学习算法中，Logistics Regression、Naive Bayes、SVM（Support Vector Machine）、Decision Tree又属于分类算法。分类算法的定义说的正式点儿是：根据文本的特征或属性，划分到已有的类别中。概况说就四个字——分门别类。

2.Linear Regression

Linear Regression是来确定两种或两种以上变量间相互依赖的定量关系的一种统计分析方法。在Spark MLlib中提供了两种实现Linear Regression的接口：LinearRegressionWithSGD和LassoWithSGD。其实，LassoWithSGD可以看做是LinearRegression的加强版，是处理如果特征比样本点还多，也就是说输入数据的矩阵X不是满秩矩阵的时候，相当于缩减系数来“理解”数据。这里以LinearRegressionWithSGD的使用为例，代码如下：

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.{SparkContext, SparkConf}

object LinearRegression {

  def main(args: Array[String]): Unit ={

    val length = args.length
    if(length != 2 && length != 3){
      System.err.println("Usage: <input file> <iteration number> <step size(optional)>")
      System.exit(1)
    }

    val conf = new SparkConf()
    val sc = new SparkContext(conf)

    val data = sc.textFile(args(0))
    //Iteration number
    val iteration = args(1).toInt
    //Step size, default vaule is 0.01
    val stepSize = if(length == 3) args(2).toInt else 0.01

    //Parse the data into LabeledPoint
    val parseData = data.map{line =>
      val elem = line.split(":")
      LabeledPoint(elem(0).toDouble, Vectors.dense(elem(1).split(" ").map(_.toDouble)))
    }
    //Train model
    val model = LinearRegressionWithSGD.train(parseData, iteration, stepSize)
    //Check its coefficients
    val weight = model.weights

    println(weight)
  }
}

3.Logistics Regression

Logistics Regression主要用于二分类，如是否是垃圾邮件，它的y值是：0和1。算法详细介绍可以看我的另一篇博文Logistic Regression笔记。在Spark MLlib中也提供两种实现Logistics Regression的接口：LogisticRegressionWithSGD和LogisticRegressionWithLBFGS。而LogisticRegressionWithLBFGS是优选的，因为它消除了优化步长。虽然二者接口使用大同小异，但是为了更直观的看到步长，这里就以LogisticsRegressionWithSGD接口为例，代码如下：

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.mllib.classification.LogisticRegressionWithSGD
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint

object LogisticRegression {

  def main (args: Array[String]): Unit = {

    val length = args.length
    if(length != 2 && length != 3){
      System.err.println("Usage: <input file> <iteration number> <step size(optional)>")
      System.exit(1)
    }

    val conf = new SparkConf()
    val sc = new SparkContext(conf)

    val data = sc.textFile(args(0))
    //Iteration number
    val iteration = args(1).toInt
    //Step size, default value is 0.01
    val stepSize = if(length == 3) args(2).toDouble else 0.01

    //Parse the data into LabeledPoint
    val parseData = data.map{line =>
      val elem = line.split(":")
      LabeledPoint(elem(0).toDouble, Vectors.dense(elem(1).split(" ").map(x => x.toDouble)))
    }
    //Train a model
    val model = LogisticRegressionWithSGD.train(parseData, iteration, stepSize)
    //Check its coefficients
    val weight = model.weights

    println(weight)

    sc.stop()
  }
}

4.Naive Bayes

Naive Bayes是基于贝叶斯定理与特征条件独立假设的分类方法。算法的详细介绍可以参考我的另一篇博文Naive Bayes笔记。在Spark MLlib中其实现接口就叫：NaiveBayes，代码如下：

import org.apache.spark.mllib.classification.NaiveBayes
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.{SparkContext, SparkConf}

object NaiveBayesDemo {

  def main (args: Array[String]): Unit = {

    val length = args.length
    if(length != 2){
      System.err.println("Usage: <input file> <lambda>")
      System.exit(1)
    }

    val conf = new SparkConf()
    val sc = new SparkContext(conf)

    val data = sc.textFile(args(0))
    //Lambda, default value is 1L
    val lambda = if(length == 2) args(1).toDouble else 1L
    //Parse the data into LabeledPoint
    val parseData = data.map{line =>
      val elem = line.split(":")
      LabeledPoint(elem(0).toDouble, Vectors.dense(elem(1).split(" ").map(_.toDouble)))
    }
    //Split the data half and half into the training and test datasets
    val splits = parseData.randomSplit(Array(0.5, 0.5), seed = 11L)
    val training = splits(0)
    val test = splits(1)

    //Train a model with the training dataset
    val model = NaiveBayes.train(training, lambda)
    //Predict the label of the test dataset
    val prediction = test.map(p => (model.predict(p.features), p.label))

    println(prediction)

    sc.stop()
  }
}

5.SVM（Support Vector Machine）

SVM是找到一个超平面把数据分为1和-1两类，而最靠近分隔超平面的点叫做支持向量（Support Vector）。在Spark MLlib中，其实现接口是SVMWithSGD，代码如下：

import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.{SparkContext, SparkConf}

object SVM {

  def main(args: Array[String]): Unit ={

    if(args.length != 2){
      System.err.println("Usage: <input file> <iteration number>")
      System.exit(1)
    }

    val conf = new SparkConf()
    val sc = new SparkContext(conf)

    val data = sc.textFile(args(0))
    //Iteration number
    val iteration = args(1).toInt

    //Parse the data into LabeledPoint
    val parseData = data.map{line =>
      val elem = line.split(":")
      LabeledPoint(elem(0).toDouble, Vectors.dense(elem(1).split(" ").map(x => x.toDouble)))
    }
    //Train a model
    val model = SVMWithSGD.train(parseData, iteration)
    //Check its coefficients
    val weight = model.weights

    println(weight)

    sc.stop()
  }
}

6.Decision Tree（待整理）