ROC曲线及facenet中的使用

最新推荐文章于 2022-01-20 14:39:20 发布

原创最新推荐文章于 2022-01-20 14:39:20 发布 · 5k 阅读

15 ·

CC 4.0 BY-SA版权

文章标签：

#facenetroc

深度学习（deep learning）同时被 3 个专栏收录

43 篇文章

订阅专栏

机器学习（machine learning）

38 篇文章

订阅专栏

python

14 篇文章

订阅专栏

ROC（Receiver Operating Characteristic）曲线，以及AUC（Area Under Curve），常用来评价一个二值分类器的优劣，ROC的横轴为false positive rate，FPR，也就是误判为正确的比例，纵轴是true positive rate，也就是正确的判断为正确的比例，定义如下：
这里写图片描述
接下来我们考虑ROC曲线图中的四个点和一条线。第一个点，(0,1)，即FPR=0, TPR=1，这意味着FN（false negative）=0，并且FP（false positive）=0。Wow，这是一个完美的分类器，它将所有的样本都正确分类。第二个点，(1,0)，即FPR=1，TPR=0，类似地分析可以发现这是一个最糟糕的分类器，因为它成功避开了所有的正确答案。第三个点，(0,0)，即FPR=TPR=0，即FP（false positive）=TP（true positive）=0，可以发现该分类器预测所有的样本都为负样本（negative）。类似的，第四个点（1,1），分类器实际上预测所有的样本都为正样本。经过以上的分析，我们可以断言，ROC曲线越接近左上角，该分类器的性能越好。
这里写图片描述
下面考虑ROC曲线图中的虚线y=x上的点。这条对角线上的点其实表示的是一个采用随机猜测策略的分类器的结果，例如(0.5,0.5)，表示该分类器随机对于一半的样本猜测其为正样本，另外一半的样本为负样本。

如何画ROC曲线

对于一个特定的分类器和测试数据集，显然只能得到一个分类结果，即一组FPR和TPR结果，而要得到一个曲线，我们实际上需要一系列FPR和TPR的值，这又是如何得到的呢？我们先来看一下Wikipedia上对ROC曲线的定义：

In signal detection theory, a receiver operating characteristic (ROC), or simply ROC curve, is a graphical plot which illustrates the performance of a binary classifier system as its discrimination threshold is varied.

问题在于“as its discrimination threashold is varied”。如何理解这里的“discrimination threashold”呢？我们忽略了分类器的一个重要功能“概率输出”，即表示分类器认为某个样本具有多大的概率属于正样本（或负样本）。通过更深入地了解各个分类器的内部机理，我们总能想办法得到一种概率输出。通常来说，是将一个实数范围通过某个变换映射到(0,1)区间[^3]。
假如我们已经得到了所有样本的概率输出（属于正样本的概率），现在的问题是如何改变“discrimination threashold”？我们根据每个测试样本属于正样本的概率值从大到小排序。每次选取一个不同的threshold，我们就可以得到一组FPR和TPR，即ROC曲线上的一点。
当我们将threshold设置为1和0时，分别可以得到ROC曲线上的(0,0)和(1,1)两个点。将这些(FPR,TPR)对连接起来，就得到了ROC曲线。当threshold取值越多，ROC曲线越平滑。

其实，我们并不一定要得到每个测试样本是正样本的概率值，只要得到这个分类器对该测试样本的“评分值”即可（评分值并不一定在(0,1)区间）。评分越高，表示分类器越肯定地认为这个测试样本是正样本，而且同时使用各个评分值作为threshold。我认为将评分值转化为概率更易于理解一些。

AUC值的计算

AUC（Area Under Curve）被定义为ROC曲线下的面积，显然这个面积的数值不会大于1。又由于ROC曲线一般都处于y=x这条直线的上方，所以AUC的取值范围在0.5和1之间。使用AUC值作为评价标准是因为很多时候ROC曲线并不能清晰的说明哪个分类器的效果更好，而作为一个数值，对应AUC更大的分类器效果更好。

为什么使用ROC曲线

既然已经这么多评价标准，为什么还要使用ROC和AUC呢？因为ROC曲线有个很好的特性：当测试集中的正负样本的分布变化的时候，ROC曲线能够保持不变。在实际的数据集中经常会出现类不平衡（class imbalance）现象，即负样本比正样本多很多（或者相反），而且测试数据中的正负样本的分布也可能随着时间变化。下图是ROC曲线和Precision-Recall曲线[^5]的对比：
这里写图片描述
在上图中，(a)和(c)为ROC曲线，(b)和(d)为Precision-Recall曲线。(a)和(b)展示的是分类其在原始测试集（正负样本分布平衡）的结果，(c)和(d)是将测试集中负样本的数量增加到原来的10倍后，分类器的结果。可以明显的看出，ROC曲线基本保持原貌，而Precision-Recall曲线则变化较大。

下面给出facenet中的计算roc代码：

#embeddings：输入端的特征向量，由于这份代码是针对lfw数据的，因此特征是每两个特征之间是一对
#actual_issame，实际上是不是同一类，或者说是不是同一个人
#nrof_folds，交叉验证的fold数
#下面有个thresholds，是从0开始，间隔0.01，一直到4，总共400个阈值，这里就对应了前面提到的，多个阈值下，得到多个TPR和FPR值，从而可以画出一个ROC曲线
#返回的tpr，fpr，即可用于绘制ROC曲线
def evaluate(embeddings, actual_issame, nrof_folds=10):
    # Calculate evaluation metrics
    thresholds = np.arange(0, 4, 0.01)
    embeddings1 = embeddings[0::2]
    embeddings2 = embeddings[1::2]
    tpr, fpr, accuracy = facenet.calculate_roc(thresholds, embeddings1, embeddings2,
        np.asarray(actual_issame), nrof_folds=nrof_folds)
    thresholds = np.arange(0, 4, 0.001)
    val, val_std, far = facenet.calculate_val(thresholds, embeddings1, embeddings2,
        np.asarray(actual_issame), 1e-3, nrof_folds=nrof_folds)
    return tpr, fpr, accuracy, val, val_std, far



def calculate_roc(thresholds, embeddings1, embeddings2, actual_issame, nrof_folds=10):
    assert(embeddings1.shape[0] == embeddings2.shape[0])
    assert(embeddings1.shape[1] == embeddings2.shape[1])
    #lfw中点对的个数
    nrof_pairs = min(len(actual_issame), embeddings1.shape[0])
    #这里thresholds是一个长度为400的数组，从0开始，间隔0.01
    nrof_thresholds = len(thresholds)
    k_fold = KFold(n_splits=nrof_folds, shuffle=False)
    #tprs是一个10x400的数组
    tprs = np.zeros((nrof_folds,nrof_thresholds))
    fprs = np.zeros((nrof_folds,nrof_thresholds))
    accuracy = np.zeros((nrof_folds))

    #求点对之间的差值
    diff = np.subtract(embeddings1, embeddings2)
    #求点对之间的差值的平方和，再相加，相当于是求完欧式距离
    dist = np.sum(np.square(diff),1)
    indices = np.arange(nrof_pairs)

    #将数据分成10份，一份是测试，9份是训练
    for fold_idx, (train_set, test_set) in enumerate(k_fold.split(indices)):

        # Find the best threshold for the fold
        acc_train = np.zeros((nrof_thresholds))
        #在0到4，每间隔0.01的数字作为阈值，然后再选择其中值最大的作为阈值
        for threshold_idx, threshold in enumerate(thresholds):
            #在当前阈值下，求训练数据的准确度
            _, _, acc_train[threshold_idx] = calculate_accuracy(threshold, dist[train_set], actual_issame[train_set])
        #获取0-4之间阈值下的最佳值
        best_threshold_index = np.argmax(acc_train)
        #获取0-4之间阈值下的测试集的TPR和FPR
        for threshold_idx, threshold in enumerate(thresholds):
            tprs[fold_idx,threshold_idx], fprs[fold_idx,threshold_idx], _ = calculate_accuracy(threshold, dist[test_set], actual_issame[test_set])
        #根据上面获取的最佳阈值，求测试数据集的准确度，作为最终的准确度
        _, _, accuracy[fold_idx] = calculate_accuracy(thresholds[best_threshold_index], dist[test_set], actual_issame[test_set])
        #由于还有交叉验证，因此还需要求10次tpr和fpr均值
        tpr = np.mean(tprs,0)
        fpr = np.mean(fprs,0)
    return tpr, fpr, accuracy

def calculate_accuracy(threshold, dist, actual_issame):
    predict_issame = np.less(dist, threshold)
    #判断为同一个人的，与实际上也为同一个人的个数，真真率
    tp = np.sum(np.logical_and(predict_issame, actual_issame))
    #判断为同一个人的，实际上不是同一个人的个数，真假率
    fp = np.sum(np.logical_and(predict_issame, np.logical_not(actual_issame)))
    #判断不为同一个人的个数，和实际上也不是同一个的个数，假假率
    tn = np.sum(np.logical_and(np.logical_not(predict_issame), np.logical_not(actual_issame)))
    #判断不为同一个人，实际上是同一个人的个数，假真率
    fn = np.sum(np.logical_and(np.logical_not(predict_issame), actual_issame))

    tpr = 0 if (tp+fn==0) else float(tp) / float(tp+fn)
    fpr = 0 if (fp+tn==0) else float(fp) / float(fp+tn)
    acc = float(tp+tn)/dist.size
    return tpr, fpr, acc