*Spark/MLib/Evaluation : Evaluating Performance Of A Classifier*

*Spark/MLib/Evaluation : Evaluating Performance Of A Classifier*

Sunday, March 20, 2016

One of the major criteria in selecting a model for a classifier is the performance

capability of the models.Couple of metrics which can assist in making this decision is PR curver

and ROC curve.A brief description of these metrics are given below.

Precision -Recall(PR) curve plots Precision against Recall.Precision is the number of true

positives divided by the sum true positives and false positives while Recall is the number

of true positives divided by sum of true positives and false negatives.The area under the

PR curve is referred as the average precision and a value for 100% is considered perfect for

the area under the curve.

ROC(Reciever Operating Characterstic) curve plots True Positive Rate (TPR) against

False Positive Rate(FPR).TPR is the number of true positives divided by sum of true positives

and false negatives while FPR is the number of false positives divided by sum of false

positives and true neagatives.The area undercurver (AUC) for ROC is considered to be perfecet

for a value of 100%.

Spark comes with a library BinaryClassificationMetrics which can be utilized to calculate

the above metrics.Execution of this library for an attack classifier with datatset events are

given below.The metrics is used to evaluate effectiveness of Logisitic Regression,SVM ,

Naive Bayes and Decision Tree against the same dataset with logregmodel,svmmodel,naivebayesmodel and dectreemodel represnting the individual model

outputs

import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics

val lrsvm = Seq(logregModel, svmModel).map { model =>

val labelval = event.map { point =>

(model.predict(point.features), point.label)

}

val bcm = new BinaryClassificationMetrics(labelval)

(model.getClass.getSimpleName, bcm.areaUnderPR, bcm.areaUnderROC)

}

val naivebayes = Seq(naivebayesModel).map{ model =>

val labelval = event.map { point =>

val values = model.predict(point.features)

(if (values > 0.5) 1.0 else 0.0, point.label)

}

val bcm = new BinaryClassificationMetrics(labelval)

(model.getClass.getSimpleName, bcm.areaUnderPR, bcm.areaUnderROC)

}

val decisiontree = Seq(dectreeModel).map{ model =>

val labelval = event.map { point =>

val values = model.predict(point.features)

(if (values > 0.5) 1.0 else 0.0, point.label)

}

val bcm = new BinaryClassificationMetrics(labelval)

(model.getClass.getSimpleName, bcm.areaUnderPR, bcm.areaUnderROC)

}

val allbcm = lrsvm ++ naivebayes ++ decisiontree

allbcm.foreach{ case (model, pr, roc) =>

println(f"$model, Area under PR: ${pr * 100.0}%2.4f%%, Area under ROC: ${roc * 100.0}%2.4f%%")

}

Output of the program is

LogisticRegressionModel, Area under PR: 78.8976%, Area under ROC: 53.2875%

SVMModel, Area under PR: 79.4210%, Area under ROC: 54.6198%

NaiveBayesModel, Area under PR: 71.3652%, Area under ROC: 62.1087%

DecisionTreeModel, Area under PR: 76.3567%, Area under ROC: 66.1256%

The AUC for Decision Tree with a value around 66% seems to the best performing model across all for this dataset.