Classification using Support Vector Machines and K-Nearest Neighbours algorithms in RStudio.

Breast Cancer Classification Data Preparation

A glimpse on our data

The dataset used can be downloaded at: https://data.world/health/breast-cancer-wisconsin.

Breast cancer is the fourth most common cause of cancer death in the UK. There are thirty-two breast cancer deaths every day according to 2017-2019 data according to Cancer Research UK. The slightly better news is that the survival rate is 78% which has improved from 40% in the past 40 years. That is due to new technology being used in the detection and treatment process. Medical procedures such as Magnetic Resonance Imaging (MRI), Mammography, and Ultrasound have been introduced. But the latest developments are coming from Machine Learning. They are widely recognized as decent contributors to breast cancer pattern classification. Moreover, they help in clinical decision-making and diagnosis.

The required libraries.

library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(caret)

## Loading required package: lattice

#Read our Dataset
df = read.csv("C:\\Users\\Desktop\\Statistical Analysis\\dataset.csv")

#Having a glimpse of our dataset
glimpse(df)

## Rows: 569
## Columns: 32
## $ id                      <int> 842302, 842517, 84300903, 84348301, 84358402, …
## $ diagnosis               <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "…
## $ radius_mean             <dbl> 17.990, 20.570, 19.690, 11.420, 20.290, 12.450…
## $ texture_mean            <dbl> 10.38, 17.77, 21.25, 20.38, 14.34, 15.70, 19.9…
## $ perimeter_mean          <dbl> 122.80, 132.90, 130.00, 77.58, 135.10, 82.57, …
## $ area_mean               <dbl> 1001.0, 1326.0, 1203.0, 386.1, 1297.0, 477.1, …
## $ smoothness_mean         <dbl> 0.11840, 0.08474, 0.10960, 0.14250, 0.10030, 0…
## $ compactness_mean        <dbl> 0.27760, 0.07864, 0.15990, 0.28390, 0.13280, 0…
## $ concavity_mean          <dbl> 0.30010, 0.08690, 0.19740, 0.24140, 0.19800, 0…
## $ concave.points_mean     <dbl> 0.14710, 0.07017, 0.12790, 0.10520, 0.10430, 0…
## $ symmetry_mean           <dbl> 0.2419, 0.1812, 0.2069, 0.2597, 0.1809, 0.2087…
## $ fractal_dimension_mean  <dbl> 0.07871, 0.05667, 0.05999, 0.09744, 0.05883, 0…
## $ radius_se               <dbl> 1.0950, 0.5435, 0.7456, 0.4956, 0.7572, 0.3345…
## $ texture_se              <dbl> 0.9053, 0.7339, 0.7869, 1.1560, 0.7813, 0.8902…
## $ perimeter_se            <dbl> 8.589, 3.398, 4.585, 3.445, 5.438, 2.217, 3.18…
## $ area_se                 <dbl> 153.40, 74.08, 94.03, 27.23, 94.44, 27.19, 53.…
## $ smoothness_se           <dbl> 0.006399, 0.005225, 0.006150, 0.009110, 0.0114…
## $ compactness_se          <dbl> 0.049040, 0.013080, 0.040060, 0.074580, 0.0246…
## $ concavity_se            <dbl> 0.05373, 0.01860, 0.03832, 0.05661, 0.05688, 0…
## $ concave.points_se       <dbl> 0.015870, 0.013400, 0.020580, 0.018670, 0.0188…
## $ symmetry_se             <dbl> 0.03003, 0.01389, 0.02250, 0.05963, 0.01756, 0…
## $ fractal_dimension_se    <dbl> 0.006193, 0.003532, 0.004571, 0.009208, 0.0051…
## $ radius_worst            <dbl> 25.38, 24.99, 23.57, 14.91, 22.54, 15.47, 22.8…
## $ texture_worst           <dbl> 17.33, 23.41, 25.53, 26.50, 16.67, 23.75, 27.6…
## $ perimeter_worst         <dbl> 184.60, 158.80, 152.50, 98.87, 152.20, 103.40,…
## $ area_worst              <dbl> 2019.0, 1956.0, 1709.0, 567.7, 1575.0, 741.6, …
## $ smoothness_worst        <dbl> 0.1622, 0.1238, 0.1444, 0.2098, 0.1374, 0.1791…
## $ compactness_worst       <dbl> 0.6656, 0.1866, 0.4245, 0.8663, 0.2050, 0.5249…
## $ concavity_worst         <dbl> 0.71190, 0.24160, 0.45040, 0.68690, 0.40000, 0…
## $ concave.points_worst    <dbl> 0.26540, 0.18600, 0.24300, 0.25750, 0.16250, 0…
## $ symmetry_worst          <dbl> 0.4601, 0.2750, 0.3613, 0.6638, 0.2364, 0.3985…
## $ fractal_dimension_worst <dbl> 0.11890, 0.08902, 0.08758, 0.17300, 0.07678, 0…

#Check the missing values
anyNA(df)

## [1] FALSE

#Remove ID column & Change the values of the column that we are going to predict "diagnosis" into factors
df$id = NULL
df$diagnosis <- as.factor(df$diagnosis)

#Have a look on our diagnosis variable
round(prop.table(table(df$diagnosis)), 2)

## 
##    B    M 
## 0.63 0.37

#Have a glimpse after edit
glimpse(df)

## Rows: 569
## Columns: 31
## $ diagnosis               <fct> M, M, M, M, M, M, M, M, M, M, M, M, M, M, M, M…
## $ radius_mean             <dbl> 17.990, 20.570, 19.690, 11.420, 20.290, 12.450…
## $ texture_mean            <dbl> 10.38, 17.77, 21.25, 20.38, 14.34, 15.70, 19.9…
## $ perimeter_mean          <dbl> 122.80, 132.90, 130.00, 77.58, 135.10, 82.57, …
## $ area_mean               <dbl> 1001.0, 1326.0, 1203.0, 386.1, 1297.0, 477.1, …
## $ smoothness_mean         <dbl> 0.11840, 0.08474, 0.10960, 0.14250, 0.10030, 0…
## $ compactness_mean        <dbl> 0.27760, 0.07864, 0.15990, 0.28390, 0.13280, 0…
## $ concavity_mean          <dbl> 0.30010, 0.08690, 0.19740, 0.24140, 0.19800, 0…
## $ concave.points_mean     <dbl> 0.14710, 0.07017, 0.12790, 0.10520, 0.10430, 0…
## $ symmetry_mean           <dbl> 0.2419, 0.1812, 0.2069, 0.2597, 0.1809, 0.2087…
## $ fractal_dimension_mean  <dbl> 0.07871, 0.05667, 0.05999, 0.09744, 0.05883, 0…
## $ radius_se               <dbl> 1.0950, 0.5435, 0.7456, 0.4956, 0.7572, 0.3345…
## $ texture_se              <dbl> 0.9053, 0.7339, 0.7869, 1.1560, 0.7813, 0.8902…
## $ perimeter_se            <dbl> 8.589, 3.398, 4.585, 3.445, 5.438, 2.217, 3.18…
## $ area_se                 <dbl> 153.40, 74.08, 94.03, 27.23, 94.44, 27.19, 53.…
## $ smoothness_se           <dbl> 0.006399, 0.005225, 0.006150, 0.009110, 0.0114…
## $ compactness_se          <dbl> 0.049040, 0.013080, 0.040060, 0.074580, 0.0246…
## $ concavity_se            <dbl> 0.05373, 0.01860, 0.03832, 0.05661, 0.05688, 0…
## $ concave.points_se       <dbl> 0.015870, 0.013400, 0.020580, 0.018670, 0.0188…
## $ symmetry_se             <dbl> 0.03003, 0.01389, 0.02250, 0.05963, 0.01756, 0…
## $ fractal_dimension_se    <dbl> 0.006193, 0.003532, 0.004571, 0.009208, 0.0051…
## $ radius_worst            <dbl> 25.38, 24.99, 23.57, 14.91, 22.54, 15.47, 22.8…
## $ texture_worst           <dbl> 17.33, 23.41, 25.53, 26.50, 16.67, 23.75, 27.6…
## $ perimeter_worst         <dbl> 184.60, 158.80, 152.50, 98.87, 152.20, 103.40,…
## $ area_worst              <dbl> 2019.0, 1956.0, 1709.0, 567.7, 1575.0, 741.6, …
## $ smoothness_worst        <dbl> 0.1622, 0.1238, 0.1444, 0.2098, 0.1374, 0.1791…
## $ compactness_worst       <dbl> 0.6656, 0.1866, 0.4245, 0.8663, 0.2050, 0.5249…
## $ concavity_worst         <dbl> 0.71190, 0.24160, 0.45040, 0.68690, 0.40000, 0…
## $ concave.points_worst    <dbl> 0.26540, 0.18600, 0.24300, 0.25750, 0.16250, 0…
## $ symmetry_worst          <dbl> 0.4601, 0.2750, 0.3613, 0.6638, 0.2364, 0.3985…
## $ fractal_dimension_worst <dbl> 0.11890, 0.08902, 0.08758, 0.17300, 0.07678, 0…

Including Plots

It is clear from the dark colors that some variables are highly correlated. Let’s see the spread of the variables via the Box Plot.

#Visualize boxplots for all parameters
boxplot(df[, 2:31], 
        main ='Box Plots Parameters')

#Let's also try Random Forest to see the importance of the features in the model
fitControl <- trainControl(method="cv",
                           number = 5,
                           preProcOptions = list(thresh = 0.99), # threshold for pca preprocess
                           classProbs = TRUE,
                           summaryFunction = twoClassSummary)

randomf <- train(diagnosis~.,
                 data = df,
                 method="rf",
                 metric="ROC",
                 #tuneLength=10,
                 preProcess = c('center', 'scale'),
                 trControl=fitControl)

plot(varImp(randomf), top = 30, main = "Random forest")

Model Creation 80/20 split

set.seed(40)
df[,'train'] <- ifelse(runif(nrow(df))<0.8,1,0)

#separate training and testing set
trainset <- df[df$train==1,]
testset <- df[df$train==0,]
#gets the index of the train column
trainColNum <- grep('train',names(trainset))
#remove train flag column from train and test sets
trainset <- trainset[,-trainColNum]
testset <- testset[,-trainColNum]

x = NA
if(is.na(x)) {x=FALSE} else {if(x) {x}}

# Check data types
print(class(trainset$diagnosis))

## [1] "factor"

print(class(testset$diagnosis))

## [1] "factor"

# Convert to factor if needed
if (!is.factor(trainset$diagnosis)) {
  trainset$diagnosis <- as.factor(trainset$diagnosis)
}

if (!is.factor(testset$diagnosis)) {
  testset$diagnosis <- as.factor(testset$diagnosis)
}

# Handle missing values
if (anyNA(trainset$diagnosis)) {
  trainset <- trainset[complete.cases(trainset), ]
}

if (anyNA(testset$diagnosis)) {
  testset <- testset[complete.cases(testset), ]
}

Create Support Vector Machine Linear Model

Support vector machines (SVM) are supervised learning models in machine learning used particularly for classification, regression analysis or for novelty detection. It is a method based solely on statistical learning networks developed in Nokia Laboratories

library(e1071)

# Train the SVM model
svm_model <- svm(diagnosis ~ ., data = trainset, type = 'C-classification', kernel = 'linear')
print(svm_model)

## 
## Call:
## svm(formula = diagnosis ~ ., data = trainset, type = "C-classification", 
##     kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  36

# Predictions on the training set
pred_train <- predict(svm_model, trainset)
train_accuracy <- mean(pred_train == trainset$diagnosis)
print(paste("Training Set Accuracy:", train_accuracy))

## [1] "Training Set Accuracy: 0.991266375545852"

# Predictions on the test set
pred_test <- predict(svm_model, testset)
test_accuracy <- mean(pred_test == testset$diagnosis)
print(paste("Test Set Accuracy:", test_accuracy))

## [1] "Test Set Accuracy: 0.981981981981982"

# Confusion matrix
confusion <- table(pred_test, testset$diagnosis)
print(confusion)

##          
## pred_test  B  M
##         B 72  2
##         M  0 37

#build scatter plot of training dataset
scatter_plot <- ggplot(data = trainset, aes(x = concave.points_worst, y = area_worst, color = diagnosis)) + 
  geom_point() + 
  scale_color_manual(values = c("red", "blue"))
scatter_plot

#Mark out the support vectors on the plot using their indices from the SVM model.
layered_plot <- 
  scatter_plot + geom_point(data = trainset[svm_model$index, ], aes(x = concave.points_worst, y = area_worst), color = "purple", size = 4, alpha = 0.5)
layered_plot

#FIND OPTIMAL COST OF CLASSIFICATION
tune.out <- tune(svm,diagnosis~., data=trainset, kernel = "linear",
                 ranges = list(cost = c(0.001, 0.01, 0.1, 1, 5, 10, 100)))
# extract the best model
(bestmod <- tune.out$best.model)

## 
## Call:
## best.tune(METHOD = svm, train.x = diagnosis ~ ., data = trainset, 
##     ranges = list(cost = c(0.001, 0.01, 0.1, 1, 5, 10, 100)), kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.1 
## 
## Number of Support Vectors:  54

#Run again the prediction with the new cost value

svm_model <- svm(diagnosis~., data=trainset, type='C-classification', kernel='linear', cost = 0.1)


pred_train <- predict(svm_model, trainset) #predicting with SVM
mean(pred_train==trainset$diagnosis) #percentage of trainset predicted correctly

## [1] 0.9847162

pred_test <- predict(svm_model, testset) #predicting with the new svm model
mean(pred_test==testset$diagnosis) #percentage of testset predicted correctly

## [1] 0.972973

## Create Support Vector Machine Radial Model

#Create SVM RADIAL
svm_model <- svm(diagnosis~., data=trainset, type='C-classification', kernel='radial')


pred_train <- predict(svm_model, trainset) #predicting with SVM
mean(pred_train==trainset$diagnosis) #percentage of trainset predicted correctly

## [1] 0.989083

pred_test <- predict(svm_model, testset) #predicting with the new svm model
mean(pred_test==testset$diagnosis) #percentage of testset predicted correctly

## [1] 0.981982

table(pred_test, testset$diagnosis) #confusion matrix of the predictions of the svm and the test data

##          
## pred_test  B  M
##         B 72  2
##         M  0 37

#Tune model to find optimal cost, gamma values
tune.out <- tune(svm,diagnosis~., data=trainset, kernel = "radial",
                 ranges = list(cost = c(0.1,1,10,100,1000),
                               gamma = c(0.5,1,2,3,4)))
# show best model
tune.out$best.model

## 
## Call:
## best.tune(METHOD = svm, train.x = diagnosis ~ ., data = trainset, 
##     ranges = list(cost = c(0.1, 1, 10, 100, 1000), gamma = c(0.5, 
##         1, 2, 3, 4)), kernel = "radial")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  10 
## 
## Number of Support Vectors:  433

#Run again with the new value of cost
svm_model <- svm(diagnosis~., data=trainset, type='C-classification', kernel='radial', cost=10)


pred_train <- predict(svm_model, trainset) #predicting with SVM
mean(pred_train==trainset$diagnosis) #percentage of trainset predicted correctly

## [1] 0.9912664

pred_test <- predict(svm_model, testset) #predicting with the new svm model
mean(pred_test==testset$diagnosis) #percentage of testset predicted correctly

## [1] 0.972973

table(pred_test, testset$diagnosis)

##          
## pred_test  B  M
##         B 71  2
##         M  1 37

KNN Classification

It is a supervised learning model which means the KNN algorithm will learn from a map of input vectors and output labels based on an example of how this pair is related. Like the SVM, KNN is also used for classification and regression. On the classification part, the algorithm relies on the Euclidean distance to determine the class of the point and it is highly recommended to normalize the data beforehand. It is called a lazy learner algorithm as it first stores the training set and performs the actions when needed for classification.

library(caret)
df <- read.csv("C:\\Users\\Statistical Analysis\\dataset.csv")
df$diagnosis <- as.factor(df$diagnosis)
df$id = NULL

#divide the data on 80% train and 20% test
train <- createDataPartition(df[,"diagnosis"], p=0.8, list = FALSE)
help("createDataPartition")

## starting httpd help server ... done

dftrain <- df[train,]
dftest <- df[-train,]

#perform a cross-validation and run knn alogrithm with k=10
ctrl <- trainControl(method = "cv", number = 10)

fit.cv <- train(diagnosis ~., data = dftrain, method = "knn", 
                preProcess = c("center","scale"),
                tuneGrid = data.frame(k=10))

#probability of each level of the factor diagnosis
pred <- predict(fit.cv, dftest)
pred.prob <- predict(fit.cv, dftest, type ="prob")

confusionMatrix(table(dftest[,"diagnosis"],pred))

## Confusion Matrix and Statistics
## 
##    pred
##      B  M
##   B 70  1
##   M  6 36
##                                           
##                Accuracy : 0.9381          
##                  95% CI : (0.8765, 0.9747)
##     No Information Rate : 0.6726          
##     P-Value [Acc > NIR] : 9.858e-12       
##                                           
##                   Kappa : 0.8641          
##                                           
##  Mcnemar's Test P-Value : 0.1306          
##                                           
##             Sensitivity : 0.9211          
##             Specificity : 0.9730          
##          Pos Pred Value : 0.9859          
##          Neg Pred Value : 0.8571          
##              Prevalence : 0.6726          
##          Detection Rate : 0.6195          
##    Detection Prevalence : 0.6283          
##       Balanced Accuracy : 0.9470          
##                                           
##        'Positive' Class : B               
##

print(fit.cv)

## k-Nearest Neighbors 
## 
## 456 samples
##  30 predictor
##   2 classes: 'B', 'M' 
## 
## Pre-processing: centered (30), scaled (30) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 456, 456, 456, 456, 456, 456, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9695791  0.9351966
## 
## Tuning parameter 'k' was held constant at a value of 10

#perform a cross-validation and run knn alogrithm with a sequence of k
train <- createDataPartition(df[,"diagnosis"], p=0.8, list = FALSE)
dftrain <- df[train,]
dftest <- df[-train,]

ctrl <- trainControl(method = "cv", number = 10)

fit.cv <- train(diagnosis ~., data = dftrain, method = "knn", 
                trControl = ctrl,
                preProcess = c("center","scale"),
                tuneGrid = data.frame(k=seq(5,100,by=5)))

#probability of each level of the factor diagnosis
pred <- predict(fit.cv, dftest)
pred.prob <- predict(fit.cv, dftest, type ="prob")

confusionMatrix(table(dftest[,"diagnosis"],pred))

## Confusion Matrix and Statistics
## 
##    pred
##      B  M
##   B 69  2
##   M  3 39
##                                           
##                Accuracy : 0.9558          
##                  95% CI : (0.8998, 0.9855)
##     No Information Rate : 0.6372          
##     P-Value [Acc > NIR] : 6.935e-16       
##                                           
##                   Kappa : 0.9048          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9583          
##             Specificity : 0.9512          
##          Pos Pred Value : 0.9718          
##          Neg Pred Value : 0.9286          
##              Prevalence : 0.6372          
##          Detection Rate : 0.6106          
##    Detection Prevalence : 0.6283          
##       Balanced Accuracy : 0.9548          
##                                           
##        'Positive' Class : B               
##

print(fit.cv)

## k-Nearest Neighbors 
## 
## 456 samples
##  30 predictor
##   2 classes: 'B', 'M' 
## 
## Pre-processing: centered (30), scaled (30) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 411, 410, 410, 410, 411, 410, ... 
## Resampling results across tuning parameters:
## 
##   k    Accuracy   Kappa    
##     5  0.9606280  0.9137205
##    10  0.9604831  0.9134293
##    15  0.9562319  0.9023719
##    20  0.9517874  0.8928080
##    25  0.9561353  0.9021400
##    30  0.9540097  0.8977047
##    35  0.9540097  0.8978169
##    40  0.9540097  0.8969780
##    45  0.9561836  0.9015872
##    50  0.9496618  0.8876689
##    55  0.9518357  0.8922552
##    60  0.9496618  0.8876474
##    65  0.9409179  0.8671689
##    70  0.9409179  0.8669361
##    75  0.9409179  0.8669361
##    80  0.9387440  0.8620955
##    85  0.9322222  0.8474483
##    90  0.9321739  0.8477422
##    95  0.9278261  0.8376894
##   100  0.9212560  0.8228762
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.

plot(fit.cv)

#From the plot we can see that the best k=10

#perform a cross-validation with more classes of information
train <- createDataPartition(df[,"diagnosis"], p=0.8, list = FALSE)
dftrain <- df[train,]
dftest <- df[-train,]
ctrl <- trainControl(method = "cv", number = 10, summaryFunction = multiClassSummary)

fit.cv <- train(diagnosis ~., data = dftrain, method = "knn", 
                trControl = ctrl,
                preProcess = c("center","scale"),
                tuneGrid = data.frame(k=seq(5,100,by=5)),
                metric = "Sensitivity")

#probability of each level of the factor diagnosis
pred <- predict(fit.cv, dftest)
pred.prob <- predict(fit.cv, dftest, type ="prob")

confusionMatrix(table(dftest[,"diagnosis"],pred))

## Confusion Matrix and Statistics
## 
##    pred
##      B  M
##   B 70  1
##   M 13 29
##                                           
##                Accuracy : 0.8761          
##                  95% CI : (0.8009, 0.9306)
##     No Information Rate : 0.7345          
##     P-Value [Acc > NIR] : 0.000204        
##                                           
##                   Kappa : 0.7183          
##                                           
##  Mcnemar's Test P-Value : 0.003283        
##                                           
##             Sensitivity : 0.8434          
##             Specificity : 0.9667          
##          Pos Pred Value : 0.9859          
##          Neg Pred Value : 0.6905          
##              Prevalence : 0.7345          
##          Detection Rate : 0.6195          
##    Detection Prevalence : 0.6283          
##       Balanced Accuracy : 0.9050          
##                                           
##        'Positive' Class : B               
##

plot(fit.cv)

KNN with class library

library(class)
#Load and store the data as dataframe
df1 <- read.csv("C:\\Users\\Desktop\\Statistical Analysis\\dataset.csv")
df<-data.frame(df1)


#Create random number for 80% of the rows of the data
ran<- sample(nrow(df),0.8*nrow(df))

#normalize the data via this function
nor <- function(x) { (x -min(x))/(max(x)-min(x)) }

#apply normalization for the data
df_nor <- as.data.frame(lapply(df[,c(3:31)],nor))
help("lapply")
#create train set, test set
df_train <- df_nor[ran,]
df_test <- df_nor[-ran,]

#convert the second column as factor which is our prediction variable
df_target <- as.factor(df[ran,2])

#convert also diagnosis in the test set
test_target <- as.factor(df[-ran,2])


#Run KNN algorithm using the class library
library(class)
pr <- knn(df_train, df_test, cl=df_target, k=10)

#confusion matrix
tb <- table(pr, test_target)
tb

##    test_target
## pr   B  M
##   B 78  1
##   M  3 32

#accuracy function
accuracy <- function(x) {
  sum(diag(x)/(sum(rowSums(x)))) *100
}
accuracy(tb)

## [1] 96.49123

confusionMatrix(table(pr,test_target))

## Confusion Matrix and Statistics
## 
##    test_target
## pr   B  M
##   B 78  1
##   M  3 32
##                                           
##                Accuracy : 0.9649          
##                  95% CI : (0.9126, 0.9904)
##     No Information Rate : 0.7105          
##     P-Value [Acc > NIR] : 2.42e-12        
##                                           
##                   Kappa : 0.9162          
##                                           
##  Mcnemar's Test P-Value : 0.6171          
##                                           
##             Sensitivity : 0.9630          
##             Specificity : 0.9697          
##          Pos Pred Value : 0.9873          
##          Neg Pred Value : 0.9143          
##              Prevalence : 0.7105          
##          Detection Rate : 0.6842          
##    Detection Prevalence : 0.6930          
##       Balanced Accuracy : 0.9663          
##                                           
##        'Positive' Class : B               
##

help(knn)

#Optimization for the best value of k
i=1
k.optm=1
for(i in 1:30){
  knn.mod <- knn(df_train, df_test, cl=df_target, k=i)
  k.optm[i] <- 100*sum(test_target==knn.mod)/NROW(test_target)
  k=i
  cat(k,'=',k.optm[i],'')
}

## 1 = 92.98246 2 = 95.61404 3 = 94.73684 4 = 96.49123 5 = 94.73684 6 = 93.85965 7 = 95.61404 8 = 95.61404 9 = 95.61404 10 = 97.36842 11 = 95.61404 12 = 94.73684 13 = 94.73684 14 = 94.73684 15 = 93.85965 16 = 93.85965 17 = 94.73684 18 = 94.73684 19 = 94.73684 20 = 94.73684 21 = 94.73684 22 = 94.73684 23 = 94.73684 24 = 94.73684 25 = 94.73684 26 = 94.73684 27 = 94.73684 28 = 94.73684 29 = 94.73684 30 = 94.73684

plot(k.optm, type = "b", xlab="K Value", ylab="Accuracy")

confusionMatrix(table(dftest[,"diagnosis"],pred))

## Confusion Matrix and Statistics
## 
##    pred
##      B  M
##   B 70  1
##   M 13 29
##                                           
##                Accuracy : 0.8761          
##                  95% CI : (0.8009, 0.9306)
##     No Information Rate : 0.7345          
##     P-Value [Acc > NIR] : 0.000204        
##                                           
##                   Kappa : 0.7183          
##                                           
##  Mcnemar's Test P-Value : 0.003283        
##                                           
##             Sensitivity : 0.8434          
##             Specificity : 0.9667          
##          Pos Pred Value : 0.9859          
##          Neg Pred Value : 0.6905          
##              Prevalence : 0.7345          
##          Detection Rate : 0.6195          
##    Detection Prevalence : 0.6283          
##       Balanced Accuracy : 0.9050          
##                                           
##        'Positive' Class : B               
##

Conclusion

We reached a satisfying high accuracy on both algorithms. The best method while using the SVM was the linear kernel with cost set at 1 which provided a high accuracy on both training (0.9912664) and test sets (0.981982). The second best was the radial kernel at cost=10 which suffered slightly on the test set. The KNN algorithm provided us with 0.9825 accuracy using the class library. The best confusion matrix we could get from both algorithms is the same, with 2 misclassifications. SVM and KNN algorithms work very well on data classification. They reach a high accuracy particularly when data is cleaned and normalized. I have used different options of what the method can offer and have summarized the results, with the highest accuracy of 98% using the Support Vector Machine with linear kernel.

Search This Blog