Quick AUC calculation and plotting function in R

UPDATE 2016.02.05: Added ggplot2 code.
UPDATE 2016.10.26: Added code to colour the line in ggplot2.

NOTE: I’m having some issues with WordPress randomly deleting or changing some of the code text now and then. I don’t know why this happens, but I’ve pasted the entire code into a comment at the bottom as a backup. Also, if you’re having issues with copy/paste of the code on this page, you can now download the R source file here.

The area under the curve (AUC) of the receiver operating characteristic (ROC) is often used (for better or worse) as a validation statistic for species distribution models. In short, it compares predicted values to true values of binary classification (e.g. model predictions to observed presence-absence values for a species presence). It is a threshold-independent measure, so one needs not define a cutoff predicted value at which to define a predicted “presence” or “absence”. The ROC is created by plotting false presences against true presences for a continuum of threshold values (conceptually an infinite number of values, though this is obviously not necessary to calculate the AUC). There’s a decent Cross-Validated thread on ROC/AUC here.

Assumptions and issues with this approach aside, the ROCR package for R provides some nice commands to process this. But it took some time to figure out an efficient way to extract all the statistics that I wanted from the ROCR commands. With the help of some of Laura Gray‘s original code, I put together a (not so) quick function that produces some basic ROC/AUC statistics, as well as a quick AUC plot.

# Build some example data
# Observed (truth) data as presence-absence (1-0)
observations <- predicted data as values ranging from 0 to 1
# Predicted data as values ranging from 0 to 1
predictions <- abs(c(rnorm(250,mean=0),rnorm(250,mean=1)))
predictions <- round((predictions-min(predictions))/max(predictions),3)

# Install the ROCR package
install.packages('ROCR')
library('ROCR')

# AUC function
fun.auc <- function(pred,obs){
  # Run the ROCR functions for AUC calculation
  ROC_perf <- performance(prediction(pred,obs),"tpr","fpr")
  ROC_sens <- performance(prediction(pred,obs),"sens","spec")
  ROC_err <- performance(prediction(pred, labels=obs),"err")
  ROC_auc <- performance(prediction(pred,obs),"auc")
  # AUC value
  AUC <- ROC_auc@y.values[[1]] # AUC
  # Mean sensitivity across all cutoffs
  x.Sens <- mean(as.data.frame(ROC_sens@y.values)[,1])
  # Mean specificity across all cutoffs
  x.Spec <- mean(as.data.frame(ROC_sens@x.values)[,1])
  # Sens-Spec table to estimate threshold cutoffs
  SS <- data.frame(SENS=as.data.frame(ROC_sens@y.values)[,1],SPEC=as.data.frame(ROC_sens@x.values)[,1])
  # Threshold cutoff with min difference between Sens and Spec
  SS_min_dif <- ROC_perf@alpha.values[[1]][which.min(abs(SS$SENS-SS$SPEC))]
  # Threshold cutoff with max sum of Sens and Spec
  SS_max_sum <- ROC_perf@alpha.values[[1]][which.max(rowSums(SS[c("SENS","SPEC")]))]
  # Min error rate
  Min_Err <- min(ROC_err@y.values[[1]])
  # Threshold cutoff resulting in min error rate
  Min_Err_Cut <- ROC_err@x.values[[1]][which(ROC_err@y.values[[1]]==Min_Err)][1]
  # Kick out the values
  round(cbind(AUC,x.Sens,x.Spec,SS_min_dif,SS_max_sum,Min_Err,Min_Err_Cut),3)
}

# Run the function with the example data
fun.auc(predictions, observations)
       AUC x.Sens x.Spec SS_min_dif SS_max_sum Min_Err Min_Err_Cut
[1,] 0.637  0.518  0.622      0.232       0.32   0.382        0.32

Now to make the AUC plot. Note that this function also runs the AUC calculations, so the above function (auc.fun) need not be run prior (in case you just want the plot and not the numbers).

# AUC plot function
fun.aucplot   # Run the AUC calculations
  ROC_perf 
  # Spawn a new plot window (Windows OS)
  graphics.off(); x11(h=6,w=6)
  # Plot the curve
  plot(ROC_perf,colorize=T,print.cutoffs.at=seq(0,1,by=0.1),lwd=3,las=1,main=title)
  # Add some statistics to the plot
  text(1,0.25,labels=paste("Npres = ",sum(obs==1),sep=""),adj=1)
  text(1,0.20,labels=paste("Nabs = ",sum(obs==0),sep=""),adj=1)
  text(1,0.15,labels=paste("AUC = ",round(ROC_auc@y.values[[1]],digits=2),sep=""),adj=1)
  text(1,0.10,labels=paste("Sens = ",round(mean(as.data.frame(ROC_sens@y.values)[,1]),digits=2),sep=""),adj=1)
  text(1,0.05,labels=paste("Spec = ",round(mean(as.data.frame(ROC_sens@x.values)[,1]),digits=2),sep=""),adj=1)
}

# Run the function
fun.aucplot(predictions, observations, "My AUC Plot")

Et viola! A lovely and colourful plot of a likely-deficient validation statistic that no doubt will be taken as gospel when you present your models.

AUC_Plot

At some point, I would like to dissect the plot function a bit to rebuild it in ggplot and make it, well, better. But that’s a task for another day and another pot of coffee. (But please let me know if you have already done this.)

Notes:

  • The Joy of Data has a nice post on AUC plotting here (reposted by R-bloggers here, if you prefer their layout).

UPDATE (Feb. 5, 2016):

So I recently updated the plotting function to use ggplot, because all the cool kids are doing that these days.

# Install the ggplot2 package
install.packages('ggplot2')
library('ggplot2')

# New plot window (Windows OS)
graphics.off(); x11(h=6,w=6)
# AUC ggplot function
fun.auc.ggplot <- function(pred, obs, title){
 
  # pred = predicted values
  # obs = observed values (truth)
  # title = plot title

  # Run the AUC calculations
  ROC_perf <- performance(prediction(pred,obs),"tpr","fpr")
  ROC_sens <- performance(prediction(pred,obs),"sens","spec")
  ROC_auc <- performance(prediction(pred,obs),"auc")
 
  # Make plot data
  plotdat <- data.frame(FP=ROC_perf@x.values[[1]],TP=ROC_perf@y.values[[1]],CUT=ROC_perf@alpha.values[[1]],POINT=NA)
  plotdat[unlist(lapply(seq(0,1,0.1),function(x){which.min(abs(plotdat$CUT-x))})),"POINT"] <- seq(0,1,0.1)
  
# Plot the curve
  ggplot(plotdat, aes(x=FP,y=TP)) + 
  geom_abline(intercept=0,slope=1) +
  geom_line(lwd=1) + 
  geom_point(data=plotdat[!is.na(plotdat$POINT),], aes(x=FP,y=TP,fill=POINT), pch=21, size=3) +
  geom_text(data=plotdat[!is.na(plotdat$POINT),], aes(x=FP,y=TP,fill=POINT), label=seq(1,0,-0.1), hjust=1, vjust=0) +
  scale_fill_gradientn("Threhsold Cutoff",colours=rainbow(14)[1:11]) +
  scale_x_continuous("False Positive Rate", limits=c(0,1)) +
  scale_y_continuous("True Positive Rate", limits=c(0,1)) +
  geom_polygon(aes(x=X,y=Y), data=data.frame(X=c(0.7,1,1,0.7),Y=c(0,0,0.29,0.29)), fill="white") +
  annotate("text",x=0.97,y=0.25,label=paste("Npres = ",sum(obs==1),sep=""),hjust=1) +
  annotate("text",x=0.97,y=0.20,label=paste("Nabs = ",sum(obs==0),sep=""),hjust=1) +
  annotate("text",x=0.97,y=0.15,label=paste("AUC = ",round(ROC_auc@y.values[[1]],digits=2),sep=""),hjust=1) +
  annotate("text",x=0.97,y=0.10,label=paste("Sens = ",round(mean(as.data.frame(ROC_sens@y.values)[,1]),digits=2),sep=""),hjust=1) +
  annotate("text",x=0.97,y=0.05,label=paste("Spec = ",round(mean(as.data.frame(ROC_sens@x.values)[,1]),digits=2),sep=""),hjust=1) +
  theme(legend.position="none", plot.title=element_text(vjust=2)) +
  ggtitle(title)
}

# Run the function
fun.auc.ggplot(predictions, observations, "My AUC Plot")

It will give you something like this. You can clean it up as you like.

AUC_ggplot


UPDATE (Oct. 26, 2016):

To colour the line in ggplot (akin to the ROCR output), you can simply add a colour call to the aesthetics of the first ggplot line, such as to colour the plot by True Positives:

ggplot(plotdat, aes(x=FP,y=TP,col=TP)) + 

You can then define the colour gradient as you like:

scale_colour_gradientn("",colours=rainbow(14)[1:11]) +

Note that you also have to add the inherits.aes=F option to the geom_polygon() line, otherwise it also searches for a TP column (or whatever you colour by) in the data frame. Also, you may want to add col=”black” to your geom_point() and geom_text() lines, otherwise they’ll also be in the same gradient colours:

geom_point(data=plotdat[!is.na(plotdat$POINT),], aes(x=FP,y=TP,fill=POINT), pch=21, size=3, col="black") +
geom_text(data=plotdat[!is.na(plotdat$POINT),], aes(x=FP,y=TP,fill=POINT), label=seq(1,0,-0.1), hjust=1, vjust=0, col="black") +
geom_polygon(aes(x=X,y=Y), inherit.aes=F, data=data.frame(X=c(0.7,1,1,0.7),Y=c(0,0,0.29,0.29)), fill="white") +

The ROCR package colours by True Presence rate (TP), but you can use whichever variable you consider more important, I suppose.

Advertisements
This entry was posted in R and tagged , , , , , , , , , . Bookmark the permalink.

14 Responses to Quick AUC calculation and plotting function in R

  1. Sam says:

    Hi, looks like you’ve got some coding problems in the
    plotdat[unlist(lapply(seq(0,1,0.1),function(x){which.min(abs(plotdat$CUT-x))})),”POINT seq(0,1,0.1)
    line

    Like

    • roder1 says:

      Thanks Sam, but I couldn’t find an error here (works when I try it, even line by line). Are you using your own data? If you can’t solve it, let me know what the error message says.

      Like

  2. doussau2.adelaide@gmail.com says:

    Hello! I have tried your updated function and I get some errors. I thought I would let you know. Adelaide

    > # AUC ggplot function
    > fun.auc.ggplot <- function(pred, obs, title){
    + # pred = predicted values
    + # obs = observed values (truth)
    + # title = plot title
    +
    + # Run the AUC calculations
    + ROC_perf <- performance(prediction(pred,obs),"tpr","fpr")
    + ROC_sens <- performance(prediction(pred,obs),"sens","spec")
    + ROC_auc <- performance(prediction(pred,obs),"auc")
    +
    + # Spawn a new plot window (Windows OS)
    + graphics.off(); x11(h=6,w=6)
    +
    + # Make plot data
    + plotdat scale_x_continuous(“False Positive Rate”, limits=c(0,1)) +
    + scale_y_continuous(“True Positive Rate”, limits=c(0,1)) +
    + geom_polygon(aes(x=X,y=Y), data=data.frame(X=c(0.7,1,1,0.7),Y=c(0,0,0.29,0.29)), fill=”white”) +
    + annotate(“text”,x=0.97,y=0.25,label=paste(“Npres = “,sum(obs==1),sep=””),hjust=1) +
    + annotate(“text”,x=0.97,y=0.20,label=paste(“Nabs = “,sum(obs==0),sep=””),hjust=1) +
    + annotate(“text”,x=0.97,y=0.15,label=paste(“AUC = “,round(ROC_auc@y.values[[1]],digits=2),sep=””),hjust=1) +
    + annotate(“text”,x=0.97,y=0.10,label=paste(“Sens = “,round(mean(as.data.frame(ROC_sens@y.values)[,1]),digits=2),sep=””),hjust=1) +
    + annotate(“text”,x=0.97,y=0.05,label=paste(“Spec = “,round(mean(as.data.frame(ROC_sens@x.values)[,1]),digits=2),sep=””),hjust=1) +
    + theme(legend.position=”none”, plot.title=element_text(vjust=2)) +
    + ggtitle(title)
    Erreur dans scale_x_continuous(“False Positive Rate”, limits = c(0, 1)) + :
    argument non numérique pour un opérateur binaire
    > }
    Erreur : ‘}’ inattendu(e) dans “}”
    >

    Like

    • roder1 says:

      Thanks Adelaide, but I couldn’t get the code to reproduce this error. I did, however, find another error, which I have now corrected. Let me know if you can’t sort this out. Sorry!

      Like

  3. Kylie says:

    Hi David, I’ve been using your original code from 22 Sept 2015, and it’s been working great. However yesterday suddenly the fun.auc function stopped working. Any ideas?

    > fun.auc(predictions, d$Delichon_urbica)
    Error: could not find function “fun.auc”

    Thanks, Kylie.

    Like

  4. markus says:

    Hi David, thanks for sharing your code!
    But the code for the plots is not working like you posted it. Aside from missing “<- function(…){" some things are missing. I could make the function fun.aucplot work with replacing ROC_perf with some lines from the first function. But I have no idea, how plotdat should be build in the fun.auc.ggplot function, which I'd love to use as well. I'd appreciate a hint.
    Thanks, Markus

    Like

  5. roder1 says:

    Thanks Markus. I’ve been having some issues with WordPress randomly deleting lines of code in this post. I seem to have to replace the code every so often just to keep it correct. No idea what’s going on, maybe reverting to an autosave version or something? Anyway, I’ll have a look and fix it up today if I can.

    Like

    • roder1 says:

      Should be all fixed now. Sorry about that.

      Like

      • markus says:

        Almost working… 😉
        I still had to add
        “] <-
        near the end of the second line where you build plotdat.
        Additionally, I changed some details, because my data does not allow for 10 cut-offs, which resulted in errors.
        I added
        max_threshold <- max(plotdat[!is.na(plotdat$POINT), 'POINT'])
        min_threshold <- min(plotdat[!is.na(plotdat$POINT), 'POINT'])
        and changed

        scale_colour_gradientn("",colours=rainbow(14)[1:sum(!is.na(plotdat$POINT))]) +
        geom_text(data=plotdat[!is.na(plotdat$POINT),],
        aes(x=FP,y=TP),
        label=seq(max_threshold, min_threshold,-0.1),
        hjust=1, vjust=0, col="black")
        Now, it's working for me. Thanks again.

        Like

      • Stephanie says:

        Hi David,
        I also would like to thank you for sharing your code!

        I understand you are having difficulties with WordPress deleting your code.

        If possible, please reply to this message with the appropriate line of code for the second line of plotdat, which is supposed to supply the cut-off points. It would be much appreciated.

        Like

  6. roder1 says:

    Thanks Stephanie. I’ve updated the page AGAIN. But here’s the entire code in case it still doesn’t work. (Time for a nasty email to WordPress.)

    # Build some example data
    # Observed (truth) data as presence-absence (1-0)
    observations <- c(rep(0,250),rep(1,250))
    # predicted data as values ranging from 0 to 1
    predictions <- abs(c(rnorm(250,mean=0),rnorm(250,mean=1)))
    predictions <- round((predictions-min(predictions))/max(predictions),3)

    # Install the ROCR package
    install.packages('ROCR')
    library('ROCR')

    ### AUC function
    fun.auc <- function(pred,obs){
    # Run the ROCR functions for AUC calculation
    ROC_perf <- performance(prediction(pred,obs),"tpr","fpr")
    ROC_sens <- performance(prediction(pred,obs),"sens","spec")
    ROC_err <- performance(prediction(pred, labels=obs),"err")
    ROC_auc <- performance(prediction(pred,obs),"auc")
    # AUC value
    AUC <- ROC_auc@y.values[[1]] # AUC
    # Mean sensitivity across all cutoffs
    x.Sens <- mean(as.data.frame(ROC_sens@y.values)[,1])
    # Mean specificity across all cutoffs
    x.Spec <- mean(as.data.frame(ROC_sens@x.values)[,1])
    # Sens-Spec table to estimate threshold cutoffs
    SS <- data.frame(SENS=as.data.frame(ROC_sens@y.values)[,1],SPEC=as.data.frame(ROC_sens@x.values)[,1])
    # Threshold cutoff with min difference between Sens and Spec
    SS_min_dif <- ROC_perf@alpha.values[[1]][which.min(abs(SS$SENS-SS$SPEC))]
    # Threshold cutoff with max sum of Sens and Spec
    SS_max_sum <- ROC_perf@alpha.values[[1]][which.max(rowSums(SS[c("SENS","SPEC")]))]
    # Min error rate
    Min_Err <- min(ROC_err@y.values[[1]])
    # Threshold cutoff resulting in min error rate
    Min_Err_Cut <- ROC_err@x.values[[1]][which(ROC_err@y.values[[1]]==Min_Err)][1]
    # Kick out the values
    round(cbind(AUC,x.Sens,x.Spec,SS_min_dif,SS_max_sum,Min_Err,Min_Err_Cut),3)
    }

    # Run the function with the example data
    fun.auc(predictions, observations)

    ### AUC plot function
    fun.aucplot <- function(pred, obs, title){
    # Run the AUC calculations
    ROC_perf <- performance(prediction(pred,obs),"tpr","fpr")
    ROC_sens <- performance(prediction(pred,obs),"sens","spec")
    ROC_auc <- performance(prediction(pred,obs),"auc")
    # Spawn a new plot window (Windows OS)
    graphics.off(); x11(h=6,w=6)
    # Plot the curve
    plot(ROC_perf,colorize=T,print.cutoffs.at=seq(0,1,by=0.1),lwd=3,las=1,main=title)
    # Add some statistics to the plot
    text(1,0.25,labels=paste("Npres = ",sum(obs==1),sep=""),adj=1)
    text(1,0.20,labels=paste("Nabs = ",sum(obs==0),sep=""),adj=1)
    text(1,0.15,labels=paste("AUC = ",round(ROC_auc@y.values[[1]],digits=2),sep=""),adj=1)
    text(1,0.10,labels=paste("Sens = ",round(mean(as.data.frame(ROC_sens@y.values)[,1]),digits=2),sep=""),adj=1)
    text(1,0.05,labels=paste("Spec = ",round(mean(as.data.frame(ROC_sens@x.values)[,1]),digits=2),sep=""),adj=1)
    }

    # Run the function
    fun.aucplot(predictions, observations, "My AUC Plot")

    ### AUC ggplot function
    # install.packages('ggplot2')
    library('ggplot2')
    library('grid')

    # Spawn a new plot window (Windows OS)
    graphics.off(); x11(h=6,w=6)

    fun.auc.ggplot <- function(pred, obs, title){

    # pred = predicted values
    # obs = observed values (truth)
    # title = plot title

    # Run the AUC calculations
    ROC_perf <- performance(prediction(pred,obs),"tpr","fpr")
    ROC_sens <- performance(prediction(pred,obs),"sens","spec")
    ROC_auc <- performance(prediction(pred,obs),"auc")

    # Make plot data
    plotdat <- data.frame(FP=ROC_perf@x.values[[1]],TP=ROC_perf@y.values[[1]],CUT=ROC_perf@alpha.values[[1]],POINT=NA)
    plotdat[unlist(lapply(seq(0,1,0.1),function(x){which.min(abs(plotdat$CUT-x))})),"POINT"] <- seq(0,1,0.1)

    # Plot the curve
    ggplot(plotdat, aes(x=FP,y=TP)) +
    geom_abline(intercept=0,slope=1) +
    geom_line(lwd=1) +
    geom_point(data=plotdat[!is.na(plotdat$POINT),], aes(x=FP,y=TP,fill=POINT), pch=21, size=3) +
    geom_text(data=plotdat[!is.na(plotdat$POINT),], aes(x=FP,y=TP,fill=POINT), label=seq(1,0,-0.1), hjust=1, vjust=0) +
    scale_fill_gradientn("Threhsold Cutoff",colours=rainbow(14)[1:11]) +
    scale_x_continuous("False Positive Rate", limits=c(0,1)) +
    scale_y_continuous("True Positive Rate", limits=c(0,1)) +
    geom_polygon(aes(x=X,y=Y), data=data.frame(X=c(0.7,1,1,0.7),Y=c(0,0,0.29,0.29)), fill="white") +
    annotate("text",x=0.97,y=0.25,label=paste("Npres = ",sum(obs==1),sep=""),hjust=1) +
    annotate("text",x=0.97,y=0.20,label=paste("Nabs = ",sum(obs==0),sep=""),hjust=1) +
    annotate("text",x=0.97,y=0.15,label=paste("AUC = ",round(ROC_auc@y.values[[1]],digits=2),sep=""),hjust=1) +
    annotate("text",x=0.97,y=0.10,label=paste("Sens = ",round(mean(as.data.frame(ROC_sens@y.values)[,1]),digits=2),sep=""),hjust=1) +
    annotate("text",x=0.97,y=0.05,label=paste("Spec = ",round(mean(as.data.frame(ROC_sens@x.values)[,1]),digits=2),sep=""),hjust=1) +
    theme(legend.position="none", plot.title=element_text(vjust=2)) +
    ggtitle(title)
    }

    fun.auc.ggplot(predictions, observations, "My AUC Plot")

    Like

    • Stephanie says:

      Thank you for the above! One question relating to: plotdat[unlist(lapply(seq(0,1,0.1),function(x){which.min(abs(plotdat$CUT-x))})),”POINT seq(0,1,0.1)

      Is there supposed to be another set of quotes and bracket? If so, and where?

      It appears my program is failing at this point.

      Like

  7. roder1 says:

    Ugh, it’s not your program that is failing Stephanie. It’s the WordPress block quote formatting for the code. >:-| Oddly, when I look at the page and copy/paste from it, everything works fine, but many people have commented that characters are missing when they do it.

    So, a new approach. I’m adding a link at the top of the page to an external file with the *working* code in it. If you have any other issues, please comment again or email me. Sorry for the hassle.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s