There are a variety of cross-validation (CV) methods to deal with things like spatial autocorrelation, including the spatial leave-one-out (SLOO) approach. This is essentially a normal leave-one-out method where the area around the withheld point is buffered in space to remove effects of residual autocorrelation (RAC).

Leave-one-out CV goes back a ways, and the spatially buffered version of SLOO has been around for a few years itself (Bahn 2009, Le Rest et al. 2014). It’s a method that we’re finding pretty useful around the lab, so I wrote a quick script to run it.

There are R codes for the SLOO approach provided with Le Rest et al. (2014), but I wanted to take things a bit further and also figure out whether it was necessary to run an exhaustive SLOO, where every point is withheld once, or if a reasonable error estimate could be achieved with a much smaller subset of data (and, if so, how to figure out how large that subset should be).

**Spatial leave-one-out script**

The first bit of code here provides a function for the SLOO approach. It requires an input of several settings/parameters in the function call. I’ve run the function on a simulated data set, incorporating a linear GLM and a quadratic GLM. It would be no problem to put whatever model structure you want into the function by subbing in the correct model formula and script (for model building and prediction).

First, we’ll create some data on a regular grid. The method works fine with irregular spacing as well as it uses a simple random point selection and then a euclidean distance to buffer. The data I create here has no spatial structure (even though that would be, presumably, the point of running the spatially buffered version of the LOO).

library('plyr') # Make some data (it can be as large as you want) datgrid <- expand.grid(x=1:50,y=1:50) datrows <- nrow(datgrid) dat <- data.frame(ID=1:datrows, X_COORD=datgrid$x, Y_COORD=datgrid$y, VAR1=rnorm(datrows), VAR2=rnorm(datrows)) dat$RESP <- with(dat, VAR1+VAR1^2+(VAR2*2)+rnorm(datrows)*3) head(dat)

That should give us a simulated data set that looks pretty random.

Now we’ll build a quick model just to make sure we have some data that’s going to work. We need data that will give us a good but not perfect explanatory model and hopefully for which the quadratic model fits a bit better (this will have to be luck, but I think it will work out).

# Build the linear and quadratic model formulae # You'll need these for the function call below lin.glm.form <- as.formula(RESP~VAR1+VAR2) quad.glm.form <- as.formula(RESP~VAR1+I(VAR1^2)+VAR2+I(VAR2^2)) # Check out the model fit lin.mod <- glm(lin.glm.form, data=dat) summary(lin.mod) Deviance Residuals: Min 1Q Median 3Q Max -11.3727 -2.2015 -0.0067 2.1050 14.6573 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.04927 0.06508 16.12 <2e-16 *** VAR1 0.99645 0.06619 15.05 <2e-16 *** VAR2 2.04084 0.06390 31.94 <2e-16 *** AIC: 12999 quad.mod <- glm(quad.glm.form, data=dat) summary(quad.mod) Deviance Residuals: Min 1Q Median 3Q Max -11.2597 -2.0790 0.0588 2.0352 10.4712 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.12271 0.08442 1.454 0.146 VAR1 1.03062 0.06087 16.933 <2e-16 *** I(VAR1^2) 0.97722 0.04552 21.466 <2e-16 *** VAR2 2.04489 0.05874 34.811 <2e-16 *** I(VAR2^2) -0.01850 0.03987 -0.464 0.643 AIC: 12579

Have a look at the residuals. Not bad.

Alright, lets build the function for the SLOO. All of the notes for the function (e.g. what parameters are required) are listed in #commented text within the script.

# Spatially Buffered Leave-One-out Function sloo.fun <- function(dat, x_coord, y_coord, resp, ss, rad, modform){ # dat = complete data file # x_coord(y_coord) = names of x(y) coordinates in data # truth = true value of response # ss = sample size to process (number of LOO runs) # rad = radius for the spatial buffer around test point # modform = formula for the GLM # Select the testing points test <- dat[sample(nrow(dat),ss),] # Vector of predictions for(i in 1:nrow(test)){ if(i==1){p <- c()} # Training data (test point & buffer removed) train <- dat[sqrt((dat[,x_coord]-test[i,x_coord])^2 +(dat[,y_coord]-test[i,y_coord])^2)>rad,] # Build the model m <- glm(modform, data=train) # Predict on test point p <- c(p, predict(m, test[i,], type="response")) } # Residuals p.res <- test[,resp]-p # RMSE p.rmse <- sqrt(mean(p.res^2)) list(SampleRows=as.numeric(rownames(test)), Truth=test[,resp], Predictions=p, Residuals=p.res,RMSE=p.rmse) }

Start by checking the function on a small sample of the data.

# SLOO CV run with subset of the whole data # Sample Size = 10, Buffer Radius = 5 # Linear GLM sloo.fun(dat, "X_COORD", "Y_COORD", "RESP", 10, 5, lin.glm.form)

$SampleRows [1] 1964 527 45 2036 648 564 655 1731 1636 59 $Truth [1] 0.5653736 -0.2235177 5.1335534 1.0022845 2.7925327 [6] -5.9553933 2.9715118 -5.0665749 0.4152819 -3.2315351 $Predictions [1] -0.6918514 -0.2429583 3.1671638 -2.7396959 0.7165788 [6] -2.3046204 2.3104736 -0.2855816 1.4095483 -0.6574948 $Residuals [1] 1.25722502 0.01944063 1.96638959 3.74198039 2.07595396 [6] -3.65077294 0.66103820 -4.78099330 -0.99426631 -2.57404025 $RMSE [1] 2.607622

# Quadratic GLM sloo.fun(dat, "X_COORD", "Y_COORD", "RESP", 10, 5, quad.glm.form) $SampleRows [1] 1310 2212 1393 247 2368 174 398 1503 273 2244 $Truth [1] 2.85618762 2.02076514 -0.06488073 0.20280548 -3.74737823 [6] -7.89045572 1.24585702 -2.88988530 -1.71646285 3.80513914 $Predictions [1] 0.1063494 0.7004177 -0.6539341 -1.9697705 -0.2487980 [6] -3.0641812 2.0693811 -3.6280424 1.9594685 1.6304830 $Residuals [1] 2.7498382 1.3203475 0.5890534 2.1725760 -3.4985802 [6] -4.8262745 -0.8235241 0.7381571 -3.6759313 2.1746561 $RMSE [1] 2.633812

OK, a couple things to note. First, each run will give you a different random selection of points because it’s based on the sample() command. Even using set.seed() will not make this consistent. If you want a consistent selection of points between, say, your two model structures, you can obviously embed both models in the function or place the point sample outside the function.

Second, the RMSE is pretty high and, more importantly, varies greatly if we run this over and over again. That’s because we’re only using 10 random points, which isn’t much out of a data set of 2,500 (0.4%). We can solve both issues above by simply making the sample size equal to the number of rows in the data (exhaustive LOO). But this may be computationally intensive, especially with larger models, larger data, or other methods like Random Forest.

The RMSE will most likely stabilise somewhere close to the true value (i.e. with exhaustive SLOO) at a sample size much smaller than your entire data. Hopefully. The trick, of course, is figuring out when this happens and how good the estimated value is. I wrote the script below that takes all the computed points in your SLOO and uses them cumulatively to plot the convergence of RMSE with increased sample size.

# TEST how many sample points required for stable RMSE estimates # Runs pretty fast, so we'll compute it for the whole data set max.size <- nrow(dat) buffer <- 10 # Linear GLM ss.test.lin <- data.frame(N=1:max.size) test.out.lin <- sloo.fun(dat, "X_COORD", "Y_COORD", "RESP", max.size, buffer, lin.glm.form) ss.test.lin[c("Truth","Pred","Res")] <- with(test.out.lin, cbind(Truth,Predictions,Residuals)) # Quadratic GLM ss.test.quad <- data.frame(N=1:max.size) test.out.quad <- sloo.fun(dat, "X_COORD", "Y_COORD", "RESP", max.size, buffer, quad.glm.form) ss.test.quad[c("Truth","Pred","Res")] <- with(test.out.quad, cbind(Truth,Predictions,Residuals))

# Add the cumulative RMSE by simply calculating RMSE in sequence # Add the random RMSE values by iteratively sampling from the data for(i in 1:max.size){ # Random RMSE # Samples i points out of all the data and calculates RMSE # Indication of the potential uncertainty in RMSE with a sample size i ss.test.lin[i,"Ran_RMSE"] <- with(ss.test.lin[sample(1:max.size,i),], sqrt(mean((Truth-Pred)^2))) ss.test.quad[i,"Ran_RMSE"] <- with(ss.test.quad[sample(1:max.size,i),], sqrt(mean((Truth-Pred)^2))) # Cumulative RMSE # Gives RMSE of all points, had you only run i points ss.test.lin[i,"Cum_RMSE"] <- sqrt(mean((ss.test.lin$Res[1:i])^2)) ss.test.quad[i,"Cum_RMSE"] <- sqrt(mean((ss.test.quad$Res[1:i])^2)) } # Note that both RMSEs start low/high and stabilise to the same value # The same happens to the quadratic model (not shown) head(ss.test.lin); tail(ss.test.lin) N Truth Pred Res Cum_RMSE Ran_RMSE 1 4.786860 2.6741284 2.112732 2.112732 5.964937 2 -1.896122 -0.7114234 -1.184699 1.712768 1.384025 3 7.896991 4.2167499 3.680241 2.543706 3.454678 4 -3.972843 -0.7861822 -3.186661 2.718737 4.467720 5 -5.575135 -2.0295995 -3.545535 2.902997 4.063304 6 5.398863 2.2979080 3.100955 2.936917 1.031987 N Truth Pred Res Cum_RMSE Ran_RMSE 2495 1.092225 0.3802105 0.712014 3.352516 3.354941 2496 -0.437340 1.4749980 -1.912338 3.352063 3.354735 2497 -2.625840 0.0248902 -2.650730 3.351812 3.351536 2498 -2.980456 -0.1654503 -2.815005 3.351614 3.351941 2499 5.256142 -0.8757664 6.131908 3.353188 3.353530 2500 7.894435 5.2038651 2.690570 3.352949 3.352949

We can plot the convergence between the cumulative RMSE and the randomly sampled RMSE to see how they compare across various sample sizes.

The points are the RMSE at each iterative sample size (n). Each point is the RMSE from a random sample of points of size n (Ran_RMSE in the code). This gives an indication of the variability in RMSE to expect with a given sample size. The lines are the calculated RMSE of all the points up to and including n (Cum_RMSE in the code), giving some indication of the convergence on the exhaustive RMSE (the last value in either Ran_RMSE and Cum_RMSE).

In multiple runs of this code, I’ve found that the convergence onto the exhaustive RMSE can vary somewhat, given the data we produced. In the end, the only way to ensure you have the exhaustive RMSE is to run the SLOO exhaustively. However, if you’re tolerance for uncertainty in the RMSE is a bit higher, running a smaller data subset might work. And you can always look at the variability in the RMSE at each subsequent point (i.e. the width of the pink and blue point clouds) to see how much uncertainty is there.

**References**

Bahn V. (2009), A new method for evaluating species distribution models. *94th Ecological Society of America Annual Meeting*, August 2-7, 2009, Albuquerque, NM, USA.

Le Rest, K., Pinaud, D., Monestiez, P., Chadoeuf, J. and Bretagnolle, V. (2014), Spatial leave-one-out cross-validation for variable selection in the presence of spatial autocorrelation. *Global Ecology and Biogeography*, 23: 811–820. doi: 10.1111/geb.12161