Spatial leave-one-out (SLOO) cross-validation

There are a variety of cross-validation (CV) methods to deal with things like spatial autocorrelation, including the spatial leave-one-out (SLOO) approach. This is essentially a normal leave-one-out method where the area around the withheld point is buffered in space to remove effects of residual autocorrelation (RAC).

Leave-one-out CV goes back a ways, and the spatially buffered version of SLOO has been around for a few years itself (Bahn 2009, Le Rest et al. 2014). It’s a method that we’re finding pretty useful around the lab, so I wrote a quick script to run it.

There are R codes for the SLOO approach provided with Le Rest et al. (2014), but I wanted to take things a bit further and also figure out whether it was necessary to run an exhaustive SLOO, where every point is withheld once, or if a reasonable error estimate could be achieved with a much smaller subset of data (and, if so, how to figure out how large that subset should be).

Spatial leave-one-out script

The first bit of code here provides a function for the SLOO approach. It requires an input of several settings/parameters in the function call. I’ve run the function on a simulated data set, incorporating a linear GLM and a quadratic GLM. It would be no problem to put whatever model structure you want into the function by subbing in the correct model formula and script (for model building and prediction).

First, we’ll create some data on a regular grid. The method works fine with irregular spacing as well as it uses a simple random point selection and then a euclidean distance to buffer. The data I create here has no spatial structure (even though that would be, presumably, the point of running the spatially buffered version of the LOO).

library('plyr')

# Make some data (it can be as large as you want)
datgrid <- expand.grid(x=1:50,y=1:50)
datrows <- nrow(datgrid)
dat <- data.frame(ID=1:datrows, X_COORD=datgrid$x, Y_COORD=datgrid$y, VAR1=rnorm(datrows), VAR2=rnorm(datrows))
dat$RESP <- with(dat, VAR1+VAR1^2+(VAR2*2)+rnorm(datrows)*3)
head(dat)

That should give us a simulated data set that looks pretty random.

1) Spatial Leave-One-Out Sample Data

Now we’ll build a quick model just to make sure we have some data that’s going to work. We need data that will give us a good but not perfect explanatory model and hopefully for which the quadratic model fits a bit better (this will have to be luck, but I think it will work out).

# Build the linear and quadratic model formulae
# You'll need these for the function call below
lin.glm.form <- as.formula(RESP~VAR1+VAR2)
quad.glm.form <- as.formula(RESP~VAR1+I(VAR1^2)+VAR2+I(VAR2^2))

# Check out the model fit
lin.mod <- glm(lin.glm.form, data=dat)
summary(lin.mod)

Deviance Residuals: 
     Min        1Q    Median       3Q       Max 
-11.3727   -2.2015   -0.0067   2.1050   14.6573 

Coefficients:
                Estimate   Std. Error   t value   Pr(>|t|) 
(Intercept)      1.04927      0.06508     16.12     <2e-16 ***
VAR1             0.99645      0.06619     15.05     <2e-16 ***
VAR2             2.04084      0.06390     31.94     <2e-16 ***

AIC: 12999

quad.mod <- glm(quad.glm.form, data=dat)
summary(quad.mod)

Deviance Residuals: 
     Min        1Q   Median       3Q       Max 
-11.2597   -2.0790   0.0588   2.0352   10.4712 

Coefficients:
             Estimate Std. Error t value Pr(>|t|) 
(Intercept)   0.12271    0.08442   1.454    0.146 
VAR1          1.03062    0.06087  16.933   <2e-16 ***
I(VAR1^2)     0.97722    0.04552  21.466   <2e-16 ***
VAR2          2.04489    0.05874  34.811   <2e-16 ***
I(VAR2^2)    -0.01850    0.03987  -0.464    0.643 

AIC: 12579

Have a look at the residuals. Not bad.

2) Spatial Leave-One-Out Model Fit

Alright, lets build the function for the SLOO. All of the notes for the function (e.g. what parameters are required) are listed in #commented text within the script.

# Spatially Buffered Leave-One-out Function
sloo.fun <- function(dat, x_coord, y_coord, resp, ss, rad, modform){
  # dat = complete data file
  # x_coord(y_coord) = names of x(y) coordinates in data
  # truth = true value of response
  # ss = sample size to process (number of LOO runs)
  # rad = radius for the spatial buffer around test point
  # modform = formula for the GLM

  # Select the testing points
  test <- dat[sample(nrow(dat),ss),]
  # Vector of predictions
  for(i in 1:nrow(test)){
    if(i==1){p <- c()}
    # Training data (test point & buffer removed)
    train <- dat[sqrt((dat[,x_coord]-test[i,x_coord])^2 +(dat[,y_coord]-test[i,y_coord])^2)>rad,]
    # Build the model
    m <- glm(modform, data=train)
    # Predict on test point
    p <- c(p, predict(m, test[i,], type="response"))
  }
  # Residuals
  p.res <- test[,resp]-p
  # RMSE
  p.rmse <- sqrt(mean(p.res^2))
  list(SampleRows=as.numeric(rownames(test)), Truth=test[,resp], Predictions=p, Residuals=p.res,RMSE=p.rmse)
}

Start by checking the function on a small sample of the data.

# SLOO CV run with subset of the whole data
# Sample Size = 10, Buffer Radius = 5
# Linear GLM
sloo.fun(dat, "X_COORD", "Y_COORD", "RESP", 10, 5, lin.glm.form)
$SampleRows
 [1] 1964 527 45 2036 648 564 655 1731 1636 59
$Truth
 [1]  0.5653736 -0.2235177  5.1335534  1.0022845  2.7925327
 [6] -5.9553933  2.9715118 -5.0665749  0.4152819 -3.2315351
$Predictions
 [1] -0.6918514 -0.2429583  3.1671638 -2.7396959  0.7165788
 [6] -2.3046204  2.3104736 -0.2855816  1.4095483 -0.6574948
$Residuals
 [1]   1.25722502   0.01944063  1.96638959  3.74198039  2.07595396
 [6]  -3.65077294   0.66103820 -4.78099330 -0.99426631 -2.57404025
$RMSE
 [1] 2.607622
# Quadratic GLM
sloo.fun(dat, "X_COORD", "Y_COORD", "RESP", 10, 5, quad.glm.form)

$SampleRows
 [1] 1310 2212 1393 247 2368 174 398 1503 273 2244
$Truth
 [1]  2.85618762  2.02076514 -0.06488073  0.20280548 -3.74737823
 [6] -7.89045572  1.24585702 -2.88988530 -1.71646285  3.80513914
$Predictions
 [1]  0.1063494  0.7004177 -0.6539341 -1.9697705 -0.2487980
 [6] -3.0641812  2.0693811 -3.6280424  1.9594685  1.6304830
$Residuals
 [1]  2.7498382  1.3203475  0.5890534  2.1725760 -3.4985802
 [6] -4.8262745 -0.8235241  0.7381571 -3.6759313  2.1746561
$RMSE
 [1] 2.633812 

OK, a couple things to note. First, each run will give you a different random selection of points because it’s based on the sample() command. Even using set.seed() will not make this consistent. If you want a consistent selection of points between, say, your two model structures, you can obviously embed both models in the function or place the point sample outside the function.

Second, the RMSE is pretty high and, more importantly, varies greatly if we run this over and over again. That’s because we’re only using 10 random points, which isn’t much out of a data set of 2,500 (0.4%). We can solve both issues above by simply making the sample size equal to the number of rows in the data (exhaustive LOO). But this may be computationally intensive, especially with larger models, larger data, or other methods like Random Forest.

The RMSE will most likely stabilise somewhere close to the true value (i.e. with exhaustive SLOO) at a sample size much smaller than your entire data. Hopefully. The trick, of course, is figuring out when this happens and how good the estimated value is. I wrote the script below that takes all the computed points in your SLOO and uses them cumulatively to plot the convergence of RMSE with increased sample size.

# TEST how many sample points required for stable RMSE estimates
# Runs pretty fast, so we'll compute it for the whole data set
max.size <- nrow(dat)
buffer <- 10
# Linear GLM
ss.test.lin <- data.frame(N=1:max.size)
test.out.lin <- sloo.fun(dat, "X_COORD", "Y_COORD", "RESP", max.size, buffer, lin.glm.form)
ss.test.lin[c("Truth","Pred","Res")] <- with(test.out.lin, cbind(Truth,Predictions,Residuals))
# Quadratic GLM
ss.test.quad <- data.frame(N=1:max.size)
test.out.quad <- sloo.fun(dat, "X_COORD", "Y_COORD", "RESP", max.size, buffer, quad.glm.form)
ss.test.quad[c("Truth","Pred","Res")] <- with(test.out.quad, cbind(Truth,Predictions,Residuals))
# Add the cumulative RMSE by simply calculating RMSE in sequence
# Add the random RMSE values by iteratively sampling from the data
for(i in 1:max.size){
  # Random RMSE
  # Samples i points out of all the data and calculates RMSE
  # Indication of the potential uncertainty in RMSE with a sample size i
  ss.test.lin[i,"Ran_RMSE"] <- with(ss.test.lin[sample(1:max.size,i),], sqrt(mean((Truth-Pred)^2)))
  ss.test.quad[i,"Ran_RMSE"] <- with(ss.test.quad[sample(1:max.size,i),], sqrt(mean((Truth-Pred)^2)))
  # Cumulative RMSE
  # Gives RMSE of all points, had you only run i points
  ss.test.lin[i,"Cum_RMSE"] <- sqrt(mean((ss.test.lin$Res[1:i])^2))
  ss.test.quad[i,"Cum_RMSE"] <- sqrt(mean((ss.test.quad$Res[1:i])^2))
}
# Note that both RMSEs start low/high and stabilise to the same value
# The same happens to the quadratic model (not shown)
head(ss.test.lin); tail(ss.test.lin)

N        Truth       Pred       Res  Cum_RMSE  Ran_RMSE
1     4.786860  2.6741284  2.112732  2.112732  5.964937
2    -1.896122 -0.7114234 -1.184699  1.712768  1.384025
3     7.896991  4.2167499  3.680241  2.543706  3.454678
4    -3.972843 -0.7861822 -3.186661  2.718737  4.467720
5    -5.575135 -2.0295995 -3.545535  2.902997  4.063304
6     5.398863  2.2979080  3.100955  2.936917  1.031987
   N     Truth       Pred       Res  Cum_RMSE  Ran_RMSE
2495  1.092225  0.3802105  0.712014  3.352516  3.354941
2496 -0.437340  1.4749980 -1.912338  3.352063  3.354735
2497 -2.625840  0.0248902 -2.650730  3.351812  3.351536
2498 -2.980456 -0.1654503 -2.815005  3.351614  3.351941
2499  5.256142 -0.8757664  6.131908  3.353188  3.353530
2500  7.894435  5.2038651  2.690570  3.352949  3.352949

We can plot the convergence between the cumulative RMSE and the randomly sampled RMSE to see how they compare across various sample sizes.

3a) Spatial Leave-One-Out cumulative RMSE plot by sample size

The points are the RMSE at each iterative sample size (n). Each point is the RMSE from a random sample of points of size n (Ran_RMSE in the code). This gives an indication of the variability in RMSE to expect with a given sample size. The lines are the calculated RMSE of all the points up to and including n (Cum_RMSE in the code), giving some indication of the convergence on the exhaustive RMSE (the last value in either Ran_RMSE and Cum_RMSE).

In multiple runs of this code, I’ve found that the convergence onto the exhaustive RMSE can vary somewhat, given the data we produced. In the end, the only way to ensure you have the exhaustive RMSE is to run the SLOO exhaustively. However, if you’re tolerance for uncertainty in the RMSE is a bit higher, running a smaller data subset might work. And you can always look at the variability in the RMSE at each subsequent point (i.e. the width of the pink and blue point clouds) to see how much uncertainty is there.

References

Bahn V. (2009), A new method for evaluating species distribution models. 94th Ecological Society of America Annual Meeting, August 2-7, 2009, Albuquerque, NM, USA.

Le Rest, K., Pinaud, D., Monestiez, P., Chadoeuf, J. and Bretagnolle, V. (2014), Spatial leave-one-out cross-validation for variable selection in the presence of spatial autocorrelation. Global Ecology and Biogeography, 23: 811–820. doi: 10.1111/geb.12161

 

Advertisements
This entry was posted in R and tagged , , , , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s