Spatially balanced subsampling in R (retaining maximum sample size)

While working on a spatial autocorrelation analysis, I came to need an R function that sub-sampled spatial data points (i.e. an xyz table, not a raster), retaining only those at least a given distance from one another. But I am also a bit data limited, so I wanted to make sure to keep as much data as possible. I’m sure there are good packages for this, but Google couldn’t really turn up what I was looking for. So I wrote my own inelegant, brute-force version.

Here’s the function:

## Function to buffer points in XY space:
## Returns the original data table with buffered points removed.
## Runs numerous iterations, as the random point selection can result in more/fewer output points.
# 1) Randomly select a single point
# 2) Remove points within 50km of that point
# 3) Randomly select of the remaining points
# 4) ...
# foo - a data.frame to select from with columns x, y
# buffer - the minimum distance between output points
# reps - the number of repetitions for the points selection
buffer.f <- function(foo, buffer, reps){
  # Make list of suitable vectors
  suitable <- list()
  for(k in 1:reps){
    # Make the output vector
    outvec <- as.numeric(c())
    # Make the vector of dropped (buffered out) points
    dropvec <- c()
    for(i in 1:nrow(foo)){
      # Stop running when all points exhausted
      if(length(dropvec)<nrow(foo)){
        # Set the rows to sample from
        if(i>1){
          rowsleft <- (1:nrow(foo))[-c(dropvec)]
        } else {
          rowsleft <- 1:nrow(foo)
        }
        # Randomly select point
        outpoint <- as.numeric(sample(as.character(rowsleft),1))
        outvec[i] <- outpoint
        # Remove points within buffer
        outcoord <- foo[outpoint,c("x","y")]
        dropvec <- c(dropvec, which(sqrt((foo$x-outcoord$x)^2 + (foo$y-outcoord$y)^2)<buffer))
        # Remove unnecessary duplicates in the buffered points
        dropvec <- dropvec[!duplicated(dropvec)]
      } 
    } 
    # Populate the suitable points list
    suitable[[k]] <- outvec
  }
  # Go through the iterations and pick a list with the most data
  best <- unlist(suitable[which.max(lapply(suitable,length))])
  foo[best,]
}

And here’s a working example:

# Create some fake spatial data in, say, kilometres.
mydat <- data.frame(x=sample(1:10000,1000), y=sample(1:10000,1000))
plot(mydat)

# The x and y points here are, on average, 10km apart.
# We want to remove points so that no two are closer than 500km.
# But we want to maintain as much data as we can.

# Run the function on the spatial data, buffer of 500km, 1000 iterations
mydat2 <- buffer.f(mydat, 500, 1000)

# Have a look at the buffered data
plot(mydat2)

Original_Data_PlotBuffered_Data_Plot

Yes, the iterative process should (given enough iterations) find a combination of points that retains a pretty good number of points. Of course, there’s no way to ensure that it’s the maximum number of points (unless you run all possible combinations), but close enough for my purposes.

As usual, if there’s an egregious error here, or if there’s an easier way (i.e. a good package) please let me know.

Advertisements
This entry was posted in R and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s