Re-scaling tricks in R

Not rocket science. Just parking this here for easy future access.

R has the very useful scale() command for scaling vectors/matrices. This command takes a group of numbers, re-centring the mean to 0 and standard deviation to 1. Essentially it’s a z-score conversion. But I often want a tool that simply redistributes the range values between 0 and 1. I also needed a quick way to scale a second data set based on the scalings of the first.

(PART 1)  The first task, straight re-scaling between 0 and 1, can be completed easily with this function (which came mostly out of this thread on StackOverflow):

# Redistribution function <- function(x){(x-min(x))/diff(range(x))}

# Sample data
foo <- data.frame(VAR1=rnorm(10), VAR2=rnorm(10)*2+50)
         VAR1     VAR2
1  -0.1629179 47.32685
2  -0.1142152 50.40980
3  -0.4446594 50.07057
4   0.2569592 49.12217
5  -1.1001371 50.80081

# Used on a single vector$VAR1)
[1] 0.6906063 0.7264937 0.4830002 1.0000000 0.0000000

# Used with apply (across columns)
          VAR1      VAR2
[1,] 0.6906063 0.0000000
[2,] 0.7264937 0.8874443
[3,] 0.4830002 0.7897972
[4,] 1.0000000 0.5167949
[5,] 0.0000000 1.0000000

(PART 2)  The second thing I need to do now and then is scale a second set of values based on the scaling rules of a first. For this, you can use the scale() command on the first data, then use a linear model with lm() to apply the same scalings to a second data set.

# Sample data with different means and SDs
foo1 <- data.frame(VAR=rnorm(100)*10+50)
foo2 <- data.frame(VAR=rnorm(100)*12+65)

# Scaled data for foo1
foo1$SCALED <- c(scale(foo1$VAR))
          VAR              SCALED 
Min.   :25.93    Min.   :-2.18614 
1st Qu.:43.46    1st Qu.:-0.56339 
Median :49.31    Median :-0.02175 
Mean   :49.55    Mean   : 0.00000 
3rd Qu.:57.00    3rd Qu.: 0.68975 
Max.   :78.10    Max.   : 2.64345

So the first data has the mean centred on 0 and the SD scaled to 1. The second data set should have the same adjustments applied, but the mean and SD will be slightly different because they come from different distributions in the first place. To apply the same scaling to foo2 as in foo1, we can make a quick linear model of foo1 and use the coefficients to define foo2$SCALED.

# Make the linear model
lm.scale <- lm(foo1$SCALED~foo1$VAR)
lm(formula = foo1$SCALED ~ foo1$VAR)
(Intercept)     foo1$VAR 
   -4.58653      0.09257

# Scale foo2 based on the coefficients from the model
foo2$SCALED <- foo2$VAR*lm.scale$coefficients[2]+lm.scale$coefficients[1]
          VAR            SCALED 
Min.   :38.15   Min.   :-1.0545 
1st Qu.:58.15   1st Qu.: 0.7965 
Median :63.79   Median : 1.3182 
Mean   :63.97   Mean   : 1.3357 
3rd Qu.:70.89   3rd Qu.: 1.9761 
Max.   :88.19   Max.   : 3.5771 

And so it goes.

Update Feb. 19, 2016:

I recently also found the squish() command from the {scales} package, which can be used to squish values into a range. But it doesn’t do quite what I expected. I found it when looking for a way to assign legend colours to z-values outside the z-limits in a ggplot (like if you want a cropped continuous legend, using this command will assign the highest/lowest colour to the out of bounds values).



This entry was posted in R. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s