Monday, 10 July 2017

Linear Mixed Effects Models in Agriculture

This post was originally part of my previous post about linear models. However, I later decided to split it into several texts because it was effectively too long and complex to navigate.
If you struggle to follow the code in this page please refer to this post (for example for the necessary packages): Linear Models (lm, ANOVA and ANCOVA) in Agriculture

Linear Mixed-Effects Models

This class of models is used to account for more than one source of random variation. For example, assume we have a dataset where we are trying to model yield as a function of nitrogen levels. However, the data were collected in many different farms. In this case, each farm would need to be consider a cluster and the model would need to take this clustering into account. Another common set of experiments where linear mixed-effects models are used is repeated measures, where time provides an additional source of correlation between measures. For these models we do not need to worry about the assumptions from previous models, since these are very robust against all of them. For example, for unbalanced design with blocking, probably these methods should be used instead of the standard ANOVA.

At the beginning on this tutorial we explored the equation that supports linear model:

This equation can be divided into two components, the fixed and random effects. For fixed effect we refer to those variables we are using to explain the model. These may be factorial (in ANOVA), continuous or a mixed of the two (ANCOVA) and they can also be the blocks used in our design. The other component in the equation is the random effect, which provides a level of uncertainty that it is difficult to account for in the model. For example, when we work with yield we might see differences between plants grown from similar soils and conditions. These may be related to the seeds or to other factors and are part of the within-subject variation that we cannot explain.

There are times however where in the data there are multiple sources of random variation. For example, data may be clustered in separate fields or separate farms. This will provide an additional source of random variation that needs to be taken into account in the model. To do so the standard equation can be amended in the following way:

This is referred to as a random intercept model, where the random variation is split into a cluster specific variation u and the standard error term. Effectively, this model assumes that each cluster will only have an effect on the slope of the linear model. In other word, we assume that data collected at different farm will have the same correlation pattern but will be shifted, see image below (source: Mixed Models):

A more complex form, that is normally used for repeated measures is the random slope and intercept model:

Where we add a new source of random variation v related to time T. In this case we assume that the random variation happends not only by changing the intercept of the linear model, but also its slope. The image below from the same website should again clarify things:

As a general rule we can use plotting to determine if and what random effects to use for modelling our data. In the examples above, a simple xy plot with colour would provide a lot of information. Alternatively, we could use the plotting method with ggplot2 and the function facet_wrap to divide our scatterplots by factors and see if there are changes only in intercept or also slope.

Random Intercept Model for Clustered Data

In the following examples we will use the function lme in the package nlme, so please install and/or load the package first. For this example we are using the same dataset lasrosas.corn from package agridat we used in the previous post Linear Models in Agriculture

Just to explain the syntax to use linear mixed-effects model in R for cluster data, we will assume that the factorial variable rep in our dataset describes some clusters. To fit a mixed-effects model we are going to use the function lme from the package nlme. This function can work with unbalanced designs:

 lme1 = lme(yield ~ nf + bv * topo, random= ~1|rep, data=dat)  

The syntax is very similar to all the models we fitted before, with a general formula describing our target variable yield and all the treatments, which are the fixed effects of the model. Then we have the option random, which allows us to include an additional random component for the clustering factor rep. In this case the ~1 indicates that the random effect will be associated with the intercept.

Once again we can use the function summary to explore our results:

 > summary(lme1)  
 Linear mixed-effects model fit by REML  
  Data: dat   
     AIC   BIC  logLik  
  27648.36 27740.46 -13809.18  
 Random effects:  
  Formula: ~1 | rep  
     (Intercept) Residual  
 StdDev:  0.798407 13.3573  
 Fixed effects: yield ~ nf + bv * topo   
         Value Std.Error  DF  t-value p-value  
 (Intercept) 327.3304 14.782524 3428 22.143068    0  
 nfN1      3.9643 0.788049 3428  5.030561    0  
 nfN2      5.2340 0.790104 3428  6.624449    0  
 nfN3      5.4498 0.789084 3428  6.906496    0  
 nfN4      7.5286 0.789551 3428  9.535320    0  
 nfN5      7.7254 0.789111 3428  9.789976    0  
 bv      -1.4685 0.085507 3428 -17.173569    0  
 topoHT   -233.3675 17.143956 3428 -13.612232    0  
 topoLO   -251.9750 20.967003 3428 -12.017693    0  
 topoW    -146.4066 16.968453 3428 -8.628162    0  
 bv:topoHT   1.1945 0.097696 3428 12.226279    0  
 bv:topoLO   1.4961 0.123424 3428 12.121624    0  
 bv:topoW    0.7873 0.097865 3428  8.044485    0  

We can also use the function Anova to display the ANOVA table:

 > Anova(lme2, type=c("III"))  
 Analysis of Deviance Table (Type III tests)  
 Response: yield  
        Chisq Df Pr(>Chisq)    
 (Intercept) 752.25 1 < 2.2e-16 ***  
 nf     155.57 5 < 2.2e-16 ***  
 bv     291.49 1 < 2.2e-16 ***  
 topo    236.52 3 < 2.2e-16 ***  
 year    797.13 1 < 2.2e-16 ***  
 bv:topo   210.38 3 < 2.2e-16 ***  
 Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1  

We might be interested in understanding if fitting a more complex model provides any advantage in terms of accuracy, compared with a model where no additional random effect is included. To do so we can compare this new model with mod6, which we created with the gls function and includes the same treatment structure.  We can do that with the function anova , specifying both models:

 > anova(lme1, mod6)  
    Model df   AIC   BIC  logLik  Test L.Ratio p-value  
 lme1   1 15 27648.36 27740.46 -13809.18              
 mod6   2 14 27651.21 27737.18 -13811.61 1 vs 2 4.857329 0.0275  

As you can see there is a decrease in AIC for the model fitted with lme, and the difference is significant (p-value below 0.05). Therefore this new model where clustering is accounted for is better than the one without an additional random effect, even though only slightly. In this case we would need to decide if fitting a more complex model (which is probably more difficult to explain to readers) is the best way.

Another way to assess the accuracy of GLS and Mixed Effects Models is through the use of pseudo R squared, which are indexes that can be interpreted as the normal R-Squared but calculated in different ways, since in these more complex models we do not calculate the sum of squares.
There are two important functions for this, both included in the package MuMIn:

 > r.squaredLR(mod6)  
 [1] 0.5469906  
 [1] 0.5470721  
 > r.squaredGLMM(lme1)  
    R2m    R2c   
 0.5459845 0.5476009   

The first function r.squaredLR can be used for GLS models and provides both and R-Squared and an Adjusted R-Squared. The second function, r.squaredGLMM, is specific for mixed-effects models and provides two measures: R2m and R2c. The first reports the R2 of the model with just fixed effects, while the second the R squared of the full model.
In this case we can see again that the R squared are similar between models, and most importantly R2c is only slightly different compared to R2m, which means that including random effects does not improve the accuracy.

Random Intercept and Slope for repeated measures

If we collected data at several time steps we are looking at a repeated measures analysis, which is most cases can be treated as mixed random slope and intercept model. Again, we cannot just assume that because we have collected data over time we have a random slope and intercept, we always need to do some plotting first and take a closer look at our data. In cases like this where we are dealing with a factorial variable, we may be forced to rely on barcharts, divided by years. In such cases it may be difficult to understand whether we need a model this complex. So it may be that the only way is to just compare different models with anova (this function can be used with more than 2 models if needed).

The code to create such a model is the following:

 lme2 = lme(yield ~ nf + bv * topo + year, random= ~year|rep, data=dat)  
 Anova(lme2, type=c("III"))  

The syntax is very similar to what we wrote before, except that now the random component includes both time and clusters. Again we can use summary to get more info about the model. We can also use again the function anova to compare this with the previous model:

 > anova(lme1, lme2)  
    Model df   AIC   BIC  logLik  Test L.Ratio p-value  
 lme1   1 15 27648.36 27740.46 -13809.18              
 lme2   2 18 26938.83 27049.35 -13451.42 1 vs 2 715.5247 <.0001  
 Warning message:  
 In anova.lme(lme1, lme2) :  
  fitted objects with different fixed effects. REML comparisons are not meaningful.  

From this output it is clear that the new model is better that the one before and their difference in highly significant. If this happens it is generally better to adopt the more complex model.

We can extract only the effects for the random components using the function ranef:

 > ranef(lme2)  
   (Intercept)     year  
 R1 -0.3468601 -1.189799e-07  
 R2 -0.5681688 -1.973702e-07  
 R3  0.9150289 3.163501e-07   

This tells us the changes in yield for each cluster and time step.

We can also do the same for the fixed effects, and this will return the coefficients of the model:

 > fixef(lme2)  
  (Intercept)     nfN1     nfN2     nfN3     nfN4   
 -1.133614e+04 3.918006e+00 5.132136e+00 5.368513e+00 7.464542e+00   
      nfN5      bv    topoHT    topoLO     topoW   
  7.639337e+00 -1.318391e+00 -2.049979e+02 -2.321431e+02 -1.136168e+02   
      year   bv:topoHT   bv:topoLO   bv:topoW   
  5.818826e+00 1.027686e+00 1.383705e+00 5.998379e-01   

To have an idea of their confidence interval we can use the function intervals:

 > intervals(lme2, which = "fixed")  
 Approximate 95% confidence intervals  
  Fixed effects:  
           lower     est.     upper  
 (Intercept) -1.214651e+04 -1.133614e+04 -1.052576e+04  
 nfN1     2.526139e+00 3.918006e+00 5.309873e+00  
 nfN2     3.736625e+00 5.132136e+00 6.527648e+00  
 nfN3     3.974809e+00 5.368513e+00 6.762216e+00  
 nfN4     6.070018e+00 7.464542e+00 8.859065e+00  
 nfN5     6.245584e+00 7.639337e+00 9.033089e+00  
 bv     -1.469793e+00 -1.318391e+00 -1.166989e+00  
 topoHT   -2.353450e+02 -2.049979e+02 -1.746508e+02  
 topoLO   -2.692026e+02 -2.321431e+02 -1.950836e+02  
 topoW    -1.436741e+02 -1.136168e+02 -8.355954e+01  
 year     5.414742e+00 5.818826e+00 6.222911e+00  
 bv:topoHT  8.547273e-01 1.027686e+00 1.200644e+00  
 bv:topoLO  1.165563e+00 1.383705e+00 1.601846e+00  
 bv:topoW   4.264933e-01 5.998379e-01 7.731826e-01  
 [1] "Fixed effects:"   

Syntax with lme4

Another popular package to perform mixed-effects models we could also use the package lme4 and the function lmer.
For example, to fit the model with random intercept (what we called lme1) we would use the following syntax in lme4:

 > lmer1 = lmer(yield ~ nf + bv * topo + (1|rep), data=dat)  
 > summary(lmer1)  
 Linear mixed model fit by REML ['lmerMod']  
 Formula: yield ~ nf + bv * topo + (1 | rep)  
   Data: dat  
 REML criterion at convergence: 27618.4  
 Scaled residuals:   
   Min   1Q Median   3Q   Max   
 -3.4267 -0.7767 -0.1109 0.7196 3.6892   
 Random effects:  
  Groups  Name    Variance Std.Dev.  
  rep   (Intercept)  0.6375 0.7984   
  Residual       178.4174 13.3573   
 Number of obs: 3443, groups: rep, 3  
 Fixed effects:  
        Estimate Std. Error t value  
 (Intercept) 327.33043  14.78252 22.143  
 nfN1      3.96433  0.78805  5.031  
 nfN2      5.23400  0.79010  6.624  
 nfN3      5.44980  0.78908  6.906  
 nfN4      7.52862  0.78955  9.535  
 nfN5      7.72537  0.78911  9.790  
 bv      -1.46846  0.08551 -17.174  
 topoHT   -233.36750  17.14396 -13.612  
 topoLO   -251.97500  20.96700 -12.018  
 topoW    -146.40655  16.96845 -8.628  
 bv:topoHT   1.19446  0.09770 12.226  
 bv:topoLO   1.49609  0.12342 12.122  
 bv:topoW    0.78727  0.09786  8.044  
 Correlation matrix not shown by default, as p = 13 > 12.  
 Use print(x, correlation=TRUE) or  
      vcov(x)     if you need it  

There are several differences between nlme and lme4 and I am not sure which is actually better. What I found is that probably lme4 is the most popular, but nlme is used for example to fit generalized addictive mixed effects models in the package mgcv. For this reason probably the best thing would be to know how to use both packages.

As you can see from the summary above, in this table there are no p-values, so it is a bit difficult to know which levels are significant for the model. We can solve this by installing and/or loading the package lmerTest. If we load lmerTest and run again the same function we would obtain the following summary table:

 > lmer1 = lmer(yield ~ nf + bv * topo + (1|rep), data=dat)  
 > summary(lmer1)  
 Linear mixed model fit by REML t-tests use Satterthwaite approximations to degrees of freedom [  
 Formula: yield ~ nf + bv * topo + (1 | rep)  
   Data: dat  
 REML criterion at convergence: 27618.4  
 Scaled residuals:   
   Min   1Q Median   3Q   Max   
 -3.4267 -0.7767 -0.1109 0.7196 3.6892   
 Random effects:  
  Groups  Name    Variance Std.Dev.  
  rep   (Intercept)  0.6375 0.7984   
  Residual       178.4174 13.3573   
 Number of obs: 3443, groups: rep, 3  
 Fixed effects:  
        Estimate Std. Error     df t value Pr(>|t|)    
 (Intercept) 327.33043  14.78252 3411.00000 22.143 < 2e-16 ***  
 nfN1      3.96433  0.78805 3428.00000  5.031 5.14e-07 ***  
 nfN2      5.23400  0.79010 3428.00000  6.624 4.03e-11 ***  
 nfN3      5.44980  0.78908 3428.00000  6.906 5.90e-12 ***  
 nfN4      7.52862  0.78955 3428.00000  9.535 < 2e-16 ***  
 nfN5      7.72537  0.78911 3428.00000  9.790 < 2e-16 ***  
 bv      -1.46846  0.08551 3428.00000 -17.174 < 2e-16 ***  
 topoHT   -233.36750  17.14396 3429.00000 -13.612 < 2e-16 ***  
 topoLO   -251.97500  20.96700 3430.00000 -12.018 < 2e-16 ***  
 topoW    -146.40655  16.96845 3430.00000 -8.628 < 2e-16 ***  
 bv:topoHT   1.19446  0.09770 3429.00000 12.226 < 2e-16 ***  
 bv:topoLO   1.49609  0.12342 3430.00000 12.122 < 2e-16 ***  
 bv:topoW    0.78727  0.09786 3430.00000  8.044 1.33e-15 ***  
 Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1  
 Correlation matrix not shown by default, as p = 13 > 12.  
 Use print(x, correlation=TRUE) or  
      vcov(x)     if you need it  

As you can see now the p-values are showing and we can assess the significance for each term.

Clearly, all the functions we used above for the function lme can be used also with the package lme4 and lmerTest. For example, we can produce the anova table with the function anova or compute the R Squared with the function r.squaredGLMM.

When we are dealing with random slope and intercept we would use the following syntax:

 lmer2 = lmer(yield ~ nf + bv * topo + year + (year|rep), data=dat)  

No comments:

Post a Comment