In this tutorial, I used two popular machine learning algorithms: Random Forest and GLMnet for Genomic Prediction of a quantitative trait. There are several machine learning R packages available, however, in this tutorial i used caret package. The objective was to develop two models: Random forest and glmnet using real data sets consisting genetic markers, that were used as predictors to predict a quantitative trait (names of the variables were disguised in this tutorial), and finally, comparing the performance of the two models by testing them on a validation set.

Data preparation

Step 1. libraries and set seed

library(randomForest)
library(mlbench)
library(caret)
library(e1071)

set.seed(06252020)

Step 2. Load input file and explore

The genetic marker data was converted into numeric format. One can use TASSEL software to convert their data into numeric format, as well as numerically impute any sites with missing data. Please remember, its curicial to impute the data or else it will result an error while executing the model and or result biased model output. Additionally, I removed any missing phenotypic values from the data set.

Download the dataset i used in this tutorial here at this link: Click here

Code:

phenoGeno <- read.table("rhGeno_numericImpu_malateBlues_HI.txt", header = T, na.strings = "NA")

## Explore the data
head(phenoGeno)
tail(phenoGeno)

Output:

   phenotype marker1 marker2 marker3 marker4 marker5 marker6 marker7 marker8
3.301974     0.5     0.0     0.5     0.0     0.0   0.500       1     0.0
3.314499     0.0     0.0     1.0     0.0     0.0   1.000       1     0.0
4.701069     0.5     0.0     1.0     0.0     0.0   0.500       1     0.0
4.943334     1.0     0.5     0.5     0.5     0.5   0.252       1     0.5
5.061315     0.0     0.0     1.0     0.0     0.0   0.500       1     0.0
5.144740     0.5     0.0     0.5     0.0     0.0   1.000       1     0.0

Step 2.1 Distribution of the phenotype data

hist(phenoGeno$phenotype)

The phenotype data is normally distributed, and no preprocessing was performed. However, I do suggest transforming the data if required.

Step 3. Training and Test set

Step 3.1 Splitting the data into train and test set

In the next step, the imported data was split into 60/40 split, since the dataset i have used in this tutorial is small and, therefore, 60/40 gives a larger and more reliable test set.

Step 3.2 Randomizing the order of the dataset

rows <- sample(nrow(phenoGeno))

phenoGeno <- phenoGeno[rows, ]

head(phenoGeno)

Step 3.3 Find the rows to split

For this tutorial, data will be split into 60/40, which provides a larger and more reliable test/validation set.

split <- round(nrow(phenoGeno) * 0.60)

train_set <- phenoGeno[1:split,]

test_set <- phenoGeno[(split + 1):nrow(phenoGeno),]

Step 3.4 Confirm train and test set sizes

# Check the ratios of training and test sets
nrow(train_set)/nrow(phenoGeno)

nrow(test_set)/nrow(phenoGeno)

# plot the two data sets
par(mfrow=c(1,2))
hist(train_set$phenotype)
hist(test_set$phenotype)
par(mfrow=c(1,1))

Random forest - rf function

The name Random forest is a popular name in the area of machine learning. This alogrithm is often acurrate than other algorithms, easier to tune, require less little preprocessing, and capture threshold effects and variale interactions, however, it is less interpretable and often quite slow.

TrainControl object

Before creating and running the models, its a good idea to create a trainControl object to tune parameters and further control how models are created.

myTrainingControl <- trainControl(method = "cv", 
                              number = 10, 
                              savePredictions = TRUE, 
                              classProbs = TRUE, 
                              verboseIter = TRUE)

Creating the model

Markers in the column from 2:1942 were used as predictors (x) and the phenotype column as the y variable. Similarly, the performance of the model was evaluated using root mean square error (RMSE) (One can choose other method such as ROC, AUC etc as well), and finally choosing random forest rf as the method in the model.

Code:

# Create tunegrid for mtry to tune the model. 
tunegrid <- data.frame(.mtry=c(2,4,6, 10, 15, 20, 100))

randomForestFit = train(x = train_set[2:1942], 
                        y = train_set$phenotype, 
                        method = "rf",
                        metric = "RMSE",
                        tuneGrid = tunegrid
                        )

# print the model output 
randomForestFit

Output:

 Random Forest 
 
  101 samples
 1941 predictors
 
 No pre-processing
 Resampling: Bootstrapped (25 reps) 
 Summary of sample sizes: 101, 101, 101, 101, 101, 101, ... 
 Resampling results across tuning parameters:
 
   mtry  RMSE      Rsquared   MAE     
     2   2.754033  0.1057086  2.233488
     4   2.736635  0.1023974  2.221954
     6   2.717262  0.1151134  2.203889
    10   2.710155  0.1121776  2.200923
    15   2.700049  0.1159634  2.197863
    20   2.700378  0.1109506  2.194062
   100   2.668136  0.1296488  2.159293
 
 RMSE was used to select the optimal model using the smallest value.
 The final value used for the model was mtry = 100.

Plotting the randomForetFit model.

plot(randomForestFit, main = "Random Forest")

The best mtry with lowest RMSE for this model was 10.

Next, we can plot the important predictors based on their calculated importance scores using varImp function.

plot(varImp(randomForestFit), top = 20)

After building the supevised random forest learning model, The top 20 predictors can be ranked by their importance, and from the above plot, we tell that the marker703 was the most important variable.

Validation of Random Forest model

Now we can test the robustness of the model by testing it on test_set data set, and regressing the predicted and measured values.

prediction_rf <- predict(randomForestFit, test_set)

plot(test_set$phenotype, prediction_rf,
     main = "Random Forest", pch=17, col=c("red"))
abline(lm(prediction_rf~test_set$phenotype))

And we can also calculate the RMSE between the predicted and measured values.

error_rf <- prediction_rf - test_set$phenotype
rmse_rf <- sqrt(mean(error_rf^2))
rmse_rf

Glmnet model - glmnet function

This is another popular machine learning algorithm, which has some benefits over the Random forest model. It is a extension of glm models with a built-in variable selection that also help in dealing with collinearity and small sizes. The glmnet has two primary froms: 1) LASSO regression, which penalizes number of non-zero coefficients, and 2) Ridge regression, which penalizes absolute magnitude of coefficients. It also attempts to find a parsimonious aka simple model and pairs well with random forest models.

First, we create the glmnet models, which is a combination of lasso and ridge regression, and we can fit the mix of the two models by selecting the values for alpha and lambda, as shown in the below code snippet.

glmnet model

Code:

glmnetFit = train(x = train_set[2:1942], 
            y = train_set$phenotype, 
            method = "glmnet",
            metric = "RMSE",
            trainControl = myControl,
            tuneGrid = expand.grid(
              alpha=0:1,
              lambda= 0:10 / 10
            )
          )

print(glmnetFit)

Output:

 glmnet 
 
samples
predictors
 
 No pre-processing
 Resampling: Bootstrapped (25 reps) 
 Summary of sample sizes: 101, 101, 101, 101, 101, 101, ... 
 Resampling results across tuning parameters:
 
   alpha  lambda  RMSE      Rsquared    MAE     
    0.0     2.724706  0.13509467  2.256743
    0.1     2.724706  0.13509467  2.256743
    0.2     2.724706  0.13509467  2.256743
    0.3     2.724706  0.13509467  2.256743
    0.4     2.724706  0.13509467  2.256743
    0.5     2.724706  0.13509467  2.256743
    0.6     2.724706  0.13509467  2.256743
    0.7     2.724706  0.13509467  2.256743
    0.8     2.724706  0.13509467  2.256743
    0.9     2.724706  0.13509467  2.256743
    1.0     2.724706  0.13509467  2.256743
    0.0     3.005565  0.09225936  2.467820
    0.1     2.897531  0.09590490  2.359915
    0.2     2.833422  0.09475887  2.279513
    0.3     2.817838  0.08354247  2.255736
    0.4     2.798377  0.08085184  2.229348
    0.5     2.776585  0.08554465  2.205043
    0.6     2.767135  0.08753894  2.193391
    0.7     2.767016  0.08620681  2.193196
    0.8     2.773281  0.08476073  2.200689
    0.9     2.784696  0.08616074  2.215916
    1.0     2.800997  0.08520355  2.235143
 
 RMSE was used to select the optimal model using the smallest value.
 The final values used for the model were alpha = 0 and lambda = 1.

From the above output of the model, we see that alpha = 1, indicating lasso regression was performed, where, the size of the penalty was 0.7

Comparing models: Ridge vs LASSO

plot(glmnetFit, main = "Glmnet")

From the above comparison model, we can say that the lamda of 0.9 has lowest RMSE.

Full regularization path

plot(glmnetFit$finalModel)

Next, we can plot the important predictors in this model based on their calculated importance scores using varImp function.

plot(varImp(glmnetFit), top = 20)

marker703 was calculated as the most important variable by glmnet model, which is consistent with random forest model. However, the glmnet has very few variables listed in comparison to rf.

Validation of glmnet model

Now we can test the robustness of the glmnet model by testing it on test_set data set, and regressing the predicted and measured values.

prediction_glmnet <- predict(glmnetFit, test_set)

plot(test_set$phenotype, prediction_glmnet,
     main = "GLMnet", pch=17, col=c("red"))
abline(lm(prediction_glmnet~test_set$phenotype))

From the above plot, we see that the most important variables only classify the high and low phenotypic scores as categorical variables, which is not the nature of this phenotype data used in this tutorial.

And we can also calculate the RMSE between the predicted and measured values.

error_glm <- prediction_glmnet - test_set$phenotype
rmse_glm <- sqrt(mean(error_glm^2))
rmse_glm

Scatter plots of the predicted values from two models

plot(test_set$phenotype, prediction_glmnet,
      pch=17, col=c("red"), ylab = "Predicted", xlab = "Measured")
points(test_set$phenotype, prediction_rf, pch=20, col=("blue"))
abline(lm(prediction_glmnet~test_set$phenotype), lwd=1, col="red")
abline(lm(prediction_rf~test_set$phenotype), lwd=1, col="blue")
legend(14,9, legend=c("GLMnet", "Random Forest"),
       col=c("red", "blue"), lty=1, cex=0.8)

Comparing the Random Forest and Glmnet models

We can compare the performance of the models by studying their MAE, RMSE and R-squared values side-by-side, making it very convenient.

# Create model_list
model_list <- list(random_forest = randomForestFit, glmnet = glmnetFit)

# Pass model_list to resamples(): resamples
resamples <- resamples(model_list)
resamples

# Summarize the results
summary(resamples)

Output:

Call:
 summary.resamples(object = resamples)
 
 Models: random_forest, glmnet 
 Number of resamples: 25 
 
 MAE 
                   Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
 random_forest 1.827043 1.955185 2.227830 2.159293 2.325698 2.462633    0
 glmnet        1.906341 2.115096 2.219258 2.256743 2.358562 2.839599    0
 
 RMSE 
                   Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
 random_forest 2.251769 2.464010 2.697926 2.668136 2.850605 3.100536    0
 glmnet        2.323055 2.544948 2.697235 2.724706 2.819970 3.509943    0
 
 Rsquared 
                       Min.    1st Qu.   Median      Mean   3rd Qu.      Max.
 random_forest 4.784201e-02 0.07672433 0.117394 0.1296488 0.1714565 0.3358213
 glmnet        8.322920e-07 0.06500280 0.126212 0.1350947 0.1709628 0.4315585
               NA's
 random_forest    0
 glmnet           0

From above table, we can tell that the random forest model appears to be better model incoparison to glmnet.

plot models by MAE, R-squrared and RMSE

Similarly, we can visually inspect the models accuracies.

bwplot(resamples, metric = "RMSE")
dotplot(resamples, metric = "RMSE")

Conclusion:

In this tutorial, real genetic markers and phenotype data were used in machine learning models: random forest and glmnet(lasso and ridge regression). Marker data were converted into numeric format and numerically imputed, prior to performing the supervised machine learning. Both models had a very close performance for this data set, but random forest did perform slightly better in comparison to glmet, and most importantly both models listed marker703 as the most important variable, which is astoudingly consistent with my QTL analysis, that is the peak QTL marker was in LD with marker703 explaining highest phenotypic variance.