In this tutorial, I used two popular machine learning algorithms: `Random Forest`

and `GLMnet`

for `Genomic Prediction`

of a quantitative trait. There are several machine learning R packages available, however, in this tutorial i used `caret`

package. The objective was to develop two models: `Random forest`

and `glmnet`

using real data sets consisting genetic markers, that were used as `predictors`

to predict a quantitative trait (names of the variables were disguised in this tutorial), and finally, comparing the performance of the two models by testing them on a validation set.

# Data preparation

## Step 1. libraries and set seed

```
library(randomForest)
library(mlbench)
library(caret)
library(e1071)
set.seed(06252020)
```

## Step 2. Load input file and explore

The genetic marker data was converted into numeric format. One can use `TASSEL`

software to convert their data into numeric format, as well as numerically impute any sites with missing data. Please remember, its curicial to impute the data or else it will result an error while executing the model and or result biased model output. Additionally, I removed any missing phenotypic values from the data set.

Download the dataset i used in this tutorial here at this link: Click here

Code:

```
phenoGeno <- read.table("rhGeno_numericImpu_malateBlues_HI.txt", header = T, na.strings = "NA")
## Explore the data
head(phenoGeno)
tail(phenoGeno)
```

Output:

```
phenotype marker1 marker2 marker3 marker4 marker5 marker6 marker7 marker8
1 3.301974 0.5 0.0 0.5 0.0 0.0 0.500 1 0.0
2 3.314499 0.0 0.0 1.0 0.0 0.0 1.000 1 0.0
3 4.701069 0.5 0.0 1.0 0.0 0.0 0.500 1 0.0
4 4.943334 1.0 0.5 0.5 0.5 0.5 0.252 1 0.5
5 5.061315 0.0 0.0 1.0 0.0 0.0 0.500 1 0.0
6 5.144740 0.5 0.0 0.5 0.0 0.0 1.000 1 0.0
```

### Step 2.1 Distribution of the phenotype data

```
hist(phenoGeno$phenotype)
```

The phenotype data is normally distributed, and no preprocessing was performed. However, I do suggest transforming the data if required.

## Step 3. Training and Test set

### Step 3.1 Splitting the data into train and test set

In the next step, the imported data was split into 60/40 split, since the dataset i have used in this tutorial is small and, therefore, 60/40 gives a larger and more reliable test set.

### Step 3.2 Randomizing the order of the dataset

```
rows <- sample(nrow(phenoGeno))
phenoGeno <- phenoGeno[rows, ]
head(phenoGeno)
```

### Step 3.3 Find the rows to split

For this tutorial, data will be split into 60/40, which provides a larger and more reliable test/validation set.

```
split <- round(nrow(phenoGeno) * 0.60)
train_set <- phenoGeno[1:split,]
test_set <- phenoGeno[(split + 1):nrow(phenoGeno),]
```

### Step 3.4 Confirm train and test set sizes

```
# Check the ratios of training and test sets
nrow(train_set)/nrow(phenoGeno)
nrow(test_set)/nrow(phenoGeno)
# plot the two data sets
par(mfrow=c(1,2))
hist(train_set$phenotype)
hist(test_set$phenotype)
par(mfrow=c(1,1))
```

# Random forest - rf function

The name `Random forest`

is a popular name in the area of machine learning. This alogrithm is often acurrate than other algorithms, easier to tune, require less little preprocessing, and capture threshold effects and variale interactions, however, it is less interpretable and often quite slow.

## TrainControl object

Before creating and running the models, its a good idea to create a `trainControl`

object to tune parameters and further control how models are created.

```
myTrainingControl <- trainControl(method = "cv",
number = 10,
savePredictions = TRUE,
classProbs = TRUE,
verboseIter = TRUE)
```

## Creating the model

Markers in the column from 2:1942 were used as predictors (`x`

) and the `phenotype`

column as the `y`

variable. Similarly, the performance of the model was evaluated using root mean square error (`RMSE`

) (One can choose other method such as `ROC`

, `AUC`

etc as well), and finally choosing random forest `rf`

as the method in the model.

Code:

```
# Create tunegrid for mtry to tune the model.
tunegrid <- data.frame(.mtry=c(2,4,6, 10, 15, 20, 100))
randomForestFit = train(x = train_set[2:1942],
y = train_set$phenotype,
method = "rf",
metric = "RMSE",
tuneGrid = tunegrid
)
# print the model output
randomForestFit
```

Output:

```
Random Forest
101 samples
1941 predictors
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 101, 101, 101, 101, 101, 101, ...
Resampling results across tuning parameters:
mtry RMSE Rsquared MAE
2 2.754033 0.1057086 2.233488
4 2.736635 0.1023974 2.221954
6 2.717262 0.1151134 2.203889
10 2.710155 0.1121776 2.200923
15 2.700049 0.1159634 2.197863
20 2.700378 0.1109506 2.194062
100 2.668136 0.1296488 2.159293
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 100.
```

Plotting the `randomForetFit`

model.

```
plot(randomForestFit, main = "Random Forest")
```

The best mtry with lowest RMSE for this model was 10.

Next, we can plot the important predictors based on their calculated importance scores using `varImp`

function.

```
plot(varImp(randomForestFit), top = 20)
```

After building the supevised random forest learning model, The top 20 predictors can be ranked by their importance, and from the above plot, we tell that the `marker703`

was the most important variable.

## Validation of Random Forest model

Now we can test the robustness of the model by testing it on `test_set`

data set, and regressing the predicted and measured values.

```
prediction_rf <- predict(randomForestFit, test_set)
plot(test_set$phenotype, prediction_rf,
main = "Random Forest", pch=17, col=c("red"))
abline(lm(prediction_rf~test_set$phenotype))
```

And we can also calculate the `RMSE`

between the predicted and measured values.

```
error_rf <- prediction_rf - test_set$phenotype
rmse_rf <- sqrt(mean(error_rf^2))
rmse_rf
```

# Glmnet model - glmnet function

This is another popular machine learning algorithm, which has some benefits over the Random forest model. It is a extension of `glm`

models with a built-in variable selection that also help in dealing with `collinearity`

and small sizes. The `glmnet`

has two primary froms: 1) `LASSO`

regression, which penalizes number of non-zero coefficients, and 2) `Ridge`

regression, which penalizes absolute magnitude of coefficients. It also attempts to find a parsimonious aka simple model and pairs well with random forest models.

First, we create the glmnet models, which is a combination of lasso and ridge regression, and we can fit the mix of the two models by selecting the values for `alpha`

and `lambda`

, as shown in the below code snippet.

## glmnet model

Code:

```
glmnetFit = train(x = train_set[2:1942],
y = train_set$phenotype,
method = "glmnet",
metric = "RMSE",
trainControl = myControl,
tuneGrid = expand.grid(
alpha=0:1,
lambda= 0:10 / 10
)
)
print(glmnetFit)
```

Output:

```
glmnet
101 samples
1941 predictors
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 101, 101, 101, 101, 101, 101, ...
Resampling results across tuning parameters:
alpha lambda RMSE Rsquared MAE
0 0.0 2.724706 0.13509467 2.256743
0 0.1 2.724706 0.13509467 2.256743
0 0.2 2.724706 0.13509467 2.256743
0 0.3 2.724706 0.13509467 2.256743
0 0.4 2.724706 0.13509467 2.256743
0 0.5 2.724706 0.13509467 2.256743
0 0.6 2.724706 0.13509467 2.256743
0 0.7 2.724706 0.13509467 2.256743
0 0.8 2.724706 0.13509467 2.256743
0 0.9 2.724706 0.13509467 2.256743
0 1.0 2.724706 0.13509467 2.256743
1 0.0 3.005565 0.09225936 2.467820
1 0.1 2.897531 0.09590490 2.359915
1 0.2 2.833422 0.09475887 2.279513
1 0.3 2.817838 0.08354247 2.255736
1 0.4 2.798377 0.08085184 2.229348
1 0.5 2.776585 0.08554465 2.205043
1 0.6 2.767135 0.08753894 2.193391
1 0.7 2.767016 0.08620681 2.193196
1 0.8 2.773281 0.08476073 2.200689
1 0.9 2.784696 0.08616074 2.215916
1 1.0 2.800997 0.08520355 2.235143
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 0 and lambda = 1.
```

From the above output of the model, we see that alpha = 1, indicating lasso regression was performed, where, the size of the penalty was 0.7

## Comparing models: Ridge vs LASSO

```
plot(glmnetFit, main = "Glmnet")
```

From the above comparison model, we can say that the lamda of 0.9 has lowest RMSE.

## Full regularization path

```
plot(glmnetFit$finalModel)
```

Next, we can plot the important predictors in this model based on their calculated importance scores using `varImp`

function.

```
plot(varImp(glmnetFit), top = 20)
```

`marker703`

was calculated as the most important variable by glmnet model, which is consistent with random forest model. However, the glmnet has very few variables listed in comparison to rf.

## Validation of glmnet model

Now we can test the robustness of the glmnet model by testing it on `test_set`

data set, and regressing the predicted and measured values.

```
prediction_glmnet <- predict(glmnetFit, test_set)
plot(test_set$phenotype, prediction_glmnet,
main = "GLMnet", pch=17, col=c("red"))
abline(lm(prediction_glmnet~test_set$phenotype))
```

From the above plot, we see that the most important variables only classify the high and low phenotypic scores as categorical variables, which is not the nature of this phenotype data used in this tutorial.

And we can also calculate the `RMSE`

between the predicted and measured values.

```
error_glm <- prediction_glmnet - test_set$phenotype
rmse_glm <- sqrt(mean(error_glm^2))
rmse_glm
```

## Scatter plots of the predicted values from two models

```
plot(test_set$phenotype, prediction_glmnet,
pch=17, col=c("red"), ylab = "Predicted", xlab = "Measured")
points(test_set$phenotype, prediction_rf, pch=20, col=("blue"))
abline(lm(prediction_glmnet~test_set$phenotype), lwd=1, col="red")
abline(lm(prediction_rf~test_set$phenotype), lwd=1, col="blue")
legend(14,9, legend=c("GLMnet", "Random Forest"),
col=c("red", "blue"), lty=1, cex=0.8)
```

## Comparing the Random Forest and Glmnet models

We can compare the performance of the models by studying their MAE, RMSE and R-squared values side-by-side, making it very convenient.

```
# Create model_list
model_list <- list(random_forest = randomForestFit, glmnet = glmnetFit)
# Pass model_list to resamples(): resamples
resamples <- resamples(model_list)
resamples
# Summarize the results
summary(resamples)
```

Output:

```
Call:
summary.resamples(object = resamples)
Models: random_forest, glmnet
Number of resamples: 25
MAE
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
random_forest 1.827043 1.955185 2.227830 2.159293 2.325698 2.462633 0
glmnet 1.906341 2.115096 2.219258 2.256743 2.358562 2.839599 0
RMSE
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
random_forest 2.251769 2.464010 2.697926 2.668136 2.850605 3.100536 0
glmnet 2.323055 2.544948 2.697235 2.724706 2.819970 3.509943 0
Rsquared
Min. 1st Qu. Median Mean 3rd Qu. Max.
random_forest 4.784201e-02 0.07672433 0.117394 0.1296488 0.1714565 0.3358213
glmnet 8.322920e-07 0.06500280 0.126212 0.1350947 0.1709628 0.4315585
NA's
random_forest 0
glmnet 0
```

From above table, we can tell that the random forest model appears to be better model incoparison to glmnet.

## plot models by MAE, R-squrared and RMSE

Similarly, we can visually inspect the models accuracies.

```
bwplot(resamples, metric = "RMSE")
dotplot(resamples, metric = "RMSE")
```

## Conclusion:

In this tutorial, real genetic markers and phenotype data were used in machine learning models: random forest and glmnet(lasso and ridge regression). Marker data were converted into numeric format and numerically imputed, prior to performing the supervised machine learning. Both models had a very close performance for this data set, but random forest did perform slightly better in comparison to glmet, and most importantly both models listed `marker703`

as the most important variable, which is astoudingly consistent with my `QTL`

analysis, that is the peak QTL marker was in LD with `marker703`

explaining highest phenotypic variance.