Multiple Quantitative traits
were evaluated with varying heritabilties to study how the inheritance of a trait affect the performance of genomic selection and prediction models. Machine learning algorithms Random Forest
and GLMnet: Lasso and Elastic-Net Regularized Generalized Linear Models
were deployed to develop Genomic selection
models using the publicly available dataset from the Maize genomes to fields (G2F) initiative. In the analysis, data from 40 maize hybrid experiments across 34 unique locations in 19 states in the U.S. and one Canadian province from years 2016 and 2017 experiments were evaluated.
From the data set, five important agronomic traits were selected: Days to Anthesis
, Days to Silking
, Plant height
, Ear height
and Grain yield
, and exploratory data analysis of each trait by each year and across locations were explored as well as their heritability
and BLUPs
were calculated. Further, genotype-by-sequencing (GBS)
for each hybrid in the dataset were filtered, thinned and imputed in command line TASSEL v5
, prior to using them as predictors of the two quantitative traits. The two data sets: phenotype and genotype data were intersected by taxa and partioned into training and testing sets, and finally, training GS models were cross-validated on testing data set and evaluated by comparing the RMSE
and R-Squared
.
Table of Contents
- 1.0 Phenotypic data assessment
- 2.0 Genotypic data
- 3.0 Genomic selection algorithms
- 4.0 Summary of genomic selection algorithms on selected traits
1.0 Phenotypic data assessment
Download G2F phenotypic data from year 2016 and 2017
#install.packages("RCurl")
library(RCurl)
## download 2016 g2f hybrid clean (outliers removed) phenotypic data
download.file("https://de.cyverse.org/anon-files//iplant/home/shared/commons_repo/curated/GenomesToFields_2014_2017_v1/G2F_Planting_Season_2016_v2/a._2016_hybrid_phenotypic_data/g2f_2016_hybrid_data_clean.csv",destfile="pheno_2016_g2f_clean.csv",method="libcurl")
## download 2017 g2f hybrid clean (outliers removed) phenotypic data
download.file("https://de.cyverse.org/anon-files//iplant/home/shared/commons_repo/curated/GenomesToFields_2014_2017_v1/G2F_Planting_Season_2017_v1/a._2017_hybrid_phenotypic_data/g2f_2017_hybrid_data_clean.csv",destfile="pheno_2017_g2f_clean.csv",method="libcurl")
Import the downloaded two data sets in R environment
library(dplyr)
# import 2016 phenotype data
pheno_2016 <- read.csv(file = "pheno_2016_g2f_clean.csv", header = T, sep = ",",
na.strings = c("","NaN"), fileEncoding="UTF-8-BOM")
# import 2017 phenotype data
pheno_2017 <- read.csv(file = "pheno_2017_g2f_clean.csv", header = T, sep = ",",
na.strings = c("","NaN"), fileEncoding="UTF-8-BOM")
Checking the imported phenotype data
Code:
glimpse(pheno_2016)
View(pheno_2016)
glimpse(pheno_2017)
#View(pheno_2017)
Output:
Rows: 16,377
Columns: 38
$ Year <int> 2016, 2016, 2016, 2016, 201...
$ Field.Location <chr> "ARH1", "ARH1", "ARH1", "AR...
$ RecId <int> 3094939, 3095402, 3093386, ...
$ Source <chr> "15SFG:2004", "15SJWE:G2F:1...
$ Pedigree <chr> "2369/PHZ51", "LH195/PHN37"...
$ Replicate <int> 2, 2, 2, 2, 1, 2, 1, 1, 2, ...
$ Block <int> 2, 2, 2, 2, 1, 2, 1, 1, 2, ...
$ Plot <int> 2, 53, 123, 137, 172, 217, ...
$ Range <int> 2, NA, 14, 15, 3, 17, 6, 3,...
$ Pass <int> 4, NA, 8, 7, 2, 7, 5, 9, 1,...
$ LOCAL_CHECK..Yes..No.Blank.. <chr> NA, NA, NA, NA, NA, NA, NA,...
$ Plot.Length.Field <dbl> 25, 25, 25, 25, 25, 25, 25,...
$ Alley.Length <dbl> 5, 5, 5, 5, 5, 5, 3, 5, 5, ...
$ Row.Spacing <int> 38, 38, 38, 38, 38, 38, 30,...
$ Plot.Area <dbl> 126.67, 126.67, 126.67, 126...
$ Rows.Plot <lgl> NA, NA, NA, NA, NA, NA, NA,...
$ Packet.Plot <lgl> NA, NA, NA, NA, NA, NA, NA,...
$ Kernels.Packet <lgl> NA, NA, NA, NA, NA, NA, NA,...
$ X..Seed <int> 98, 98, 98, 98, 98, 98, 100...
$ Date.Planted <chr> "4/7/16", "4/7/16", "4/7/16...
$ Date.Harvested <chr> "8/30/16", "8/30/16", "8/30...
Select five traits from the imported .csv data set from 2016 and 2017, and merging both data sets.
Code:
library(dplyr)
## 2016 selected phenotypes and creating new object
pheno_2016_selected <- pheno_2016 %>%
dplyr::select(Pedigree, Year, Field.Location, Replicate, Block, Plot, Pollen.DAP..days., Silk.DAP..days., Plant.Height..cm., Ear.Height..cm., Grain.Yield..bu.A.)
head(pheno_2016_selected)
## 2017 selected phenotypes and creating new object
pheno_2017_selected <- pheno_2017 %>%
dplyr::select(Pedigree, Year, Field.Location, Replicate, Block, Plot, Pollen.DAP..days., Silk.DAP..days., Plant.Height..cm., Ear.Height..cm., Grain.Yield..bu.A.)
head(pheno_2017_selected)
## merge selected 2016 and 2017 phenotype datya
merged_pheno <- rbind(pheno_2016_selected, pheno_2017_selected)
head(merged_pheno)
tail(merged_pheno)
#View(merged_pheno)
Output:
Pedigree Year Field.Location Replicate Block Plot Pollen.DAP..days.
1 2369/PHZ51 2016 ARH1 2 2 2 72
2 LH195/PHN37 2016 ARH1 2 2 53 69
3 PHHB9/PHM57 2016 ARH1 2 2 123 72
4 PHW53/LH123HT 2016 ARH1 2 2 137 70
5 B119/3IIH6 2016 ARH1 1 1 172 67
6 BSSSC0_038/3IIH6 2016 ARH1 2 2 217 69
Silk.DAP..days. Plant.Height..cm. Ear.Height..cm. Grain.Yield..bu.A.
1 73 188 107 111.83608
2 69 205 122 124.12257
3 75 170 100 84.60683
4 72 200 97 130.43523
5 69 175 125 119.27595
6 70 145 84 99.01134
Summary statistics table of the selected phenotpe data across years and locations
Code:
library(dplyr)
library(tidyr)
df_mergePheno <- tbl_df(merged_pheno)
df_mergePheno.summary <- df_mergePheno %>%
dplyr::select(Pollen.DAP..days., Silk.DAP..days., Plant.Height..cm., Ear.Height..cm., Grain.Yield..bu.A.) %>% # select variables to summarise
summarise_all(funs(min = min,
#q25 = quantile(., 0.25),
median = median,
#q75 = quantile(., 0.75),
max = max,
mean = mean,
sd = sd), na.rm = TRUE)
df_mergePheno.summary.tidy <- df_mergePheno.summary %>% gather(stat, val) %>%
separate(stat, into = c("Phenotype", "stat"), sep = "_") %>%
spread(stat, val) %>%
dplyr::select(Phenotype, min, max, mean, median, sd ) # reorder columns
print(df_mergePheno.summary.tidy)
Output:
# A tibble: 5 x 6
Phenotype min max mean median sd
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Ear.Height..cm. 20 266 106. 107 30.2
2 Grain.Yield..bu.A. 7.97 323. 143. 145. 47.7
3 Plant.Height..cm. 71 347 218. 225 49.2
4 Pollen.DAP..days. 29 96 68.2 68 8.94
5 Silk.DAP..days. 29 99 69.3 70 9.01
From the above table, Plant height
and grain yield
has the highest variance, while flowering time data
has the lowest variance.
Data visualization
Box plots of each phenotype by year and multiple location.
Flowering time data
library(ggplot2)
## Pollen DAP days
ggplot(merged_pheno, aes(x=Pollen.DAP..days., fill = as.factor(Year))) +
geom_boxplot(alpha=0.6) + facet_wrap(~Field.Location) +
scale_x_log10() +
ggtitle("Pollen DAP Days")
Output:
## Silk.DAP..days.
ggplot(merged_pheno, aes(x=Silk.DAP..days., fill = as.factor(Year))) +
geom_boxplot(alpha=0.6) + facet_wrap(~Field.Location) +
scale_y_log10() +
ggtitle("Silk DAP Days")
Output:
From the above plot, both flowering time data are consistent among two years and across years, however, with a couple of exceptions such as data from ONH2 (Canada), which appears to early flowering in 2017 and late in 2016, possibly due to difference in the weather conditions. Similarly, there are locations were no data were collected and or with only single year data.
Plant and ear height
# Plant height
ggplot(merged_pheno, aes(x=Plant.Height..cm., fill = as.factor(Year))) +
geom_boxplot(alpha=0.6) + facet_wrap(~Field.Location) +
scale_x_log10() +
ggtitle("Plant height (cm)")
# Ear height
ggplot(merged_pheno, aes(x=Ear.Height..cm., fill = as.factor(Year))) +
geom_boxplot(alpha=0.6) + facet_wrap(~Field.Location) +
scale_x_log10() +
ggtitle("Ear height (cm)")
Output:
Both plant and ear height data across years and location are considerably consisent, with few outliers in the data set.
Grain Yield
# Grain Yield
ggplot(merged_pheno, aes(x=Grain.Yield..bu.A., fill = as.factor(Year))) +
geom_boxplot(alpha=0.6) + facet_wrap(~Field.Location) +
scale_x_log10() +
ggtitle("Grain Yield (bu/Acre)")
Output:
The grain yield that appears to fairly consistent between two years however, this data set has a high number of outliers in comparison to other traits.
Heritability
Calculating variance componets and Heritability on a line mean basis
for each phenotype using lme4
package in R.
Checking data structure
attach(merged_pheno)
# checking data structure
str(merged_pheno)
# renaming variables
YEAR = as.factor(Year)
LINE = as.factor(Pedigree)
REP = as.factor(Replicate)
BLOCK = as.factor(Block)
LOC = as.factor(Field.Location)
POLLEN = as.numeric(Pollen.DAP..days.)
SILK = as.numeric(Silk.DAP..days.)
EAR_HT = as.numeric(Ear.Height..cm.)
PLNT_HT = as.numeric(Plant.Height..cm.)
GRAIN_YLD = as.numeric(Grain.Yield..bu.A.)
output:
'data.frame': 34758 obs. of 11 variables:
$ Pedigree : chr "2369/PHZ51" "LH195/PHN37" "PHHB9/PHM57" "PHW53/LH123HT" ...
$ Year : int 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
$ Field.Location : chr "ARH1" "ARH1" "ARH1" "ARH1" ...
$ Replicate : int 2 2 2 2 1 2 1 1 2 2 ...
$ Block : int 2 2 2 2 1 2 1 1 2 2 ...
$ Plot : int 2 53 123 137 172 217 104 179 170 174 ...
$ Pollen.DAP..days. : int 72 69 72 70 67 69 62 67 67 67 ...
$ Silk.DAP..days. : int 73 69 75 72 69 70 63 69 69 69 ...
$ Plant.Height..cm. : num 188 205 170 200 175 145 244 183 177 146 ...
$ Ear.Height..cm. : num 107 122 100 97 125 84 124 114 109 96 ...
$ Grain.Yield..bu.A.: num 111.8 124.1 84.6 130.4 119.3 ...
Pollen DAP data
library(lme4)
## Model variance component analysis
pollen_varcomp = lmer(POLLEN ~ (1|LINE) + (1|LOC) + (1|YEAR) + (1|LINE:LOC) + (1|LINE:YEAR))
summary(pollen_varcomp)
## formula calucalting heritability on mean basis
# H2 = var(LINE)/[var(LINE) + var(LINE:LOC)/N] + var(LINE:YEAR)/N + var(RESIDUAL)/N]
H2_pollen = 10.47035 /(10.47035 + 0.05196/34 + 0.13156/2 + 20.16058/36)
cat("Heritabilty of Pollen DAP days is", H2_pollen, "\n")
Linear mixed model fit by REML ['lmerMod']
Formula: POLLEN ~ (1 | LINE) + (1 | LOC) + (1 | YEAR) + (1 | LINE:LOC) +
(1 | LINE:YEAR)
REML criterion at convergence: 151506.8
Scaled residuals:
Min 1Q Median 3Q Max
-5.1637 -0.4649 0.0036 0.4833 7.4742
Random effects:
Groups Name Variance Std.Dev.
LINE:LOC (Intercept) 0.05186 0.2277
LINE:YEAR (Intercept) 0.13160 0.3628
LINE (Intercept) 10.47045 3.2358
LOC (Intercept) 78.26891 8.8470
YEAR (Intercept) 0.02846 0.1687
Residual 20.16066 4.4901
Number of obs: 25515, groups:
LINE:LOC, 9242; LINE:YEAR, 1305; LINE, 815; LOC, 30; YEAR, 2
Fixed effects:
Estimate Std. Error t value
(Intercept) 67.377 1.624 41.48
Silk DAP data
silk_varcomp = lmer(SILK ~ (1|LINE) + (1|LOC) + (1|YEAR) + (1|LINE:LOC) + (1|LINE:YEAR))
summary(silk_varcomp)
# formula calucalting heritability on mean basis
# H2 = var(LINE)/[var(LINE) + var(LINE:LOC)/N] + var(LINE:YEAR)/N + var(RESIDUAL)/N]
H2_silk = 12.2271 /(12.2271 + 0.0155/34 + 0.2316/2 + 20.7834/36)
cat("Heritabilty of Silk DAP days is", H2_silk, "\n")
Plant height data
plantHT_varcomp = lmer(PLNT_HT ~ (1|LINE) + (1|LOC) + (1|YEAR) + (1|LINE:LOC) + (1|LINE:YEAR))
summary(plantHT_varcomp)
# formula calucalting heritability on mean basis
# H2 = var(LINE)/[var(LINE) + var(LINE:LOC)/N] + var(LINE:YEAR)/N + var(RESIDUAL)/N]
H2_plHT = 287.302 /(287.302 + 2.628/34 + 2.589/2 + 604.888/36)
cat("Heritabilty of ear height data is", H2_plHT, "\n")
Ear height data
earHT_varcomp = lmer(EAR_HT ~ (1|LINE) + (1|LOC) + (1|YEAR) + (1|LINE:LOC) + (1|LINE:YEAR))
summary(earHT_varcomp)
## formula calucalting heritability on mean basis
# H2 = var(LINE)/[var(LINE) + var(LINE:LOC)/N] + var(LINE:YEAR)/N + var(RESIDUAL)/N]
H2_earHT = 166.764 /(166.764 + 4.028/34 + 2.589/2 + 257.965/36)
cat("Heritabilty of ear height data is", H2_earHT, "\n")
Grain yield data
Yield_varcomp = lmer(GRAIN_YLD ~ (1|LINE) + (1|LOC) + (1|YEAR) + (1|LINE:LOC) + (1|LINE:YEAR))
summary(Yield_varcomp)
# formula calucalting heritability on mean basis
# H2 = var(LINE)/[var(LINE) + var(LINE:LOC)/N] + var(LINE:YEAR)/N + var(RESIDUAL)/N]
H2_yield = 187.64 /(187.64 + (92.85/34) + (54.95/2) + (1000.21/36))
cat("Heritabilty of ear height data is", H2_yield, "\n")
Summary table of the heritablity of each trait
Trait | H2 |
---|---|
Pollen DAP | 0.94 |
Silk DAP | 0.95 |
Plant height | 0.94 |
Ear height | 0.95 |
Grain yield | 0.76 |
All traits have high broad-sense heritabilty except grain yield, which implies that the grain yield is significantly influenced by environmental and other unknown factors, and using only genetic markers for GS could result spurious results.
Estimating Best Linear UnBiased Predictors (BLUP) for each Line by trait
BLUPs shrink the estimates toward the mean and allow us to account for environmental, year-to-year variance factors in the model as well as the missing data.
Pollen DAP days
# estimate BLUPS
pollen_blup = ranef(pollen_varcomp)
# look at output structure
str(pollen_blup)
# extract blup for line
pollen_lineblup = pollen_blup$LINE
# see the structure of the blup for each line
str(pollen_lineblup)
# save the brixlineblup output to a separate .csv file
write.csv(pollen_lineblup, file="Pollen_LineBLUPS.csv")
LINEBLUP_pollen = pollen_lineblup[,1]
Silk DAP days
# estimate BLUPS
silk_blup = ranef(silk_varcomp)
# look at output structure
str(silk_blup)
# extract blup for line
silk_lineblup = silk_blup$LINE
# see the structure of the blup for each line
str(silk_lineblup)
# save the brixlineblup output to a separate .csv file
write.csv(silk_lineblup, file="silkDAP_LineBLUPS.csv")
LINEBLUP_silk = silk_lineblup[,1]
Plant height
# estimate BLUPS
plntHT_blup = ranef(plantHT_varcomp)
# look at output structure
str(plntHT_blup)
# extract blup for line
plntHT_lineblup = plntHT_blup$LINE
# see the structure of the blup for each line
str(plntHT_lineblup)
# save the brixlineblup output to a separate .csv file
write.csv(plntHT_lineblup, file="plntHT_LineBLUPS.csv")
LINEBLUP_plntHT = plntHT_lineblup[,1]
Ear height
# estimate BLUPS
earHT_blup = ranef(earHT_varcomp)
# look at output structure
str(earHT_blup)
# extract blup for line
earHT_lineblup = earHT_blup$LINE
# see the structure of the blup for each line
str(earHT_lineblup)
# save the brixlineblup output to a separate .csv file
write.csv(earHT_lineblup, file="earHT_LineBLUPS.csv")
LINEBLUP_earHT = earHT_lineblup[,1]
Grain Yield
# estimate BLUPS
yield_blup = ranef(Yield_varcomp)
# look at output structure
str(yield_blup)
# extract blup for line
yield_lineblup = yield_blup$LINE
# see the structure of the blup for each line
str(yield_lineblup)
# save the brixlineblup output to a separate .csv file
write.csv(yield_lineblup, file="yield_lineblup.csv")
LINEBLUP_yield = yield_lineblup[,1]
Histograms of the BLUPs for each trait
par(mfrow=c(2,3))
hist(LINEBLUP_pollen, col="grey", main = "Pollen DAP BLUPs")
hist(LINEBLUP_silk, col="grey", main = "Silk DAP BLUPs")
hist(LINEBLUP_plntHT, col="grey" , main = "Plant height BLUPs")
hist(LINEBLUP_earHT, col="grey", main = "Ear height BLUPs")
hist(LINEBLUP_yield, col="grey", main = "Yield BLUPs")
par(mfrow=c(1,1))
2.0 Genotypic data
The imputed genotpe data was downloaded from G2F_Planting_Season_2017_v1
, with filename: g2f_2017_ZeaGBSv27_Imputed_AGPv4.h5
and analyzed on Cornell University BioHPC Cluster in Linux OS using command line version of TASSEL v5
.
The genotype summary of the raw genetic data set was calculated in TASSEL command line.
Plot Minor and Major Allele Frequency raw marker data:
rawMarkerSum <- read.table("raw_geno_summary3.txt", header = T)
df_rawMF <- data.frame(MinorAlleleF=rawMarkerSum$Minor_Allele_Frequency,
MajorAlleleF=rawMarkerSum$Major_Allele_Frequency)
library(ggplot2);library(reshape2)
rawdata<- melt(df_rawMF)
ggplot(rawdata, aes(x=value, fill=variable)) + geom_histogram(bins = 100) + ggtitle("Raw genotype allele freq.")
Minor allele frequency in the raw data is extremely high (<5e+05), indicating presence of rare variants in disproportionate number, and therefore needs to be filtered prior to any downstream analysis to avoid any bias in the result.
Filtering genotype calls of MAF and minimum site count
In TASSEL, SNP calls were filtered stringently for MAF threshold of 0.1 and minimum count of 1000 to remove extremely rare variants and the number of markers with high missing data. Similarly, taxa with greater 80% missing genotype calls were also filtered using the command line below.
$ ./tassel-5-standalone/run_pipeline.pl -log log.txt -Xms2g -Xmx10g -fork1 -h5 genotype_imputed_2017.h5 -filterAlign -filterAlignMinFreq--exit-with-session 0.05 -filterAlignMinCount 200 -FilterTaxaPropertiesPlugin -minNotMissing 0.2 -endPlugin -export -w
genotype_imputed_2017_filtered -exportType VCF -runfork1
Next, genotype summary of the filtered file was generated in TASSEL using -GenotypeSummaryPlugin and using the below command line:
$ ./tassel-5-standalone/run_pipeline.pl -Xmx20g -importGuess genotype_imputed_2017_filteredMAF01min1000.vcf -GenotypeSummaryPlugin -endPlugin -export GenoSummary_filtered
Summary of the 2017 genotypic data after filteration of markers and taxa
Stat Type | Value |
---|---|
Number of Taxa | 1512 |
Number of Sites | 232681 |
Sites x Taxa | 3.5181E8 |
Number Not Missing | 3.1629E8 |
Proportion Not Missing | 0.89901 |
Number Missing | 3.5528E7 |
Proportion Missing | 0.10099 |
Number Gametes | 7.0363E8 |
Gametes Not Missing | 6.3257E8 |
Proportion Gametes Not Missing | 0.89901 |
Gametes Missing | 7.1057E7 |
Proportion Gametes Missing | 0.10099 |
Number Heterozygous | 8.114E6 |
Proportion Heterozygous | 0.02306 |
Average Minor Allele Frequency | 0.28261 |
Even after stringent filtering, there are about 10% missing data in filtered data set. But before that, the number of filtered markers or predictors are on higher end which could possibly be in high linkage disequilibrium (LD)
with one another and not providing any additional importance in the analysis but adding computational burdern. Therefore, this marker set was further thinned by its physical position and imputed.
Thinning of markers by position
232,681 markers were obtained after post-filtering the raw marker set. To reduce the computational workload and redundant markers in strong LD, filtered marker set were thinned by their physical position using _ThinSitesByPositionPlugin with minimum physical distance between markers of 10,000 bp.
$ ./tassel-5-standalone/run_pipeline.pl -log log_thinMarkers.txt -Xmx20g -importGuess genotype_imputed_2017_filteredMAF01min1000.vcf -ThinSitesByPositionPlugin -o thinned10k_geno_2017.vcf -minDist 10000 -endPlugin
Summary of the markers after thinning them:
Stat Type | Value |
---|---|
Number of Taxa | 1512 |
Number of Sites | 28595 |
Sites x Taxa | 4.3236E7 |
Number Not Missing | 3.8758E7 |
Proportion Not Missing | 0.89643 |
Number Missing | 4.4779E6 |
Proportion Missing | 0.10357 |
Number Gametes | 8.6471E7 |
Gametes Not Missing | 7.7515E7 |
Proportion Gametes Not Missing | 0.89643 |
Gametes Missing | 8.9558E6 |
Proportion Gametes Missing | 0.10357 |
Number Heterozygous | 1.0066E6 |
Proportion Heterozygous | 0.02328 |
Average Minor Allele Frequency | 0.27872 |
Proportion of missing genotype calls is same after thinning.
Plot MAF of thinned marker set:
thinSiteSum <- read.table("thinnedGeno_summary3.txt", header = T)
x <- data.frame(MinorAlleleF=thinSiteSum$Minor_Allele_Frequency,
MajorAlleleF=thinSiteSum$Major_Allele_Frequency)
library(ggplot2);library(reshape2)
data<- melt(x)
ggplot(data,aes(x=value, fill=variable)) + geom_histogram(alpha=0.5, bins = 100) + ggtitle("Thinned genotype allele freq.")
MAF post-filtering and thinning has removed most of the rare variants in the data set.
Plot Multidimenisonal Scaling (MDS) to explore genetic structure of the genotypes
MDS with elucidean distance was calcuated using post-filtered and thinned marker data to detect the underlying dimensions of the observed genetic distance i.e Identity-By-Descent (IBS)
among the lines in the data set.
mds <- read.table("MDS_thinned10k_geno.txt", header = T)
head(mds)
ggplot(mds, aes(x=PC1, y=PC2, color = PC1))+
geom_point() +
ggtitle("MDS plot")
Converting thinned genotype data into numerical format and imputing for Machine Learning
28,595 markers post-filteration were converetd into numeric format using -NumericalGenotypePlugin in TASSEL (see log_numeric.txt file for details), where, for each marker homozygous major is 1.0, homozygous minor is 0.0, and heterozygous is 0.5 that will be used as predictors in the genomic selection analysis.
command line:
$ ./tassel-5-standalone/run_pipeline.pl -Xmx20g -log 10g_numeric.txt -importGuess thinned10k_geno_2017.vcf -NumericalGenotypePlugin -endPlugin -export Numeric_thinned_2017_geno -exportType ReferenceProbability
Running Numerical Imputation by means from TASSEL command line:
The numeric imputation was performed by computing the mean of the respective marker columns.
$ ./tassel-5-standalone/run_pipeline.pl -Xms10g -Xmx30g -log log_numericImpute.txt -importGuess Numeric_thinned_2017_geno.txt -ImputationPlugin -ByMean -endPlugin -export Numeric_thinned_2017_geno_Imputed -exportType ReferenceProbability
Intersecting genotype and phenotype data
pheno_BlUPs_noMissing <- read.table("pheno_BLUPs_allTraits_wGBStaxaNames_noMissingData.txt", header = T, na.strings = "NA")
geno_numericThinImpu <- read.table("Numeric_thinned_2017_geno_Imputed.txt", header = T)
## as data frames
pheno_BlUPs_noMissing <- as.data.frame(pheno_BlUPs_noMissing)
geno_numericThinImpu <- as.data.frame(geno_numericThinImpu)
head(pheno_BlUPs_noMissing)
#head(geno_numericThinImpu)
The phenotype and numerical genotype data were intersected by the Taxa names in each data set, and any Taxa with missing data were removed from the GS analysis.
## return all rows from x where there are matching values in y, and all columns from x and y
intrsct_phenoGeno <- inner_join(pheno_BlUPs_noMissing, geno_numericThinImpu)
head(intrsct_phenoGeno)
Output:
## Taxa earBLUPs PlantHTBLUPs yield_BLUPs PollenDAP_BLUPs
## 1 PI601318:250033667 -29.39624 -23.67643 1.503655 -5.0852823
## 2 PI601170:250032540 -27.66371 -37.79167 1.988569 -5.2087454
## 3 PHJ89:100000663 -26.73055 -17.46497 6.775845 -5.4622321
## 4 S8324:100000684 -26.71990 -30.55353 -23.851084 -9.3000384
## 5 CG60:100001134 -26.41626 -40.89895 -16.331693 -6.9434702
## 6 PI601575:250035073 -26.06640 -31.77753 8.341925 -0.3779614
## silkDAP_BLUPs
## 1 -5.1033176
## 2 -5.3709531
## 3 -5.5588561
## 4 -9.2906753
## 5 -7.8424272
## 6 -0.2195389
3.0 Genomic selection algorithms
In the GS analysis, BLUPs only two traits – Pollen DAP (high H2) and Grain yield (low H2) were evaluated, to evaluated the performance of GS alogrithms (Random Forest and Glmnet) on traits with significantly different heritability.
libraries and set seed
library(randomForest)
library(mlbench)
library(caret)
library(e1071)
set.seed(07232020)
Randomizing and Partitioning the dataset to build training and testing sets
The imported data was split into 70/30 split, since this dataset gives a larger and more reliable test set.
Randomizing the order of the dataset
rows <- sample(nrow(intrsct_phenoGeno))
intrsct_phenoGeno <- intrsct_phenoGeno[rows, ]
Find the rows to split
split <- round(nrow(intrsct_phenoGeno) * 0.70)
train_set <- intrsct_phenoGeno[1:split,]
test_set <- intrsct_phenoGeno[(split + 1):nrow(intrsct_phenoGeno),]
Confirm train and test set sizes
# Check the ratios of training and test sets
nrow(train_set)/nrow(intrsct_phenoGeno)
nrow(test_set)/nrow(intrsct_phenoGeno)
# plot the two data sets
par(mfrow=c(2,2))
hist(train_set$PollenDAP_BLUPs, main = "Train set - Pollen DAP", xlab = "Pollen DAP BLUPs")
hist(test_set$PollenDAP_BLUPs, main = "Test set - Pollen DAP", xlab = "Pollen DAP BLUPs")
hist(train_set$yield_BLUPs, main = "Train set - Grain Yield", xlab = "Grain Yield BLUPs")
hist(test_set$yield_BLUPs, main = "Test set - Grain Yield", xlab = "Grain Yield BLUPs")
par(mfrow=c(1,1))
As needed, the distribution of training and testing set is about the same, which is import to prevent any bias in the result.
Pollen DAP BLUPs
glmnet model
It is a extension of glm
models with a built-in variable selection that also help in dealing with collinearity
and small sizes. The glmnet
has two primary froms: 1) LASSO
regression, which penalizes number of non-zero coefficients, and 2) Ridge
regression, which penalizes absolute magnitude of coefficients. It also attempts to find a parsimonious aka simple model and pairs well with random forest models.
trainControl
Before creating and running the models, its a good idea to create a trainControl
object to tuning parameters and further control how models are created.
myControl <- trainControl(
method = "cv",
number =10,
summaryFunction = twoClassSummary,
classProbs = T,
verboseIter = T
)
fitting the glmnet model:
First, we create the glmnet models, which is a combination of lasso and ridge regression, and we can fit the mix of the two models by selecting the values for alpha
and lambda
.
glmnetFit_pollen = train(x = train_set[7:28601],
y = train_set$PollenDAP_BLUPs,
method = "glmnet",
metric = "RMSE",
trainControl = myControl,
tuneGrid = expand.grid(
alpha=0:1,
lambda= 0:10 / 10
)
)
print(glmnetFit_pollen)
From the above output of the model, we see that alpha = 0, indicating ridge regression was performed in the final model.
Comparing models: Ridge vs LASSO
plot(glmnetFit_pollen, main = "GlmnetFit Pollen DAP")
Full regularization path
plot(glmnetFit_pollen$finalModel)
Plot top 20 important variables for Pollen DAP
plot(varImp(glmnetFit_pollen), top = 20)
Marker S7_161300485
on chromosome 7 was calculated to be the most important variable by glmnet model for Pollen DAP, which coinsides with one of the major QTL for flowering time in NAM population, that is responsible to vegtative growth and transition from vegitative to reproductive stage.
Validation of glmnet model
Now we can test the robustness of the glmnet model by testing it on test_set
data set, and regressing the predicted and measured values.
prediction_glmnet_pollen <- predict(glmnetFit_pollen, test_set)
df_pollen_mes_glmnet <- data.frame(test_set$PollenDAP_BLUPs)
df_pollen_pred_glmnet <- data.frame(prediction_glmnet_pollen)
library(ggplot2)
library(reshape2)
pollen_glmnet_bind <- cbind(df_pollen_mes_glmnet, df_pollen_pred_glmnet)
ggplot(pollen_glmnet_bind, aes(x=test_set.PollenDAP_BLUPs, y=prediction_glmnet_pollen)) +
geom_point(color="red") + ggtitle("Test set - GLMnet Pollen DAP") +
geom_smooth(method="lm", se=TRUE, fullrange=FALSE, level=0.95)
In the above plot, we see that glmnet model pollen predicting the phenotype considerably well.
The RMSE
and R-Sq
between the predicted and measured values are below:
error_glm_pollen <- prediction_glmnet_pollen - test_set$PollenDAP_BLUPs
rmse_glm_pollen <- sqrt(mean(error_glm_pollen^2))
cat("RMSE of glmnet model for Pollen DAP is", rmse_glm_pollen)
R-squared:
# creat Rsq function
rsq <- function(x, y) summary(lm(y~x))$r.squared
Rsq_Pollen_glmnet <- rsq(test_set$PollenDAP_BLUPs, prediction_glmnet_pollen)
cat("R-squared of glmnet model for Pollen DAP is", Rsq_Pollen_glmnet)
Random forest - rf function
This alogrithm is often acurrate than other algorithms, easier to tune, require less little preprocessing, and capture threshold effects and variable interactions, however, it is less interpretable and often quite slow.
Random forest model
Markers in the column from 2:1942 were used as predictors (x
) and the phenotype
column as the y
variable. Similarly, the performance of the model was evaluated using root mean square error (RMSE
) (One can choose other method such as ROC
, AUC
etc as well), and finally choosing random forest rf
as the method in the model.
# Create tunegrid for mtry to tune the model.
tunegrid <- data.frame(.mtry=c(2,4,6, 10, 15, 20, 100))
randomForestFit_pollen = train(x = train_set[7:28601],
y = train_set$PollenDAP_BLUPs,
method = "rf",
metric = "RMSE",
tuneGrid = tunegrid
)
# print the model output
randomForestFit_pollen
Plotting the randomForestFit
for Pollen BLUPs model.
plot(randomForestFit_pollen, main = "Pollen DAP - Random Forest")
The best mtry with lowest RMSE for this model was 6.
Next, plotting the important predictors based on their calculated importance scores using varImp
function.
plot(varImp(randomForestFit_pollen), top = 20)
After building the supevised random forest learning model, The top 20 predictors can be ranked by their importance, and from the above plot, we tell that the marker S2_53055530, along with other 6 markers contributing in expalining the variance of the phenotype.
Validation of Random Forest model
Now we can test the robustness of the model by testing it on test_set
data set, and regressing the predicted and measured values.
prediction_pollen_rf <- predict(randomForestFit_pollen, test_set)
df_pollen_mes_rf <- data.frame(test_set$PollenDAP_BLUPs)
df_pollen_pred_rf <- data.frame(prediction_pollen_rf)
library(ggplot2);library(reshape2)
pollen_rf_bind <- cbind(df_pollen_mes_rf, df_pollen_pred_rf)
ggplot(pollen_rf_bind, aes(x=test_set.PollenDAP_BLUPs, y=prediction_pollen_rf)) +
geom_point(color="red") + ggtitle("Test set - Random Forest - Pollen DAP") +
geom_smooth(method="lm", se=TRUE, fullrange=FALSE, level=0.95)
And we can also calculate the RMSE
between the predicted and measured values.
error_pollen_rf <- prediction_pollen_rf - test_set$PollenDAP_BLUPs
rmse_pollen_rf <- sqrt(mean(error_pollen_rf^2))
cat("RMSE of Random Forest model for Pollen DAP is", rmse_pollen_rf)
R-squared:
# creat Rsq function
rsq <- function(x, y) summary(lm(y~x))$r.squared
Rsq_Pollen_rf <- rsq(test_set$PollenDAP_BLUPs, prediction_pollen_rf)
cat("R-squared of Random Forest model for Pollen DAP is", Rsq_Pollen_rf)
Comparing the GS models by their type
We can compare the performance of the models by studying their MAE, RMSE and R-squared values side-by-side, making it very convenient.
# Create model_list
model_list_pollen <- list(random_forest = randomForestFit_pollen, glmnet = glmnetFit_pollen)
# Pass model_list to resamples(): resamples
resamples_pollen <- resamples(model_list_pollen)
resamples_pollen
# Summarize the results
summary(resamples_pollen)
plot both GS models by RMSE
Similarly, we can visually inspect the models accuracies.
par(mfrow=c(1,2))
bwplot(resamples_pollen, metric = "RMSE")
dotplot(resamples_pollen, metric = "RMSE")
par(mfrow=c(1,1))
From above table, we can tell that the glmnet model appears to be better model for GS selection of pollen DAP data.
Grain Yield BLUPs
glmnet model
making custom trainControl
myControl <- trainControl(
method = "cv",
number =10,
summaryFunction = twoClassSummary,
classProbs = T,
verboseIter = T
)
fitting the model for grain yield using the glmnet model
glmnetFit_yield = train(x = train_set[7:28601],
y = train_set$yield_BLUPs,
method = "glmnet",
metric = "RMSE",
trainControl = myControl,
tuneGrid = expand.grid(
alpha=0:1,
lambda= 0:10 / 10
)
)
print(glmnetFit_yield)
Comparing models: Ridge vs LASSO
plot(glmnetFit_yield, main = "GlmnetFit Grain Yield")
From the above comparison model, we can tell that the alpha and lamda of 1 has lowest RMSE.
Full regularization path
plot(glmnetFit_yield$finalModel)
Top predictors
plot(varImp(glmnetFit_yield), top = 20)
Markers S3_210537524
was calculated as the most important variable by glmnet model.
Validation of glmnet model
Now we can test the robustness of the glmnet model by testing it on test_set
data set, and regressing the predicted and measured values.
prediction_glmnet_yield <- predict(glmnetFit_yield, test_set)
df_yield_mes_glmnet <- data.frame(test_set$yield_BLUPs)
df_yield_pred_glmnet <- data.frame(prediction_glmnet_yield)
library(ggplot2);library(reshape2)
yield_glmnet_bind <- cbind(df_yield_mes_glmnet, df_yield_pred_glmnet)
ggplot(yield_glmnet_bind, aes(x=test_set.yield_BLUPs, y=prediction_glmnet_yield)) +
geom_point(color="red") + ggtitle("Test set - Glmnet - Grain Yield") +
geom_smooth(method="lm", se=TRUE, fullrange=FALSE, level=0.95)
From the above plot, we see poor correlation between the predicted and observed values for grain yield data set.
RMSE
and R-Sq
between the predicted and measured values.
error_glm_yield <- prediction_glmnet_yield - test_set$yield_BLUPs
rmse_glm_yield <- sqrt(mean(error_glm_yield^2))
cat("RMSE of Glmnet model for Grain yield is", rmse_glm_yield)
R-squared:
# creat Rsq function
rsq <- function(x, y) summary(lm(y~x))$r.squared
Rsq_grain_glmnet <- rsq(test_set$yield_BLUPs, prediction_glmnet_yield)
cat("R-squared of Glmnet model for Grain yield is", Rsq_grain_glmnet)
Random forest model
# Create tunegrid for mtry to tune the model.
tunegrid <- data.frame(.mtry=c(2,4,6, 10, 15, 20, 100))
randomForestFit_yield = train(x = train_set[7:28601],
y = train_set$yield_BLUPs,
method = "rf",
metric = "RMSE",
tuneGrid = tunegrid
)
# print the model output
randomForestFit_yield
Plotting the randomForetFit
model.
plot(randomForestFit_yield, main = "Grain yield - Random Forest")
The best mtry with lowest RMSE for this model was 2.
Next, we can plot the important predictors based on their calculated importance scores using varImp
function.
plot(varImp(randomForestFit_yield), top = 20)
From the above plot, we tell that the S6_153099472
was the most important variable, whch is different from the glmnet model for grain yield.
Validation of Random Forest model
Now we can test the robustness of the model by testing it on test_set
data set, and regressing the predicted and measured values.
prediction_yield_rf <- predict(randomForestFit_yield, test_set)
df_yield_mes_rf <- data.frame(test_set$yield_BLUPs)
df_yield_pred_rf <- data.frame(prediction_yield_rf)
library(ggplot2);library(reshape2)
yield_rf_bind <- cbind(df_yield_mes_rf, df_yield_pred_rf)
ggplot(yield_rf_bind, aes(x=test_set.yield_BLUPs, y=prediction_yield_rf)) +
geom_point(color="red") + ggtitle("Test set - Random Forest - Grain Yield") +
geom_smooth(method="lm", se=TRUE, fullrange=FALSE, level=0.95)
Similar to glmnet model, random forest fail to predict the grain yield.
And we can also calculate the RMSE
between the predicted and measured values.
error_yield_rf <- prediction_yield_rf - test_set$yield_BLUPs
rmse_yield_rf <- sqrt(mean(error_yield_rf^2))
cat("RMSE of Random Forest model for Grain yield is", rmse_yield_rf)
R-squared:
# creat Rsq function
rsq <- function(x, y) summary(lm(y~x))$r.squared
Rsq_grain_rf <- rsq(test_set$yield_BLUPs, prediction_yield_rf)
cat("R-squared of Random Forest model for Grain yield is", Rsq_grain_rf)
Comparing the GS models by their type
We can compare the performance of the models by studying their MAE, RMSE and R-squared values side-by-side, making it very convenient.
# Create model_list
model_list_yield <- list(random_forest = randomForestFit_yield, glmnet = glmnetFit_yield)
# Pass model_list to resamples(): resamples
resamples_yield<- resamples(model_list_yield)
resamples_yield
# Summarize the results
summary(resamples_yield)
plot both GS models by RMSE
Similarly, we can visually inspect the models accuracies.
par(mfrow=c(1,2))
bwplot(resamples_yield, metric = "RMSE")
dotplot(resamples_yield, metric = "RMSE")
par(mfrow=c(1,1))
From above table, we can tell that the random forest model appears to be slighlty better model incoparison to glmnet for grain yield.
4.0 Summary of genomic selection algorithms on selected traits
Trait | Model | Set | RMSE | R-Sq |
---|---|---|---|---|
Pollen DAP | Glmnet | Training set | 2.01 | 0.54 |
Test set | 2.28 | 0.39 | ||
Random forest | Training set | 2.04 | 0.50 | |
Test set | 2.37 | 0.40 | ||
Grain yield | Glmnet | Training set | 10.3 | 0.25 |
Test set | 10.75 | 0.16 | ||
Random forest | Training set | 10.2 | 0.23 | |
Test set | 11.08 | 0.12 |
In summary, Pollen DAP trait had a high heritability, and both GS models had a relatively statisfactory prediction accuracy in both training and test data sets. In contrast, grain yield had moderate heriatblity indicating significant influence of genetic and environmental componets on this trait, and both GS models built with only genomic markers failed, which tells us that heritability of a trait is an important characteristics as so are the environmental components. Therefore, its strongly advised to calculate the heritability of trait of interest prior to building any GS model, so that one has idea of what to expect from the GS model performance.
References
McFarland, B.A., AlKhalifah, N., Bohn, M., Bubert, J., Buckler, E.S., Ciampitti, I., Edwards, J., Ertl, D., Gage, J.L., Falcon, C.M. and Flint-Garcia, S., 2020. Maize genomes to fields (G2F) : 2014–2017 field seasons: genotype, phenotype, climatic, soil, and inbred ear image datasets. BMC Research Notes, 13(1), pp.1-6.