# Pruning genetic markers based on their physical distance and linkage disequilibrium (LD)

High density markers do not provide any additional information, therefore, can be pruned based on the physical distances between adjacent markers and linkage disequilibrium (LD). In this tutorial, I will show how to prune markers based on their physical position in TASSEL software, and based on LD in PLINK software.

# Pruning markers based on the physical distances in TASSEL

For detailed information on how to use TASSEL software, please consult user’s guide and further documentation at: https://www.maizegenetics.net/tassel

Download and install the latest version of the TASSEL software at this link: https://www.maizegenetics.net/tassel

## Genotype file

TASSEL allows various genotype file formats such as VCF (variant call format), .hmp.txt, and plink. In this tutorial, I am using the hmp.txt version of the genotype file. The below is the screenshot of the hmp.txt genotype file.

## Step 1.2: Importing files

Import the files by following the steps shown below. Tip! Both files can be opened at same time holding CTRL and clicking the file names.

## Command line pruning in TASSEL

Use below sample command line to filter :

./run_pipeline.pl -importGuess /Users/lcj34/genotyep.hmp.txt -ThinSitesByPositionPlugin
-o /Users/lcj34/thin40000.vcf -minDist 40000 –endPlugin


## Pruning in GUI of TASSEL

Or else, in TASSEL GUI, you can use “Thin Sites by Position” plugin:

# 2.0 LD based pruning in PLINK software

Download and install the latest version of the PLINK software at this link: http://zzz.bwh.harvard.edu/plink/download.shtml

If you have a genotype data in formats such as VCF, then you will need to convert it to PLINK format in VCF tools using the command line below:

vcftools --vcf myvcf.vcf --plink --out myplink


Note: Your genotype file has to be in the VCF format in order to convert into PLINK format

## 2.2 Extracting markers for pruning based on LD in PLINK

PLINK has two options for LD thinning/pruning: based on variance inflation factor (by regressing a SNP on all other SNPs in the window simultaneously) and based on pairwise correlation (R2). These are the --indep and --indep-pairwise options, respectively. Below is the code:

plink --noweb --file data_in --indep 50 5 2 --out data_out


The command above that specifies 50 5 0.5 would

• consider a window of 50 SNPs
• calculate LD between each pair of SNPs in the window
• remove one of a pair of SNPs if the LD is greater than 0.5
• shift the window 5 SNPs forward and repeat the procedure

The output of the above commands creates two lists of SNPs: those that are pruned out and those that are not. See below:

 plink.prune.in


## 2.3 Make a new, pruned file

Next, make a new, pruned file using the command below:

plink --file data --extract plink.prune.in --make-bed --out pruneddata


## 2.4 Convert the prunned output file into VCF

The pruned file can be converted back to VCF format using the command below:

plink109 --bfile pruneddata --recode vcf --out vcf_pruned


Once the final pruned data is converted to VCF file it can be viewed in TASSEL software for further analysis.