Pruning genetic markers based on their physical distance and linkage disequilibrium (LD)
High density markers do not provide any additional information, therefore, can be pruned based on the physical distances between adjacent markers and linkage disequilibrium (LD). In this tutorial, I will show how to prune markers based on their physical position in
TASSEL software, and based on LD in
Pruning markers based on the physical distances in TASSEL
For detailed information on how to use
TASSEL software, please consult user’s guide and further documentation at:
1.1 Download and install TASSEL software
Download and install the latest version of the TASSEL software at this link:
TASSEL allows various genotype file formats such as
VCF (variant call format),
plink. In this tutorial, I am using the
hmp.txt version of the genotype file. The below is the screenshot of the hmp.txt genotype file.
Step 1.2: Importing files
Import the files by following the steps shown below.
Tip! Both files can be opened at same time holding
CTRL and clicking the file names.
Command line pruning in TASSEL
Use below sample command line to filter :
./run_pipeline.pl -importGuess /Users/lcj34/genotyep.hmp.txt -ThinSitesByPositionPlugin -o /Users/lcj34/thin40000.vcf -minDist 40000 –endPlugin
Pruning in GUI of TASSEL
Or else, in TASSEL GUI, you can use “Thin Sites by Position” plugin:
2.0 LD based pruning in PLINK software
1.1 Download and install PLINK software
Download and install the latest version of the PLINK software at this link:
While downloading the software make sure you choose your correct platform!
2.1 Converting your genotype data into PLINK format
If you have a genotype data in formats such as VCF, then you will need to convert it to PLINK format in
VCF tools using the command line below:
vcftools --vcf myvcf.vcf --plink --out myplink
Note: Your genotype file has to be in the VCF format in order to convert into PLINK format
2.2 Extracting markers for pruning based on LD in PLINK
PLINK has two options for LD thinning/pruning: based on
variance inflation factor (by regressing a SNP on all other SNPs in the window simultaneously) and based on
pairwise correlation (R2). These are the
--indep-pairwise options, respectively.
Below is the code:
plink --noweb --file data_in --indep 50 5 2 --out data_out
The command above that specifies 50 5 0.5 would
- consider a window of 50 SNPs
- calculate LD between each pair of SNPs in the window
- remove one of a pair of SNPs if the LD is greater than 0.5
- shift the window 5 SNPs forward and repeat the procedure
The output of the above commands creates two lists of SNPs: those that are pruned out and those that are not. See below:
2.3 Make a new, pruned file
Next, make a new, pruned file using the command below:
plink --file data --extract plink.prune.in --make-bed --out pruneddata
2.4 Convert the prunned output file into VCF
The pruned file can be converted back to VCF format using the command below:
plink109 --bfile pruneddata --recode vcf --out vcf_pruned
please remember this parameter is only available in PLINK v 1.09
Once the final pruned data is converted to
VCF file it can be viewed in TASSEL software for further analysis.
--- End of Tutorial ---
Thank you for reading this tutorial. If you have any questions or comments, please let me know in the comment section below or send me an email.
Package: PLINK (including version number) Author: Shaun Purcell URL: http://pngu.mgh.harvard.edu/purcell/plink/ Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ & Sham PC (2007) PLINK: a toolset for whole-genome association and population-based linkage analysis. American Journal of Human Genetics, 81.