Cluster based methods such as
multidimensional scaling (MDS) and
priniciple component analysis (PCA) are traditionally used in identifying samples with genotypic inconsistencies, however, it is important to identify genotypes with high mendelian inconsistencies prior to any genetic or statistical analysis. In this tutorial, I would like to share how one can quickly check samples with high mendelian inconsistencies using genotypic data and pedigree information in
bcftools is only available for
Linux operating system. For detailed information on how to install it please consult user’s guide and further documentation at below link: https://samtools.github.io/bcftools/howtos/install.html
Two files are required as:
- Genotype file in VCF format
- Pedigree file in text (.txt) version
Pedigree file format
The pedigree file should be formatted as shown below:
mother,father,child1 mother,father,child2 mother,father,child3 mother,father,child4 mother,father,child5 mother,father,child6 mother,father,child7 mother,father,child8 mother,father,child9 mother,father,child10 mother,father,child11 mother,father,child12 mother,father,child13 mother,father,child14 mother,father,child15 mother,father,child16
Important! Make sure the parent names are sperated by comma (,) without any spaces in the text file.
running the bcftools
Use below sample command line to run the program:
$ bcftools +mendelian file.vcf -T pedigree.txt -c 2> output
The out file contains the following headers:
# nOK nBad nSkipped Trio 334518 4621 996 mother,father,child
- nOK .. number of genotypes with no missing sites and no Mendelian error
- nBad .. number of genotypes with a Mendelian error
- nSkipped .. number of genotypes at which at least one individual missing and therefore could not be considered 4
The output file can be opened directly in MS Excel and can be interpreted by plotting it.
In the above plot, it can be noted that there are four samples (in red) that have the highest mendelian error in comparison to the rest of samples, which should be excluded from any further analysis.
--- End of Tutorial ---
Thank you for reading this tutorial. If you have any questions or comments, please let me know in the comment section below or send me an email.
*Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9. [PMID: 19505943]