- Category: Variant calling
- Coordination: Gwenola Tosser-Klopp
- Project date: 2018 - 2021
- Project URL: www.goatgenome.org
Identification of variations in Goat genomes related to domestication and adaptation
The sigenae team was in charge of all bioinformatics treatments, from the raw data (fastq files) to the final files (VCF).
Read alignment and variant calling
Each sample was processed using a pipeline based on the domain reference tools and the Genome Analysis Toolkit (GATK) best practices: BWA version 0.7.15 for alignment (Li and Durbin 2009), SAMtools version 1.6 for handling SAM/BAM formats and calling variants (Li et al. 2009), Picard tools version 2.1.1 for labelling duplicated reads (http://broadinstitute.github.io/picard), as well as GATK version 3.6 for Insertion/Deletion (INDEL) realignment, base recalibration and calling variants (McKenna et al. 2010), BCFtools version 1.6 for handling VCF/BCF formats, Freebayes version 1.1.0 for calling variants (Garrison and Marth 2012), and snpEff version 4.3t for VCF annotation (Cingolani et al. 2012).
Reads were mapped to the latest ARS1 genome version (Genbank accession GCA_001704415.1) of the C. hircus species (Bickhart et al. 2017) using BWA-MEM software with default parameters except for “-t 14 -M” and “-R” to add read groups. The SAM output files were converted to sorted BAM using SAMtools.
Pre-processing steps (marking duplicates, INDEL-based realignment and base quality score recalibration - BQSR) were done using Picard-MarkDuplicates and GATK. The “known variants” file needed for the BQSR step was computed on a subset of 13 samples (see details in the next paragraph). Variants fulfilling the following conditions were included in the “known variants” file: (1) presenting at least 6 genotypes harbouring an alternative allele ("snpSift filter 'countVariant()>6'") (2) being called by both Freebayes and GATK-HaplotypeCaller.
Variant calling per sample was done with GATK-HaplotypeCaller in ERC mode with a minimum read mapping quality of 30 (this is required to consider a read as valid) and a minimum phred-scaled confidence threshold of 30 (“-stand_call_conf 30.0 -mmq 30 -ERC GVCF -variant_index_type LINEAR - variant_index_parameter 128000”).Due to the large number of samples, GVCF files were combined (CombineGVCFs) before the joint genotyping step (GenotypeGVCFs) to produce the raw VCF files by chromosome/scaffold.
A Variant Quality Score Recalibration (VQSR) step was performed on the raw VCF files.
In order to set up training resource sets for VQSR calibration, 13 goats, representing 13 breeds were chosen out of the 248 animals sequenced in the first batch. They belong to 11 out of the 15 gene pools determined by Colli et al. (Colli et al. 2018) with the software Admixture (Alexander et al. 2009). We added one individual from an inbred breed (Palmera) and an animal of the Creole breed (different from the Creole individuals genotyped in the ADAPTmap dataset), as a representative of the American gene pool.
Two true sites training resources were built:
- The first one, corresponding to the highest quality calls, used the variants consistently identified with GATK, Mpileup and Freebayes (“known=false, training=true, truth=true, prior=15.0”).
- The second one used the 60,000 SNPs selected by Tosser-Klopp et al. (2014) to generate the set of polymorphisms included in the Goat SNP50 BeadChip (Illumina) (“known=false, training=true, truth=true, prior=12.0”).
The non-true sites training resource was built using the variants exclusively called by GATK (“known=false, training=true, truth=false, prior=10.0”).
We only retained biallelic SNPs with a GATK quality score (QUAL) over 100 and with at least two individuals carrying the alternative allele.
Filtering resulted in a high confidence set of:
- 74,274,427 SNPs
- 13,607,850 INDELs