This README.txt file was generated on 2023-11-19 by Chris Toomajian ------------------- GENERAL INFORMATION ------------------- 1. Title of Dataset: Variant call format (VCF) file corresponding to called single nucleotide polymorphisms (SNPs) from a set of 454 clone-corrected isolates of Fusarium graminearum (Dhakal et al., submitted). 2. Author Information First Author Contact Information Name: Upasana Dhakal Institution: Kansas State University Email: upasanad@ksu.edu Corresponding Author Contact Information Name: Christopher Toomajian Institution: Kansas State University Address: 4024 Throckmorton, Department of Plant Pathology Email: toomajia@ksu.edu Author Contact Information Name: John F. Leslie Institution: Kansas State University Email: jfl@ksu.edu Author Contact Information Name: Wei Yue Email: weiyuebio@gmail.com --------------------- DATA & FILE OVERVIEW --------------------- Directory of Files A. Filename: selected.vqsr.454.biallelic.vcf Short description: VCF file (fileformat=VCFv4.2, readable without specialized software) with genotyping-by-sequencing SNPs from set of 454 clone-corrected isolates of Fusarium graminearum, from the US and Uruguay. Produced by GATK software v. 4.1.8.1, only biallelic SNPs retained. SNPs from this file segregated in a larger sample of over 500 isolates (some from closely related species), and what remains in this file were selected after variant quality score recalibration (vqsr). ----------------------------------------- DATA DESCRIPTION FOR: selected.vqsr.454.biallelic.vcf ----------------------------------------- 1. Number of variables: see below 2. Number of cases/rows: 90059 rows in the file, 90011 are entries for SNPs (though some are not polymorphic in this subsample of 454). The remaining 48 rows are part of the metadata section, which describe the GATK software parameters and filtering. 3. Missing data codes: Please consult other sources for all abbreviations and conventions used in VCF files. The numerous variables and data descriptors that are standard in VCF files produced by the software GATK will not be described here. See, e.g., https://www.genomoncology.com/blog/what-is-a-variant-call-format-vcf-file -------------------------- METHODOLOGICAL INFORMATION -------------------------- 1. Software-specific information: Name: Genome analysis toolkit GATK Version: 4.1.8.1 Additional Notes: Trimmed DNA reads were aligned to the PH-1 reference genome (NCBI BioProject: PRJEB5475) using default settings on BWA mem (Li and Durbin, 2009). The sequencing library was generated using restriction digestion so that the start and end of sequencing reads often were not random but rather fixed based on restriction cut sites. This process should cause many reads to appear duplicated, and thus duplicates were neither marked nor removed. GATK was used for variant calling. GATK HaplotypeCaller was used to call variants in GVCF mode. Briefly, variants were called separately for each sample and genomic variant call format (GVCFs) files were consolidated. This was followed by joint genotyping using “GenotypeGVCFs”. No community-curated standard SNP set is available for F. graminearum, so the SNPs obtained from joint genotyping were filtered to create a set of known SNPs for base quality score recalibration (BQSR). Sites with QD <20, MQ <55, MQRankSum <= -10.0 && MQRankSum >= 10.0 and ReadPosRankSum <= -10.0 && ReadPosRankSum >= 10.0 were filtered out. Three rounds of BQSR were performed. At the end of each round, variants were called on recalibrated BAM files using GATK HaplotypeCaller, followed by joint genotyping. Variants were filtered to select SNPs, and SNPs were hard filtered based on the above criteria. The SNP set obtained after hard filtering was used as the known set for the subsequent round of BQSR. For variant quality score recalibration (VQSR), a set of SNPs common to the output files created after applying the hard filter on VCF files from the first, second, and third rounds of BQSR was used as the truth set. The training set was obtained by performing hard filtering (QD <10, MQ <55, MQRankSum <= -10.0 && MQRankSum >= 10.0 and ReadPosRankSum <= -10.0 && ReadPosRankSum >= 10.0) on the SNP set obtained after the third round of BQSR. GATK VariantRecalibrator was used for variant quality score recalibration. The SNP set (no filters applied) obtained after running GenotypeGVCFs on BAM files after the third round of BQSR was used as the input for variant quality score recalibration. The model obtained using VariantRecalibrator was used to select SNPs after setting truth-sensitivity-filter-level to 99.0. 2. Equipment-specific information: GBS libraries were prepared using the F. graminearum isolates as described previously (Fumero et al., 2021). All GBS libraries were sequenced using 100 bp single-end reads on an Illumina HiSeq2000 (Fumero et al., 2021). 3. Date of data collection (single date, range, approximate date): Data were collected and processed over a broad time range. First DNA sequence data from these isolates were generated in 2014. SelectVariants command of GATK executed May 25, 2021 6:36:14 PM CDT to produce final VCF file. 4. References Fumero, M.V., Yue, W., Chiotta, M.L., Chulze, S.N., Leslie, J.F., Toomajian, C., 2021. Divergence and gene flow between Fusarium subglutinans and F. temperatum isolated from maize in Argentina. Phytopathology 111, 170–183. https://doi.org/10.1094/PHYTO-09-20-0434-FI Li, H., Durbin, R., 2009. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760. https://doi.org/10.1093/bioinformatics/btp324 McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M., DePristo, M.A., 2010. The genome analysis toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303. https://doi.org/10.1101/gr.107524.110 README_Template.txt Displaying README_Template.txt.