This README.txt file was generated on 09/21/2020 by Christopher Toomajian ------------------- GENERAL INFORMATION ------------------- 1. Title of Dataset: Population genomics of Fusarium subglutinans and Fusarium temperatum from Argentina 2. Author Information First Author Contact Information Name: Maria Veronica Fumero Institution: Research Institute on Mycology and Mycotoxicology (IMICO), National Scientific and Technical Research Council - National University of Rio Cuarto (CONICET-UNRC) Address: X5800, Rio Cuarto, Cordoba, Argentina Email: mariaveronicafumero@gmail.com Corresponding Author Contact Information Name: Christopher Toomajian Institution: Kansas State University Address: Department of Plant Pathology, Manhattan, KS 66506 Email: toomajia@ksu.edu Phone: 1-785-532-0879 Author Contact Information Name: Wei Yue Institution: Kansas State University Address: Department of Plant Pathology, Manhattan, KS 66506 Email: weiyuebio@gmail.com Author Contact Information Name: Maria Laura Chiotta Institution: Research Institute on Mycology and Mycotoxicology (IMICO), National Scientific and Technical Research Council - National University of Rio Cuarto (CONICET-UNRC) Address: X5800, Rio Cuarto, Cordoba, Argentina Email: mchiotta@gmail.com Author Contact Information Name: Sofia N. Chulze Institution: Research Institute on Mycology and Mycotoxicology (IMICO), National Scientific and Technical Research Council - National University of Rio Cuarto (CONICET-UNRC) Address: X5800, Rio Cuarto, Cordoba, Argentina Email: snchulze@gmail.com Author Contact Information Name: John F. Leslie Institution: Kansas State University Address: Department of Plant Pathology, Manhattan, KS 66506 Email: jfl@ksu.edu --------------------- DATA & FILE OVERVIEW --------------------- Directory of Files A. Data files 1a. Filename: 1a_results_complete.vcf Short description: VCF file output of GATK v. 3.1- HaplotypeCaller tool run on BAM file of GBS reads from 96 samples (including 1 negative control) mapped to F. temperatum CMWF389 genome. 1b. Filename: 1b_results.table Short description: Tab separated variant table exported from vcf file 1a, including all variants. 1c. Filename: 1c_resultsedit3.table Short description: Tab separated modification of file 1b with negative control sample removed. Serves as input to a_summarize_results_table.pl. 2a. Filename: 2a_summarize_results_table.out Short description: Tab separated output of Perl script a_summarize_results_table.pl. Includes genotypes from 95 strains plus additional information on each SNP (allele counts in each species). 2b. Filename: 2b_FsFt22817B.tsv Short description: Tab separated values file, derived from file 2a, including only the 73 strains used in subsequent analyses and the SNPs that are polymorphic in that strain set. Serves as input to the following Perl scripts: b_pairwise_SNPcompare.pl, d1_makeSTRUCT73input.pl, e1_make_smartpcas_bothspp.pl, e2_make_smartpcas_Fs.pl, e3_make_smartpcas_Ft.pl, h_createSweepFinder_input.pl. 2c. Filename: 2c_FsFt22817cc.tsv Short description: Tab separated values file, derived from file 2b with no SNPs removed but columns removed so that genotypes from only the 73 strains retained for subsequent analyses remain. Serves as input to the following Perl scripts: c_filter_clonecorrectedSNPs.pl, e4_make_smartpcas_bothsppCC.pl, e5_make_smartpcas_FsCC.pl, e6_make_smartpcas_FtCC.pl. 3. Filename: 3_filter_clonecorrectedSNPs33.out Short description: Tab separated output of Perl script c_filter_clonecorrectedSNPs.pl, this file contains only the 12,021 SNPs that are polymorphic in the clone-corrected set of samples (after dropping singletons and filtering for amount of missing data. Serves as input to Perl script d2_makeSTRUCT33input.pl. 4a. Filename: 4a_VSNP24.map Short description: Tab separated values file containing position information for 5099 SNPs used as input for Perl script f_summarizeLD.pl to summarize LD decay in clone-corrected Fs population. 4b. Filename: 4b_VSNP9.map Short description: Tab separated values file containing position information for 4254 SNPs used as input for Perl script f_summarizeLD.pl to summarize LD decay in clone-corrected Ft population. 5a. Filename: 5a_plink24matrix.ld Short description: Pairwise linkage disequilibrium measure (r^2) for all pairs of SNPs used in Fs LD analysis. The list of those SNPs is given in file 4a_VSNP24.map. Consecutive values within each row are separated by a single space. Output from PLINK software v. 1.07. Serves as input for Perl script f_summarizeLD.pl to summarize LD decay in clone-corrected Fs popuation. 5b. Filename: 5b_plink9matrix.ld Short description: Pairwise linkage disequilibrium measure (r^2) for all pairs of SNPs used in Ft LD analysis. The list of those SNPs is given in file 4b_VSNP9.map. Consecutive values within each row are separated by a single space. Output from PLINK software v. 1.07. Serves as input for Perl script f_summarizeLD.pl to summarize LD decay in clone-corrected Ft popuation. 6a. Filename: 6a_FsFtcc2.evec Short description: Tab separated values file with first 3 principal components of clone-corrected sample of 33 strains from both species. Serves as input for R code g_PCA_plottingFsFt.R which plots first 3 principal components of these strains. 6b. Filename: 6b_Fscc2.evec Short description: Tab separated values file with first 3 principal components of clone-corrected sample of 24 Fs strains. Serves as input for R code g_PCA_plottingFsFt.R which plots first 3 principal components of these strains. 6c. Filename: 6c_Ftcc3.evec Short description: Tab separated values file with first 3 principal components of clone-corrected sample of 9 Ft strains. Serves as input for R code g_PCA_plottingFsFt.R which plots first 3 principal components of these strains. 6d. Filename: 6d_PCAC_Vsmartpca3 Short description: Tab separated values file with first 3 principal components of full sample of 73 strains from both species. Serves as input for R code g_PCA_plottingFsFt.R which plots first 3 principal components of these strains. 6e. Filename: 6e_PCAFs_Vsmartpca Short description: Tab separated values file with first 3 principal components of full sample of 46 Fs strains. Serves as input for R code g_PCA_plottingFsFt.R which plots first 3 principal components of these strains. 6f. Filename: 6f_PCAFt_Vsmartpca Short description: Tab separated values file with first 3 principal components of full sample of 27 Ft strains. Serves as input for R code g_PCA_plottingFsFt.R which plots first 3 principal components of these strains. 7. Filename: 7_dadi_input_full Short description: Tab separated values file with allele counts in each species of the 9845 SNPs for which the ancestral state could be inferred. Serves as input to Perl script h_createSweepFinder_input.pl as well as dadi software. 8. Filename: 8_batch_4321_seqinput_log.csv Short description: Comma separated values file of position, sample size, and reference sequence for 40589 GBS loci. Serves as input to Perl script i1_batch_compute_totalsample.pl, which attempts to compute nucleotide diversity statistics for alignments of sequences from both species for each locus. 9. Filename: 9_locusnucdiv_summary Short description: Tab separated values file containing partial nucleotide diversity output per GBS locus from 'compute' tool of 'analysis' package. Serves as input for Perl script j_pi_slidewin100k.pl. B. Perl or R code 10. Filename: a_summarize_results_table.pl Short description: Perl script which filters out SNPs with > 2 alleles and adds to the SNP genotypes a summary of the count of each allele and missing genotype values in the Fs and Ft species samples. Input files(s): 1c_resultsedit3.table Output file(s): 2a_summarize_results_table.out; VSNP_sum_multi 11. Filename: b_pairwise_SNPcompare.pl Short description: Perl script to perform all pairwise comparisons of strains, to aid in computing the proportion of SNPs at which the strains differ. Input file(s): 2b_FsFt22817B.tsv Output file(s): FsFt_pairwise 12. Filename: c_filter_clonecorrectedSNPs.pl Short description: Perl script to output only those SNPs polymorphic in the clone-corrected set of samples (after dropping singletons and filtering for amount of missing data. Creates output file for SNPs polymorphic in the 2-species sample as well as files for the SNPs polymorphic within each species. Input file(s): 2c_FsFt22817cc.tsv Output file(s): 3_filter_clonecorrectedSNPs33.out; VSNP_sumFs24; VSNP_sumFt9 13. Filename: d1_makeSTRUCT73input.pl Short description: Perl script for transforming a SNP genotype matrix from the full set of samples into an input file suitable for the software STRUCTURE. Input file(s): 2b_FsFt22817B.tsv Output file(s): VSNP73_str.in 14. Filename: d2_makeSTRUCT33input.pl Short description: Perl script for transforming a SNP genotype matrix from the set of clone-corrected samples into an input file suitable for the software STRUCTURE. Input file(s): 3_filter_clonecorrectedSNPs33.out Output file(s): VSNP33_str.in 15. Filename: e1_make_smartpcas_bothspp.pl Short description: Perl script to read a SNP genotype matrix from the full set of strains from both species and create input files suitable for the smartpca program from the EIGENSOFT software package. It filters out SNPs with >2 alleles, singleton SNPs, and those with genotype calls in fewer than half of the strains. Input file(s): 2b_FsFt22817B.tsv Output file(s): Vsmartpca3.geno; Vsmartpca3.snp; Vsmartpca3.ind 16. Filename: e2_make_smartpcas_Fs.pl Short description: Perl script to read a SNP genotype matrix from the set of all Fs strains and create input files suitable for the smartpca program from the EIGENSOFT software package. It filters out SNPs with >2 alleles, singleton SNPs, and those with genotype calls in fewer than half of the strains. Input file(s): 2b_FsFt22817B.tsv Output file(s): VsmartpcaFs.geno; VsmartpcaFs.snp; VsmartpcaFs.ind 17. Filename: e3_make_smartpcas_Ft.pl Short description: Perl script to read a SNP genotype matrix from the set of all Ft strains and create input files suitable for the smartpca program from the EIGENSOFT software package. It filters out SNPs with >2 alleles, singleton SNPs, and those with genotype calls in fewer than half of the strains. Input file(s): 2b_FsFt22817B.tsv Output file(s): VsmartpcaFt.geno; VsmartpcaFt.snp; VsmartpcaFt.ind 18. Filename: e4_make_smartpcas_bothsppCC.pl Short description: Perl script to read a SNP genotype matrix from the clone-corrected set of strains from both species and create input files suitable for the smartpca program from the EIGENSOFT software package. It filters out SNPs with >2 alleles, singleton SNPs, and those with genotype calls in fewer than half of the strains. Input file(s): 2c_FsFt22817cc.tsv Output file(s): Vsmartpca3cc.geno; Vsmartpca3cc.snp; Vsmartpca3cc.ind 19. Filename: e5_make_smartpcas_FsCC.pl Short description: Perl script to read a SNP genotype matrix from the clone-corrected set of Fs strains and create input files suitable for the smartpca program from the EIGENSOFT software package. It filters out SNPs with >2 alleles, singleton SNPs, and those with genotype calls in fewer than half of the strains. Input file(s): 2c_FsFt22817cc.tsv Output file(s): VsmartpcaFscc.geno; VsmartpcaFscc.snp; VsmartpcaFscc.ind 20. Filename: e6_make_smartpcas_FtCC.pl Short description: Perl script to read a SNP genotype matrix from the clone-corrected set of Ft strains and create input files suitable for the smartpca program from the EIGENSOFT software package. It filters out SNPs with >2 alleles, singleton SNPs, and those with genotype calls in fewer than half of the strains. Input file(s): 2c_FsFt22817cc.tsv Output file(s): VsmartpcaFtcc.geno; VsmartpcaFtcc.snp; VsmartpcaFtcc.ind 21. Filename: f_summarizeLD.pl Short description: Perl script to summarize PLINK software output containing the linkage disequilibrium measure r^2 for all pairs of SNPs in a dataset. It computes the average of r^2 for all SNP pairs separated by a particular distance. Input file(s): 4a_VSNP24.map; 5a_plink24matrix.ld OR alternately 4b_VSNP9.map; 5b_plink9matrix.ld Output file(s): Fs_LDdecay.out OR alternately Ft_LDdecay.out 22. Filename: g_PCA_plottingFsFt.R Short description: R code for creating '3-D' PCA scatterplots for 6 separate datasets: PCA of combined Fs and Ft strains (inputfile=6d_PCAC_Vsmartpca3), PCA of Fs strains separately (inputfile=6e_PCAFs_Vsmartpca), PCA of Ft strains separately (inputfile=6f_PCAFt_Vsmartpca), PCA of combined Fs and Ft strains after clone-correction (inputfile=6a_FsFtcc2.evec), PCA of Fs strains after clone-correction (inputfile=6b_Fscc2.evec), and PCA of Ft strains after clone-correction (inputfile=6c_Ftcc3.evec). 23. Filename: h_createSweepFinder_input.pl Short description: Perl script that takes SNP genotype matrix from sample and from it prepares species-specific input files for the software SweepFinder2 with SNP position, allele count, genotype call count, and derived allele state (when available) for each SNP that is polymorphic within that species or for all substitutions fixed in that species relative to the ancestral allele. Input file(s): 2b_FsFt22817B.tsv; 7_dadi_input_full Output file(s): 2 sets, of the form SFinFsB_1 and SFinFtB_1, with 1 file per species (Fs or Ft) for each scaffold of the reference genome. 24. Filename: i1_batch_compute_totalsample.pl Short description: Perl script that calls analysis package tool 'compute' for each alignment at a GBS locus that includes sequences from more than one strain. Input file(s): 8_batch_4321_seqinput_log.csv; all alignments of sequences at GBS loci in fasta format Output file(s): Files of the form x_com.out and x.comlog, where x is integer representing each GBS locus. 25. Filename: i2_batch_compute_separate_species.pl Short description: Perl script that calls analysis package tool 'compute' for each species-specific alignment at a GBS locus. Input file(s): All alignments of species-specific sequences at GBS loci in fasta format Output file(s): Files of the form x_Fs_com.out and x_Ft_com.out, where x is integer representing each GBS locus. 26. Filename: j_pi_slidewin100k.pl Short description: Perl script that computes average nucleotide diversity values for sliding windows along chromosomes, with average across GBS loci within a window being weighted by the length of each GBS locus. Computes nucleotide diversity within each species (Fs, Ft) as well as between species. Input file(s): 9_locusnucdiv_summary Output file(s): subpop_sliding100k Additional Notes on File Relationships, Context, or Content: See "short description" of each data file above and "input file(s)" and "output file(s)" of each computer code file for information on file relationships. Specifically, they list which scripts produced a specific data file as output, which use a specific data file as input, and otherwise how different data files may be derived from other data files. File Naming Convention (if not included above): All scripts (written in Perl or R) start with a lower case letter and end in .pl or .R, respectively. All data files begin with a number, and all are simple text files. Most are tab separated values (tsv) files unless otherwise stated (only 3 data files are NOT tsv: 1 csv and 2 where values are separated by single spaces). Both types of files are ordered (alphabetically or numerically) to correspond roughly to the order in which they should be used or were created following the analysis workflow as presented in manuscript describing this work. Fs and Ft stand for F. subglutinans and F. temperatum, respectively, and cc stands for clone-corrected. 33 corresponds to the number of strains in the clone-corrected sample, 73 corresponds to the total number of strains used in the analyses. 9 corresponds to the number of Ft strains in the clone-corrected sample, 24 corresponds to the number of Fs strains in the clone-corrected sample. ----------------------------------------- DATA DESCRIPTION FOR: 1a_results_complete.vcf ----------------------------------------- As indicated in the first row, this file follows the standard format for VCFv4.1 See the following link for full information on the VCFv4.1 specification: https://samtools.github.io/hts-specs/VCFv4.1.pdf ----------------------------------------- DATA DESCRIPTION FOR: 1b_results.table ----------------------------------------- 1. Number of variables: 6, plus genotype calls for 95 distinct fungal strains plus 1 negative control (water solution rather than DNA sample) 2. Number of cases/rows: 1 header row plus 48649 rows, 1 for each SNP 3. Missing data codes: '.' Indicates no genotype call made for a particular strain at the current SNP. 4. Variable List A. Name: CHROM Description: scaffold where SNP located (string) B. Name: POS Description: position of SNP on the scaffold (integer >0) C. Name: REF Description: Nucleotide base found in reference sequence at this SNP. Possible values = A, C, G, T D. Name: ALT Description: Nucleotide base(s) differing from REF at this SNP that are found in at least one strain. Possible values = A, C, G, T E. Name: AC Description: Count of times present in set of genotyped strains, for each ALT allele (integer(s)) F. Name: AF Description: Allele frequency for each ALT allele (positive number <=1) G. Name: x.GT (where x is the strain name) Description: Genotype Possible values = A, C, G, T, or when data is missing, "." ----------------------------------------- DATA DESCRIPTION FOR: 1c_resultsedit3.table ----------------------------------------- 1. Number of variables: 6, plus genotype calls for 95 distinct fungal strains 2. Number of cases/rows: 1 header row plus 48649 rows, 1 for each SNP 3. Missing data codes: '.' Indicates no genotype call made for a particular strain at the current SNP. 4. Variable List A. Name: CHROM Description: scaffold where SNP located (string) B. Name: POS Description: position of SNP on the scaffold (integer >0) C. Name: REF Description: Nucleotide base found in reference sequence at this SNP. Possible values = A, C, G, T D. Name: ALT Description: Nucleotide base(s) differing from REF at this SNP that are found in at least one strain. Possible values = A, C, G, T E. Name: AC Description: Count of times present in set of genotyped strains, for each ALT allele (integer(s)) F. Name: AF Description: Allele frequency for each ALT allele (positive number <=1) G. Name: x.GT (where x is the strain name) Description: Genotype Possible values = A, C, G, T, or when data is missing, "." ----------------------------------------- DATA DESCRIPTION FOR: 2a_summarize_results_table.out ----------------------------------------- 1. Number of variables: 15, plus genotype calls for 95 distinct fungal strains 2. Number of cases/rows: 1 header row plus 48468 rows, 1 for each biallelic SNP 3. Missing data codes: '.' Indicates no genotype call made for a particular strain at the current SNP. 4. Variable List A. Name: snp_id Description: Unique ID for each SNP (string) B. Name: allele1 Description: Nucleotide state of 1 of 2 alleles found at SNP Possible values = A, C, G, T C. Name: allele2 Description: Nucleotide state of 1 of 2 alleles found at SNP Possible values = A, C, G, T D. Name: Fs_missingcount Description: Count of Fs strains with missing genotype calls at SNP. (integer) E. Name: Fs_allele1count Description: Count of Fs strains with allele1 at SNP (integer) F. Name: Fs_allele2count Description: Count of Fs strains with allele2 at SNP (integer) G. Name: Ft_missingcount Description: Count of Ft strains with missing genotype calls at SNP. (integer) H. Name: Ft_allele1count Description: Count of Ft strains with allele1 at SNP (integer) I. Name: Ft_allele2count Description: Count of Ft strains with allele2 at SNP (integer) J. Name: CHROM Description: scaffold where SNP located (string) K. Name: POS Description: position of SNP on the scaffold (integer >0) L. Name: REF Description: Nucleotide base found in reference sequence at this SNP. Possible values = A, C, G, T M. Name: ALT Description: Nucleotide base(s) differing from REF at this SNP that are found in other strains. Possible values = A, C, G, T N. Name: AC Description: Count of times present in set of genotyped strains, for each ALT allele (integer(s)) O. Name: AF Description: Allele frequency for each ALT allele (positive number <=1) P. Name: x.GT (where x is the strain name) Description: Genotype Possible values = A, C, G, T, or when data is missing, "." ----------------------------------------- DATA DESCRIPTION FOR: 2b_FsFt22817B.tsv ----------------------------------------- 1. Number of variables: 5, plus genotype calls for 73 distinct fungal strains 2. Number of cases/rows: 1 header row plus 22817 rows, 1 for each SNP 3. Missing data codes: '.' Indicates no genotype call made for a particular strain at the current SNP. 4. Variable List A. Name: ID Description: Unique ID for each SNP (string) B. Name: allele1 Description: Nucleotide state of 1 of 2 alleles found at SNP Possible values = A, C, G, T C. Name: allele2 Description: Nucleotide state of 1 of 2 alleles found at SNP Possible values = A, C, G, T D. Name: CHROM Description: Scaffold where SNP located (string) E. Name: POS Description: Position of SNP on the scaffold (integer >0) F. Name: x.GT (where x is the strain name) Description: Genotype Possible values = A, C, G, T, or when data is missing, "." ----------------------------------------- DATA DESCRIPTION FOR: 2c_FsFt22817cc.tsv ----------------------------------------- 1. Number of variables: 6, plus genotype calls for 33 distinct fungal strains 2. Number of cases/rows: 1 header row plus 22817 rows, 1 for each SNP 3. Missing data codes: '.' Indicates no genotype call made for a particular strain at the current SNP. 4. Variable List A. Name: CHROM Description: scaffold where SNP located (string) B. Name: POS Description: position of SNP on the scaffold (integer >0) C. Name: REF Description: Nucleotide base found in reference sequence at this SNP. Possible values = A, C, G, T D. Name: ALT Description: Nucleotide base(s) differing from REF at this SNP that are found in at least one strain. Possible values = A, C, G, T E. Name: AC Description: Count of times present in set of genotyped strains, for each ALT allele (integer(s)) F. Name: AF Description: Allele frequency for each ALT allele (positive number <=1) G. Name: x.GT (where x is the strain name) Description: Genotype Possible values = A, C, G, T, or when data is missing, "." ----------------------------------------- DATA DESCRIPTION FOR: 3_filter_clonecorrectedSNPs33.out ----------------------------------------- 1. Number of variables: 8, plus genotype calls for 33 distinct fungal strains 2. Number of cases/rows: 1 header row plus 12021 rows, 1 for each SNP 3. Missing data codes: '.' Indicates no genotype call made for a particular strain at the current SNP. 4. Variable List A. Name: Scaffold Description: scaffold where SNP located (integer) B. Name: position Description: position of SNP on the scaffold (integer >0) C. Name: SNP_ID Description: Unique ID for each SNP (string) D. Name: allele1 Description: Nucleotide state of 1 of 2 alleles found at SNP Possible values = A, C, G, T E. Name: allele2 Description: Nucleotide state of 1 of 2 alleles found at SNP Possible values = A, C, G, T F. Name: missing genotype calls Description: Count of strains with missing genotype calls at SNP. (integer) G. Name: allele1 count Description: Count of strains with allele1 at SNP (integer) H. Name: allele2 count Description: Count of strains with allele2 at SNP (integer) I. Name: x (where x is the strain name) Description: Genotype Possible values = A, C, G, T, or when data is missing, "." ----------------------------------------- DATA DESCRIPTION FOR: 4a_VSNP24.map ----------------------------------------- 1. Number of variables: 3 2. Number of cases/rows: 5099, 1 for each SNP 3. Missing data codes: None 4. Variable List A. Name: Column 1 (Scaffold) Description: scaffold where SNP located (string) B. Name: Column 2 (SNP_ID) Description: Unique ID for each SNP (string) C. Name: Column 3 (position) Description: position of SNP on the scaffold (integer >0) ----------------------------------------- DATA DESCRIPTION FOR: 4b_VSNP9.map ----------------------------------------- 1. Number of variables: 3 2. Number of cases/rows: 4254, 1 for each SNP 3. Missing data codes: None 4. Variable List A. Name: Column 1 (Scaffold) Description: scaffold where SNP located (string) B. Name: Column 2 (SNP_ID) Description: Unique ID for each SNP (string) C. Name: Column 3 (position) Description: position of SNP on the scaffold (integer >0) ----------------------------------------- DATA DESCRIPTION FOR: 5a_plink24matrix.ld ----------------------------------------- This file is a symmetrical square matrix of r^2 values for all possible pairs of SNPs in file 4a_VSNP24.map 1. Number of variables: 5099 columns, 1 for each SNP. 2. Number of cases/rows: 5099, 1 for each SNP 3. Missing data codes: "nan" r^2 undefined between 2 SNPs due to pattern of missing data 4. Variable List A. Name: r^2 Description: r^2 measure of linkage disequilibrium, the square of the correlation coefficient of allele state between 2 loci. ----------------------------------------- DATA DESCRIPTION FOR: 5b_plink9matrix.ld ----------------------------------------- This file is a symmetrical square matrix of r^2 values for all possible pairs of SNPs in file 4b_VSNP9.map 1. Number of variables: 4254 columns, 1 for each SNP 2. Number of cases/rows: 4254, 1 for each SNP 3. Missing data codes: "nan" r^2 undefined between 2 SNPs due to pattern of missing data 4. Variable List A. Name: r^2 Description: r^2 measure of linkage disequilibrium, the square of the correlation coefficient of allele state between 2 loci. ----------------------------------------- DATA DESCRIPTION FOR: 6a_FsFtcc2.evec ----------------------------------------- 1. Number of variables: 5 2. Number of cases/rows: 1 header row plus 99 rows (3 x 33 strains) 3. Missing data codes: None 4. Variable List A. Name: Column 1 (strain name) Description: Name of strain, or dummy value for x-y and x-z projection of data points (string) B. Name: Column 2 (PC1) Description: Principal component 1 (number) C. Name: Column 3 (PC2) Description: Principal component 2 (number) D. Name: Column 4 (PC3) Description: Principal component 2 (number) E. Name: Column 5 (population code) Description: Code designating each strain into one of several populations (integer >0). In practice, used to determine color of points in scatterplot, so that 3 distinct colors are used for each strain because each strain is displayed 3 times, in '3D' space as well as projected onto a marginal x-y plane and onto a marginal x-z plane. ----------------------------------------- DATA DESCRIPTION FOR: 6b_Fscc2.evec ----------------------------------------- 1. Number of variables: 5 2. Number of cases/rows: 1 header row plus 72 rows (3 x 24 strains) 3. Missing data codes: None 4. Variable List A. Name: Column 1 (strain name) Description: Name of strain, or dummy value for x-y and x-z projection of data points (string) B. Name: Column 2 (PC1) Description: Principal component 1 (number) C. Name: Column 3 (PC2) Description: Principal component 2 (number) D. Name: Column 4 (PC3) Description: Principal component 2 (number) E. Name: Column 5 (population code) Description: Code designating each strain into one of several populations (integer >0). In practice, used to determine color of points in scatterplot, so that 3 distinct colors are used for each strain because each strain is displayed 3 times, in '3D' space as well as projected onto a marginal x-y plane and onto a marginal x-z plane. ----------------------------------------- DATA DESCRIPTION FOR: 6c_Ftcc3.evec ----------------------------------------- 1. Number of variables: 5 2. Number of cases/rows: 1 header row plus 27 rows (3 x 9 strains) 3. Missing data codes: None 4. Variable List A. Name: Column 1 (strain name) Description: Name of strain, or dummy value for x-y and x-z projection of data points (string) B. Name: Column 2 (PC1) Description: Principal component 1 (number) C. Name: Column 3 (PC2) Description: Principal component 2 (number) D. Name: Column 4 (PC3) Description: Principal component 2 (number) E. Name: Column 5 (population code) Description: Code designating each strain into one of several populations (integer >0). In practice, used to determine color of points in scatterplot, so that 3 distinct colors are used for each strain because each strain is displayed 3 times, in '3D' space as well as projected onto a marginal x-y plane and onto a marginal x-z plane. ----------------------------------------- DATA DESCRIPTION FOR: 6d_PCAC_Vsmartpca3 ----------------------------------------- 1. Number of variables: 5 2. Number of cases/rows: 1 header row plus 219 rows (3 x 73 strains) 3. Missing data codes: None 4. Variable List A. Name: Column 1 (strain name) Description: Name of strain, or dummy value for x-y and x-z projection of data points (string) B. Name: Column 2 (PC1) Description: Principal component 1 (number) C. Name: Column 3 (PC2) Description: Principal component 2 (number) D. Name: Column 4 (PC3) Description: Principal component 2 (number) E. Name: Column 5 (population code) Description: Code designating each strain into one of several populations (integer >0). In practice, used to determine color of points in scatterplot, so that 3 distinct colors are used for each strain because each strain is displayed 3 times, in '3D' space as well as projected onto a marginal x-y plane and onto a marginal x-z plane. ----------------------------------------- DATA DESCRIPTION FOR: 6e_PCAFs_Vsmartpca ----------------------------------------- 1. Number of variables: 5 2. Number of cases/rows: 1 header row plus 138 rows (3 x 46 strains) 3. Missing data codes: None 4. Variable List A. Name: Column 1 (strain name) Description: Name of strain, or dummy value for x-y and x-z projection of data points (string) B. Name: Column 2 (PC1) Description: Principal component 1 (number) C. Name: Column 3 (PC2) Description: Principal component 2 (number) D. Name: Column 4 (PC3) Description: Principal component 2 (number) E. Name: Column 5 (population code) Description: Code designating each strain into one of several populations (integer >0). In practice, used to determine color of points in scatterplot, so that 3 distinct colors are used for each strain because each strain is displayed 3 times, in '3D' space as well as projected onto a marginal x-y plane and onto a marginal x-z plane. ----------------------------------------- DATA DESCRIPTION FOR: 6f_PCAFt_Vsmartpca ----------------------------------------- 1. Number of variables: 5 2. Number of cases/rows: 1 header row plus 81 rows (3 x 27 strains) 3. Missing data codes: None 4. Variable List A. Name: Column 1 (strain name) Description: Name of strain, or dummy value for x-y and x-z projection of data points (string) B. Name: Column 2 (PC1) Description: Principal component 1 (number) C. Name: Column 3 (PC2) Description: Principal component 2 (number) D. Name: Column 4 (PC3) Description: Principal component 2 (number) E. Name: Column 5 (population code) Description: Code designating each strain into one of several populations (integer >0). In practice, used to determine color of points in scatterplot, so that 3 distinct colors are used for each strain because each strain is displayed 3 times, in '3D' space as well as projected onto a marginal x-y plane and onto a marginal x-z plane. ----------------------------------------- DATA DESCRIPTION FOR: 7_dadi_input_full ----------------------------------------- 1. Number of variables: 10 2. Number of cases/rows: 1 header row plus 9845 rows, 1 for each SNP 3. Missing data codes: None 4. Variable List A. Name: Ref Description: Reference nucleotide at SNP position and 2 flanking positions. B. Name: Out Description: Inferred ancestral nucleotide at SNP position (based on 2 outgroup species) and 2 flanking positions. C. Name: Allele1 Description: Nucleotide state of 1 of 2 alleles found at SNP Possible values = A, C, G, T D. Name: fs (allele1) Description: Count of Fs strains with allele1 and SNP (integer) E. Name: ft (allele1) Description: Count of Ft strains with allele1 and SNP (integer) F. Name: Allele2 Description: Nucleotide state of 1 of 2 alleles found at SNP Possible values = A, C, G, T G. Name: fs (allele2) Description: Count of Fs strains with allele2 at SNP (integer) H. Name: ft (allele2) Description: Count of Ft strains with allele2 at SNP (integer) I. Name: scaff Description: Scaffold where SNP located (integer) J. Name: pos Description: Position of SNP on the scaffold (integer >0) ----------------------------------------- DATA DESCRIPTION FOR: 8_batch_4321_seqinput_log.csv ----------------------------------------- 1. Number of variables: 7 2. Number of cases/rows: 40589 rows, 1 for each GBS locus 3. Missing data codes: None 4. Variable List A. Name: Column 1 (LocusID) Description: Number corresponding to ID of GBS locus (integer) B. Name: Column 2 (Scaffold) Description: Scaffold where GBS locus found (string) C. Name: Column 3 (Start position) Description: Position on scaffold at which GBS locus starts (integer) D. Name: Column 4 (Orientation) Description: Orientation of GBS reads '+' = reference strand (such that position of start of read is smaller than position of end of read) '-' = reverse complement of reference strand (such that position of start of read is larger than position of end of read) E. Name: Column 5 (Ft_count) Description: Count of Ft strains with sequence at this locus (integer) F. Name: Column 6 (Fs_count) Description: Count of Fs strains with sequence at this locus (integer) G. Name: Column 7 (Refseq) Description: Sequence of reference genome at GBS locus (string) ----------------------------------------- DATA DESCRIPTION FOR: 9_locusnucdiv_summary ----------------------------------------- 1. Number of variables: 15 2. Number of cases/rows: 1 header row plus 11880 rows, one for each locus where nucleotide diversity could be computed 3. Missing data codes: None 4. Variable List A. Name: Column 1 (locus_id) Description: Unique ID for each locus (integer) B. Name: Column 2 (Scaffold) Description: Scaffold where GBS locus found (string) C. Name: Column 3 (Chromosome) Description: Chromosome (from an assumed 12 for the reference genome) where GBS locus is found. (integer) D. Name: Column 4 (Scaffold position) Description: Position on scaffold at which GBS locus starts (integer) E. Name: Column 5 (Chromosome position) Description: Estimated position on chromosome at which GBS locus starts (integer) F. Name: Column 6 (Orientation) Description: Orientation of GBS reads '+' = reference strand (such that position of start of read is smaller than position of end of read) '-' = reverse complement of reference strand (such that position of start of read is larger than position of end of read) G. Name: Ntot Description: Number of total strains with sequence at GBS locus (integer) H. Name: nogap (total) Description: Length in bp of alignment of all strains excluding gap positions (integer) I. Name: thetapi (total) Description: Per bp nucleotide diversity of locus for sample that includes all strains of both species (number) J. Name: NFt Description: Number of Ft strains with sequence at GBS locus (integer) K. Name: nogap (Ft) Description: Length in bp of alignment of Ft strains excluding gap positions (integer) L. Name: thetapi (Ft) Description: Per bp nucleotide diversity of locus for sample that includes all Ft strains (number) M. Name: NFs Description: Number of Fs strains with sequence at GBS locus (integer) N. Name: nogap (Fs) Description: Length in bp of alignment of Fs strains excluding gap positions (integer) O. Name: thetapi (Fs) Description: Per bp nucleotide diversity of locus for sample that includes all Fs strains (number) -------------------------- METHODOLOGICAL INFORMATION -------------------------- 1. Software-specific information: See peer-review publications related to this dataset for all software-specific information related to the generation and analysis of the datasets included here. All data files here are in simple text formats, and need no specialized software to interpret them. 2. Equipment-specific information: See peer-review publications related to this dataset for any equipment-specific information. 3. Date of data collection (single date, range, approximate date): Next-generation sequencing data was collected during Summer 2016. All analyses reported here were conducted between Summer 2016 and September 2020.