This README.txt file was generated on 09/21/2020 by Christopher Toomajian


-------------------
GENERAL INFORMATION
-------------------


1. Title of Dataset:
Population genomics of Fusarium subglutinans and Fusarium temperatum from Argentina



2. Author Information

  First Author Contact Information
        Name: Maria Veronica Fumero
           Institution: Research Institute on Mycology and
        Mycotoxicology (IMICO), National Scientific and Technical
        Research Council - National University of Rio Cuarto (CONICET-UNRC)
           Address: X5800, Rio Cuarto, Cordoba, Argentina
           Email: mariaveronicafumero@gmail.com


  Corresponding Author Contact Information
        Name: Christopher Toomajian
           Institution: Kansas State University
           Address: Department of Plant Pathology, Manhattan, KS 66506
           Email: toomajia@ksu.edu
	   Phone: 1-785-532-0879


  Author Contact Information
           Name:  Wei Yue
           Institution: Kansas State University
           Address: Department of Plant Pathology, Manhattan, KS 66506
           Email: weiyuebio@gmail.com


  Author Contact Information
           Name: Maria Laura Chiotta
           Institution: Research Institute on Mycology and
        Mycotoxicology (IMICO), National Scientific and Technical
        Research Council - National University of Rio Cuarto (CONICET-UNRC)
           Address: X5800, Rio Cuarto, Cordoba, Argentina
           Email: mchiotta@gmail.com


  Author Contact Information
           Name: Sofia N. Chulze
           Institution: Research Institute on Mycology and
        Mycotoxicology (IMICO), National Scientific and Technical
        Research Council - National University of Rio Cuarto (CONICET-UNRC)
           Address: X5800, Rio Cuarto, Cordoba, Argentina
           Email: snchulze@gmail.com


  Author Contact Information
           Name: John F. Leslie
           Institution: Kansas State University
           Address: Department of Plant Pathology, Manhattan, KS 66506
           Email: jfl@ksu.edu


---------------------
DATA & FILE OVERVIEW
--------------------- 

Directory of Files
   A. Data files

      1a. Filename:        1a_results_complete.vcf
      Short description: VCF file output of GATK v. 3.1- HaplotypeCaller tool
      run on BAM file of GBS reads from 96 samples (including 1
      negative control) mapped to F. temperatum CMWF389 genome.

      1b. Filename:        1b_results.table
      Short description: Tab separated variant table exported from vcf file 1a,
      including all variants.

      1c. Filename:        1c_resultsedit3.table
      Short description: Tab separated modification of file 1b with negative control
      sample removed. Serves as input to a_summarize_results_table.pl.

      2a. Filename:        2a_summarize_results_table.out
      Short description: Tab separated output of Perl script
      a_summarize_results_table.pl. Includes genotypes from 95 strains
      plus additional information on each SNP (allele counts in each species).

      2b. Filename:        2b_FsFt22817B.tsv
      Short description: Tab separated values file, derived from file 2a, including only the 73
      strains used in subsequent analyses and the SNPs that are polymorphic in
      that strain set. Serves as input to the following Perl scripts:
      b_pairwise_SNPcompare.pl, d1_makeSTRUCT73input.pl,
      e1_make_smartpcas_bothspp.pl, e2_make_smartpcas_Fs.pl,
      e3_make_smartpcas_Ft.pl, h_createSweepFinder_input.pl.

      2c. Filename:        2c_FsFt22817cc.tsv
      Short description: Tab separated values file, derived from file 2b with no SNPs removed but
      columns removed so that genotypes from only the 73 strains retained
      for subsequent analyses remain. Serves as input to the
      following Perl scripts: c_filter_clonecorrectedSNPs.pl,
      e4_make_smartpcas_bothsppCC.pl, e5_make_smartpcas_FsCC.pl,
      e6_make_smartpcas_FtCC.pl.

      3. Filename:        3_filter_clonecorrectedSNPs33.out
      Short description: Tab separated output of Perl script
      c_filter_clonecorrectedSNPs.pl, this file contains only the
      12,021 SNPs that are polymorphic in the clone-corrected set of
      samples (after dropping singletons and filtering for amount of
      missing data. Serves as input to Perl script
      d2_makeSTRUCT33input.pl. 

      4a. Filename:        4a_VSNP24.map
      Short description: Tab separated values file containing position information for 5099 SNPs used as input for
      Perl script f_summarizeLD.pl to summarize LD decay in
      clone-corrected Fs population.

      4b. Filename:        4b_VSNP9.map
      Short description: Tab separated values file containing position information for 4254 SNPs used as input for
      Perl script f_summarizeLD.pl to summarize LD decay in
      clone-corrected Ft population.

      5a. Filename:        5a_plink24matrix.ld
      Short description: Pairwise linkage
      disequilibrium measure (r^2) for all pairs of SNPs used in Fs LD
      analysis. The list of those SNPs is given in file
      4a_VSNP24.map. Consecutive values within each row are separated by a
      single space. Output from PLINK
      software v. 1.07. Serves as input for Perl script f_summarizeLD.pl to summarize LD decay in
      clone-corrected Fs popuation.

      5b. Filename:        5b_plink9matrix.ld
      Short description: Pairwise linkage
      disequilibrium measure (r^2) for all pairs of SNPs used in Ft LD
      analysis. The list of those SNPs is given in file 4b_VSNP9.map.
      Consecutive values within each row are separated by a
      single space. Output from PLINK
      software v. 1.07. Serves as input for Perl script f_summarizeLD.pl to summarize LD decay in
      clone-corrected Ft popuation.

      6a. Filename:        6a_FsFtcc2.evec
      Short description: Tab separated values file with first 3 principal components of
      clone-corrected sample of 33 strains from both species. Serves as input for R code
      g_PCA_plottingFsFt.R which plots first 3 principal components of
      these strains.

      6b. Filename:        6b_Fscc2.evec
      Short description: Tab separated values file with first 3 principal components of
      clone-corrected sample of 24 Fs strains. Serves as input for R code
      g_PCA_plottingFsFt.R which plots first 3 principal components of
      these strains.

      6c. Filename:        6c_Ftcc3.evec
      Short description: Tab separated values file with first 3 principal components of
      clone-corrected sample of 9 Ft strains. Serves as input for R code
      g_PCA_plottingFsFt.R which plots first 3 principal components of
      these strains.

      6d. Filename:        6d_PCAC_Vsmartpca3
      Short description: Tab separated values file with first 3 principal components of
      full sample of 73 strains from both species. Serves as input for R code
      g_PCA_plottingFsFt.R which plots first 3 principal components of
      these strains.

      6e. Filename:        6e_PCAFs_Vsmartpca
      Short description: Tab separated values file with first 3 principal components of
      full sample of 46 Fs strains. Serves as input for R code
      g_PCA_plottingFsFt.R which plots first 3 principal components of
      these strains.

      6f. Filename:        6f_PCAFt_Vsmartpca
      Short description: Tab separated values file with first 3 principal components of
      full sample of 27 Ft strains. Serves as input for R code
      g_PCA_plottingFsFt.R which plots first 3 principal components of
      these strains.

      7. Filename:        7_dadi_input_full
      Short description: Tab separated values file with allele counts in each species of
      the 9845 SNPs for which the ancestral state could be inferred. Serves as input to Perl script
      h_createSweepFinder_input.pl as well as dadi software.

      8. Filename:        8_batch_4321_seqinput_log.csv
      Short description: Comma separated values file of position, sample size, and reference
      sequence for 40589 GBS loci. Serves as input to
      Perl script i1_batch_compute_totalsample.pl, which attempts to
      compute nucleotide diversity statistics for alignments
      of sequences from both species for each locus.

      9. Filename:        9_locusnucdiv_summary
      Short description: Tab separated values file containing partial nucleotide diversity
      output per GBS locus from 'compute' tool of 'analysis'
      package. Serves as input for Perl script j_pi_slidewin100k.pl.
        
   B. Perl or R code

      10. Filename:        a_summarize_results_table.pl
      Short description: Perl script which filters out SNPs with > 2
      alleles and adds to the SNP genotypes a summary of the count of
      each allele and missing genotype values in the Fs and Ft species samples.
      Input files(s): 1c_resultsedit3.table
      Output file(s): 2a_summarize_results_table.out; VSNP_sum_multi

      11. Filename:        b_pairwise_SNPcompare.pl
      Short description: Perl script to perform all pairwise
      comparisons of strains, to aid in computing the proportion of SNPs at
      which the strains differ.
      Input file(s): 2b_FsFt22817B.tsv
      Output file(s): FsFt_pairwise

      12. Filename:        c_filter_clonecorrectedSNPs.pl
      Short description: Perl script to output only those SNPs polymorphic in the clone-corrected set of
      samples (after dropping singletons and filtering for amount of
      missing data. Creates output file for SNPs polymorphic in the
      2-species sample as well as files for the SNPs polymorphic
      within each species.
      Input file(s): 2c_FsFt22817cc.tsv
      Output file(s): 3_filter_clonecorrectedSNPs33.out; VSNP_sumFs24; VSNP_sumFt9
        
      13. Filename:        d1_makeSTRUCT73input.pl
      Short description: Perl script for transforming a SNP genotype
      matrix from the full set of samples into an input file suitable
      for the software STRUCTURE.
      Input file(s): 2b_FsFt22817B.tsv
      Output file(s): VSNP73_str.in

      14. Filename:        d2_makeSTRUCT33input.pl
      Short description: Perl script for transforming a SNP genotype
      matrix from the set of clone-corrected samples into an input
      file suitable for the software STRUCTURE.
      Input file(s): 3_filter_clonecorrectedSNPs33.out
      Output file(s): VSNP33_str.in

      15. Filename:        e1_make_smartpcas_bothspp.pl
      Short description: Perl script to read a SNP genotype matrix
      from the full set of strains from both species and create input files suitable
      for the smartpca program from the EIGENSOFT software package. It
      filters out SNPs with >2 alleles, singleton SNPs, and those with
      genotype calls in fewer than half of the strains.
      Input file(s): 2b_FsFt22817B.tsv
      Output file(s): Vsmartpca3.geno; Vsmartpca3.snp; Vsmartpca3.ind

      16. Filename:        e2_make_smartpcas_Fs.pl
      Short description: Perl script to read a SNP genotype matrix
      from the set of all Fs strains and create input files suitable
      for the smartpca program from the EIGENSOFT software package. It
      filters out SNPs with >2 alleles, singleton SNPs, and those with
      genotype calls in fewer than half of the strains.
      Input file(s): 2b_FsFt22817B.tsv
      Output file(s): VsmartpcaFs.geno; VsmartpcaFs.snp; VsmartpcaFs.ind

      17. Filename:        e3_make_smartpcas_Ft.pl
      Short description: Perl script to read a SNP genotype matrix
      from the set of all Ft strains and create input files suitable
      for the smartpca program from the EIGENSOFT software package. It
      filters out SNPs with >2 alleles, singleton SNPs, and those with
      genotype calls in fewer than half of the strains.
      Input file(s): 2b_FsFt22817B.tsv
      Output file(s): VsmartpcaFt.geno; VsmartpcaFt.snp; VsmartpcaFt.ind

      18. Filename:        e4_make_smartpcas_bothsppCC.pl
      Short description: Perl script to read a SNP genotype matrix
      from the clone-corrected set of strains from both species and create input files suitable
      for the smartpca program from the EIGENSOFT software package. It
      filters out SNPs with >2 alleles, singleton SNPs, and those with
      genotype calls in fewer than half of the strains.
      Input file(s): 2c_FsFt22817cc.tsv
      Output file(s): Vsmartpca3cc.geno; Vsmartpca3cc.snp; Vsmartpca3cc.ind

      19. Filename:        e5_make_smartpcas_FsCC.pl
      Short description: Perl script to read a SNP genotype matrix
      from the clone-corrected set of Fs strains and create input files suitable
      for the smartpca program from the EIGENSOFT software package. It
      filters out SNPs with >2 alleles, singleton SNPs, and those with
      genotype calls in fewer than half of the strains.
      Input file(s): 2c_FsFt22817cc.tsv
      Output file(s): VsmartpcaFscc.geno; VsmartpcaFscc.snp; VsmartpcaFscc.ind

      20. Filename:        e6_make_smartpcas_FtCC.pl
      Short description: Perl script to read a SNP genotype matrix
      from the clone-corrected set of Ft strains and create input files suitable
      for the smartpca program from the EIGENSOFT software package. It
      filters out SNPs with >2 alleles, singleton SNPs, and those with
      genotype calls in fewer than half of the strains.
      Input file(s): 2c_FsFt22817cc.tsv
      Output file(s): VsmartpcaFtcc.geno; VsmartpcaFtcc.snp; VsmartpcaFtcc.ind

      21. Filename:        f_summarizeLD.pl
      Short description: Perl script to summarize PLINK software
      output containing the linkage disequilibrium measure r^2 for all
      pairs of SNPs in a dataset. It computes the average of r^2 for
      all SNP pairs separated by a particular distance.
      Input file(s): 4a_VSNP24.map; 5a_plink24matrix.ld OR alternately 4b_VSNP9.map; 5b_plink9matrix.ld
      Output file(s): Fs_LDdecay.out OR alternately Ft_LDdecay.out

      22. Filename:        g_PCA_plottingFsFt.R
      Short description: R code for creating '3-D' PCA scatterplots
      for 6 separate datasets: PCA of combined Fs and Ft strains (inputfile=6d_PCAC_Vsmartpca3), PCA
      of Fs strains separately (inputfile=6e_PCAFs_Vsmartpca), PCA of
      Ft strains separately (inputfile=6f_PCAFt_Vsmartpca), PCA of
      combined Fs and Ft strains after clone-correction (inputfile=6a_FsFtcc2.evec), PCA of Fs
      strains after clone-correction (inputfile=6b_Fscc2.evec), and PCA of Ft strains after
      clone-correction (inputfile=6c_Ftcc3.evec).       

      23. Filename:        h_createSweepFinder_input.pl
      Short description: Perl script that takes SNP genotype matrix
      from sample and from it prepares species-specific input files for the software SweepFinder2 with SNP position,
      allele count, genotype call count, and derived allele state
      (when available) for each SNP that is polymorphic within that
      species or for all substitutions fixed in that species relative
      to the ancestral allele.
      Input file(s): 2b_FsFt22817B.tsv; 7_dadi_input_full
      Output file(s): 2 sets, of the form SFinFsB_1 and SFinFtB_1,
      with 1 file per species (Fs or Ft) for each scaffold of the reference genome.

      24. Filename:        i1_batch_compute_totalsample.pl
      Short description: Perl script that calls analysis package
      tool 'compute' for each alignment at a GBS locus that includes
      sequences from more than one strain.
      Input file(s): 8_batch_4321_seqinput_log.csv; all alignments of
      sequences at GBS loci in fasta format
      Output file(s): Files of the form x_com.out and x.comlog, where
      x is integer representing each GBS locus.

      25. Filename:        i2_batch_compute_separate_species.pl
      Short description: Perl script that calls analysis package
      tool 'compute' for each species-specific alignment at a GBS locus.
      Input file(s): All alignments of species-specific sequences at GBS loci in fasta format
      Output file(s): Files of the form x_Fs_com.out and x_Ft_com.out,
      where x is integer representing each GBS locus.

      26. Filename:        j_pi_slidewin100k.pl
      Short description: Perl script that computes average nucleotide
      diversity values for sliding windows along chromosomes, with
      average across GBS loci within a window being weighted by the
      length of each GBS locus. Computes nucleotide diversity within
      each species (Fs, Ft) as well as between species.
      Input file(s): 9_locusnucdiv_summary
      Output file(s): subpop_sliding100k


 Additional Notes on File Relationships, Context, or Content:        

See "short description" of each data file above and "input file(s)"
and "output file(s)" of each computer code file for information on file
relationships. Specifically, they list which scripts produced a
specific data
file as output, which use a specific data file as input, and otherwise how
different data files may be derived from other data files.



File Naming Convention (if not included above):

All scripts (written in Perl or R) start with a lower case letter and
end in .pl or .R, respectively.

All data files begin with a number, and all are simple text
files. Most are tab separated values
(tsv) files unless otherwise stated (only 3 data files are NOT tsv: 1
csv and 2 where values are separated by single spaces).

Both types of files are ordered (alphabetically or numerically) to
correspond roughly to the order in which they should be used or were
created following the analysis workflow as presented in manuscript
describing this work.

Fs and Ft stand for F. subglutinans and F. temperatum, respectively,
and cc stands for clone-corrected.
33 corresponds to the number of strains in the clone-corrected sample,
73 corresponds to the total number of strains used in the analyses.
9 corresponds to the number of Ft strains in the clone-corrected
sample, 24 corresponds to the number of Fs strains in the
clone-corrected sample.



-----------------------------------------
DATA DESCRIPTION FOR: 1a_results_complete.vcf
-----------------------------------------

As indicated in the first row, this file follows the standard format
for VCFv4.1

See the following link for full information on the VCFv4.1
specification: https://samtools.github.io/hts-specs/VCFv4.1.pdf


-----------------------------------------
DATA DESCRIPTION FOR: 1b_results.table
-----------------------------------------

1. Number of variables: 6, plus genotype calls for 95 distinct fungal
strains plus 1 negative control (water solution rather than DNA sample)


2. Number of cases/rows: 1 header row plus 48649 rows, 1 for each SNP


3. Missing data codes:
       '.'        Indicates no genotype call made for a particular
                 strain at the current SNP.

4. Variable List

    A. Name: CHROM
       Description: scaffold where SNP located (string)

    B. Name: POS
       Description: position of SNP on the scaffold (integer >0)

    C. Name: REF
       Description: Nucleotide base found in reference sequence at
       this SNP.
       Possible values = A, C, G, T

    D. Name: ALT
       Description: Nucleotide base(s) differing from REF at this SNP that are
       found in at least one strain.
       Possible values = A, C, G, T

    E. Name: AC
       Description: Count of times present in set of genotyped
       strains, for each ALT allele (integer(s))

    F. Name: AF
       Description: Allele frequency for each ALT allele (positive
       number <=1)

    G. Name: x.GT (where x is the strain name)
       Description: Genotype
       Possible values = A, C, G, T, or when data is missing, "."


-----------------------------------------
DATA DESCRIPTION FOR: 1c_resultsedit3.table 
-----------------------------------------

1. Number of variables: 6, plus genotype calls for 95 distinct fungal
strains

2. Number of cases/rows: 1 header row plus 48649 rows, 1 for each SNP


3. Missing data codes:
       '.'        Indicates no genotype call made for a particular
                 strain at the current SNP.

4. Variable List

    A. Name: CHROM
       Description: scaffold where SNP located (string)

    B. Name: POS
       Description: position of SNP on the scaffold (integer >0)

    C. Name: REF
       Description: Nucleotide base found in reference sequence at
       this SNP.
       Possible values = A, C, G, T

    D. Name: ALT
       Description: Nucleotide base(s) differing from REF at this SNP that are
       found in at least one strain.
       Possible values = A, C, G, T

    E. Name: AC
       Description: Count of times present in set of genotyped
       strains, for each ALT allele (integer(s))

    F. Name: AF
       Description: Allele frequency for each ALT allele (positive
       number <=1)

    G. Name: x.GT (where x is the strain name)
       Description: Genotype
       Possible values = A, C, G, T, or when data is missing, "."


-----------------------------------------
DATA DESCRIPTION FOR: 2a_summarize_results_table.out 
-----------------------------------------

1. Number of variables: 15, plus genotype calls for 95 distinct fungal
strains

2. Number of cases/rows: 1 header row plus 48468 rows, 1 for each
biallelic SNP

3. Missing data codes:
       '.'        Indicates no genotype call made for a particular
                 strain at the current SNP.

4. Variable List

    A. Name: snp_id
       Description: Unique ID for each SNP (string)

    B. Name: allele1
       Description: Nucleotide state of 1 of 2 alleles found at SNP 
       Possible values = A, C, G, T

    C. Name: allele2
       Description: Nucleotide state of 1 of 2 alleles found at SNP
       Possible values = A, C, G, T

    D. Name: Fs_missingcount
       Description: Count of Fs strains with missing genotype calls at
       SNP. (integer)

    E. Name: Fs_allele1count
       Description: Count of Fs strains with allele1 at SNP (integer)

    F. Name: Fs_allele2count
       Description: Count of Fs strains with allele2 at SNP (integer)

    G. Name: Ft_missingcount
       Description: Count of Ft strains with missing genotype calls at
       SNP. (integer)

    H. Name: Ft_allele1count
       Description: Count of Ft strains with allele1 at SNP (integer)

    I. Name: Ft_allele2count
       Description: Count of Ft strains with allele2 at SNP (integer)
       
    J. Name: CHROM
       Description: scaffold where SNP located (string)

    K. Name: POS
       Description: position of SNP on the scaffold (integer >0)

    L. Name: REF
       Description: Nucleotide base found in reference sequence at
       this SNP.
       Possible values = A, C, G, T

    M. Name: ALT
       Description: Nucleotide base(s) differing from REF at this SNP that are
       found in other strains.
       Possible values = A, C, G, T

    N. Name: AC
       Description: Count of times present in set of genotyped
       strains, for each ALT allele (integer(s))

    O. Name: AF
       Description: Allele frequency for each ALT allele (positive
       number <=1)

    P. Name: x.GT (where x is the strain name)
       Description: Genotype
       Possible values = A, C, G, T, or when data is missing, "."


-----------------------------------------
DATA DESCRIPTION FOR: 2b_FsFt22817B.tsv
-----------------------------------------

1. Number of variables: 5, plus genotype calls for 73 distinct fungal
strains

2. Number of cases/rows: 1 header row plus 22817 rows, 1 for each SNP

3. Missing data codes:
       '.'        Indicates no genotype call made for a particular
                 strain at the current SNP.

4. Variable List

    A. Name: ID
       Description: Unique ID for each SNP (string)

    B. Name: allele1
       Description: Nucleotide state of 1 of 2 alleles found at SNP 
       Possible values = A, C, G, T

    C. Name: allele2
       Description: Nucleotide state of 1 of 2 alleles found at SNP
       Possible values = A, C, G, T

    D. Name: CHROM
       Description: Scaffold where SNP located (string)

    E. Name: POS
       Description: Position of SNP on the scaffold (integer >0)

    F. Name: x.GT (where x is the strain name)
       Description: Genotype
       Possible values = A, C, G, T, or when data is missing, "."


-----------------------------------------
DATA DESCRIPTION FOR: 2c_FsFt22817cc.tsv
-----------------------------------------

1. Number of variables: 6, plus genotype calls for 33 distinct fungal
strains

2. Number of cases/rows: 1 header row plus 22817 rows, 1 for each SNP

3. Missing data codes:
       '.'        Indicates no genotype call made for a particular
                 strain at the current SNP.

4. Variable List

    A. Name: CHROM
       Description: scaffold where SNP located (string)

    B. Name: POS
       Description: position of SNP on the scaffold (integer >0)

    C. Name: REF
       Description: Nucleotide base found in reference sequence at
       this SNP.
       Possible values = A, C, G, T

    D. Name: ALT
       Description: Nucleotide base(s) differing from REF at this SNP that are
       found in at least one strain.
       Possible values = A, C, G, T

    E. Name: AC
       Description: Count of times present in set of genotyped
       strains, for each ALT allele (integer(s))

    F. Name: AF
       Description: Allele frequency for each ALT allele (positive
       number <=1)

    G. Name: x.GT (where x is the strain name)
       Description: Genotype
       Possible values = A, C, G, T, or when data is missing, "."


-----------------------------------------
DATA DESCRIPTION FOR: 3_filter_clonecorrectedSNPs33.out
-----------------------------------------

1. Number of variables: 8, plus genotype calls for 33 distinct fungal
strains

2. Number of cases/rows: 1 header row plus 12021 rows, 1 for each SNP

3. Missing data codes:
       '.'        Indicates no genotype call made for a particular
                 strain at the current SNP.

4. Variable List

    A. Name: Scaffold
       Description: scaffold where SNP located (integer)

    B. Name: position
       Description: position of SNP on the scaffold (integer >0)

    C. Name: SNP_ID
       Description: Unique ID for each SNP (string)
       
    D. Name: allele1
       Description: Nucleotide state of 1 of 2 alleles found at SNP 
       Possible values = A, C, G, T

    E. Name: allele2
       Description: Nucleotide state of 1 of 2 alleles found at SNP
       Possible values = A, C, G, T

    F. Name: missing genotype calls
       Description: Count of strains with missing genotype calls at
       SNP. (integer)

    G. Name: allele1 count
       Description: Count of strains with allele1 at SNP (integer)

    H. Name: allele2 count
       Description: Count of strains with allele2 at SNP (integer)

    I. Name: x (where x is the strain name)
       Description: Genotype
       Possible values = A, C, G, T, or when data is missing, "."


-----------------------------------------
DATA DESCRIPTION FOR: 4a_VSNP24.map
-----------------------------------------

1. Number of variables: 3

2. Number of cases/rows: 5099, 1 for each SNP

3. Missing data codes:
 None

4. Variable List

    A. Name: Column 1 (Scaffold)
       Description: scaffold where SNP located (string)

    B. Name: Column 2 (SNP_ID)
       Description: Unique ID for each SNP (string)

    C. Name: Column 3 (position)
       Description: position of SNP on the scaffold (integer >0)


-----------------------------------------
DATA DESCRIPTION FOR: 4b_VSNP9.map
-----------------------------------------

1. Number of variables: 3

2. Number of cases/rows: 4254, 1 for each SNP

3. Missing data codes:
 None

4. Variable List

    A. Name: Column 1 (Scaffold)
       Description: scaffold where SNP located (string)

    B. Name: Column 2 (SNP_ID)
       Description: Unique ID for each SNP (string)

    C. Name: Column 3 (position)
       Description: position of SNP on the scaffold (integer >0)


-----------------------------------------
DATA DESCRIPTION FOR: 5a_plink24matrix.ld
-----------------------------------------
This file is a symmetrical square matrix of r^2 values for all
possible pairs of SNPs in file 4a_VSNP24.map

1. Number of variables: 5099 columns, 1 for each SNP. 

2. Number of cases/rows: 5099, 1 for each SNP

3. Missing data codes:
        "nan"        r^2 undefined between 2 SNPs due to pattern of
        missing data

4. Variable List

    A. Name: r^2
       Description: r^2 measure of linkage disequilibrium, the square
       of the correlation coefficient of allele state between 2 loci.


-----------------------------------------
DATA DESCRIPTION FOR: 5b_plink9matrix.ld
-----------------------------------------
This file is a symmetrical square matrix of r^2 values for all
possible pairs of SNPs in file 4b_VSNP9.map

1. Number of variables: 4254 columns, 1 for each SNP

2. Number of cases/rows: 4254, 1 for each SNP

3. Missing data codes:
        "nan"        r^2 undefined between 2 SNPs due to pattern of
        missing data
	
4. Variable List

    A. Name: r^2
       Description: r^2 measure of linkage disequilibrium, the square
       of the correlation coefficient of allele state between 2 loci.
       

-----------------------------------------
DATA DESCRIPTION FOR: 6a_FsFtcc2.evec
-----------------------------------------

1. Number of variables: 5

2. Number of cases/rows: 1 header row plus 99 rows (3 x 33 strains)

3. Missing data codes:
 None

4. Variable List

    A. Name: Column 1 (strain name)
       Description: Name of strain, or dummy value for x-y and x-z
       projection of data points (string)

    B. Name: Column 2 (PC1)
       Description: Principal component 1 (number)

    C. Name: Column 3 (PC2)
       Description: Principal component 2 (number)

    D. Name: Column 4 (PC3)
       Description: Principal component 2 (number)

    E. Name: Column 5 (population code)
       Description: Code designating each strain into one of several
       populations (integer >0). In practice, used to determine color
       of points in scatterplot, so that 3 distinct colors are used for
       each strain because each
       strain is displayed 3 times, in '3D' space as well as projected
       onto a marginal x-y plane and onto a marginal x-z plane.


-----------------------------------------
DATA DESCRIPTION FOR: 6b_Fscc2.evec
-----------------------------------------

1. Number of variables: 5

2. Number of cases/rows: 1 header row plus 72 rows (3 x 24 strains)

3. Missing data codes:
 None

4. Variable List

    A. Name: Column 1 (strain name)
       Description: Name of strain, or dummy value for x-y and x-z
       projection of data points (string)

    B. Name: Column 2 (PC1)
       Description: Principal component 1 (number)

    C. Name: Column 3 (PC2)
       Description: Principal component 2 (number)

    D. Name: Column 4 (PC3)
       Description: Principal component 2 (number)

    E. Name: Column 5 (population code)
       Description: Code designating each strain into one of several
       populations (integer >0). In practice, used to determine color
       of points in scatterplot, so that 3 distinct colors are used for
       each strain because each
       strain is displayed 3 times, in '3D' space as well as projected
       onto a marginal x-y plane and onto a marginal x-z plane.


-----------------------------------------
DATA DESCRIPTION FOR: 6c_Ftcc3.evec
-----------------------------------------

1. Number of variables: 5

2. Number of cases/rows: 1 header row plus 27 rows (3 x 9 strains)

3. Missing data codes:
 None

4. Variable List

    A. Name: Column 1 (strain name)
       Description: Name of strain, or dummy value for x-y and x-z
       projection of data points (string)

    B. Name: Column 2 (PC1)
       Description: Principal component 1 (number)

    C. Name: Column 3 (PC2)
       Description: Principal component 2 (number)

    D. Name: Column 4 (PC3)
       Description: Principal component 2 (number)

    E. Name: Column 5 (population code)
       Description: Code designating each strain into one of several
       populations (integer >0). In practice, used to determine color
       of points in scatterplot, so that 3 distinct colors are used for
       each strain because each
       strain is displayed 3 times, in '3D' space as well as projected
       onto a marginal x-y plane and onto a marginal x-z plane.


-----------------------------------------
DATA DESCRIPTION FOR: 6d_PCAC_Vsmartpca3
-----------------------------------------

1. Number of variables: 5

2. Number of cases/rows: 1 header row plus 219 rows (3 x 73 strains)

3. Missing data codes:
 None

4. Variable List

    A. Name: Column 1 (strain name)
       Description: Name of strain, or dummy value for x-y and x-z
       projection of data points (string)

    B. Name: Column 2 (PC1)
       Description: Principal component 1 (number)

    C. Name: Column 3 (PC2)
       Description: Principal component 2 (number)

    D. Name: Column 4 (PC3)
       Description: Principal component 2 (number)

    E. Name: Column 5 (population code)
       Description: Code designating each strain into one of several
       populations (integer >0). In practice, used to determine color
       of points in scatterplot, so that 3 distinct colors are used for
       each strain because each
       strain is displayed 3 times, in '3D' space as well as projected
       onto a marginal x-y plane and onto a marginal x-z plane.


-----------------------------------------
DATA DESCRIPTION FOR: 6e_PCAFs_Vsmartpca
-----------------------------------------

1. Number of variables: 5

2. Number of cases/rows: 1 header row plus 138 rows (3 x 46 strains)

3. Missing data codes:
 None

4. Variable List

    A. Name: Column 1 (strain name)
       Description: Name of strain, or dummy value for x-y and x-z
       projection of data points (string)

    B. Name: Column 2 (PC1)
       Description: Principal component 1 (number)

    C. Name: Column 3 (PC2)
       Description: Principal component 2 (number)

    D. Name: Column 4 (PC3)
       Description: Principal component 2 (number)

    E. Name: Column 5 (population code)
       Description: Code designating each strain into one of several
       populations (integer >0). In practice, used to determine color
       of points in scatterplot, so that 3 distinct colors are used for
       each strain because each
       strain is displayed 3 times, in '3D' space as well as projected
       onto a marginal x-y plane and onto a marginal x-z plane.


-----------------------------------------
DATA DESCRIPTION FOR: 6f_PCAFt_Vsmartpca
-----------------------------------------

1. Number of variables: 5

2. Number of cases/rows: 1 header row plus 81 rows (3 x 27 strains)

3. Missing data codes:
 None

4. Variable List

    A. Name: Column 1 (strain name)
       Description: Name of strain, or dummy value for x-y and x-z
       projection of data points (string)

    B. Name: Column 2 (PC1)
       Description: Principal component 1 (number)

    C. Name: Column 3 (PC2)
       Description: Principal component 2 (number)

    D. Name: Column 4 (PC3)
       Description: Principal component 2 (number)

    E. Name: Column 5 (population code)
       Description: Code designating each strain into one of several
       populations (integer >0). In practice, used to determine color
       of points in scatterplot, so that 3 distinct colors are used for
       each strain because each
       strain is displayed 3 times, in '3D' space as well as projected
       onto a marginal x-y plane and onto a marginal x-z plane.


-----------------------------------------
DATA DESCRIPTION FOR: 7_dadi_input_full
-----------------------------------------

1. Number of variables: 10

2. Number of cases/rows: 1 header row plus 9845 rows, 1 for each SNP

3. Missing data codes:
 None

4. Variable List

    A. Name: Ref
       Description: Reference nucleotide at SNP position and 2
       flanking positions.

    B. Name: Out
       Description: Inferred ancestral nucleotide at SNP position
       (based on 2 outgroup species) and
       2 flanking positions. 

    C. Name: Allele1
       Description: Nucleotide state of 1 of 2 alleles found at SNP 
       Possible values = A, C, G, T
       
    D. Name: fs (allele1)
       Description: Count of Fs strains with allele1 and SNP (integer) 

    E. Name: ft (allele1)
       Description: Count of Ft strains with allele1 and SNP (integer)

    F. Name: Allele2
       Description: Nucleotide state of 1 of 2 alleles found at SNP
       Possible values = A, C, G, T

    G. Name: fs (allele2)
       Description: Count of Fs strains with allele2 at SNP (integer)

    H. Name: ft (allele2)
       Description: Count of Ft strains with allele2 at SNP (integer)

    I. Name: scaff
       Description: Scaffold where SNP located (integer)

    J. Name: pos
       Description: Position of SNP on the scaffold (integer >0)


-----------------------------------------
DATA DESCRIPTION FOR: 8_batch_4321_seqinput_log.csv
-----------------------------------------

1. Number of variables: 7

2. Number of cases/rows: 40589 rows, 1 for each GBS locus

3. Missing data codes:
 None

4. Variable List

    A. Name: Column 1 (LocusID)
       Description: Number corresponding to ID of GBS locus (integer)

    B. Name: Column 2 (Scaffold)
       Description: Scaffold where GBS locus found (string)

    C. Name: Column 3 (Start position)
       Description: Position on scaffold at which GBS locus starts (integer)
       
    D. Name: Column 4 (Orientation)
       Description: Orientation of GBS reads
       '+' = reference strand (such that position of start of read is smaller than
       position of end of read)
       '-' = reverse complement of reference strand (such that
       position of start of read is larger than position of end of read)

    E. Name: Column 5 (Ft_count)
       Description: Count of Ft strains with sequence at this locus (integer)

    F. Name: Column 6 (Fs_count)
       Description: Count of Fs strains with sequence at this locus (integer)

    G. Name: Column 7 (Refseq)
       Description: Sequence of reference genome at GBS locus (string)


-----------------------------------------
DATA DESCRIPTION FOR: 9_locusnucdiv_summary
-----------------------------------------

1. Number of variables: 15

2. Number of cases/rows: 1 header row plus 11880 rows, one for each
locus where nucleotide diversity could be computed

3. Missing data codes:
 None

4. Variable List

    A. Name: Column 1 (locus_id)
       Description: Unique ID for each locus (integer)

    B. Name: Column 2 (Scaffold)
       Description:  Scaffold where GBS locus found (string)

    C. Name: Column 3 (Chromosome)
       Description: Chromosome (from an assumed 12 for the reference genome) where GBS locus is
       found. (integer)

    D. Name: Column 4 (Scaffold position)
       Description: Position on scaffold at which GBS locus starts (integer)

    E. Name: Column 5 (Chromosome position)
       Description: Estimated position on chromosome at which GBS locus starts (integer)

    F. Name: Column 6 (Orientation)
       Description: Orientation of GBS reads
       '+' = reference strand (such that position of start of read is smaller than
       position of end of read)
       '-' = reverse complement of reference strand (such that
       position of start of read is larger than position of end of read)

    G. Name: Ntot
       Description: Number of total strains with sequence at GBS locus (integer)

    H. Name: nogap (total)
       Description: Length in bp of alignment of all strains excluding
       gap positions (integer)
       
    I. Name: thetapi (total)
       Description: Per bp nucleotide diversity of locus for sample
       that includes all strains of both species (number)

    J. Name: NFt
       Description: Number of Ft strains with sequence at GBS locus (integer)

    K. Name: nogap (Ft)
       Description: Length in bp of alignment of Ft strains excluding
       gap positions (integer)

    L. Name: thetapi (Ft)
       Description: Per bp nucleotide diversity of locus for sample
       that includes all Ft strains (number)

    M. Name: NFs
       Description: Number of Fs strains with sequence at GBS locus (integer)

    N. Name: nogap (Fs)
       Description: Length in bp of alignment of Fs strains excluding
       gap positions (integer)

    O. Name: thetapi (Fs)
       Description: Per bp nucleotide diversity of locus for sample
       that includes all Fs strains (number)


--------------------------
METHODOLOGICAL INFORMATION
--------------------------

1. Software-specific information:

See peer-review publications related to this dataset for all
software-specific information related to the generation and analysis
of the datasets included here.
All data files here are in simple text formats, and need no
specialized software to interpret them.


2. Equipment-specific information:

See peer-review publications related to this dataset for any
equipment-specific information.


3. Date of data collection (single date, range, approximate date):

Next-generation sequencing data was collected during Summer 2016.
All analyses reported here were conducted between Summer 2016 and
September 2020.