signatures of selection analysis¶
This sub-workflow is intended for genome-wide detection of signatures of
selection based on phased and unphased data. It can be enabled with the
argument: sig_sel = true . It has following methods :
pairwise Wright’s fixation index (Fst) as implemented in VCFtools (v 0.1.16)
Tajima’s D as implemented in VCFtools (v 0.1.16)
nucleotide diversity (pi) as implemented in VCFtools (v 0.1.16)
composite likelihood ratio (CLR) as implemented in SweepFinder2 (v 1.0)
integrated haplotype score (iHS) as implemented in selscan (v 1.2.0a)
cross-population extended haplotype homozygosity (XP-EHH) as implemented in selscan (v 1.2.0a)
Description of the options and parameters:¶
Following options require boolean parameter as an input, meaning, either “true” or “false” (default is false).
sig_sel : setting this to “false” will not run this entire workflow.
skip_sel_outgroup: an option to skip the outgroup when applying
signatures of selection.
tajimas_d: whether or not to calculate Tajima’s D.
pi: whether or not to calculate nucleotide diversity measure, pi.
pairwise_fst: whether or not to calculate pairwise Fst in windows or
for each SNP in every possible pair of populations.
single_vs_all_fst : if this option is set to true, then the workflow
will take one population at a time as one member of the pair and the
remaining samples in the dataset as another member of the pair to
calculate pairwise Fst.
skip_chrmwise: an option relevant only for the methods implemented
in VCFtools; setting this to true will calculate Fst, Tajima’s D and pi
for each population after concatenating all chromosome-wise vcf files
into one genome-wide vcf file per population.
clr: whether or not to calculate CLR.
use_precomputed_afs: selective sweeps with pre-computed empirical
spectrum, option “-l” in SweepFinder2.
skip_phasing: setting this to false will skip phasing the data for
the analyses with selscan, meaning that the workflow assumed that the
supplied vcf files are already phased.
impute_status: an option to impute the missing genotypes (the
reference vcf file need to be provided for that), option “impute” in
Beagle.
ihs: whether or not to calculate iHS.
xpehh : whether or not to calculate XP-EHH.
Following options require file path or string as an input (default is none).
skip_pop: the path to the text file containing population IDs that
will be skipped from all the analyses implemented in this sub-workflow.
ref_vcf: the path to the csv file with these two columns: chromosome
ID and the path to its respective reference file in vcf format which
will be used for imputation and phasing, option “ref” in Beagle. For
reference, refer to this example file
test_files/test_reference_imputation_panel.csv.
anc_files: in case of SweepFinder2 or selscan, including the
information about ancestral and derived allele increases power of the
analyses, by invoking this option it is possible to include this
information. It takes any of these three parameters: (i) “create”:
supplying this string as parameter will invoke the process (described in
the next section) of detecting ancestral alleles using the outgroup
samples present in the vcf files; (ii). “none”: supplying this string as
parameter will perform selscan and SweepFinder2 analyses without
considering information about ancestral alleles; (iii). if none of the
“create” or “none” is supplied, then the parameter is assumed to be the
path to the csv file having these two columns: chromosome ID and the
path to its respective space separated file containing information about
ancestral alleles. For reference, refer to these example files:
ihs_args: optional
parameters
that can be applied to iHS computation.
xpehh_args : optional
parameters
that can be applied to XP-EHH computation.
selscan_map: by invoking this option, it is possible to include
recombination map in SweepFinder2 analysis. It takes any of these three
parameters: (i). path to the csv file having these two columns:
chromosome ID and the path to its respective recombination map; (ii).
“default”: create map file with genetic and physical positions for each
variant site using default conversion (1 cM = 1 Mbp); (iii). “none”: the
information about recombination map will not be considered in
SweepFinder2 analyses.
cm_map: an option to provide a path to the csv file having these two
columns: chromosome ID and the path to its respective PLINK formated
genetic map with cM units, option “map” in Beagle.
use_recomb_map: selective sweeps with pre-computed empirical
spectrum and recombination map, option “-lr” in SweepFinder2. This
argument takes any of these three options: (i). if set to “default”, the
workflow will create an input file with recombination rates assuming 1cM
= 1Mbp; (ii). if set to “none”, the SweepFinder2 analysis will be run
without recombination map file; (iii), if it is neither “default” nor
“none”, the third option is path to the csv file having these two
columns: chromosome ID and the path to its respective recombination map
file in the format as recognized by SweepFinder2.
Following arguments require integer as an input.
min_samples_per_pop: minimum sample size of the population to be
used in the analyses. Default: 2
tajimasd_window_size: the desired window size for Tajima’s D
calculation. Default: 50000
fst_window_size: the desired window size for pairwise Fst
calculation. Default: 50000
fst_step_size: the desired step size between windows for Fst
calculation. Default: -9. Any value greater or equal to zero means that
window and step size are equal.
pi_window_size: the desired window size for pi calculation. Default:
50000.
pi_step_size: the desired step size between windows for pi
calculation. Default: -9. Any value greater or equal to zero means that
window and step size are equal.
grid_space: the spacing in number of nucleotides between grid
points, option “g” in SweepFinder2. Default: 50000
grid_points: the number of points equally spaced across the genome,
option “G” in SweepFinder2. Default: -9. Any value greater than zero
will carry out SweepFinder2 analyses with grid points and ignore the
value specified in grid space.
burnin_val: the maximum number of burnin iterations, option “burnin”
in Beagle. Default: 3
iterations_val: the number of iterations, option “iterations” in
Beagle. Default: 12
ne_val: the effective population size, option “ne” in Beagle.
Default: 10000000
Overview of the processed carried out in this sub-workflow:¶
Computing iHS and XP-EHH requires phased input data while calculation of Fst, Tajima’s D, pi and CLR don’t. Following is the brief summary of the processes carried out by this sub-workflow:
1. splitting individuals’ IDs by population into each separate file according to information provided in the sample map file and removing the populations that don’t satisfy the threshold of minimum sample size.
Using VCFtools to calculate Tajima’s D, Weir’s Fst and pi
2. calculating Tajima’s D
3. calculating pi
4. calculating Fst for each population pair combination
5. calculating Fst for pair combinations of population versus all other
Detection of ancestral alleles usingest-sfs
Analyses implemented in VCFtools do not require outgroup/ancestral alleles. In case of CLR as implemented in SweepFinder2 as well as in case of iHS and XP-EHH as implemented in selscan, using the information of ancestral allele vs. derived allele increases the power. Therefore, if the outgroup is present in the vcf files, the following processes are carried out before applying SweepFinder2 and selscan analyses:
6. if the outgroup samples are present in the vcf file, ancestral alleles will be detected using est-sfs.
**Note:** Porgram est-sfs detect ancestral alleles only for the sites, where all samples are genotyped and not a single sample has a missing genotype at this position.
7. create a new vcf file by extracting the sites for which the program est-sfs has detected ancestral alleles.
Using SweepFinder2 to calculate CLR
8. preparing input files for SweepFinder2: splitting sample map (the same as in step 1)
9. preparing input files for SweepFinder2: genome wide allele frequency and recombination files with in-house Python scripts
10. computing the empirical frequency spectrum with SweepFinder2
11. calculating CLR with SweepFinder2
There are several ways of running SweepFinder2 analysis: (i). to run
without recombination map and pre-computed empirical frequency spectrum
(section 5.1 of SweepFinder2 manual), provide “none” to parameter
use_recomb_map and set use_precompute_afsto false; (ii). to
run without recombination map but with pre-computed empirical frequency
spectrum (section 5.2 of the SweepFinder2 manual), provide “none” to
parameter use_recomb_mapbut set use_precompute_afsto true;
(iii). to run with both recombination map and pre-computed empirical
frequency spectrum (section 5.3 of SweepFinder2 manual), either provide
“default” or file path to the use_recomb_mapand set
use_precompute_afsto true.
Using selscan to calculate iHS and XP-EHH
12. preparing input files for selscan: phasing genotypes with the program Beagle
13. preparing input files for selscan: a map file specifying physical distances
14. preparing input files for selscan: splitting the phased vcf files by each population
15. calculating iHS
16. calculating XP-EHH for each population pair
Description of the output files and directory-structure generated by this sub-workflow:¶
If the pipeline has completed successfully, results of it will be stored in ${output directory}/selection/. Inside this directory, following directories will be created (depending on the parameters set):
plot¶
The directory structure of “input_pop” is shown below:
plot¶
Each directory contains the list of population and samples included in the respective analysis.
The directory structure of “vcftools” is shown below:
plot¶
There will be one directory for each analysis performed. The screenshot
above shows the directory for “tajima_d”, which stores the results
of calculated Tajima’s D values. Inside, there will be a directory for
each chromosome, in which are then results for each population.
Likewise, if the option pi is set to true, there will be another
directory for “pi” besides “tajima_d”.
The directory structure of “ancestral_alleles_determination” is shown below:
plot¶
There will be 12 files for each chromosome, of these six files are inputs and outputs of “ests-sfs” tool:
1). *_config.txt : the configuration file containing parameters to run est-sfs
2). *_data.txt : the data file
3). *_seed.txt: the text file containing positive integer value
4). *_out_sfs.txt: the output file containing estimated uSFS vector.
5). *_out_pvalue.txt: the output file containing the estimated ancestral state probabilities for each site.
6). *_estsfs.log: the log file of est-sfs.
For detailed description of these files, refer to the manual of est-sfs.
The description of the remaining six files are as follows:
7). *_non_missing_sites.map: this text file contains three columns: chromsome ID, position, and the information about the major allele. If the major allele is based on the reference, then code in the third column will be 0 else it will be 1 (the major allele is alternative allele).
8). *_outgroup.txt: information about population used as outgroup
9). *.anc: the text file containing information about ancestral alleles. This text file will be used in the processes of selscan and SweepFinder2 analyses. It contains four columns: chromosome ID, position, ancestral allele, derived allele. Number 0 refers to the reference allele and 1 refers to the alternative allele.
10). *_pos_with_anc_alleles.vcf.gz: This vcf file contains only those positons where ancestral and derived alleles were determined. It will also be used for SweepFinder2 and selscan analyses.
11). *_pos_with_anc_alleles.vcf.gz.tbi: index file of the above file
12). *_pos_with_anc_alleles.log: the log file containing the commands used to generate file in steps 10 and 11.
The directory structure of “sweepfinder2” is shown below:
plot¶
It contains two directories: “input_files” and “results”. Inside the “input_files” directory, there will be input files used to run SweepFinder2 analysis. Inside the “results” directory, there will be a directory for each chromosome. Inside this directory, there will be results for each population.
The directory structure of “selscan” is shown below:
plot¶
There will be one directory for each performed analysis. In this example, only iHS were computed. Inside directory “iHS” are two more sub-directories: “input_files” and “results”. In the “input_files” directory are input files used to run iHS analysis for each population as well for each chromosome. Within the “results” each chromosome has a separate directory, where are two files for every population:
1). *vcf.iHS.out: this is the raw output file generated by selscan. This output is not normalized.
2). *vcf_anc.iHS.out: this is the output file using the information of ancestral allele. This output is also not normalized.
Validation of the sub-workflow:¶
For workflow validation, we have downloaded publicly available samples (see map below) with whole genome sequences from NCBI database (Alberto et al., 2018; Grossen et al., 2020; Henkel et al., 2019). We included domestic goats (Capra hircus) represented by various breeds from Switzerland. In addition to them, we also included Alpine ibex (C. ibex) and Bezoar wild goat (C. aegagrus). Since we need an outgroup when performing some of the analyses, we also added Urial sheep (Ovis vignei). We will use variants from chromosome 28 and 29 of, all together, 85 animals.
Geographic map of samples used for the testing and validation
purpose
1. Required input data files¶
The input data should be in the VCF or PLINK binary format files.
All VCF files need to be splitted by the chromosomes and indexed with tabix. Please check *test_files/test_input_vcf.csv* or the example below, where, in our case, we inserted the link to the cloud stored data. The first information in each row of input file is chromosome id, next is path/to/the/file.vcf.gz and the last is path/to/the/file.vcf.gz.tbi. Please note that the chromosome id must not contain any punctuation marks.
chr28,https://data.cyverse.org/dav-anon/iplant/home/maulik88/28\_filt\_samples.vcf.gz,https://data.cyverse.org/dav-anon/iplant/home/maulik88/28\_filt\_samples.vcf.gz.tbi
chr29,https://data.cyverse.org/dav-anon/iplant/home/maulik88/29\_filt\_samples.vcf.gz,https://data.cyverse.org/dav-anon/iplant/home/maulik88/29\_filt\_samples.vcf.gz.tbi
In addition to the VCF input format, it is also necessary to prepare a sample map file of individuals and populations. Sample map has two tab-delimited columns: in the first column are individual IDs and in the second are population IDs as demonstrated on the example below. It is also important that the name of the file ends with “.map”.
SRR8437780ibex AlpineIbex
SRR8437782ibex AlpineIbex
SRR8437783ibex AlpineIbex
SRR8437791ibex AlpineIbex
SRR8437793ibex AlpineIbex
SRR8437799ibex AlpineIbex
SRR8437809ibex AlpineIbex
SRR8437810ibex AlpineIbex
SRR8437811ibex AlpineIbex
SRX5250055\_SRR8442974 Appenzell
SRX5250057\_SRR8442972 Appenzell
SRX5250124\_SRR8442905 Appenzell
SRX5250148\_SRR8442881 Appenzell
SRX5250150\_SRR8442879 Appenzell
SRX5250151\_SRR8442878 Appenzell
SRX5250153\_SRR8442876 Appenzell
SRX5250155\_SRR8442874 Appenzell
SRX5250156\_SRR8442873 Appenzell
SRX5250157\_SRR8442872 Appenzell
340330\_T1 Bezoar
340331\_T1 Bezoar
340334\_T1 Bezoar
340340\_T1 Bezoar
340345\_T1 Bezoar
340347\_T1 Bezoar
340426\_T1 Bezoar
470100\_T1 Bezoar
470104\_T1 Bezoar
470106\_T1 Bezoar
...
454948\_T1 Urial
ERR454947urial Urial
SRR12396950urial Urial
For the Plink binary input, user need to specify the path to the
BED/BIM/FAM files in the section of general parameters:
input= "path/to/the/files/\*.{bed,bim,fam}" ### 2. Optional input
data files
In this sub-workflow, the user can list population IDs that should be
excluded in the analyses (skip_pop). For example, as we are only
interested to investigate signatures of selection in domestic goat
breeds, we excluded Alpine ibexes and Bezoar wild goats. We provided a
text file with population IDs in one column:
AlpineIbex
Bezoar
3. Setting the parameters¶
At the beginning, we have to specify some of the general parameters, which can be found in the first tab of GUI (general_param):
input: path to the .csv input file for the VCF format or names of
the PLINK binary files;
outDir: the name of the output folder;
sample_map: path to the file with the suffix “.map” that have listed
individuals and populations as addition to VCF input;
concate_vcf_prefix: file prefix of the genome-wise merged vcf files;
geo_plot_yml: path to the yaml file containing parameters for
plotting the samples on a geographical map;
tile_yml: path to the yaml file containing parameters for the
geographical map to be used for plotting;
f_chrom_len: path to the file with chromosomes’ length for the Plink
binary inputs;
f_pop_cord: path to the file with geographical locations for map
plotting;
f_pop_color: path to the file with specified colors for map
plotting;
fasta: the name of the reference genome fasta file that will be used
for converting in case of PLINK input;
allow_extra_chrom: set to true if the input contains chromosome name
in the form of string;
max_chrom: maximum number of chromosomes;
outgroup: the population ID of the outgroup;
cm_to_bp: the number of base pairs that corresponds to one cM
When we have filled in all the general parameters, we can move to the tab **sig_sel_params**, which is intended for analysing signatures of selection. Specify here the parameters described at the beginning of this documentation. At the end, save the parameters as yml file.
After setting all parameters and exporting them as yml file, we are ready to start the workflow. Choose any profile, we prefer mamba, and set the maximum number of processes, 10 in our case, that can be executed in parallel by each executor. From within the **scalepopgen** folder, execute the following command:
nextflow run scalepopgen.nf -params-file sig\_sel.yml -profile mamba -qs 10
You can check all the other command running options with the option help :
nextflow run scalepopgen.nf -help
If the analyses processed successfully, the command line output is looking like this:
N E X T F L O W ~ version 23.04.1
Launching `scalepopgen.nf` [astonishing\_torricelli] DSL2 - revision: b2755eec5b
WARN: Access to undefined parameter `help` -- Initialise it to a default value eg. `params.help = some\_value`
executor > local (44)
[2b/18f285] process > GENERATE\_POP\_COLOR\_MAP (generating pop color map) [100%] 1 of 1, cached: 1 ✔
[87/2886a0] process > FILTER\_SITES (filter\_sites\_CHR29) [100%] 2 of 2, cached: 2 ✔
[48/f09ccf] process > RUN\_SEL\_VCFTOOLS:SPLIT\_MAP\_FOR\_VCFTOOLS (splitting\_idfile\_by\_pop) [100%] 1 of 1, cached: 1 ✔
[a6/85c03d] process > RUN\_SEL\_VCFTOOLS:CONCAT\_VCF (concate\_vcf) [100%] 1 of 1, cached: 1 ✔
[bc/c52fb5] process > RUN\_SEL\_VCFTOOLS:CALC\_TAJIMA\_D (calculating\_tajima\_d) [100%] 8 of 8, cached: 8 ✔
[71/d9eac6] process > RUN\_SEL\_VCFTOOLS:MANHATTAN\_TAJIMAS\_D (generating\_mahnattan\_plot) [100%] 8 of 8 ✔
[6e/53befa] process > RUN\_SEL\_VCFTOOLS:CALC\_PI (calculating\_pi) [100%] 8 of 8, cached: 8 ✔
[26/e512bd] process > RUN\_SEL\_VCFTOOLS:MANHATTAN\_PI (generating\_mahnattan\_plot) [100%] 8 of 8 ✔
[d4/2b7e7c] process > RUN\_SEL\_VCFTOOLS:CALC\_WFST (calculating\_pairwise\_fst) [100%] 28 of 28, cached: 26 ✔
[8a/46f2d5] process > RUN\_SEL\_VCFTOOLS:CALC\_WFST\_ONE\_VS\_REMAINING (calculating\_one\_vs\_remaining\_fst) [100%] 8 of 8, cached: 8 ✔
[24/aa8283] process > RUN\_SEL\_VCFTOOLS:MANHATTAN\_FST (generating\_mahnattan\_plot) [100%] 8 of 8 ✔
[08/153d29] process > RUN\_SEL\_SWEEPFINDER2:SPLIT\_FOR\_SWEEPFINDER2 (splitting\_idfile\_by\_pop) [100%] 1 of 1, cached: 1 ✔
[d5/ae7ef7] process > RUN\_SEL\_SWEEPFINDER2:PREPARE\_SWEEPFINDER\_INPUT (sweepfinder\_input\_CHR28) [100%] 16 of 16, cached: 16 ✔
[a7/aae8e6] process > RUN\_SEL\_SWEEPFINDER2:COMPUTE\_EMPIRICAL\_AFS (sweepfinder\_input\_Grigia) [100%] 8 of 8, cached: 8 ✔
[02/5b5d06] process > RUN\_SEL\_SWEEPFINDER2:RUN\_SWEEPFINDER2 (sweepfinder\_input\_Peacock) [100%] 16 of 16, cached: 16 ✔
[dd/c91de1] process > RUN\_SIG\_SEL\_PHASED\_DATA:SPLIT\_FOR\_SELSCAN (splitting\_idfile\_by\_pop) [100%] 1 of 1, cached: 1 ✔
[4c/b15966] process > RUN\_SIG\_SEL\_PHASED\_DATA:PHASING\_GENOTYPE\_BEAGLE (phasing\_CHR28) [100%] 2 of 2, cached: 2 ✔
[23/35d715] process > RUN\_SIG\_SEL\_PHASED\_DATA:PREPARE\_MAP\_SELSCAN (preparing\_selscan\_map\_CHR29) [100%] 2 of 2, cached: 2 ✔
[fa/936a01] process > RUN\_SIG\_SEL\_PHASED\_DATA:SPLIT\_VCF\_BY\_POP (split\_vcf\_by\_pop\_CHR29) [100%] 2 of 2 ✔
[da/c5f1bf] process > RUN\_SIG\_SEL\_PHASED\_DATA:CALC\_iHS (calculating\_iHS\_CHR28) [100%] 16 of 16, cached: 16 ✔
[cd/67a405] process > RUN\_SIG\_SEL\_PHASED\_DATA:NORM\_iHS [100%] 16 of 16, cached: 16 ✔
[d1/198711] process > RUN\_SIG\_SEL\_PHASED\_DATA:CALC\_XPEHH (calculating\_xpehh\_CHR28) [100%] 56 of 56, cached: 56 ✔
[01/dfc42f] process > RUN\_SIG\_SEL\_PHASED\_DATA:NORM\_XPEHH [100%] 56 of 56, cached: 56 ✔
Completed at: 10-Sep-2023 17:36:51
Duration : 2h 32m 41s
CPU hours :
Succeeded :
4. Description of the output:¶
Results are stored in the specified output folder, more precisely in the
folder **selection**. In the sub-folder
**interactive_manhattan_plots**, we will take a look at interactive
plots of calculated Fst, pi and Tajima’s D. Plots were made separatly
for all included populations. For example, in the case of Saanen breed
from Switzerland, we detected peaks on chromosome 29 (NC_030836) that
have higher Fst values as shown in the figure below:
If we move with the pointer to one of the peaks, it will show us the link to explore this genomic region in Ensembl.
References¶
Please cite the following papers if you use this sub-workflow in your study:
[1] Danecek, P., Auton, A., Abecasis, G., Albers, C. A., Banks, E., DePristo, M. A., Handsaker, R. E., Lunter, G., Marth, G. T., Sherry, S. T., McVean, G., Durbin, R., & 1000 Genomes Project Analysis Group (2011). The variant call format and VCFtools. Bioinformatics (Oxford, England), 27(15), 2156–2158. https://doi.org/10.1093/bioinformatics/btr330
[2] Keightley, P. D., & Jackson, B. C. (2018). Inferring the Probability of the Derived vs. the Ancestral Allelic State at a Polymorphic Site. Genetics, 209(3), 897–906. https://doi.org/10.1534/genetics.118.301120
[3] DeGiorgio, M., Huber, C. D., Hubisz, M. J., Hellmann, I., & Nielsen, R. (2016). SweepFinder2: increased sensitivity, robustness and flexibility. Bioinformatics (Oxford, England), 32(12), 1895–1897. https://doi.org/10.1093/bioinformatics/btw051
[4] Browning, B. L., Tian, X., Zhou, Y., & Browning, S. R. (2021). Fast two-stage phasing of large-scale sequence data. American journal of human genetics, 108(10), 1880–1890. https://doi.org/10.1016/j.ajhg.2021.08
[5] Szpiech, Z. A., & Hernandez, R. D. (2014). selscan: an efficient multithreaded program to perform EHH-based scans for positive selection. Molecular biology and evolution, 31(10), 2824–2827. https://doi.org/10.1093/molbev/msu211
[6] Di Tommaso, P., Chatzou, M., Floden, E. W., Barja, P. P., Palumbo, E., & Notredame, C. (2017). Nextflow enables reproducible computational workflows. Nature biotechnology, 35(4), 316–319. https://doi.org/10.1038/nbt.3820
License¶
MIT