.. _explore-genetic-structure: PCA, IBS-based clustering, Fst and ADMIXTURE analysis ================================================================== This sub-workflow carried out Admixture analysis, two different Principal Component Analyses, and clustering based on Identity by State distances and Fst . The first PCA is done with the function SmartPCA of the `EIGENSOFT `__ software and performed with the argument ``run_smartpca``. The second one is done by using the function snpgdsPCA inside the R package `SNPRelate `__ and can be invoked with the argument ``run_gds_pca``. Analysis with the program `Admixture `__ are carried out with the argument ``admixture``. Both clustering are done with `Plink `__ calculations of Fst and IBS. Description of the parameters: ============================== | ``genetic_structure``: setting this to false will not run this sub-workflow | ``ld_filt``: an option to use LD-based filtering according to parameters specified below, meaning to include or skip step 4 | ``ld_window_size``: a window size in variant count or kilo bases (step 4) | ``ld_step_size``: number of variants to shift the window at the end of each step (step 4) | ``r2_value``: squared correlation threshold; at each step, only pairs of variants with r2 greater than the threshold are recognized (step 4) | ``structure_remove_indi``: the name of a file with listed samples that will be removed in **all of anylses** in this sub-workflow | ``smartpca_param``: the path to the file with additional parameters for SmartPCA (step 6) | ``pop_color_file``: the path to the text file with specified color codes for each population (steps 7, 9, 10 and 11) | ``f_pop_marker``: the path to the text file with specified marker shapes for each population (steps 7 and 9) | ``pca_yml``: the path to the yml file containing the parameters to plot interactive PCA results (steps 7 and 9) | ``starting_k_value``: starting number of clusters for Admixture analysis (step 12) | ``ending_k_value``: maximal number of clusters for Admixture analysis (step 12) | ``cross_validation``: the number of folds for cross-validation (step 13) | ``termination_criteria``: the lowest limit of the log-likelihood change between iterations to stop the process of cross-validation (step 13), option “-C” in program Admixture | ``plot_pop_order``: the path to the text file with population IDs in order they should be plotted on admixture graph (step 14) | ``fst_based_nj_tree``: an option to estimate NJ tree based on pairwise Fst distances between each pair of populations (step 10) | ``fst_nj_yml``: the path to the yml file containing parameters of plotting interactive NJ tree based on Fst (step 10) | ``est_1_min_ibs_based_nj_tree``: an option to estimate NJ tree based on 1-ibs distance between each pair of samples in the dataset (step 11) | ``ibs_nj_yml``: the path to the yml file containing parameters of plotting interactive NJ tree based on 1-ibs (step 11) Overview of the processed carried out in this sub-workflow: =========================================================== | **1.** converting vcf to bed file format using Plink | **2.** merging converted bed files (separated by chromosomes) with Plink | **3.** removing listed individuals | **4.** applying LD-based filtering of bed files with Plink | **5.** updating chromosome names from strings to integer numbers | **6.** running PCA with SmartPCA | **7.** plotting the results of SmartPCA | **8.** running PCA with snpgdsPCA | **9.** plotting the results of snpgdsPCA | **10.** calculating pairwise Fst and plot NJ-based tree | **11.** calculating IBS distances and plot NJ-based tree | **12.** running admixture from the starting and ending k-value selected by the user | **13.** selecting the optimal number of clusters (K) regarding the lowest cross validation error | **14.** plotting the admixture results and generating Pong input **Note:** If your input files are already in Plink format, it will skip step 1) and step 2). Also, if the LD-based filtering is set to true, all the analyses will be carried out using LD-based pruned dataset. Validation and test-run of the sub-workflow:: ============================================= For workflow validation, we have downloaded publicly available samples (see map below) with whole genome sequences from NCBI database (Alberto et al., 2018; Grossen et al., 2020; Henkel et al., 2019). We included domestic goats (*Capra hircus*) represented by various breeds from Switzerland. In addition to them, we also included Alpine ibex (*C. ibex*) and Bezoar wild goat (*C. aegagrus*). Since we need an outgroup when performing some of the analyses, we also added Urial sheep (*Ovis vignei*). We will use variants from chromosome 28 and 29 of, all together, 85 animals. |plot| Geographic map of samples used for the testing and validation purpose | Alberto et al. (2018). Convergent genomic signatures of domestication in sheep and goats. *Nature communications*, https://doi.org/10.1038/s41467-018-03206-y | Grossen et al. (2020). Purging of highly deleterious mutations through severe bottlenecks in Alpine ibex. *Nature communications*, https://doi.org/10.1038/s41467-020-14803-1 | Henkel et al. (2019). Selection signatures in goats reveal copy number variants underlying breed-defining coat color phenotypes. *PLoS genetics*, https://doi.org/10.1371/journal.pgen.1008536 1. Required input data files ---------------------------- The input data should be in the VCF or PLINK binary format files. All VCF files need to be splitted by the chromosomes and indexed with tabix. Please check *test_files/test_input_vcf.csv* or the example below, where, in our case, we inserted the link to the cloud stored data. The first information in each row of input file is chromosome id, next is path/to/the/file.vcf.gz and the last is path/to/the/file.vcf.gz.tbi. Please note that the chromosome id must not contain any punctuation marks. :: chr28,https://data.cyverse.org/dav-anon/iplant/home/maulik88/28_filt_samples.vcf.gz,https://data.cyverse.org/dav-anon/iplant/home/maulik88/28_filt_samples.vcf.gz.tbi chr29,https://data.cyverse.org/dav-anon/iplant/home/maulik88/29_filt_samples.vcf.gz,https://data.cyverse.org/dav-anon/iplant/home/maulik88/29_filt_samples.vcf.gz.tbi In addition to the VCF input format, it is also necessary to prepare a sample map file of individuals and populations. Sample map has two tab-delimited columns: in the first column are individual IDs and in the second are population IDs as demonstrated on the example below. It is also important that the name of the file ends with “.map”. :: SRR8437780ibex AlpineIbex SRR8437782ibex AlpineIbex SRR8437783ibex AlpineIbex SRR8437791ibex AlpineIbex SRR8437793ibex AlpineIbex SRR8437799ibex AlpineIbex SRR8437809ibex AlpineIbex SRR8437810ibex AlpineIbex SRR8437811ibex AlpineIbex SRX5250055_SRR8442974 Appenzell SRX5250057_SRR8442972 Appenzell SRX5250124_SRR8442905 Appenzell SRX5250148_SRR8442881 Appenzell SRX5250150_SRR8442879 Appenzell SRX5250151_SRR8442878 Appenzell SRX5250153_SRR8442876 Appenzell SRX5250155_SRR8442874 Appenzell SRX5250156_SRR8442873 Appenzell SRX5250157_SRR8442872 Appenzell 340330_T1 Bezoar 340331_T1 Bezoar 340334_T1 Bezoar 340340_T1 Bezoar 340345_T1 Bezoar 340347_T1 Bezoar 340426_T1 Bezoar 470100_T1 Bezoar 470104_T1 Bezoar 470106_T1 Bezoar ... 454948_T1 Urial ERR454947urial Urial SRR12396950urial Urial For the Plink binary input, user need to specify the path to the BED/BIM/FAM files in the section of general parameters: ``input= "path/to/the/files/*.{bed,bim,fam}"`` ### 2. Optional input data files In this module, the samples that should be removed in given analyses (``structure_remove_indi``) can be provided. For example, during the filtering (in the earlier step of sub-workflow), the samples of Grigia goat breed were removed (``rem_indi``). Here, for the PCA and admixture, all samples of the outgroup will be excluded. A space/tab-delimited text file with population IDs in the first column and sample IDs in the second column should be provided (*test_files/remove_outgroup.txt*): :: Urial 454948_T1 Urial ERR454947urial Urial SRR12396950urial The desired colors and mark shapes for each population can also be provided for the plotting. For example, ``pop_color_file``, a tab-delimited text file, where the population names are in the first column and specified hex color codes in the second *test_files/pop_color.txt*: :: AlpineIbex #008000 Appenzell #ff5733 Booted #0000FF ChamoisColored #d6b919 Grigia #aee716 Peacock #16e7cc Saanen #75baf3 Urial #A52A2A Toggenburg #da4eed Bezoar #FFA500 Similarly, prepare another file for ``f_pop_marker`` with specified marker shapes that are listed in **./extra/markershapes.txt**. Here is the example of *test_files/pop_shape.txt*: :: AlpineIbex square_default Appenzell circle_default Booted circle_default ChamoisColored circle_default Grigia circle_default Peacock circle_default Saanen circle_default Urial diamond_default Toggenburg circle_default Bezoar triangle_default Additionally, we will like to plot admixture results in a certain order of populations (``plot_pop_order``). For that, we need to prepare a text file with ordered population IDs in one column *test_files/pop_order.txt*: :: Appenzell Booted ChamoisColored Grigia Peacock Saanen Toggenburg Bezoar AlpineIbex In the case of PCA with the program SmartPCA, you can also provide your own file with optional parameters (``smartpca_param``). Please, make one according to the `instructions of the EIGENSOFT software `__. This workflow also has an option to draw a geographic map with samples’ origin. For that, we need to provide two files with coordinates (``f_pop_cord``) and colors (``f_pop_color``). In the first one (*test_files/geo_data.txt*), we write down population IDs in the first column and comma separated latitudes and longitudes in second column. ``Bezoar 32.662864436650814,51.64853259116807 Urial 34.66031157,53.49391737 AlpineIbex 46.48952713,9.832698605 ChamoisColored 46.620927266181674,7.345747305114329 Appenzell 47.33229709563813,9.401363933224248 Booted 47.426361052956736,9.384330852599533 Peacock 46.321661051197026,8.804738507288173 Toggenburg 47.358160245764715,9.01070577172017 Grigia 46.24935612558498,8.700996940189137 Saanen 46.9570926960748,8.205509946726016`` In the second file, we specified the hex codes of colors that will represent each population (*test_files/pop_color.txt*). ``AlpineIbex #008000 Appenzell #ff5733 Booted #0000FF ChamoisColored #d6b919 Grigia #aee716 Peacock #16e7cc Saanen #75baf3 Urial #A52A2A Toggenburg #da4eed Bezoar #FFA500`` The last file is not obligatory as the tool can choose random colors, while the first one with coordinates is necessary for map plotting. 3. Setting the parameters ------------------------- | At the beginning, we have to specify some of the general parameters, which can be found in the first tab of GUI (**general_param**): | ``input``: path to the .csv input file for the VCF format or names of the PLINK binary files; | ``outDir``: the name of the output folder; | ``sample_map``: path to the file with the suffix “.map” that have listed individuals and populations as addition to VCF input; | ``concate_vcf_prefix``: file prefix of the genome-wise merged vcf files; | ``geo_plot_yml``: path to the yaml file containing parameters for plotting the samples on a geographical map; | ``tile_yml``: path to the yaml file containing parameters for the geographical map to be used for plotting; | ``f_chrom_len``: path to the file with chromosomes’ length for the Plink binary inputs; | ``f_pop_cord``: path to the file with geographical locations for map plotting; | ``f_pop_color``: path to the file with specified colors for map plotting; | ``fasta``: the name of the reference genome fasta file that will be used for converting in case of PLINK input; | ``allow_extra_chrom``: set to true if the input contains chromosome name in the form of string; | ``max_chrom``: maximum number of chromosomes; | ``outgroup``: the population ID of the outgroup; | ``cm_to_bp``: the number of base pairs that corresponds to one cM Move forward to the tab named **genstruct_params** dedicated to the PCA, Admixture and both NJ clustering analyses, where we specify parameters described at the begining of this read.me. At the end, save the parameters as yml file. After setting the parameters, choose any profile, we prefer mamba, and set maximum number of processes, 10 in our case, that can be executed in parallel by each executor. From within the **scalepopgen** folder, execute the following command: :: nextflow run scalepopgen.nf -params-file gen_struct.yml -profile mamba -qs 10 You can check all the other command running options with the option help : :: nextflow run scalepopgen.nf -help If the module analyses are processed successfully, the command line output is looking like this: :: N E X T F L O W ~ version 23.04.1 Launching `scalepopgen.nf` [big_swirles] DSL2 - revision: 9f9aaad1d2 executor > slurm (10) [26/3c186e] process > GENERATE_POP_COLOR_MAP (generating pop color map) [100%] 1 of 1 ✔ [33/cca7ac] process > PLOT_GEO_MAP (plotting_sample_on_map) [100%] 1 of 1 ✔ [27/5137ac] process > CONVERT_FILTERED_VCF_TO_PLINK:CONVERT_VCF_TO_BED (converting_vcf_to_bed_CHR29) [100%] 2 of 2 ✔ [0c/af2851] process > CONVERT_FILTERED_VCF_TO_PLINK:MERGE_BED (merging_bed_goats) [100%] 1 of 1 ✔ [d8/740abc] process > EXPLORE_GENETIC_STRUCTURE:REMOVE_INDI_STRUCTURE (remove_indi_pca_goats) [100%] 1 of 1 ✔ [2d/1ba3a8] process > EXPLORE_GENETIC_STRUCTURE:LD_FILTER_STRUCTURE (ld_filtering_goats_rem_indi) [100%] 1 of 1 ✔ [56/e7e488] process > EXPLORE_GENETIC_STRUCTURE:UPDATE_CHROM_IDS (updating_chrom_ids) [100%] 1 of 1 ✔ [7d/fceb05] process > EXPLORE_GENETIC_STRUCTURE:RUN_SMARTPCA (running_smartpca_goats_rem_indi_ld_filtered_upd... [100%] 1 of 1 ✔ [66/231574] process > EXPLORE_GENETIC_STRUCTURE:PLOT_SMARTPCA (plot_interactive_pca) [100%] 1 of 1 ✔ [7b/2561e0] process > EXPLORE_GENETIC_STRUCTURE:RUN_SNPGDSPCA (running_snpgdspca_goats_rem_indi_ld_filtered_u... [100%] 1 of 1 ✔ [6e/5c3380] process > EXPLORE_GENETIC_STRUCTURE:PLOT_SNPGDSPCA (plot_interactive_pca) [100%] 1 of 1 ✔ [75/0f8471] process > EXPLORE_GENETIC_STRUCTURE:CALC_PAIRWISE_FST (ld_filtering_goats_rem_indi_ld_filtered) [100%] 1 of 1 ✔ [cd/67a405] process > EXPLORE_GENETIC_STRUCTURE:CALC_1_MIN_IBS_DIST (1_min_ibs_distance_goats_rem_indi_ld_fil... [100%] 1 of 1 ✔ [01/dfc42f] process > EXPLORE_GENETIC_STRUCTURE:RUN_ADMIXTURE_DEFAULT (run_admixture_5) [100%] 9 of 9 ✔ [7a/ae5112] process > EXPLORE_GENETIC_STRUCTURE:EST_BESTK_PLOT (estimating_bestK) [100%] 1 of 1 ✔ [44/e127c9] process > EXPLORE_GENETIC_STRUCTURE:GENERATE_PONG_INPUT (generating_pong_input) [100%] 1 of 1 ✔ Completed at: 09-Aug-2023 13:56:31 Duration : 5m 52s CPU hours : 2.8 Succeeded : 25 4. Description of the output files generated by this sub-workflow: ------------------------------------------------------------------ The results are stored in the folder **./genetic_structure** and inside we have subfolders of each PCA, Admixture and interactiv plots. In subfolder **/plink** are stored files after modifying and filtering steps. .. figure:: ../../images/genstruct_dir.PNG :alt: folders folders Each PCA has its own folder with eigenvectors and eigenvalues: |folders| .. figure:: ../../images/smartPCA_dir.PNG :alt: folders folders Folder admixture contains Q-matrices for each K value together with interactive plot of optimal K: .. figure:: ../../images/admixture_dir.PNG :alt: folders folders Interactive plots of PCA and both NJ trees are stored here: .. figure:: ../../images/genstruct_plots_dir.PNG :alt: folders folders Let’s take a look at the PCA plots first. On both of them our samples cluster into three groups. In the first one we can found all breeds of domestic goats from Switzerland. In another cluster are Alpine ibexes and in the third one are Bezoar wild goats. .. figure:: ../../images/test_pca_plot.svg :alt: image description image description Figure 1: Interactive plots of both Principal Component Analyses Program Admixture estimated the optimal number of clusters at five. As we can see on Figure 2, four Swiss goats (Booted, Chamois colored, Peacock and Saanen) are showing very similar genomic structures. Appenzell and Toggenburg goat breeds have distinct and more homogeneous structures with some individual goats that share segments with other breeds from Switzerland. Both wild species, Alpine ibex and Bezoar, have uniform genetic structures. |image1| Figure 2: Population structures of our dataset The NJ tree of IBS-based (Figure 3) distances positioned the branches of Swiss goats according to their breeds. Alpine ibexes and Bezoar wild goats formed their own clade, inside of which we can see clear distinction between the two species. According to that distribution was also the layout of Fst-based NJ tree (Figure 4), but, unlike the IBS-based tree, here we have population’ divergence. |image description| Figure 3: A neighbor-joining tree constructed with matrix of IBS distances between individuals |image2| Figure 4: A neighbor-joining tree constructed with Fst distances between populations 5. Generating the interactive plots without running the workflow ---------------------------------------------------------------- For generating the interactive pca plot only (withut re-running the workflow), one can use the following command: :: python3 plot_interactive_pca.py pop_markershape_col.txt pca.yml The python script is located in the bin folder of scalepopgen. **** and **** are located in the respective output folders of smartpca and gds_pca. The yaml file is located in “/parameters/plots/” folder. **pop_markershape_col.txt** is located in the folder of “interactive_plots/pca/” or one can also create this tab-delimited file with this format: the first column is pop_id, the second column is shape_id, the third column is hex color code. Refer to “./extra/markershapes.txt” to see the list of shapes implemented in this bokeh-dependent python script. The parameters of the yaml files are described below: :: plot_width: plot-width size in pixel plot_height: plot-height size in pixel pc_x_to_plot: which pc to plot on the x-axis pc_y_to_plot: which pc to plot on the y-axis fill_alpha: fill-color intensity line_alpha: line-color intensity marker_size : size of the markers to be plotted; input should be boolean; True or False show_sample_label: whether or not to show the sample label for each dot during hovering. **For plotting the large number of samples and population, increase the plot_width and plot_height size, reduce the marker_size and set show_sample_label to false** Next, the admixture plot can be also generated with the following command: :: python3 plot_interactive_q_mat.py -q -f -y admixture.yml -c color.txt -o -s plot_pop_order.txt The python script located in the bin folder of scalepopgen. **** is located in the respective output folder of admixture. The file is located in folder \**./plink/update_chrom_ids/*.fam\ **. Text files**\ color.txt\ **and**\ plot_pop_order.txt\ **can be created by the user. File**\ color.txt*\* has hex color codes listed in one column: :: #FF0000 #00FF00 #0000FF #FFFF00 #FF00FF #00FFFF #FFA500 #800080 #008000 #800000 Similarly, in file **plot_pop_order.txt** population IDs are listed in one column according to the order they should be plotted: :: Appenzell Booted ChamoisColored Peacock Saanen Toggenburg Bezoar AlpineIbex The yaml file **admixture.yml** is located in “/parameters/plots/” and it cointains parameters described below: :: width: plot-width size in pixel height: plot-height size in pixel bar_width: the width of each sample bar sample_label_orientation: degrees of anlge at which the sample labels should be written pop_label_orientation: degrees of anlge at which the population labels should be written space_pop_group: the width of space between populations legend_font_size: font size of the legend num_legend_per_col: number of different K per column label_font_size: font size of the labels fil_alpha: fill-color intensity For generating the IBS-dist interactive NJ trees, one can use the following command: :: python3 make_ibs_dist_nj_tree.py -r -i -m -c pop_sc_color.map -y ibs_nj.yml -o The python script is located in the bin folder of scalepopgen. **** refers to the population to be used for rooting the tree, **** refers to the sqaure matrix of 1-ibs distance between pairwise samples and generated by plink1.9, **** refers to id file generated along with the square matrix by plink1.9, **pop_sc_color.map\* is the tab-delimited file containing the first column as pop_id, second column as sample size and the third column as hex color code. Also generated by the workflow and saved in the output folder as**\ pop_sc_color.map**. The yml file is located in “parameter/plots/” folder. The parameter of the yml files are described below: :: width: plot-width size in pixel height: plot-height size in pixel layout: the tree layout, valid options: 'c','r','d', for circular, right and down layout of the tree tip_label_align: whether or not to align the tip labels; input should be boolean; True or False tip_label_font_size: the font size of the tip labels, default: "12px" edge_widths: the width of edges, default: 1 node_sizes: the size of nodes, default:6 node_hover: whether or not to show details info of the node while hovering; input should be boolean; True or False The python script can also be run directly using the file containing tree in newick format. For more details run “python3 make_ibs_dist_nj_tree.py -h”. As the python script is dependent on Toytree, for more details of these parameters, refer to `Toytree `__ documentation. For generating the fst-based NJ tree, one can use the following command: :: python3 make_fst_dist_nj_tree.py -i -r -o -y fst_nj.yml -c pop_sc_color.map References ========== Please cite the following papers if you use this sub-workflow in your study: [1] Alexander, D. H., Novembre, J., & Lange, K. (2009). Fast model-based estimation of ancestry in unrelated individuals. Genome research, 19(9), 1655-1664. https://doi.org/10.1101/gr.094052.109 [2] Patterson, N., Price, A. L., & Reich, D. (2006). Population structure and eigenanalysis. PLoS genetics, 2(12), e190. https://doi.org/10.1371/journal.pgen.0020190 [3] Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. A., & Reich, D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nature genetics, 38(8), 904-909. https://doi.org/10.1038/ng1847 [4] Zheng, X., Levine, D., Shen, J., Gogarten, S. M., Laurie, C., & Weir, B. S. (2012). A high-performance computing toolset for relatedness and principal component analysis of SNP data. Bioinformatics (Oxford, England), 28(24), 3326-3328. https://doi.org/10.1093/bioinformatics/bts606 [5] Francis, R. M. (2017). pophelper: an R package and web app to analyse and visualize population structure. Mol Ecol Resour, 17: 27–32. doi:10.1111/1755-0998.12509 [6] Huerta-Cepas, J. et al.,(2016). ETE 3: Reconstruction, Analysis, and Visualization of Phylogenomic Data, Molecular Biology and Evolution, Volume 33, Issue 6, June 2016, Pages 1635–1638, https://doi.org/10.1093/molbev/msw046. [7] Eaton, Deren. (2019). Toytree: A minimalist tree visualization and manipulation library for Python. Methods in Ecology and Evolution. 11. 10.1111/2041-210X.13313. [8] Di Tommaso, P., Chatzou, M., Floden, E. et al. Nextflow enables reproducible computational workflows. Nat Biotechnol 35, 316-319 (2017). https://doi.org/10.1038/nbt.3820 License ======= MIT .. |plot| image:: ../../images/Sample_info.png .. |folders| image:: ../../images/gds_PCA_dir.PNG .. |image1| image:: ../../images/admixture_plot.svg .. |image description| image:: ../../images/ibs_tree.svg .. |image2| image:: ../../images/fst_tree.svg