LDhat is a widely used software package for estimating variable recombination rates from population genetic data using composite likelihood methods.
Here is a practical workflow guide to analyzing recombination rates with LDhat. Phase 1: Data Preparation
LDhat requires specific input formats, typically generated from VCF files using tools like vcftools or custom scripts.
Sites File (.sites): Contains the genotype or haplotype data. The first line lists the number of sequences, number of sites, and routing for data type (e.g., 2 for phased/unphased SNP data).
Locs File (.locs): Contains the physical or genetic coordinates of the SNPs. The first line lists the number of sites and the total length of the region. Phase 2: Generating the Lookup Table
LDhat uses a composite likelihood lookup table to estimate the population recombination rate (
). Generating this table from scratch is computationally expensive.
Pre-made Tables: Always check if a pre-calculated lookup table matches your sample size (n) and population mutation rate (θ or Watterson’s θ).
Custom Tables: If no table matches, use the complete program to generate one. Command: complete -n [sample_size] -theta [value]
Note: This can take days for large sample sizes; split jobs if necessary. Phase 3: Running the Interval Program
The interval program is the core module of LDhat. It uses a Markov Chain Monte Carlo (MCMC) approach to estimate how recombination rates vary across the region.
Execution: Run interval by providing the .sites, .locs, and lookup table files. Key Parameters:
-its: Number of MCMC iterations (typically 1,000,000 to 10,000,000).
-bpen: Block penalty. Controls the smoothness of rate variation (default is often 5; higher values create smoother curves).
-samp: Sampling interval (e.g., every 2,000 or 5,000 iterations). Phase 4: Convergence and Convergence Diagnostics
Before analyzing the output, you must ensure the MCMC chain has reached stationary distribution.
Burn-in: Discard the first 10% to 20% of the iterations to remove bias from the starting configurations.
Trace Plots: Plot the log-likelihood values across iterations using tools like R to check for a flat, stable plateau. Phase 5: Processing the Output
The stat program summarizes the raw MCMC output generated by interval.
Execution: Run stat on the output file (usually ends in _rates.txt).
Command: stat -input [interval_output_file] -burn [number_of_samples_to_discard]
Result: This generates a summarized file (res.txt) containing the mean, median, and confidence intervals for ρ between every pair of adjacent SNPs. Phase 6: Downstream Analysis and Visualization
With your finalized ρ values, you can now interpret the biological significance.
Recombination Hotspots: Look for sharp peaks where ρ is significantly higher than the background rate.
Plotting: Import the res.txt coordinates and mean ρ values into R or Python to map recombination landscapes against genomic features (like genes or promoters).
To help tailor this workflow to your specific research project, please let me know:
Leave a Reply