Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Heterozygous rate from "de novo" sequencing

    Hi,

    I've just tried to calculate the heterozygous rate of my seq data, but no luck using gce or even preqc from sga using kmer information. Any idea about how to directly calculate heterozygous rate by SNPs?

    Bests,

    Carlos

  • #2
    You can do that with BBTools like this:

    kmercountexact.sh in=reads.fq khist=histogram.txt peaks=peaks.txt

    You can examine the histogram manually, or use the "peaks" file which tells you the number of unique kmers in each peak on the histogram. For a diploid, the first peak will be the het peak, the second will be the homozygous peak, and the rest will be repeat peaks. The peak caller is not perfect, though, so particularly with noisy data I would only rely on it for the first two peaks, and try to quantify the higher-order peaks manually if you need to (which you generally don't).

    I think AllPaths-LG will report het rate, also.

    Comment


    • #3
      Dear Brian,

      Thank you for developing such an useful an versatile tool. I was wondering how to interpretate both files to gain insight into the heterozygosity. My peaks file was generated using information from a highly heterozygous diploid organism, however there is a third peak with a greater volume than any other, why could this be happening?

      Thanks a lot!

      #start center stop max volume
      10 14 18 4633863 36855138
      18 29 36 4913455 85947645
      36 59 103 12178207 347426111
      103 105 106 324995 972461
      106 114 228 331479 18667384
      228 229 230 56194 111376
      230 231 234 55293 217949
      234 235 99931 54052 12688071

      Comment


      • #4
        There are several possible explanations for this, such as organelles, contamination, a symbiont, or mozaicism. But in this case, noting that the peaks are centered at 14, 29, and 59, it very much looks like the organism is tetraploid, or was tetraploid in the past and is now diploid but each pair of chromosomes has a very similar additional pair. How confident are you that it's diploid?

        Organelles is unlikely because they generally have a small, very high copy peak. You can check for contamination/symbionts by plotting a gc-content histogram of the raw reads (reformat.sh in=reads.fq gchist=gchist.txt gcbins=auto) or assembling, mapping, and plotting gc versus coverage (this can be generated by BBMap with the covstats output) or plotting a PCA of tetramer frequencies of the contigs.

        The latest version of BBTools (35.14) does smoothing of the kmer histogram prior to calling peaks, which greatly increases the accuracy, particularly for high-copy-count peaks.

        Comment


        • #5
          Thank you for your prompt reply! A karyotype was performed and it was determined that the organism is diploid, however the possibility of a recent duplication is not difficult to imagine.
          Regarding symbionts or organelles, we would only expect the mitochondria, we are quite confident that this sample is symbiont free.
          Could you please elaborate into why you think this sample is tetraploid? Also, how could I quantify heterozygosity using the output?
          Thanks!

          Comment


          • #6
            Typically, for a polyploid organism, the peak with the largest volume (or the major peak with the greatest depth) represents the homozygous portion of the genome. For a diploid organism with 60x coverage, you would expect two peaks - one at 60x (the homozygous peak) and one at 30x (the heterozygous peak). As long as the het rate is not super high, the homozygous peak will be bigger; though with an extremely high het rate, the heterozygous peak could be bigger. But, instead of homozygous and heterozygous peaks, let's call them the 1-copy and 2-copy peaks.

            With a tetraploid, you would expect 4 peaks - a 1-copy, 2-copy, 3-copy, and 4-copy. The 4-copy will generally be the largest. If you have 60x coverage (of the haploid genome), then the peaks will be at 15, 30, 45, and 60. Presumably, if you look at the kmer histogram, this dataset should have a 3-copy peak at depth 45, but it was probably merged into the 60-depth peak.

            So, generally, if you ever have the biggest peak at depth N, and have smaller peaks at N/4, 2N/4, and ideally 3N/4, you are probably looking at a tetraploid.

            Oh... as for calculating the het rate, you can estimate that the genome size is 1*(volume of 1-copy peak)+2*(volume of 2-copy peak), etc. The latest version of BBMap will do this automatically and report the genome size in the header of the peaks file. That's the number of base pairs in each cell of the organism; if you want the haploid genome size, then for a tetraploid, you'd divide that number by 4. So:

            (1*36855138+2*85947645+4*347426111)/4=399,613,718 - roughly 400Mbp.

            To get the het rate, if you assume that all the differences between ploidy is just SNPs (a useful simplifying assumption), then since KmerCountExact uses 31-mers, each 1-copy SNP would be expected to produce 31 kmers in the 1-copy peak. Therefore, the number of single-copy SNPs would be the volume of the 1-copy peak divided by 31, which is 1188875. The number of 2-copy SNPs would be 2-copy peak over 31, or 2772504. So the total number of SNP locations is probably around 3,961,379. Therefore the het rate is something like 1/100.
            Last edited by Brian Bushnell; 07-08-2015, 11:29 AM.

            Comment


            • #7
              Thanks a lot for your explanations, it is all clear now! And once more thank you for developing BBTools.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM
              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              30 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              32 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              28 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              52 views
              0 likes
              Last Post seqadmin  
              Working...
              X