Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Coverage & insert size estimation

    I have illumina paired-end reads (1.fastq and 2.fastq) of genomic reads sequenced through hiseq2000. Since my genomic reads (around 100 Gb in size each reads) and also have constraint in computing power (shortage of memory and space- working in Dell workstation). I need to know coverage and insert size for my genome without doing denovo assembly and mapping the reads to it. I know some tools which calculate insert size and coverage from sam/bam file like qualimap, qatools.

    1. Is there any tool which estimate in approx the coverage and insert size with out doing asembling and mapping?

    2. If no tools available, can I extract 10% of random reads, to do denovo assemble and map the reads to find coverage and insert size?

  • #2
    Depending on the length and insert size of the reads, you can get an insert size histogram via overlap, which is fast and does not require assembly or mapping. You can do that like this:

    bbmerge.sh in1=1.fastq in2=2.fastq ihist=ihist.txt reads=2000000

    ...which will just process the first two million reads. However, if the insert size is long enough that they don't overlap, it won't work and you need to assemble and map. Whether or not you can assemble only 10% of the reads depends on how much coverage you have. Do you know what kind of organism it is, or is it a metagenome?

    You can estimate coverage via kmer-counting, like this:

    khist.sh in1=1.fastq in2=2.fastq hist=hist.txt


    Then you look at the histogram and find the first major peak, which tells you the approximate coverage. You could also speed it up by limiting it to some fraction of the total reads and then scaling the result by a factor.

    Both of these are in the BBTools package. Note that these command lines are for Linux. If your computer uses Windows, the commands would be slightly different.

    Comment


    • #3
      @Brian Bushnell- Thanks for your suggestion.

      I am working with plant genome of fruit crop. Can I use 10% of reads for denovo assemble and map the reads used for assembling to estimate coverage, insert size and heterozygosity?. Will be this analysis will be enough for approx estimation of these metrics?.
      Another thing, can tool to estimate heterozygosity rate from mapped reads?

      Comment


      • #4
        I would rather do this with the whole dataset. If you have enough coverage (approx. >30-fold) the k-mer graph should not only be able to give you a hint about the genome size and coverage but also heterozygosity.

        I haven't worked with the BBTools package yet but with Jellyfish and SOAPec. There is also a tool available for the estimation of these characteristics (see the attached paper). The Figures in there might also be helpful for the understanding of the k-mer graph:

        Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects

        Comment


        • #5
          Originally posted by bioman1 View Post
          @Brian Bushnell- Thanks for your suggestion.

          I am working with plant genome of fruit crop. Can I use 10% of reads for denovo assemble and map the reads used for assembling to estimate coverage, insert size and heterozygosity?. Will be this analysis will be enough for approx estimation of these metrics?.
          Another thing, can tool to estimate heterozygosity rate from mapped reads?
          First, try merging the reads by overlap; you will know in under a minute whether the reads overlap or not (based on the percentage merged). If they do, then the insert size question is solved.

          The kmer histogram can give you an estimate of the genome size, repetitiveness, AND the heterozygosity. There's really no way to tell whether 10% is enough for assembly without a genome size estimate. If you have 200Gbp, that would give 30x coverage for a ~700Mbp organism, which is very small for a tree (even ignoring the ploidy).

          By the way, you can also do normalization and subsampling with BBTools, either of which will reduce the read count. For example, you could normalize to approximately 30x coverage like this:

          bbnorm.sh in1=1.fastq in2=2.fastq hist=hist.txt out=normalized.fq target=30

          ...which will automatically determine how many reads you need to get a uniform 30x coverage. It's slower than sampling, but not too bad. The output from that command would be interleaved.
          Last edited by Brian Bushnell; 06-11-2014, 09:22 AM.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM
          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          27 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          31 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          27 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          52 views
          0 likes
          Last Post seqadmin  
          Working...
          X