Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to estimate error rate for short-reads and base-calling duplicate?

    (1) How to estimate error rate for GAII short-reads?
    (2) For some papers, like panda genome paper, they filter out the base-calling duplicate. I do not have idea for this issue. Can someone give me some clues? How to filter out these base-calling duplicate?



    "(1) Base-calling duplicate, this is a unique characteristic for each lane,
    caused by the solexa-pipeline, and they are not real sequences. The higher the raw
    cluster density, the more severe this problem is. The redundant reads were filtered at a
    threshold of euclid distance <= 3 and a mismatch rate of <= 0.1. We observed that the
    average rate of base-calling duplicates for each lane was about 0.83%, ranging from
    0.00% to 8.52%. (2) Adapter contamination, another unique characteristic of the
    specific library, is caused by DNA adaptor dimerization, the empty loading or too
    small an insert size (less than the read length)."

  • #2
    I didn't understand well what is the definition of base-calling duplicate. The duplicate filtering for panda genome assembly would not loss much information, but may loss the coverage. I guess it's why the coverage for high GC-content regions (>60%) become relatively lower. (It is known that high GC regions have higher sequecning coverage due to PCR.) If it is a RNA-seq study, this filtering may cause the under-estimating gene expression.
    Xi Wang

    Comment


    • #3
      Hi Xi,

      Would you know of a reference for the high GC regions have higher sequnceing coverage due to PCR ?

      Comment


      • #4
        First of all, we need to know meaning of base-calling duplicate. Anyone have some ideas?

        Comment


        • #5
          " Base-calling duplicate .... The higher the raw
          cluster density, the more severe this problem is."

          Must be that the softwares call two reads from the same cluster. Look for near identical coordinates for identical reads then.

          Very GC-rich sequences have lower coverage actually, if you can't amplify it well you can't sequence it (unless you do single molecule sequencing of course)

          Comment


          • #6
            Originally posted by mattanswers View Post
            Hi Xi,

            Would you know of a reference for the high GC regions have higher sequnceing coverage due to PCR ?
            It's an "old" paper.

            Code:
            Dohm JC, Lottaz C, Borodina T, Himmelbauer H. Substantial biases in
            ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids
            Res. 2008 Sep;36(16):e105. Epub 2008 Jul 26. PubMed PMID: 18660515; PubMed
            Central PMCID: PMC2532726.
            Xi Wang

            Comment


            • #7
              Originally posted by Chipper View Post
              "
              Very GC-rich sequences have lower coverage actually, if you can't amplify it well you can't sequence it (unless you do single molecule sequencing of course)
              You meant that the GC-rich sequences can not be amplified well?
              Xi Wang

              Comment


              • #8
                if you check the sequencing results of G-C/A-T ratio, you will find they are similar. Also, if you check this ratio along the reads, you will find the more variable in the ends, which maybe be due to adapter.

                Comment


                • #9
                  Originally posted by lmf_bill View Post
                  if you check the sequencing results of G-C/A-T ratio, you will find they are similar. Also, if you check this ratio along the reads, you will find the more variable in the ends, which maybe be due to adapter.
                  The G-C/A-T ratio is similar for our libraries (Chip-Seq experiment). However, the GC ratio for the whole genome is 36%; exons with 43 % GC and intergenic with 32 % GC, so I was expecting a much different ratio for our libraries.

                  Comment


                  • #10
                    Originally posted by mattanswers View Post
                    The G-C/A-T ratio is similar for our libraries (Chip-Seq experiment). However, the GC ratio for the whole genome is 36%; exons with 43 % GC and intergenic with 32 % GC, so I was expecting a much different ratio for our libraries.
                    I know your mean. You find the ~50 % GC in Chip-Seq data. You think there is GC bias, maybe due to base-calling duplication. It seems reasonable.

                    another thing, how do you estimate the GC ratio, of genome, exon and intergenic? all based on ensembl annotation? Ever, I estimate the exonome of hg19, I find ~50% GC, only slightly smaller than AT content. Maybe, I need more check

                    Comment


                    • #11
                      I am working with Arabidopsis genome which has been sequenced. The numbers I wrote come from Table 3 of Town et al. The Plant Cell, 18:1351, 2006. They seemed to have used the TIGR annotation pipeline.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Current Approaches to Protein Sequencing
                        by seqadmin


                        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                        04-04-2024, 04:25 PM
                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 04-11-2024, 12:08 PM
                      0 responses
                      27 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 10:19 PM
                      0 responses
                      30 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 09:21 AM
                      0 responses
                      26 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-04-2024, 09:00 AM
                      0 responses
                      52 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X