Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • duplicated reads in fastQC

    Hi, I have some duplication issues as suggest in fastQC. The duplication levels in my samples average about 60%. I read some old posts. The link below seems to sugested removal of duplicated reads, while the sequencing facility suggested otherwise. It seems to me that the duplication will affect the accurate counts of the transcripts. Any thoughts?

    What software can do this duplication removal? I check out the fastX, it doesn't seem have that functionality. Suggestions?

    thanks!


    Originally posted by GenoMax View Post

  • #2
    Originally posted by JQL View Post
    It seems to me that the duplication will affect the accurate counts of the transcripts.
    To some extent you would expect duplication in a transcriptome (or even small genome) project. It depends on your sequencing coverage and the size of the transcriptome/genome.

    As a thought experiment, let's say that the size of your transcriptome is 100,000,000 bases. That means that at the best you can have 100M unique sequences. If you sequence 200M bases (cheap to do!) then you would expect a 2x duplication level.

    All sorts of caveats plus 'and-also's in the above but the general idea is that with modern sequencing it is quite easy to overwhelm the uniqueness of reads and start picking up duplicates.

    Comment


    • #3
      Originally posted by JQL View Post

      What software can do this duplication removal? I check out the fastX, it doesn't seem have that functionality. Suggestions?

      thanks!
      PRINSEQ (http://edwards.sdsu.edu/cgi-bin/prinseq/prinseq.cgi) can do removal of duplicate (or n-plicate) sequences.

      Comment


      • #4
        Thanks for your thoughts. I think I would agree with you. I would probably leave the duplicates alone then.

        Originally posted by westerman View Post
        To some extent you would expect duplication in a transcriptome (or even small genome) project. It depends on your sequencing coverage and the size of the transcriptome/genome.

        As a thought experiment, let's say that the size of your transcriptome is 100,000,000 bases. That means that at the best you can have 100M unique sequences. If you sequence 200M bases (cheap to do!) then you would expect a 2x duplication level.

        All sorts of caveats plus 'and-also's in the above but the general idea is that with modern sequencing it is quite easy to overwhelm the uniqueness of reads and start picking up duplicates.

        Comment


        • #5
          thanks GenoMax for the link.
          I may experiment a little bit. Remove the duplicates and rerun the fastQC and see what happens.


          Originally posted by GenoMax View Post
          PRINSEQ (http://edwards.sdsu.edu/cgi-bin/prinseq/prinseq.cgi) can do removal of duplicate (or n-plicate) sequences.

          Comment


          • #6
            I just went through this myself with some recent transcriptome data that FastQC showed to be highly redundant.
            Like westerman said it depends on what you are trying to do, but if you are going to use the data for an assembly I'd suggest looking into the digital normalization procedure. This will reduce the amount of redundant data you feed into the assembler and make assembly much more efficient. Of course if you are trying to analyze for differential expression you will ultimately need to retain all of the duplicates.

            Comment


            • #7
              I am currently only interested in differential expressions.

              thanks for sharing your thoughts.

              Originally posted by NRP View Post
              I just went through this myself with some recent transcriptome data that FastQC showed to be highly redundant.
              Like westerman said it depends on what you are trying to do, but if you are going to use the data for an assembly I'd suggest looking into the digital normalization procedure. This will reduce the amount of redundant data you feed into the assembler and make assembly much more efficient. Of course if you are trying to analyze for differential expression you will ultimately need to retain all of the duplicates.

              Comment


              • #8
                Another related question:

                While I agree it is probably better to leave the duplicated sequences alone for differential expression study, there are also some over-represented sequences (ORS) in my samples. In fastQC report, some of those top ORS are shown to be adapter seqeunces, others shown to have no hits. They probably don't accounts for large percentage of duplicated sequences (5% maybe?), do you guys remove those adaptor sequences?

                Comment


                • #9
                  Yes, I had that issue as well. I think it is best to trim those. I used trim galore for that & it worked quite well.

                  Comment


                  • #10
                    Hi,

                    I just want to add that we need to also consider the potential sources of the duplication. Is it due to high coverage or PCR-amplification during library prep. It is never a clean cut but you need to assess which one is more dominant as they have different impacts to certain quantitation studies.

                    Best regards,
                    Douglas

                    Comment


                    • #11
                      I have looked into fastx clipper which is supposed to trim the adapter sequence. But I have also read some earlier posts here that suggested that fastx clipper didn't work well. http://seqanswers.com/forums/showthr...=fastx+clipper

                      In my case, fastQC suggests I have 4.7% (out of 4M sampled) of the adapter sequence "GATCGGAAGAGCACACGTCTGAACTCCAGTCACTTAGGCATCTCGTATGCC". But after running fastx_clipper with option -C to remove the above 51-base adapter seq, I lost 4,752,644. I have a total of ~23M reads -- thats about 20% of reads. It seems either I have done something wrong or the program still has bugs. Any suggestions?

                      I haven't tried trim galore yet.


                      Originally posted by NRP View Post
                      Yes, I had that issue as well. I think it is best to trim those. I used trim galore for that & it worked quite well.

                      Comment


                      • #12
                        I've never tried fastx clipper, but in trim galore you can specify the sequence to trim & adjust the match stringency so that might help.

                        Comment


                        • #13
                          grep -c ADAPTER found 1M adapter, which is about 4.4%, consistent with the fastQC report. Not sure how fastx clipper found and removed 4.7M adapter sequences.

                          I guess, Trim Galore seems to be a better option.

                          Originally posted by JQL View Post
                          I have looked into fastx clipper which is supposed to trim the adapter sequence. But I have also read some earlier posts here that suggested that fastx clipper didn't work well. http://seqanswers.com/forums/showthr...=fastx+clipper

                          In my case, fastQC suggests I have 4.7% (out of 4M sampled) of the adapter sequence "GATCGGAAGAGCACACGTCTGAACTCCAGTCACTTAGGCATCTCGTATGCC". But after running fastx_clipper with option -C to remove the above 51-base adapter seq, I lost 4,752,644. I have a total of ~23M reads -- thats about 20% of reads. It seems either I have done something wrong or the program still has bugs. Any suggestions?

                          I haven't tried trim galore yet.

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Essential Discoveries and Tools in Epitranscriptomics
                            by seqadmin




                            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                            04-22-2024, 07:01 AM
                          • seqadmin
                            Current Approaches to Protein Sequencing
                            by seqadmin


                            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                            04-04-2024, 04:25 PM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, Today, 08:47 AM
                          0 responses
                          11 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-11-2024, 12:08 PM
                          0 responses
                          60 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 10:19 PM
                          0 responses
                          59 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 09:21 AM
                          0 responses
                          54 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X