Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Trim/filter or not?

    Hello,
    I have some paired-end RNA-Seq reads (length=100) sequenced using Illumina's HiSeq platform and was wondering whether I should trim/ilter my reads. I've run FastQC with some of my samples and the immages are attached. Here are my questions:
    1) The per_base_n_content plots show many Ns in the last (about 10 or so) positions. Is this normal? Do I need to trim the end of reads if I use these reads for differential expression analysis? What if I use them for variant calling? If trimming is necessary, what is a good strategy? To trim a fixed number of bases, to trim by quality score, or to trim if there's an N and the quality is low?
    2) The duplication levels of my samples are all over 50%. I can use Picard to mark those caused by PCR amplification or something else. But how reliable is that? Is it safe to remove duplicates marked by that software?
    3) There are some adapter sequences in my reads but the percentage is low (from 0.1% to <1%). Is it necessary to remove them?
    I actually have two independent sets of data sequenced by our NGS core facility. The samples were prepared by different people for different purposes at different times, but both data sets have similar FastQC results. If they both look unusual, could it be that there was something worng with the core facility?

    Thanks,
    Sylvia
    Attached Files

  • #2
    The per_base_n_content plots show many Ns in the last (about 10 or so) positions. Is this normal?
    On Illumina machines, base quality decreases as you get farther into the read, so yes.

    Do I need to trim the end of reads if I use these reads for differential expression analysis?
    Any sane aligner should handle these properly, but mild trimming is always a good idea.

    If trimming is necessary, what is a good strategy? To trim a fixed number of bases, to trim by quality score, or to trim if there's an N and the quality is low?
    The only fixed number of bases it makes sense to trim are the six that came from your random hexamer primer, if that's what you did, since those are prone to mismatches. Otherwise set your trim threshold to Q5 or so (see http://genomebio.org/is-trimming-is-...al-in-rna-seq/). This should also remove the N's, which are low-quality by definition.

    The duplication levels of my samples are all over 50%. I can use Picard to mark those caused by PCR amplification or something else. But how reliable is that? Is it safe to remove duplicates marked by that software?
    Do you have some reason to think your library is overamplified? E.g. did you do more than 15 cycles of PCR? Or, did you sequence to high depth? (That would cause "duplicate" reads by chance.)

    Inspect the read distribution in a browser and see if it looks spikey. Or, since you have paired-end reads (unnecessary for differential expression, FYI), confirm that both R1s and R2s are identical, which is very unlikely to happen by chance, before you remove any reads.

    There are some adapter sequences in my reads but the percentage is low (from 0.1% to <1%). Is it necessary to remove them?
    Only if you don't feel like wasting <1% of your data. It's easy to fix with e.g. Trimmomatic so why wouldn't you?

    If they both look unusual, could it be that there was something worng with the core facility?
    Core facilities are consistent but sometimes that means consistently wrong. E.g. mine very consistently selects ~175 bp inserts for 2x100 sequencing, which is at best a creative use of paired-end sequencing that requires special bioinformatics to take advantage of, but more likely a simple oversight.

    Comment


    • #3
      HI jwfoley,
      Thank you so much for the informative reply. So it sounds like I can do a Q5 trimming as well as removing adapters. I don't know the details about library preparation, so I'll ask my colleagues how they did it.
      Is the "random hexamer primer" the barcode? If so, the core facility should have removed them.
      As for the duplication, since my reads are paired-end, if both the left and right duplicated reads are exactly the same, they are unlikely to be real. Is this understanding correct? Also, will Picard or whatever software mark only this kind of duplicates?
      My reads' map rate by TopHat is only around 60%. I don't know why and whether this needs to be improved. I understand that simply imcreasing the map rate doesn't necessarily give a better differential expression result.

      Thank you,
      Sylvia

      Comment


      • #4
        Is the "random hexamer primer" the barcode? If so, the core facility should have removed them.
        No, most likely your barcodes are in separate reads, per the current Illumina system. The random hexamer I'm talking about is the reverse-transcription primer, in the most common RNA-seq library protocols. If you used one of those, it is probably found in the first six bases of both reads if your library is nondirectional ("fr-unstranded" in TopHat) or the first six bases of R1 if it's from the common directional protocol ("fr-firststrand"). http://onetipperday.blogspot.ca/2012...pe-to-use.html

        As for the duplication, since my reads are paired-end, if both the left and right duplicated reads are exactly the same, they are unlikely to be real. Is this understanding correct?
        Yes, that's what I'm saying. True PCR clones will be identical at both ends.

        Also, will Picard or whatever software mark only this kind of duplicates?
        I don't know.

        My reads' map rate by TopHat is only around 60%. I don't know why and whether this needs to be improved.
        It's a bit low but see what happens when you trim low-quality bases and clip off adapter sequence.

        Comment


        • #5
          Originally posted by jwfoley View Post
          On Illumina machines, base quality decreases as you get farther into the read, so yes.
          Any sane aligner should handle these properly, but mild trimming is always a good idea.
          True, but I've never had data come back with that profile at the end of the reads, RNA-Seq or otherwise, not in thousands of samples - so it is not as far as I am concerned 'normal'.

          Comment


          • #6
            Fair enough. It's not normal but it's not unheard of.

            It certainly wouldn't be unreasonable to toss all those positions out of an abundance of caution, but I suspect trimming by quality would have the same effect. The question is whether you believe the non-N bases in those positions.
            Last edited by jwfoley; 02-12-2014, 01:15 PM.

            Comment


            • #7
              I tried the Q5 trimming using FastX with a typical sample, but no reads were trimmed. I think it's because all bases in my reads have a score higher than 5. So why did I get those Ns at the end of reads, and scores of those Ns are not low?
              Last edited by zzhao2; 02-13-2014, 09:20 AM.

              Comment


              • #8
                You could CROP the sequences using trimmomatic to get rid of the N's.

                That said it is odd that you have so many N's at the end of the reads. That almost certainly indicates that there was some technical problem with the flowcell this sample ran on. You may want to check with your sequence provider to see if something happened.

                Comment


                • #9
                  Well, I've used trimmomatic to get rid of adapters, N's and low quality bases for the sample shown in those plots in my initial post. I used the example parameter settings in the trimmomatic's manual for paired-end data i.e.

                  ILLUMINACLIP:adapters/TruSeq3-PE-2.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

                  Then I've rerun tophat and got an increased map rate at about 70% (it used to be about 60%). I checked the trimmed reads again using FastQC and now the per_base_n_content plot looks normal. However, the per_base_gc_content is still like before i.e. the last (3~5? can't tell exactly from the plot) bases' gc% are very different. Please see attached images. Do I need to crop the last a few bases off?

                  Also, my first 10-15 bases' gc% are fluctuating as well, but it seems that people think this is quite normal. so I don't need to crop them, right?
                  Attached Files

                  Comment


                  • #10
                    Originally posted by zzhao2 View Post
                    Also, my first 10-15 bases' gc% are fluctuating as well, but it seems that people think this is quite normal. so I don't need to crop them, right?
                    That is correct, specially for RNA-seq data.

                    I still feel that there was something technically wrong with this run (see how the A signal just drops to zero). If you are able to, then ask your sequence provider about any problems during the run, before saying anything about what you are observing.

                    Comment


                    • #11
                      Originally posted by GenoMax View Post
                      I still feel that there was something technically wrong with this run (see how the A signal just drops to zero).
                      Agree. And as I have mentioned, I have two independent sets of data and both have similar FastQC plots. Maybe our core facility consistently has some problem. I'll check with them.

                      Thank you all so much for your help!

                      Comment


                      • #12
                        Originally posted by zzhao2 View Post
                        Agree. And as I have mentioned, I have two independent sets of data and both have similar FastQC plots. Maybe our core facility consistently has some problem. I'll check with them.

                        Thank you all so much for your help!
                        The question is are these independent samples that were at different times on separate flowcells? If that was the case and you are still seeing that pattern then it may indicate that the problem may be with your libraries.

                        Comment


                        • #13
                          Originally posted by GenoMax View Post
                          The question is are these independent samples that were at different times on separate flowcells? If that was the case and you are still seeing that pattern then it may indicate that the problem may be with your libraries.
                          My two colleagues did their RNA-Seq experiments at different times, and they study different problems. One had 6 samples sequenced maybe 6 months ago (let me call it exp1), and the other had 18 samples sequenced 2 months ago (let me call it exp2). Honestly I know nothing about the details of their experiments. They sent their samples to our core facility and I just downloaded the data for analysis. Then I randomly checked 3 samples in exp2 and 1 sample in exp1, and their FastQC plots all look similar to those attached in my initial post. Based on this description, can you help identify where the problem could be? Or if more information is needed, please just let me know.

                          Another question is the duplication level.The duplication levels of all the random samples that I've checked are over 50%. Is >50% normal for RNA-Seq? If not, how to avoid it in the future experiment design?

                          Comment


                          • #14
                            I was only referring to the drop in A signal based on the plots you had posted. It would be odd to have that pattern repeat on samples done months apart.

                            Just to confirm: These samples have already been demultiplexed, correct? The only time I remember seeing that sort of pattern at the end of the read is for the "tag".

                            See Simon's note about the duplication plots in FastQC: http://proteo.me.uk/2013/09/a-new-wa...-fastqc-v0-11/ and the original interpretation: http://proteo.me.uk/2011/05/interpre...lot-in-fastqc/

                            Comment


                            • #15
                              I have only just looked at the FastQC plots of your data, and I don't think it looks all that abnormal.

                              Try running FastQC using the --nogroup parameter, it will let you see how many bases at the end of the read are affected, it could be only the last base, which often has a much lower quality than the rest of the read.

                              As for the trimmomatic trimming, how long are the adapter sequences you are using for palindrome trimming?
                              With a threshold score of 30 for palindrome trimming, if each matching base adds 0.6 to the score (see the trimmomatic web page), unless the sequences in your adapter.fasta file are quite long, trimmomatic will not recognise and trim the adapter sequences.

                              You can use grep to see how many adapter sequences are in your reads before and after trimmomatic trimming.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin


                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                                Yesterday, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              39 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              41 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              35 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              55 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X