Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • FastQC

    Hello,
    we've done a RNA-Seq analysis (Illumina HiSeq2000, 50 bp, paired end) and I have checked the quality with FastQC. There raised some questions for me:

    1. Is FastQC as quality check alright for paired end reads?
    2. The program gives for all of my four samples a fail for the per base quality of the second read (only the last three bases show lower quartile less than 5 or a median less than 20). Is there a logical explanation?
    3. The sequencing and mapping was done by a company and they told us they trimmed the adapters. But I get a fail in the dublication level and if you look at the overrepresented sequences I can see only primer or adapter sequences. Did they a bad job?

    Thanks for your help! Isabelle

  • #2
    FastQC is appropriate for QC of PE reads.

    It would be better if you post screenshots/images of the FastQC results instead of just descriptions. Having something marked as "fail" does not automatically fail the entire sample. It is possible that the analysis done by your provider may not have removed all adapter dimers etc.

    Comment


    • #3
      Ok, here are the images of FastQC...
      Attached Files

      Comment


      • #4
        Read 2 often has decreased quality at its 3' end. A bit of trimming can easily get rid of that.

        BTW, they likely sent you untrimmed sequences and aligned trimmed sequences, which is why fastQC is telling you that the raw sequences still have adapter contamination.

        Also, a fail on duplication level is pretty much expected for RNAseq data (that test is really only meant for whole-genome sequencing).

        Comment


        • #5
          Thanks a lot, that looks for me that the quality check makes not really sense then, its more or less good for the per base quality... ?

          Comment


          • #6
            Yeah, just do a bit of quality/adapter trimming (e.g., with trimmomatic or trim_galore) and you should be good to go.

            Comment


            • #7
              But can I be shure that the company used trimmed data for mapping? Maybe they didnt, how can I check this?

              Comment


              • #8
                Just look at the read lengths in the BAM file:

                Code:
                samtools view some_file.bam | cut -f 10 | awk '{print length($1)}' | uniq | sort | uniq
                If they trimmed the reads prior to alignment, you should get more than one value.

                Comment


                • #9
                  I will, but unfortunately I cant do this from my private computer so I have to wait until I am back at the institute... but many thanks already at this point.

                  Comment


                  • #10
                    Hello, its long time ago, but still/again present for me... It was not possible for me to check the data again at the institute with samtools, but shoudn't I see the same (different read sizes) if I look with IGV to my data? That in fact gives me for all reads the same size of 51 bases, which means the campany didn't trimm the data before mapping... am I right? Thanks for your help! Isabelle

                    Comment


                    • #11
                      Yes, it sounds like they didn't trim them then. Scroll through IGV and see if there are any soft-clipped alignments (alignments that appear shorter but where the original sequence is 51). Using an aligner that does soft-clipping alleviates some of the issues surrounding adapter contamination and quality. If, however, they did end-to-end alignment (i.e., there are no soft-clipped alignments) on untrimmed data then I'd say they did a half-ass job.

                      Comment


                      • #12
                        Hey, thanks for the fast replay. I found some shorter ones... they did it with the -q option of BWA.
                        When I asked them for the mapping parameters I got following answer:

                        n NUM max #diff (int) or missing prob under 0.02 err rate
                        t:4 (number of threads)
                        M:3 (mismatch penalty)
                        q: (quality threshold for read trimming down to 35bp 0)

                        I am not shure if I understand this 35bp thing, because I can find reads with a length less then 35bp (The 0 is maybe a typing error)?
                        Another question is, how can I get alignments like that (see figure)??? If you have n=0.02, shouldtn there at most 2 mismatches per 50 bp? Isabelle
                        Attached Files

                        Comment


                        • #13
                          Can't say I'm overly familiar with bwa aln, since most people use bwa mem these days.

                          The -n option has to have one of the more confusing descriptions I've seen. If it's an integer then the explanation is simple. I assume that it uses a poisson distribution with fractional -n, so a value of 0.02 with 50bp reads would correspond to a maximal edit distance of 3 (in R: qpois(0.98, 50*0.02)).

                          The -q option in bwa aln doesn't really specify a minimum read length. It specifies a value used when determining the trim location:

                          The -q value is INT and the quality at position i is q_i. So, this basically sums the penalties and finds the maximum value. The position with the maximum value is where trimming will occur (essentially, obviously if the penalty is <0 then no trimming should occur).

                          Comment


                          • #14
                            Ok, I think I got the -q option, its just the information of the company, which is strange, maybe they mean a quality treshold of 35...
                            But the -n value is absolutely confusing... I was reading a lot of threads about this topic, but still. If in my case the maximal edit distance is 3, what does that mean??? Is there any relation to the allowed amount of mismatches?

                            Comment


                            • #15
                              They are related, yes. "Edit distance" is a generalization of mismatches. If a read aligns with 3 mismatches then its edit distance is 3. However mismatches can't describe things like insertions or deletions. So if your read aligns with an insert of 2 bases then it has an edit distance of 2. If it has a single base mismatch and later a deletion of 3 bases then the edit distance is 4. The wikipedia article on edit distance is quite good. In short, "edit distance" is the minimum number of single character changes (insertion, deletion, or substitution) needed to convert one sequence to another.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X