Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • FastQC with the ENCODE RNASeq data

    Hello,
    I just downloaded the LHCN RNASeq data generated at Caltech. I merged the fastq files from the 3 runs to generate a single file and ran fastqc and I an a little bit confused about the output I have got. The per base quality graph in fastqc is showing quality score going upto 70 (attached) and the per sequence graph is showing peaks at approx 38 and 68 (also attached). According to the ENCODE documentation, the quality scores are phred 33, so how come the quality score graphs look like this?
    Apologies if my question is silly and if i am not understanding the way fastqc works.

    Thanks for help.
    NGSnewbie
    Attached Files

  • #2
    That does look odd. Can you tell us where exactly you downloaded this data from? I can have a look to see if I can reproduce this.

    Comment


    • #3
      I got the 2 X 75 LHCN (cycling and 7 day diff) fastq files from the following link

      Attached is the snapshot of the files that I got
      Attached Files

      Comment


      • #4
        I tried one of the files from the dataset and I am seeing this plot. Data appears to have been submitted on 05/05/2011. Based on that date it is most likely in illumina format.
        Attached Files
        Last edited by GenoMax; 10-01-2012, 05:08 AM.

        Comment


        • #5
          Originally posted by GenoMax View Post
          I tried one of the files from the dataset and I am seeing this plot. Data appears to have been submitted on 05/05/2011. Based on that date it is most likely in illumina format.
          Thanks for trying fastqc one of the datasets. Other than the fact that your plot looks much better, I notice that the fastqc run that i did has identified the illumina encoding as 1.9 whereas your run has identified it as Illumina 1.5. I ran fastqc on the same file that you did and got the same results! Attached is the plot
          Could my earlier results have happened because I combined the data from three runs and performed fastqc on the combined dataset? Or did you specify any paramters during the fastqc run?



          Also, would it be better to keep the runs separate and also do the alignment etc. accordingly?

          Thanks a lot for your help.
          NGSnewbie
          Attached Files
          Last edited by per_ngs; 10-01-2012, 06:55 AM. Reason: More information

          Comment


          • #6
            The only explanation seems to be that something happened when you combined the files (did you just "cat" them together?).

            You could keep the lanes separate and then combine the results later.

            Originally posted by per_ngs View Post
            Could my earlier results have happened because I combined the data from three runs and performed fastqc on the combined dataset? Or did you specify any paramters during the fastqc run?


            Also, would it be better to keep the runs separate and also do the alignment etc. accordingly?

            Thanks a lot for your help.
            NGSnewbie

            Comment


            • #7
              Illumina 1.3+ uses Phred+64 while Illumina 1.9+ uses Phred+33.
              You can't combine them without adjusting the quality scores to match. You will have to treat each version separately or convert the quality scores.

              Comment


              • #8
                Thanks GenoMax and pbluescript. I will keep the data separate and process it that way.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Advancing Precision Medicine for Rare Diseases in Children
                  by seqadmin




                  Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                  12-16-2024, 07:57 AM
                • seqadmin
                  Recent Advances in Sequencing Technologies
                  by seqadmin



                  Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                  Long-Read Sequencing
                  Long-read sequencing has seen remarkable advancements,...
                  12-02-2024, 01:49 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 12-17-2024, 10:28 AM
                0 responses
                33 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 12-13-2024, 08:24 AM
                0 responses
                49 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 12-12-2024, 07:41 AM
                0 responses
                34 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 12-11-2024, 07:45 AM
                0 responses
                46 views
                0 likes
                Last Post seqadmin  
                Working...
                X