Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • FastQC with the ENCODE RNASeq data

    Hello,
    I just downloaded the LHCN RNASeq data generated at Caltech. I merged the fastq files from the 3 runs to generate a single file and ran fastqc and I an a little bit confused about the output I have got. The per base quality graph in fastqc is showing quality score going upto 70 (attached) and the per sequence graph is showing peaks at approx 38 and 68 (also attached). According to the ENCODE documentation, the quality scores are phred 33, so how come the quality score graphs look like this?
    Apologies if my question is silly and if i am not understanding the way fastqc works.

    Thanks for help.
    NGSnewbie
    Attached Files

  • #2
    That does look odd. Can you tell us where exactly you downloaded this data from? I can have a look to see if I can reproduce this.

    Comment


    • #3
      I got the 2 X 75 LHCN (cycling and 7 day diff) fastq files from the following link

      Attached is the snapshot of the files that I got
      Attached Files

      Comment


      • #4
        I tried one of the files from the dataset and I am seeing this plot. Data appears to have been submitted on 05/05/2011. Based on that date it is most likely in illumina format.
        Attached Files
        Last edited by GenoMax; 10-01-2012, 05:08 AM.

        Comment


        • #5
          Originally posted by GenoMax View Post
          I tried one of the files from the dataset and I am seeing this plot. Data appears to have been submitted on 05/05/2011. Based on that date it is most likely in illumina format.
          Thanks for trying fastqc one of the datasets. Other than the fact that your plot looks much better, I notice that the fastqc run that i did has identified the illumina encoding as 1.9 whereas your run has identified it as Illumina 1.5. I ran fastqc on the same file that you did and got the same results! Attached is the plot
          Could my earlier results have happened because I combined the data from three runs and performed fastqc on the combined dataset? Or did you specify any paramters during the fastqc run?



          Also, would it be better to keep the runs separate and also do the alignment etc. accordingly?

          Thanks a lot for your help.
          NGSnewbie
          Attached Files
          Last edited by per_ngs; 10-01-2012, 06:55 AM. Reason: More information

          Comment


          • #6
            The only explanation seems to be that something happened when you combined the files (did you just "cat" them together?).

            You could keep the lanes separate and then combine the results later.

            Originally posted by per_ngs View Post
            Could my earlier results have happened because I combined the data from three runs and performed fastqc on the combined dataset? Or did you specify any paramters during the fastqc run?


            Also, would it be better to keep the runs separate and also do the alignment etc. accordingly?

            Thanks a lot for your help.
            NGSnewbie

            Comment


            • #7
              Illumina 1.3+ uses Phred+64 while Illumina 1.9+ uses Phred+33.
              You can't combine them without adjusting the quality scores to match. You will have to treat each version separately or convert the quality scores.

              Comment


              • #8
                Thanks GenoMax and pbluescript. I will keep the data separate and process it that way.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Essential Discoveries and Tools in Epitranscriptomics
                  by seqadmin




                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                  04-22-2024, 07:01 AM
                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 08:47 AM
                0 responses
                16 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                60 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                60 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                54 views
                0 likes
                Last Post seqadmin  
                Working...
                X