Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • lindseykelly
    Junior Member
    • Apr 2012
    • 5

    Initial QC and grooming for Illumina HiSeq2000 paired end RNAseq on Galaxy

    I am trying to do RNAseq analysis on Paired end data from the Hiseq2000. I have about 50 files for each sample (25 forward and 25 reverse - although each sample has a different number of files).

    I think that I need to:
    -convert them into FASTQ sanger format using the FASTSQ groomer tool
    -check the quality using the FASTQqc tool

    I don't know how to handle this many files. Do I have to groom and run the QC for each file? Should I join the paired files and run both tools on each pair, or should I combine all of the data for each sample (which I don't know how to do) and then groom and run the QC for all of the reads for the sample.

    Thanks in advance for advice
    Lindsey
  • lindseykelly
    Junior Member
    • Apr 2012
    • 5

    #2
    This was the response from the Galaxy team, in case someone else has this question:

    Yes, you have this correct. The general path would be to:

    - join forward and reverse data per run
    - run FASTQ Groomer & FastQC
    (note: if your data is already in Sanger FASTQ format with Phred+33 quality scaled
    values, the datatype '.fastqsanger' can be directly assigned and the FASTQ Groomer
    step skipped. This is likely true if your data is a from the latest CASAVA pipeline, but
    please double check.)
    - discard data as needed based on quality
    - split forward and reverse data that passes QC
    - concatenate all forward reads from a sample into one FASTQ file
    - concatenate all reverse reads from a sample into one FASTQ file.
    - for each sample, run TopHat using the two concatenated FASTQ files

    To manipulate paired end data, please see the tools -> NGS: QC and manipulation: FASTQ splitter & FASTQ joiner.

    To combined data files head-to-tail from multiple runs into a single FASTQ file please see the tool -> Text Manipulation: Concatenate datasets.

    I am not sure of the actual volume of data, but if these start to get large or TopHat errors with a memory problem, a local or cluster instance would be the recommendation: http://getgalaxy.org

    For reference:



    Hopefully this helps. Others are welcome to post comments/suggestions.

    Jen
    Galaxy team

    Comment

    • mhkiani
      Member
      • Oct 2013
      • 12

      #4
      Broken paired reads

      I got some RNA-seq paired 100bp data and when I did the RNA-seq analyis with CLC, I got more than 50% broken pairs among the reads and I'm not sure why.

      Comment

      • sugo
        Junior Member
        • Nov 2013
        • 8

        #5
        What is the purpose of joining the forward and reverse reads prior to QC? Couldn't the QC be run on the separate reads?

        Comment

        • Mike2188
          Member
          • Oct 2013
          • 27

          #6
          If you do each file individually then you run into errors during alignments. For instance if I had 100,000 paired end reads in two files forward.fq and reverse.fq and I performed some trimming and quality filtering on each individually, I might end up with one file with 90,000 reads and one with 89,000. Now when I go to do alignments, the program will assume the first read in forward.fastq corresponds to the first read in reverse.fastq - but now the files are uneven. The alignments won't work correctly because of this.

          Comment

          Latest Articles

          Collapse

          • SEQadmin2
            Nine Things a Sample Prep Scientist Thinks About Before Sequencing
            by SEQadmin2


            I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

            Here are nine questions we think about, in roughly the order they matter, before...
            06-18-2026, 07:11 AM
          • SEQadmin2
            From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
            by SEQadmin2


            Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


            The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
            ...
            06-02-2026, 10:05 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by SEQadmin2, 06-26-2026, 11:10 AM
          0 responses
          16 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-17-2026, 06:09 AM
          0 responses
          49 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-09-2026, 11:58 AM
          0 responses
          108 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-05-2026, 10:09 AM
          0 responses
          125 views
          0 reactions
          Last Post SEQadmin2  
          Working...