Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to keep the raw .fastq.gz files for RNASeq data

    Hello,

    I have 75bp paired-end RNASeq data generated from Illumina HTSeq 2000 using the protocol of 7 samples mixture each lane from lane 1-7 in each flowcell. Each sample has 6bp-index associated with it. Using this protocal, for each sample, there are ~50 small .fastq.gz files for left-read and ~50 small .fastq.gz files for right-read. These small files are generated by the sequencer machine automatically. Now it comes up my questions regarding how to combine and keep the raw .fastq.gz files.

    I used the command “cat” to combine these 50 small .fastq.gz files into one large .fastq.gz like the following for sample “2894” (is this the right way?)

    cat 2894_CCTTCA_L00*_R1*. .fastq.gz > 2894_R1.fastq.gz
    cat 2894_CCTTCA_L00*_R2*. .fastq.gz > 2894_R2.fastq.gz

    After this, I have two .fastq.gz files for each sample. I think this is the files I want for analysis (TopHat), and also for uploading to public domain (SRA) when I publish my results.

    However, the support staff in our sequencing core suggested that it is better to keep the original small .fastq.gz files for two reasons. 1. They are truly raw, that is to say, they are files generated automatically by the machine. 2. Bowtie2/tophat2 can take these small files as input directly.

    Keep in mind that our RNASeq project is big, and we are not affording to keep both all small .fastq.gz files and the combined .fastq.gz files for each sample. So I would like to ask suggestions from you. If you can only keep one copy of the raw .fastq.gz files, which one you routinely keep for each sample:

    the combined big .fastq.gz file or
    the original 50 small .fastq.gz files generated by the machine

    Many thanks,
    Shirley

  • #2
    It is depends on the experiment. If your samples have different conditions, you can not combine them. If you use the original files you can use as replicates (more statistical power).

    Comment


    • #3
      As far as data files are concerned for a single sample (on a flowcell) having them in many small pieces or just a single large file is equivalent. One can set up illumina CASAVA pipeline to generate a single file (instead of the ~2 M sequence file chunks that are produced by default).
      Last edited by GenoMax; 03-25-2014, 06:08 AM.

      Comment


      • #4
        Thanks both of you for your quick reply.
        GenoMax, the protocol used in our project is 7 samples mixture each lane from lane 1-7 in each flowcell. For each sample, there will be data generated from Lane 1-7, and within each lane, there are multiple small (~300Mb) sequence file chunks as shown below. Can one still set up illumina CASAVA pipeline to generate a single file which is equivalent to the following many small pieces? Thanks a lot!

        300103469 Mar 19 9:48 2894_CCTTCA_L001_R1_001.fastq.gz
        299267851 Mar 19 9:47 2894_CCTTCA_L001_R1_002.fastq.gz
        296812322 Mar 19 9:53 2894_CCTTCA_L001_R1_003.fastq.gz
        298068175 Mar 19 9:56 2894_CCTTCA_L001_R1_004.fastq.gz
        298941666 Mar 19 9:59 2894_CCTTCA_L001_R1_005.fastq.gz
        297368542 Mar 19 10:00 2894_CCTTCA_L001_R1_006.fastq.gz
        295074828 Mar 19 10:02 2894_CCTTCA_L001_R1_007.fastq.gz
        27339550 Mar 19 10:02 2894_CCTTCA_L001_R1_008.fastq.gz
        299788150 Mar 19 9:48 2894_CCTTCA_L002_R1_001.fastq.gz
        297005199 Mar 19 9:49 2894_CCTTCA_L002_R1_002.fastq.gz
        299336456 Mar 19 9:51 2894_CCTTCA_L002_R1_003.fastq.gz
        298957127 Mar 19 9:55 2894_CCTTCA_L002_R1_004.fastq.gz
        298370958 Mar 19 9:57 2894_CCTTCA_L002_R1_005.fastq.gz
        296303213 Mar 19 10:00 2894_CCTTCA_L002_R1_006.fastq.gz
        297318084 Mar 19 10:01 2894_CCTTCA_L002_R1_007.fastq.gz
        56309336 Mar 19 9:48 2894_CCTTCA_L002_R1_008.fastq.gz
        299490670 Mar 19 10:02 2894_CCTTCA_L003_R1_001.fastq.gz
        298204197 Mar 19 9:48 2894_CCTTCA_L003_R1_002.fastq.gz
        298381878 Mar 19 9:52 2894_CCTTCA_L003_R1_003.fastq.gz
        298207558 Mar 19 9:54 2894_CCTTCA_L003_R1_004.fastq.gz
        297211698 Mar 19 9:57 2894_CCTTCA_L003_R1_005.fastq.gz
        296272949 Mar 19 10:00 2894_CCTTCA_L003_R1_006.fastq.gz
        295333326 Mar 19 10:01 2894_CCTTCA_L003_R1_007.fastq.gz
        25252928 Mar 19 9:47 2894_CCTTCA_L003_R1_008.fastq.gz
        298636337 Mar 19 9:46 2894_CCTTCA_L004_R1_001.fastq.gz
        298401494 Mar 19 9:49 2894_CCTTCA_L004_R1_002.fastq.gz
        298056832 Mar 19 9:52 2894_CCTTCA_L004_R1_003.fastq.gz
        297487782 Mar 19 9:55 2894_CCTTCA_L004_R1_004.fastq.gz
        296972912 Mar 19 9:58 2894_CCTTCA_L004_R1_005.fastq.gz
        296600770 Mar 19 9:59 2894_CCTTCA_L004_R1_006.fastq.gz
        296969650 Mar 19 10:01 2894_CCTTCA_L004_R1_007.fastq.gz
        6172325 Mar 19 10:02 2894_CCTTCA_L004_R1_008.fastq.gz
        299219937 Mar 19 9:47 2894_CCTTCA_L005_R1_001.fastq.gz
        299250792 Mar 19 9:51 2894_CCTTCA_L005_R1_002.fastq.gz
        299132778 Mar 19 9:53 2894_CCTTCA_L005_R1_003.fastq.gz
        298451004 Mar 19 9:56 2894_CCTTCA_L005_R1_004.fastq.gz
        297911999 Mar 19 9:58 2894_CCTTCA_L005_R1_005.fastq.gz
        297310880 Mar 19 10:00 2894_CCTTCA_L005_R1_006.fastq.gz
        295327365 Mar 19 10:01 2894_CCTTCA_L005_R1_007.fastq.gz
        59057213 Mar 19 10:02 2894_CCTTCA_L005_R1_008.fastq.gz
        297818921 Mar 19 9:46 2894_CCTTCA_L006_R1_001.fastq.gz
        299471365 Mar 19 9:49 2894_CCTTCA_L006_R1_002.fastq.gz
        299352842 Mar 19 9:53 2894_CCTTCA_L006_R1_003.fastq.gz
        297294165 Mar 19 9:56 2894_CCTTCA_L006_R1_004.fastq.gz
        296796918 Mar 19 9:59 2894_CCTTCA_L006_R1_005.fastq.gz
        297483409 Mar 19 10:00 2894_CCTTCA_L006_R1_006.fastq.gz
        295547701 Mar 19 10:02 2894_CCTTCA_L006_R1_007.fastq.gz
        45004013 Mar 19 10:02 2894_CCTTCA_L006_R1_008.fastq.gz

        Comment


        • #5
          See page 23 of the CASAVA manual for possible organization of the sample files based on concept of "projects". http://supportres.illumina.com/docum..._15011196d.pdf

          I am inclined to keep sample files organized on a per lane basis, which is what CASAVA will do. Probably faster to feed them into an aligner in parallel than a huge single file. That said, you could cat them across lanes into a big file (since you should be able to figure out what lane a sequence came from by looking at the Fastq ID header) but the files may become too unwieldy to handle.
          Last edited by GenoMax; 03-25-2014, 07:15 AM.

          Comment


          • #6
            Got it. Thank you!

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Advancing Precision Medicine for Rare Diseases in Children
              by seqadmin




              Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
              12-16-2024, 07:57 AM
            • seqadmin
              Recent Advances in Sequencing Technologies
              by seqadmin



              Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

              Long-Read Sequencing
              Long-read sequencing has seen remarkable advancements,...
              12-02-2024, 01:49 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 12-17-2024, 10:28 AM
            0 responses
            23 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 12-13-2024, 08:24 AM
            0 responses
            42 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 12-12-2024, 07:41 AM
            0 responses
            28 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 12-11-2024, 07:45 AM
            0 responses
            42 views
            0 likes
            Last Post seqadmin  
            Working...
            X