SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
Raw read counts for RNAseq biofreak Introductions 13 01-16-2013 05:28 AM
Raw Illumina Data to FASTQ ? Meligethes Illumina/Solexa 2 03-14-2012 06:50 AM
smallRNA GAIIx raw fastq files - quality filter? vebaev Bioinformatics 0 08-22-2011 10:30 AM
Raw readcounts for RNAseq data using CountOverlaps function in IRanges biofreak General 1 06-28-2011 01:32 PM
fastq files comparison to 1000 gp data Masta Bioinformatics 0 02-22-2011 10:27 PM

Reply
 
Thread Tools
Old 03-25-2014, 05:44 AM   #1
shirley0818
Member
 
Location: MA

Join Date: Apr 2013
Posts: 13
Default How to keep the raw .fastq.gz files for RNASeq data

Hello,

I have 75bp paired-end RNASeq data generated from Illumina HTSeq 2000 using the protocol of 7 samples mixture each lane from lane 1-7 in each flowcell. Each sample has 6bp-index associated with it. Using this protocal, for each sample, there are ~50 small .fastq.gz files for left-read and ~50 small .fastq.gz files for right-read. These small files are generated by the sequencer machine automatically. Now it comes up my questions regarding how to combine and keep the raw .fastq.gz files.

I used the command “cat” to combine these 50 small .fastq.gz files into one large .fastq.gz like the following for sample “2894” (is this the right way?)

cat 2894_CCTTCA_L00*_R1*. .fastq.gz > 2894_R1.fastq.gz
cat 2894_CCTTCA_L00*_R2*. .fastq.gz > 2894_R2.fastq.gz

After this, I have two .fastq.gz files for each sample. I think this is the files I want for analysis (TopHat), and also for uploading to public domain (SRA) when I publish my results.

However, the support staff in our sequencing core suggested that it is better to keep the original small .fastq.gz files for two reasons. 1. They are truly raw, that is to say, they are files generated automatically by the machine. 2. Bowtie2/tophat2 can take these small files as input directly.

Keep in mind that our RNASeq project is big, and we are not affording to keep both all small .fastq.gz files and the combined .fastq.gz files for each sample. So I would like to ask suggestions from you. If you can only keep one copy of the raw .fastq.gz files, which one you routinely keep for each sample:

the combined big .fastq.gz file or
the original 50 small .fastq.gz files generated by the machine

Many thanks,
Shirley
shirley0818 is offline   Reply With Quote
Old 03-25-2014, 05:57 AM   #2
TiborNagy
Senior Member
 
Location: Budapest

Join Date: Mar 2010
Posts: 329
Default

It is depends on the experiment. If your samples have different conditions, you can not combine them. If you use the original files you can use as replicates (more statistical power).
TiborNagy is offline   Reply With Quote
Old 03-25-2014, 06:06 AM   #3
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,962
Default

As far as data files are concerned for a single sample (on a flowcell) having them in many small pieces or just a single large file is equivalent. One can set up illumina CASAVA pipeline to generate a single file (instead of the ~2 M sequence file chunks that are produced by default).

Last edited by GenoMax; 03-25-2014 at 06:08 AM.
GenoMax is offline   Reply With Quote
Old 03-25-2014, 06:39 AM   #4
shirley0818
Member
 
Location: MA

Join Date: Apr 2013
Posts: 13
Default

Thanks both of you for your quick reply.
GenoMax, the protocol used in our project is 7 samples mixture each lane from lane 1-7 in each flowcell. For each sample, there will be data generated from Lane 1-7, and within each lane, there are multiple small (~300Mb) sequence file chunks as shown below. Can one still set up illumina CASAVA pipeline to generate a single file which is equivalent to the following many small pieces? Thanks a lot!

300103469 Mar 19 9:48 2894_CCTTCA_L001_R1_001.fastq.gz
299267851 Mar 19 9:47 2894_CCTTCA_L001_R1_002.fastq.gz
296812322 Mar 19 9:53 2894_CCTTCA_L001_R1_003.fastq.gz
298068175 Mar 19 9:56 2894_CCTTCA_L001_R1_004.fastq.gz
298941666 Mar 19 9:59 2894_CCTTCA_L001_R1_005.fastq.gz
297368542 Mar 19 10:00 2894_CCTTCA_L001_R1_006.fastq.gz
295074828 Mar 19 10:02 2894_CCTTCA_L001_R1_007.fastq.gz
27339550 Mar 19 10:02 2894_CCTTCA_L001_R1_008.fastq.gz
299788150 Mar 19 9:48 2894_CCTTCA_L002_R1_001.fastq.gz
297005199 Mar 19 9:49 2894_CCTTCA_L002_R1_002.fastq.gz
299336456 Mar 19 9:51 2894_CCTTCA_L002_R1_003.fastq.gz
298957127 Mar 19 9:55 2894_CCTTCA_L002_R1_004.fastq.gz
298370958 Mar 19 9:57 2894_CCTTCA_L002_R1_005.fastq.gz
296303213 Mar 19 10:00 2894_CCTTCA_L002_R1_006.fastq.gz
297318084 Mar 19 10:01 2894_CCTTCA_L002_R1_007.fastq.gz
56309336 Mar 19 9:48 2894_CCTTCA_L002_R1_008.fastq.gz
299490670 Mar 19 10:02 2894_CCTTCA_L003_R1_001.fastq.gz
298204197 Mar 19 9:48 2894_CCTTCA_L003_R1_002.fastq.gz
298381878 Mar 19 9:52 2894_CCTTCA_L003_R1_003.fastq.gz
298207558 Mar 19 9:54 2894_CCTTCA_L003_R1_004.fastq.gz
297211698 Mar 19 9:57 2894_CCTTCA_L003_R1_005.fastq.gz
296272949 Mar 19 10:00 2894_CCTTCA_L003_R1_006.fastq.gz
295333326 Mar 19 10:01 2894_CCTTCA_L003_R1_007.fastq.gz
25252928 Mar 19 9:47 2894_CCTTCA_L003_R1_008.fastq.gz
298636337 Mar 19 9:46 2894_CCTTCA_L004_R1_001.fastq.gz
298401494 Mar 19 9:49 2894_CCTTCA_L004_R1_002.fastq.gz
298056832 Mar 19 9:52 2894_CCTTCA_L004_R1_003.fastq.gz
297487782 Mar 19 9:55 2894_CCTTCA_L004_R1_004.fastq.gz
296972912 Mar 19 9:58 2894_CCTTCA_L004_R1_005.fastq.gz
296600770 Mar 19 9:59 2894_CCTTCA_L004_R1_006.fastq.gz
296969650 Mar 19 10:01 2894_CCTTCA_L004_R1_007.fastq.gz
6172325 Mar 19 10:02 2894_CCTTCA_L004_R1_008.fastq.gz
299219937 Mar 19 9:47 2894_CCTTCA_L005_R1_001.fastq.gz
299250792 Mar 19 9:51 2894_CCTTCA_L005_R1_002.fastq.gz
299132778 Mar 19 9:53 2894_CCTTCA_L005_R1_003.fastq.gz
298451004 Mar 19 9:56 2894_CCTTCA_L005_R1_004.fastq.gz
297911999 Mar 19 9:58 2894_CCTTCA_L005_R1_005.fastq.gz
297310880 Mar 19 10:00 2894_CCTTCA_L005_R1_006.fastq.gz
295327365 Mar 19 10:01 2894_CCTTCA_L005_R1_007.fastq.gz
59057213 Mar 19 10:02 2894_CCTTCA_L005_R1_008.fastq.gz
297818921 Mar 19 9:46 2894_CCTTCA_L006_R1_001.fastq.gz
299471365 Mar 19 9:49 2894_CCTTCA_L006_R1_002.fastq.gz
299352842 Mar 19 9:53 2894_CCTTCA_L006_R1_003.fastq.gz
297294165 Mar 19 9:56 2894_CCTTCA_L006_R1_004.fastq.gz
296796918 Mar 19 9:59 2894_CCTTCA_L006_R1_005.fastq.gz
297483409 Mar 19 10:00 2894_CCTTCA_L006_R1_006.fastq.gz
295547701 Mar 19 10:02 2894_CCTTCA_L006_R1_007.fastq.gz
45004013 Mar 19 10:02 2894_CCTTCA_L006_R1_008.fastq.gz
shirley0818 is offline   Reply With Quote
Old 03-25-2014, 07:10 AM   #5
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,962
Default

See page 23 of the CASAVA manual for possible organization of the sample files based on concept of "projects". http://supportres.illumina.com/docum..._15011196d.pdf

I am inclined to keep sample files organized on a per lane basis, which is what CASAVA will do. Probably faster to feed them into an aligner in parallel than a huge single file. That said, you could cat them across lanes into a big file (since you should be able to figure out what lane a sequence came from by looking at the Fastq ID header) but the files may become too unwieldy to handle.

Last edited by GenoMax; 03-25-2014 at 07:15 AM.
GenoMax is offline   Reply With Quote
Old 03-25-2014, 09:15 AM   #6
shirley0818
Member
 
Location: MA

Join Date: Apr 2013
Posts: 13
Default

Got it. Thank you!
shirley0818 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:44 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO