SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
converting paired-end (PE) bam file to single-end (SE) fastq adrian Bioinformatics 3 05-05-2015 10:00 AM
Combinig technical fastq files into a single fastq file dena.dinesh RNA Sequencing 3 03-27-2015 06:15 AM
Initial QC and grooming for Illumina HiSeq2000 paired end RNAseq on Galaxy lindseykelly RNA Sequencing 5 07-30-2014 01:09 PM
How to keep the raw .fastq.gz files for RNASeq data shirley0818 RNA Sequencing 5 03-25-2014 09:15 AM
Tophat - processing several files fastq marb Bioinformatics 3 04-18-2012 03:12 PM

Reply
 
Thread Tools
Old 06-24-2015, 09:01 AM   #1
imsharmanitin
Postdoc Cancer Bioinformatics
 
Location: Olso, Norway

Join Date: Dec 2014
Posts: 17
Cool processing Fastq files from HiSeq2000 single end and RNAseq analysis

Hello all,

i apologise in advance for asking some basic questions

i have recently started working on RNA-Seq. i have fastq files in zipped format.

I read lot of threads and the one that came very close to my queries is
http://seqanswers.com/forums/showthread.php?t=21331

as far i understood, I have to do following steps:

1) merge the fastq files for each sample (3 files per sample in my case)
-> do i just need to concatenate the files or there is some specific software to achieve this?

2) I have to remove barcodes and adapter sequences.
-> how can i know if i have barcode and adapter sequences?
-> should i use cutadapt before fastqc, as fastqc gives results on first 200,000 sequences
-> are there any adapters specific to RNA-seq?

3) check the quality with fastqc and discard the data based on quality

* what is FASTQ grooming and why we need to do it?
As far i know from February 2011, Illumina's newest version (1.8) of their pipeline CASAVA will directly produce fastq in Sanger format in Phred+33 format. Hence, i don't need to use FASTQ Groomer.

4) align with reference genome
-> should i use assembly(human) grch37 or grch38 ? I am inclined to use gr38 as it should be most updated version


Some more basic questions:

in the HiSeq2000 fastq format

@HWI-ST1146:243:C5HH7ACXX:1:2316:16223:100755 1:N:0:NTTTCG
GGGAGGCTGTTCTGCTTTACGCATCTGAGAACTACATAGGAGAGNAANNN
+
CCCFFFFFHHHHHJJJJJJJJJ1FHIJJJJJJJJJJJJJJJJJJ#0?###

what is use of Index Sequence ?
imsharmanitin is offline   Reply With Quote
Old 06-24-2015, 09:19 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,982
Default

Quote:
Originally Posted by imsharmanitin View Post
Hello all,

i apologise in advance for asking some basic questions

i have recently started working on RNA-Seq. i have fastq files in zipped format.

I read lot of threads and the one that came very close to my queries is
http://seqanswers.com/forums/showthread.php?t=21331

as far i understood, I have to do following steps:

1) merge the fastq files for each sample (3 files per sample in my case)
-> do i just need to concatenate the files or there is some specific software to achieve this?
No special software needed you could do something like this:

Code:
$ zcat L001_R1.fastq.gz L002_R1.fastq.gz L003_R1.fastq.gz L004_R1.fastq.gz | gzip -c - > R1.fastq.gz
Quote:
Originally Posted by imsharmanitin View Post
2) I have to remove barcodes and adapter sequences.
-> how can i know if i have barcode and adapter sequences?
-> should i use cutadapt before fastqc, as fastqc gives results on first 200,000 sequences
-> are there any adapters specific to RNA-seq?
I would suggest scanning your reads with a trimming program such as trimmomatic or bbduk.sh (from BBMap) that is paired-end aware. Search for threads for these programs here. You should not have barcodes in your reads since they are never a part of the actual read in illumina technology. Both trimmomatic and bbduk.sh include all standard illumina adapter sequences (generally in resources directory). If your sequences don't have adapter contamination then they would come through unchanged when passed through the trimming program. This ensures that you have no extraneous sequences in your data as you go forward with your analysis.

Quote:
Originally Posted by imsharmanitin View Post
3) check the quality with fastqc and discard the data based on quality

* what is FASTQ grooming and why we need to do it?
As far i know from February 2011, Illumina's newest version (1.8) of their pipeline CASAVA will directly produce fastq in Sanger format in Phred+33 format. Hence, i don't need to use FASTQ Groomer.
If your data is of recent vintage (from last ~2 years) then you may not need to use any grooming/Q-score conversion. Data from older times may be in "illumina" fastq format which used a different offset for the scores (phred+64). More here: http://en.wikipedia.org/wiki/FASTQ_format

Quote:
Originally Posted by imsharmanitin View Post
4) align with reference genome
-> should i use assembly(human) grch37 or grch38 ? I am inclined to use gr38 as it should be most updated version
That is up to you. hg19 generally has fuller annotations available.

Quote:
Originally Posted by imsharmanitin View Post
Some more basic questions:

in the HiSeq2000 fastq format

@HWI-ST1146:243:C5HH7ACXX:1:2316:16223:100755 1:N:0:NTTTCG
GGGAGGCTGTTCTGCTTTACGCATCTGAGAACTACATAGGAGAGNAANNN
+
CCCFFFFFHHHHHJJJJJJJJJ1FHIJJJJJJJJJJJJJJJJJJ#0?###

what is use of Index Sequence ?
Index/barcode sequences are used to tag samples so multiple samples can be pooled together in a single run. After a run is demultiplexed using illumina software (CASAVA/bcl2fastq) the tag read sequence is moved to the header of the fastq read as they get binned. I have highlighted the tag sequence in the example you posted above in red. A single sample will have identical tag sequence in that position in a file.

Last edited by GenoMax; 06-24-2015 at 09:25 AM.
GenoMax is offline   Reply With Quote
Reply

Tags
adapter trimming, barcodes, fastq format, hiseq2000, rna-seq advice

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:02 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO