View Single Post
Old 06-24-2015, 09:19 AM   #2
Senior Member
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,978

Originally Posted by imsharmanitin View Post
Hello all,

i apologise in advance for asking some basic questions

i have recently started working on RNA-Seq. i have fastq files in zipped format.

I read lot of threads and the one that came very close to my queries is

as far i understood, I have to do following steps:

1) merge the fastq files for each sample (3 files per sample in my case)
-> do i just need to concatenate the files or there is some specific software to achieve this?
No special software needed you could do something like this:

$ zcat L001_R1.fastq.gz L002_R1.fastq.gz L003_R1.fastq.gz L004_R1.fastq.gz | gzip -c - > R1.fastq.gz
Originally Posted by imsharmanitin View Post
2) I have to remove barcodes and adapter sequences.
-> how can i know if i have barcode and adapter sequences?
-> should i use cutadapt before fastqc, as fastqc gives results on first 200,000 sequences
-> are there any adapters specific to RNA-seq?
I would suggest scanning your reads with a trimming program such as trimmomatic or (from BBMap) that is paired-end aware. Search for threads for these programs here. You should not have barcodes in your reads since they are never a part of the actual read in illumina technology. Both trimmomatic and include all standard illumina adapter sequences (generally in resources directory). If your sequences don't have adapter contamination then they would come through unchanged when passed through the trimming program. This ensures that you have no extraneous sequences in your data as you go forward with your analysis.

Originally Posted by imsharmanitin View Post
3) check the quality with fastqc and discard the data based on quality

* what is FASTQ grooming and why we need to do it?
As far i know from February 2011, Illumina's newest version (1.8) of their pipeline CASAVA will directly produce fastq in Sanger format in Phred+33 format. Hence, i don't need to use FASTQ Groomer.
If your data is of recent vintage (from last ~2 years) then you may not need to use any grooming/Q-score conversion. Data from older times may be in "illumina" fastq format which used a different offset for the scores (phred+64). More here:

Originally Posted by imsharmanitin View Post
4) align with reference genome
-> should i use assembly(human) grch37 or grch38 ? I am inclined to use gr38 as it should be most updated version
That is up to you. hg19 generally has fuller annotations available.

Originally Posted by imsharmanitin View Post
Some more basic questions:

in the HiSeq2000 fastq format

@HWI-ST1146:243:C5HH7ACXX:1:2316:16223:100755 1:N:0:NTTTCG

what is use of Index Sequence ?
Index/barcode sequences are used to tag samples so multiple samples can be pooled together in a single run. After a run is demultiplexed using illumina software (CASAVA/bcl2fastq) the tag read sequence is moved to the header of the fastq read as they get binned. I have highlighted the tag sequence in the example you posted above in red. A single sample will have identical tag sequence in that position in a file.

Last edited by GenoMax; 06-24-2015 at 09:25 AM.
GenoMax is offline   Reply With Quote