Seqanswers Leaderboard Ad

**GenoMax** · 06-24-2015, 09:19 AM

Originally posted by imsharmanitin View Post

Hello all,

i apologise in advance for asking some basic questions

i have recently started working on RNA-Seq. i have fastq files in zipped format.

I read lot of threads and the one that came very close to my queries is

Initial QC and grooming for Illumina HiSeq2000 paired end RNAseq on Galaxy - SEQanswers

http://seqanswers.com/forums/showthread.php?t=21331

Application of sequencing to RNA analysis (RNA-Seq, whole transcriptome, SAGE, expression analysis, novel organism mining, splice variants)

as far i understood, I have to do following steps:

1) merge the fastq files for each sample (3 files per sample in my case)
-> do i just need to concatenate the files or there is some specific software to achieve this?

No special software needed you could do something like this:

Code:

$ zcat L001_R1.fastq.gz L002_R1.fastq.gz L003_R1.fastq.gz L004_R1.fastq.gz | gzip -c - > R1.fastq.gz

Originally posted by imsharmanitin View Post

2) I have to remove barcodes and adapter sequences.
-> how can i know if i have barcode and adapter sequences?
-> should i use cutadapt before fastqc, as fastqc gives results on first 200,000 sequences
-> are there any adapters specific to RNA-seq?

I would suggest scanning your reads with a trimming program such as trimmomatic or bbduk.sh (from BBMap) that is paired-end aware. Search for threads for these programs here. You should not have barcodes in your reads since they are never a part of the actual read in illumina technology. Both trimmomatic and bbduk.sh include all standard illumina adapter sequences (generally in resources directory). If your sequences don't have adapter contamination then they would come through unchanged when passed through the trimming program. This ensures that you have no extraneous sequences in your data as you go forward with your analysis.

Originally posted by imsharmanitin View Post

3) check the quality with fastqc and discard the data based on quality

* what is FASTQ grooming and why we need to do it?
As far i know from February 2011, Illumina's newest version (1.8) of their pipeline CASAVA will directly produce fastq in Sanger format in Phred+33 format. Hence, i don't need to use FASTQ Groomer.

If your data is of recent vintage (from last ~2 years) then you may not need to use any grooming/Q-score conversion. Data from older times may be in "illumina" fastq format which used a different offset for the scores (phred+64). More here: http://en.wikipedia.org/wiki/FASTQ_format

Originally posted by imsharmanitin View Post

4) align with reference genome
-> should i use assembly(human) grch37 or grch38 ? I am inclined to use gr38 as it should be most updated version

That is up to you. hg19 generally has fuller annotations available.

Originally posted by imsharmanitin View Post

Some more basic questions:

in the HiSeq2000 fastq format

@HWI-ST1146:243:C5HH7ACXX:1:2316:16223:100755 1:N:0:NTTTCG
GGGAGGCTGTTCTGCTTTACGCATCTGAGAACTACATAGGAGAGNAANNN
+
CCCFFFFFHHHHHJJJJJJJJJ1FHIJJJJJJJJJJJJJJJJJJ#0?###

what is use of Index Sequence ?

Index/barcode sequences are used to tag samples so multiple samples can be pooled together in a single run. After a run is demultiplexed using illumina software (CASAVA/bcl2fastq) the tag read sequence is moved to the header of the fastq read as they get binned. I have highlighted the tag sequence in the example you posted above in red. A single sample will have identical tag sequence in that position in a file.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

processing Fastq files from HiSeq2000 single end and RNAseq analysis

Comment

Latest Articles

ad_right_rmr

News