Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • imsharmanitin
    Postdoc Cancer Bioinformatics
    • Dec 2014
    • 17

    processing Fastq files from HiSeq2000 single end and RNAseq analysis

    Hello all,

    i apologise in advance for asking some basic questions

    i have recently started working on RNA-Seq. i have fastq files in zipped format.

    I read lot of threads and the one that came very close to my queries is
    Application of sequencing to RNA analysis (RNA-Seq, whole transcriptome, SAGE, expression analysis, novel organism mining, splice variants)


    as far i understood, I have to do following steps:

    1) merge the fastq files for each sample (3 files per sample in my case)
    -> do i just need to concatenate the files or there is some specific software to achieve this?

    2) I have to remove barcodes and adapter sequences.
    -> how can i know if i have barcode and adapter sequences?
    -> should i use cutadapt before fastqc, as fastqc gives results on first 200,000 sequences
    -> are there any adapters specific to RNA-seq?

    3) check the quality with fastqc and discard the data based on quality

    * what is FASTQ grooming and why we need to do it?
    As far i know from February 2011, Illumina's newest version (1.8) of their pipeline CASAVA will directly produce fastq in Sanger format in Phred+33 format. Hence, i don't need to use FASTQ Groomer.

    4) align with reference genome
    -> should i use assembly(human) grch37 or grch38 ? I am inclined to use gr38 as it should be most updated version


    Some more basic questions:

    in the HiSeq2000 fastq format

    @HWI-ST1146:243:C5HH7ACXX:1:2316:16223:100755 1:N:0:NTTTCG
    GGGAGGCTGTTCTGCTTTACGCATCTGAGAACTACATAGGAGAGNAANNN
    +
    CCCFFFFFHHHHHJJJJJJJJJ1FHIJJJJJJJJJJJJJJJJJJ#0?###

    what is use of Index Sequence ?
  • GenoMax
    Senior Member
    • Feb 2008
    • 7142

    #2
    Originally posted by imsharmanitin View Post
    Hello all,

    i apologise in advance for asking some basic questions

    i have recently started working on RNA-Seq. i have fastq files in zipped format.

    I read lot of threads and the one that came very close to my queries is
    Application of sequencing to RNA analysis (RNA-Seq, whole transcriptome, SAGE, expression analysis, novel organism mining, splice variants)


    as far i understood, I have to do following steps:

    1) merge the fastq files for each sample (3 files per sample in my case)
    -> do i just need to concatenate the files or there is some specific software to achieve this?
    No special software needed you could do something like this:

    Code:
    $ zcat L001_R1.fastq.gz L002_R1.fastq.gz L003_R1.fastq.gz L004_R1.fastq.gz | gzip -c - > R1.fastq.gz
    Originally posted by imsharmanitin View Post
    2) I have to remove barcodes and adapter sequences.
    -> how can i know if i have barcode and adapter sequences?
    -> should i use cutadapt before fastqc, as fastqc gives results on first 200,000 sequences
    -> are there any adapters specific to RNA-seq?
    I would suggest scanning your reads with a trimming program such as trimmomatic or bbduk.sh (from BBMap) that is paired-end aware. Search for threads for these programs here. You should not have barcodes in your reads since they are never a part of the actual read in illumina technology. Both trimmomatic and bbduk.sh include all standard illumina adapter sequences (generally in resources directory). If your sequences don't have adapter contamination then they would come through unchanged when passed through the trimming program. This ensures that you have no extraneous sequences in your data as you go forward with your analysis.

    Originally posted by imsharmanitin View Post
    3) check the quality with fastqc and discard the data based on quality

    * what is FASTQ grooming and why we need to do it?
    As far i know from February 2011, Illumina's newest version (1.8) of their pipeline CASAVA will directly produce fastq in Sanger format in Phred+33 format. Hence, i don't need to use FASTQ Groomer.
    If your data is of recent vintage (from last ~2 years) then you may not need to use any grooming/Q-score conversion. Data from older times may be in "illumina" fastq format which used a different offset for the scores (phred+64). More here: http://en.wikipedia.org/wiki/FASTQ_format

    Originally posted by imsharmanitin View Post
    4) align with reference genome
    -> should i use assembly(human) grch37 or grch38 ? I am inclined to use gr38 as it should be most updated version
    That is up to you. hg19 generally has fuller annotations available.

    Originally posted by imsharmanitin View Post
    Some more basic questions:

    in the HiSeq2000 fastq format

    @HWI-ST1146:243:C5HH7ACXX:1:2316:16223:100755 1:N:0:NTTTCG
    GGGAGGCTGTTCTGCTTTACGCATCTGAGAACTACATAGGAGAGNAANNN
    +
    CCCFFFFFHHHHHJJJJJJJJJ1FHIJJJJJJJJJJJJJJJJJJ#0?###

    what is use of Index Sequence ?
    Index/barcode sequences are used to tag samples so multiple samples can be pooled together in a single run. After a run is demultiplexed using illumina software (CASAVA/bcl2fastq) the tag read sequence is moved to the header of the fastq read as they get binned. I have highlighted the tag sequence in the example you posted above in red. A single sample will have identical tag sequence in that position in a file.
    Last edited by GenoMax; 06-24-2015, 09:25 AM.

    Comment

    Latest Articles

    Collapse

    • SEQadmin2
      Nine Things a Sample Prep Scientist Thinks About Before Sequencing
      by SEQadmin2


      I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


      Here are nine questions we think about, in roughly the order they matter, before...
      06-18-2026, 07:11 AM
    • SEQadmin2
      From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
      by SEQadmin2


      Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


      The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
      ...
      06-02-2026, 10:05 AM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by SEQadmin2, 06-17-2026, 06:09 AM
    0 responses
    24 views
    0 reactions
    Last Post SEQadmin2  
    Started by SEQadmin2, 06-09-2026, 11:58 AM
    0 responses
    41 views
    0 reactions
    Last Post SEQadmin2  
    Started by SEQadmin2, 06-05-2026, 10:09 AM
    0 responses
    48 views
    0 reactions
    Last Post SEQadmin2  
    Started by SEQadmin2, 06-04-2026, 08:59 AM
    0 responses
    49 views
    0 reactions
    Last Post SEQadmin2  
    Working...