Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • processing Fastq files from HiSeq2000 single end and RNAseq analysis

    Hello all,

    i apologise in advance for asking some basic questions

    i have recently started working on RNA-Seq. i have fastq files in zipped format.

    I read lot of threads and the one that came very close to my queries is
    Application of sequencing to RNA analysis (RNA-Seq, whole transcriptome, SAGE, expression analysis, novel organism mining, splice variants)


    as far i understood, I have to do following steps:

    1) merge the fastq files for each sample (3 files per sample in my case)
    -> do i just need to concatenate the files or there is some specific software to achieve this?

    2) I have to remove barcodes and adapter sequences.
    -> how can i know if i have barcode and adapter sequences?
    -> should i use cutadapt before fastqc, as fastqc gives results on first 200,000 sequences
    -> are there any adapters specific to RNA-seq?

    3) check the quality with fastqc and discard the data based on quality

    * what is FASTQ grooming and why we need to do it?
    As far i know from February 2011, Illumina's newest version (1.8) of their pipeline CASAVA will directly produce fastq in Sanger format in Phred+33 format. Hence, i don't need to use FASTQ Groomer.

    4) align with reference genome
    -> should i use assembly(human) grch37 or grch38 ? I am inclined to use gr38 as it should be most updated version


    Some more basic questions:

    in the HiSeq2000 fastq format

    @HWI-ST1146:243:C5HH7ACXX:1:2316:16223:100755 1:N:0:NTTTCG
    GGGAGGCTGTTCTGCTTTACGCATCTGAGAACTACATAGGAGAGNAANNN
    +
    CCCFFFFFHHHHHJJJJJJJJJ1FHIJJJJJJJJJJJJJJJJJJ#0?###

    what is use of Index Sequence ?

  • #2
    Originally posted by imsharmanitin View Post
    Hello all,

    i apologise in advance for asking some basic questions

    i have recently started working on RNA-Seq. i have fastq files in zipped format.

    I read lot of threads and the one that came very close to my queries is
    Application of sequencing to RNA analysis (RNA-Seq, whole transcriptome, SAGE, expression analysis, novel organism mining, splice variants)


    as far i understood, I have to do following steps:

    1) merge the fastq files for each sample (3 files per sample in my case)
    -> do i just need to concatenate the files or there is some specific software to achieve this?
    No special software needed you could do something like this:

    Code:
    $ zcat L001_R1.fastq.gz L002_R1.fastq.gz L003_R1.fastq.gz L004_R1.fastq.gz | gzip -c - > R1.fastq.gz
    Originally posted by imsharmanitin View Post
    2) I have to remove barcodes and adapter sequences.
    -> how can i know if i have barcode and adapter sequences?
    -> should i use cutadapt before fastqc, as fastqc gives results on first 200,000 sequences
    -> are there any adapters specific to RNA-seq?
    I would suggest scanning your reads with a trimming program such as trimmomatic or bbduk.sh (from BBMap) that is paired-end aware. Search for threads for these programs here. You should not have barcodes in your reads since they are never a part of the actual read in illumina technology. Both trimmomatic and bbduk.sh include all standard illumina adapter sequences (generally in resources directory). If your sequences don't have adapter contamination then they would come through unchanged when passed through the trimming program. This ensures that you have no extraneous sequences in your data as you go forward with your analysis.

    Originally posted by imsharmanitin View Post
    3) check the quality with fastqc and discard the data based on quality

    * what is FASTQ grooming and why we need to do it?
    As far i know from February 2011, Illumina's newest version (1.8) of their pipeline CASAVA will directly produce fastq in Sanger format in Phred+33 format. Hence, i don't need to use FASTQ Groomer.
    If your data is of recent vintage (from last ~2 years) then you may not need to use any grooming/Q-score conversion. Data from older times may be in "illumina" fastq format which used a different offset for the scores (phred+64). More here: http://en.wikipedia.org/wiki/FASTQ_format

    Originally posted by imsharmanitin View Post
    4) align with reference genome
    -> should i use assembly(human) grch37 or grch38 ? I am inclined to use gr38 as it should be most updated version
    That is up to you. hg19 generally has fuller annotations available.

    Originally posted by imsharmanitin View Post
    Some more basic questions:

    in the HiSeq2000 fastq format

    @HWI-ST1146:243:C5HH7ACXX:1:2316:16223:100755 1:N:0:NTTTCG
    GGGAGGCTGTTCTGCTTTACGCATCTGAGAACTACATAGGAGAGNAANNN
    +
    CCCFFFFFHHHHHJJJJJJJJJ1FHIJJJJJJJJJJJJJJJJJJ#0?###

    what is use of Index Sequence ?
    Index/barcode sequences are used to tag samples so multiple samples can be pooled together in a single run. After a run is demultiplexed using illumina software (CASAVA/bcl2fastq) the tag read sequence is moved to the header of the fastq read as they get binned. I have highlighted the tag sequence in the example you posted above in red. A single sample will have identical tag sequence in that position in a file.
    Last edited by GenoMax; 06-24-2015, 09:25 AM.

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Strategies for Sequencing Challenging Samples
      by seqadmin


      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
      03-22-2024, 06:39 AM
    • seqadmin
      Techniques and Challenges in Conservation Genomics
      by seqadmin



      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

      Avian Conservation
      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
      03-08-2024, 10:41 AM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, Yesterday, 06:37 PM
    0 responses
    10 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, Yesterday, 06:07 PM
    0 responses
    9 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 03-22-2024, 10:03 AM
    0 responses
    49 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 03-21-2024, 07:32 AM
    0 responses
    67 views
    0 likes
    Last Post seqadmin  
    Working...
    X