Hi everybody,
As a beginner for RNA-seq analysis, I desperately need your help and will appreciate it very much.
I did single end sequencing of Arabidopsis thaliana transcriptome with Hiseq2000. The read length is 51bp. The sequencing quality seemed to be quite good when checked with FASTQC. When I ran Tophat2, the resulting accepted_hits.bam file was about 38 M bite in its size while the unmapped.bam was about 280 MB. Although I haven't found out the exact mapping rate, judging from the sizes of the mapped and unmapped files it seems that the majority of the reads are not mapped to the genome. When I randomly picked up some reads from the unmapped file and blasted them against the Arabidopsis genome (-intron, +UTR), I found almost all the reads I checked can be perfectly blasted to a certain mRNA. I used genes.gtf and genome in the TAIR10 downloaded from iGenome. This low mapped rate happened no matter I used the following scrpit1 or 2. Does any one has any clue what the reason can be? Thanks for your suggestions.
script1:
tophat2 -p 8 -i 30 -g 5 --min-coverage-intron 30 --min-segment-intron 30 --b2-sensitive -G genes.gtf -o ./ genome 4_GCCAAT_L001_R1_001.fastq.gz 4_GCCAAT_L001_R1_002.fastq.gz 4_GCCAAT_L001_R1_003.fastq.gz 4_GCCAAT_L001_R1_004.fastq.gz 4_GCCAAT_L001_R1_005.fastq.gz 4_GCCAAT_L001_R1_006.fastq.gz 4_GCCAAT_L001_R1_007.fastq.gz 4_GCCAAT_L001_R1_008.fastq.gz 4_GCCAAT_L001_R1_009.fastq.gz 4_GCCAAT_L001_R1_010.fastq.gz 4_GCCAAT_L001_R1_011.fastq.gz 4_GCCAAT_L001_R1_012.fastq.gz
script2:
tophat2 -p 8 -G genes.gtf -o ./ genome 4_GCCAAT_L001_R1_001.fastq.gz 4_GCCAAT_L001_R1_002.fastq.gz 4_GCCAAT_L001_R1_003.fastq.gz 4_GCCAAT_L001_R1_004.fastq.gz 4_GCCAAT_L001_R1_005.fastq.gz 4_GCCAAT_L001_R1_006.fastq.gz 4_GCCAAT_L001_R1_007.fastq.gz 4_GCCAAT_L001_R1_008.fastq.gz 4_GCCAAT_L001_R1_009.fastq.gz 4_GCCAAT_L001_R1_010.fastq.gz 4_GCCAAT_L001_R1_011.fastq.gz 4_GCCAAT_L001_R1_012.fastq.gz
As a beginner for RNA-seq analysis, I desperately need your help and will appreciate it very much.
I did single end sequencing of Arabidopsis thaliana transcriptome with Hiseq2000. The read length is 51bp. The sequencing quality seemed to be quite good when checked with FASTQC. When I ran Tophat2, the resulting accepted_hits.bam file was about 38 M bite in its size while the unmapped.bam was about 280 MB. Although I haven't found out the exact mapping rate, judging from the sizes of the mapped and unmapped files it seems that the majority of the reads are not mapped to the genome. When I randomly picked up some reads from the unmapped file and blasted them against the Arabidopsis genome (-intron, +UTR), I found almost all the reads I checked can be perfectly blasted to a certain mRNA. I used genes.gtf and genome in the TAIR10 downloaded from iGenome. This low mapped rate happened no matter I used the following scrpit1 or 2. Does any one has any clue what the reason can be? Thanks for your suggestions.
script1:
tophat2 -p 8 -i 30 -g 5 --min-coverage-intron 30 --min-segment-intron 30 --b2-sensitive -G genes.gtf -o ./ genome 4_GCCAAT_L001_R1_001.fastq.gz 4_GCCAAT_L001_R1_002.fastq.gz 4_GCCAAT_L001_R1_003.fastq.gz 4_GCCAAT_L001_R1_004.fastq.gz 4_GCCAAT_L001_R1_005.fastq.gz 4_GCCAAT_L001_R1_006.fastq.gz 4_GCCAAT_L001_R1_007.fastq.gz 4_GCCAAT_L001_R1_008.fastq.gz 4_GCCAAT_L001_R1_009.fastq.gz 4_GCCAAT_L001_R1_010.fastq.gz 4_GCCAAT_L001_R1_011.fastq.gz 4_GCCAAT_L001_R1_012.fastq.gz
script2:
tophat2 -p 8 -G genes.gtf -o ./ genome 4_GCCAAT_L001_R1_001.fastq.gz 4_GCCAAT_L001_R1_002.fastq.gz 4_GCCAAT_L001_R1_003.fastq.gz 4_GCCAAT_L001_R1_004.fastq.gz 4_GCCAAT_L001_R1_005.fastq.gz 4_GCCAAT_L001_R1_006.fastq.gz 4_GCCAAT_L001_R1_007.fastq.gz 4_GCCAAT_L001_R1_008.fastq.gz 4_GCCAAT_L001_R1_009.fastq.gz 4_GCCAAT_L001_R1_010.fastq.gz 4_GCCAAT_L001_R1_011.fastq.gz 4_GCCAAT_L001_R1_012.fastq.gz
Comment