Hi,
Does anyone see anything wrong with splitting of fastq files for alignment to tophat then to merge them together afterwards?
The reason why I want to split them is to be able to make greater use of the cluster we have available.
I am able to split the fastq files using an algorithm I created in perl, the merging of the files seems to work except I am getting a few missing reads when I compare the merged output from my split fastq as compared to when I run the file in tophat as a whole.
For example the split paired end tophat run produces a samtools flagstat of
$ samtools flagstat merged_accepted_hits.bam
37716745 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
37716745 + 0 mapped (100.00%:nan%)
37716745 + 0 paired in sequencing
19017603 + 0 read1
18699142 + 0 read2
35853292 + 0 properly paired (95.06%:nan%)
35974826 + 0 with itself and mate mapped
1741919 + 0 singletons (4.62%:nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
While the full fastq filed paired end run from tophat produces
$ samtools flagstat accepted_hits.bam
37739551 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
37739551 + 0 mapped (100.00%:nan%)
37739551 + 0 paired in sequencing
19028732 + 0 read1
18710819 + 0 read2
35896074 + 0 properly paired (95.12%:nan%)
36017796 + 0 with itself and mate mapped
1721755 + 0 singletons (4.56%:nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
The difference is only 0.06% of properly paired reads, but may be missing some useful information. I have checked the splitting of the files and the numbers of the lines are exactly the same.
http://seqanswers.com/forums/showthr...at+fastq+split. this thread suggests that some "low abundance splice sites" are lost.
Would anyone have anymore information about this?
Thanks for the help,
Bobbie.
Does anyone see anything wrong with splitting of fastq files for alignment to tophat then to merge them together afterwards?
The reason why I want to split them is to be able to make greater use of the cluster we have available.
I am able to split the fastq files using an algorithm I created in perl, the merging of the files seems to work except I am getting a few missing reads when I compare the merged output from my split fastq as compared to when I run the file in tophat as a whole.
For example the split paired end tophat run produces a samtools flagstat of
$ samtools flagstat merged_accepted_hits.bam
37716745 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
37716745 + 0 mapped (100.00%:nan%)
37716745 + 0 paired in sequencing
19017603 + 0 read1
18699142 + 0 read2
35853292 + 0 properly paired (95.06%:nan%)
35974826 + 0 with itself and mate mapped
1741919 + 0 singletons (4.62%:nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
While the full fastq filed paired end run from tophat produces
$ samtools flagstat accepted_hits.bam
37739551 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
37739551 + 0 mapped (100.00%:nan%)
37739551 + 0 paired in sequencing
19028732 + 0 read1
18710819 + 0 read2
35896074 + 0 properly paired (95.12%:nan%)
36017796 + 0 with itself and mate mapped
1721755 + 0 singletons (4.56%:nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
The difference is only 0.06% of properly paired reads, but may be missing some useful information. I have checked the splitting of the files and the numbers of the lines are exactly the same.
http://seqanswers.com/forums/showthr...at+fastq+split. this thread suggests that some "low abundance splice sites" are lost.
Would anyone have anymore information about this?
Thanks for the help,
Bobbie.
Comment