SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
split fastq file Balat Bioinformatics 10 09-22-2016 07:55 AM
tophat with multiple fastq files dariober Bioinformatics 3 06-13-2013 06:43 AM
Tophat - processing several files fastq marb Bioinformatics 3 04-18-2012 03:12 PM
split a fastq file lfaino Bioinformatics 4 04-14-2011 03:28 PM
Split GA FASTQ file aritakum Bioinformatics 3 06-10-2010 04:15 AM

Reply
 
Thread Tools
Old 09-30-2012, 06:29 PM   #1
Bobbieshaban
Junior Member
 
Location: Australia

Join Date: Sep 2012
Posts: 1
Default Split fastq files for tophat analysis

Hi,

Does anyone see anything wrong with splitting of fastq files for alignment to tophat then to merge them together afterwards?

The reason why I want to split them is to be able to make greater use of the cluster we have available.

I am able to split the fastq files using an algorithm I created in perl, the merging of the files seems to work except I am getting a few missing reads when I compare the merged output from my split fastq as compared to when I run the file in tophat as a whole.

For example the split paired end tophat run produces a samtools flagstat of

$ samtools flagstat merged_accepted_hits.bam
37716745 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
37716745 + 0 mapped (100.00%:nan%)
37716745 + 0 paired in sequencing
19017603 + 0 read1
18699142 + 0 read2
35853292 + 0 properly paired (95.06%:nan%)
35974826 + 0 with itself and mate mapped
1741919 + 0 singletons (4.62%:nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

While the full fastq filed paired end run from tophat produces

$ samtools flagstat accepted_hits.bam
37739551 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
37739551 + 0 mapped (100.00%:nan%)
37739551 + 0 paired in sequencing
19028732 + 0 read1
18710819 + 0 read2
35896074 + 0 properly paired (95.12%:nan%)
36017796 + 0 with itself and mate mapped
1721755 + 0 singletons (4.56%:nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

The difference is only 0.06% of properly paired reads, but may be missing some useful information. I have checked the splitting of the files and the numbers of the lines are exactly the same.

http://seqanswers.com/forums/showthr...at+fastq+split. this thread suggests that some "low abundance splice sites" are lost.

Would anyone have anymore information about this?

Thanks for the help,

Bobbie.
Bobbieshaban is offline   Reply With Quote
Old 03-12-2013, 06:40 AM   #2
dGho
Member
 
Location: Rochester, NY

Join Date: Jan 2013
Posts: 43
Default split files

I have split fastq files to run Tophat. From what I understand is that this is a fairly common practice. Here is a hypothetical example:

#split read 1 into smaller files after every 40,000,000 lines
split -l 40000000 wholefile_read1.fastq ;
#rename resulting files
mv xaa wholefile_read1_1.fastq
mv xab wholefile_read1_2.fastq
.
.
#split read 2 into smaller files after every 40,000,000 lines
split -l 40000000 wholefile_read2.fastq
#rename resulting files
mv xaa wholefile_read2_1.fastq
mv xab wholefile_read2_2.fastq
.
.
#align split files with tophat
tophat -o out_1 -G mm10.gtf mm10 wholefile_read1_1.fastq wholefile_read2_1.fastq
tophat -o out_2 -G mm10.gtf mm10 wholefile_read1_2.fastq wholefile_read2_2.fastq
.
.
#use samtools to put the bam files back together
Samtools merge out.bam out_1 out_2
dGho is offline   Reply With Quote
Old 03-12-2013, 06:44 AM   #3
dGho
Member
 
Location: Rochester, NY

Join Date: Jan 2013
Posts: 43
Default I didn't answer your question

I guess I did not exactly answer your question though. I do not know if there is any difference in results when the files are split. I do know that my very experienced co-worker does it all the time. That does not necessarily help.
dGho is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:26 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO