SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Merging non-overlapping paired end reads karenr Illumina/Solexa 9 12-16-2016 07:02 PM
Paired End Merging TonyBrooks Bioinformatics 6 02-07-2013 10:26 PM
paired end nextgen DNA sequences and BLAST wlangdon Bioinformatics 3 01-15-2013 01:57 AM
merging paired reads - any software out there? Tectona De novo discovery 4 12-19-2011 05:29 AM
paired-end reads mapped to genome.. gene with only one direction of paired-end reads? danwiththeplan Bioinformatics 2 09-22-2011 03:06 AM

Reply
 
Thread Tools
Old 02-22-2013, 07:13 AM   #1
JJenks
Junior Member
 
Location: Hampshire UK

Join Date: May 2012
Posts: 6
Default Merging paired end reads for BLAST

Hi All,

I've just read the various threads about dealing with paired end reads, but none seemed to address my problem.
I've got several metagenomic datasets consisting of paired end reads from Illumina MiSeq technology, which we are planning on BLASTing. Reads are 100bp in length and are from a 300-400 bp fraction, so will not overlap. I'd like to know if there is a way in which I can combine each pair into a single file, which can be BLASTed to increase the accuracy of the BLAST.
Also, would I be correct in saying that I require a reverse compliment of the R2 read before combination?

Sorry if this is a little vague, I can provide more information if required.

Thanks
Joe
JJenks is offline   Reply With Quote
Old 02-22-2013, 07:16 AM   #2
JackieBadger
Senior Member
 
Location: Halifax, Nova Scotia

Join Date: Mar 2009
Posts: 381
Default

Yes. Reverse complement R2 using Fastx tool kit.
Then you can upload your files to Galaxy: convert from FASTQ to Tabular format, and use the cut/merge column functions under text manipulation to join reads end-to end
JackieBadger is offline   Reply With Quote
Old 02-22-2013, 02:34 PM   #3
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

Yes, rev-comp read 2.

Just write a script to do it. It would be pretty straightforward.

Next-gen sequencing is kind of hard to do without a little unix and scripting ability.
swbarnes2 is offline   Reply With Quote
Old 02-22-2013, 04:33 PM   #4
ucpete
Member
 
Location: San Francisco Bay Area

Join Date: Dec 2008
Posts: 35
Default

I'm not sure what you're aiming for really -- I'm not aware of BLAST having any special way of using paired-end information to make alignment more accurate, especially because your reads don't overlap. If what you're trying to do is resolve any discordant read pair alignments using BLAST as your aligner, you definitely do NOT need to take the reverse compliment first and you definitely do NOT want to merge the R1 and R2 data before aligning -- each read in the pair has the exact same title (the only difference in identification being the file from which they derive) so you won't be able to deconvolve the results afterwards! What I do in these situations is run two BLASTs: one for read 1, another for read 2, then I parse the results to output the concordant information. For example, R1 aligns to organism A, B, and C equally well, and R2 aligns well to organism B. That pair would be called as deriving from organism B. Often times I will consider the E value when making these calls as well, whether it's only using the top scoring hit per query, or using score to break ties. Also, taking the reverse compliment is nonsensical if this is your situation, as BLAST already searches both strands. Hope this helps!

Last edited by ucpete; 02-22-2013 at 04:34 PM. Reason: typo
ucpete is offline   Reply With Quote
Old 02-25-2013, 03:35 AM   #5
JJenks
Junior Member
 
Location: Hampshire UK

Join Date: May 2012
Posts: 6
Default

Thanks for the advice! We wanted to combine the pairs for two reasons, firstly to increase the amount of sequence available for the BLAST search, and to reduce our dataset size. Surely a BLAST of combined datasets would decrease the likelihood of returning multiple alignments due to there being 2x the amount of sequence?
JJenks is offline   Reply With Quote
Old 02-25-2013, 08:24 AM   #6
ucpete
Member
 
Location: San Francisco Bay Area

Join Date: Dec 2008
Posts: 35
Default

BLASTing each read file separately or BLASTing them in the same file will search the same amount of sequence. There will be no reduction in search space unless reads 1 and 2 are overlapping, and you merged them first by assembly. But you said there is no overlap. BLAST returns all valid alignments with E-values less than your threshold so you will get the same number of alignments whether you have all the reads in one file or you BLAST both reads separately. But if you merge them without modifying the FASTA/Q title, you will have the problem of not being able to distinguish which read is which as the read titles are exactly the same for each read in the pair. Why not run two BLASTs?! It takes the same amount of time, it produces the same output, but you will actually know which read is which!
ucpete is offline   Reply With Quote
Old 02-25-2013, 08:31 AM   #7
JJenks
Junior Member
 
Location: Hampshire UK

Join Date: May 2012
Posts: 6
Default

Ok, that makes more sense. Thanks!
JJenks is offline   Reply With Quote
Old 05-23-2013, 08:23 AM   #8
felvis56
Member
 
Location: Glasgow

Join Date: Dec 2012
Posts: 11
Default

I am looking to combine R1 and R2 for a blast search as we sequenced amplicons and I want to blast the full 500 bp rather that 2 x 250 bp searches. Can I combine the files for this?
felvis56 is offline   Reply With Quote
Old 05-23-2013, 08:43 AM   #9
JJenks
Junior Member
 
Location: Hampshire UK

Join Date: May 2012
Posts: 6
Default

For amplicon data, try using Pandaseq. This should merge your reads which have areas of overlap, and can also be used to remove primers/barcodes.
https://github.com/neufeld/pandaseq/...Aseq-Assembler
JJenks is offline   Reply With Quote
Old 11-05-2018, 10:40 AM   #10
indugun
Junior Member
 
Location: India

Join Date: Apr 2013
Posts: 4
Default Parsing the blast results

Hello,
I have Blasted reads R1 and R2 separately. For one of the read the results as below. Could you please suggest me how parse R1 and R1 to select appropriate protein.

For R1
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q2YXR6|GLPK_STAAB 65.1 43 15 0 136 8 438 480 1.4e-10 65.9
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q6GHD5|GLPK_STAAR 65.1 43 15 0 136 8 438 480 1.4e-10 65.9
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|B0K643|GLPK_THEPX 67.4 43 14 0 136 8 437 479 2.3e-10 65.1
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|A6U1B8|GLPK_STAA2 62.8 43 16 0 136 8 438 480 3.9e-10 64.3
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|P63741|GLPK_STAAM 62.8 43 16 0 136 8 438 480 3.9e-10 64.3
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q8NWX7|GLPK_STAAW 62.8 43 16 0 136 8 438 480 3.9e-10 64.3
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|A7X1U3|GLPK_STAA1 62.8 43 16 0 136 8 438 480 3.9e-10 64.3
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q2FHD9|GLPK_STAA3 62.8 43 16 0 136 8 438 480 3.9e-10 64.3
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|A5ISI2|GLPK_STAA9 62.8 43 16 0 136 8 438 480 3.9e-10 64.3
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|P99113|GLPK_STAAN 62.8 43 16 0 136 8 438 480 3.9e-10 64.3
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q6G9R3|GLPK_STAAS 62.8 43 16 0 136 8 438 480 3.9e-10 64.3
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|A8Z1X0|GLPK_STAAT 62.8 43 16 0 136 8 438 480 3.9e-10 64.3
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q2FYZ5|GLPK_STAA8 62.8 43 16 0 136 8 438 480 3.9e-10 64.3
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|A6QGJ8|GLPK_STAAE 62.8 43 16 0 136 8 438 480 3.9e-10 64.3
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q5HGD2|GLPK_STAAC 62.8 43 16 0 136 8 438 480 3.9e-10 64.3
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|B0K754|GLPK_THEP3 65.1 43 15 0 136 8 437 479 5.1e-10 63.9
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q9KDW8|GLPK_BACHD 62.8 43 16 0 136 8 439 481 6.7e-10 63.5
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q8R8J4|GLPK_CALS4 60.5 43 17 0 136 8 437 479 1.5e-09 62.4
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q8CSS0|GLPK_STAES 65.1 43 15 0 136 8 438 480 2.6e-09 61.6
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q5HPP1|GLPK_STAEQ 65.1 43 15 0 136 8 438 480 2.6e-09 61.6
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|C4ZGB4|GLPK_AGARV 56.8 44 19 0 136 5 437 480 4.4e-09 60.8
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|C6C1M7|GLPK_DESAD 60.5 43 17 0 136 8 437 479 9.7e-09 59.7
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|B2FI02|GLPK_STRMK 58.1 43 18 0 136 8 439 481 1.3e-08 59.3
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|B4SJT3|GLPK_STRM5 58.1 43 18 0 136 8 439 481 1.7e-08 58.9
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|B2I618|GLPK_XYLF2 53.7 41 19 0 136 14 439 479 2.2e-08 58.5

For R2:
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|B0K643|GLPK_THEPX 77.1 35 8 0 13 117 428 462 5.8e-10 63.5
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|B0K754|GLPK_THEP3 74.3 35 9 0 13 117 428 462 1.3e-09 62.4
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|C4ZGB4|GLPK_AGARV 71.4 35 10 0 13 117 428 462 1.7e-09 62.0
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|C6C1M7|GLPK_DESAD 71.4 35 10 0 13 117 428 462 2.9e-09 61.2
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q8R8J4|GLPK_CALS4 71.4 35 10 0 13 117 428 462 2.9e-09 61.2
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q2YXR6|GLPK_STAAB 71.4 35 10 0 13 117 429 463 4.9e-09 60.5
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q6GHD5|GLPK_STAAR 71.4 35 10 0 13 117 429 463 4.9e-09 60.5
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|A6U1B8|GLPK_STAA2 68.6 35 11 0 13 117 429 463 1.4e-08 58.9
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|P63741|GLPK_STAAM 68.6 35 11 0 13 117 429 463 1.4e-08 58.9
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q8NWX7|GLPK_STAAW 68.6 35 11 0 13 117 429 463 1.4e-08 58.9
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|A7X1U3|GLPK_STAA1 68.6 35 11 0 13 117 429 463 1.4e-08 58.9
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q2FHD9|GLPK_STAA3 68.6 35 11 0 13 117 429 463 1.4e-08 58.9
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|A5ISI2|GLPK_STAA9 68.6 35 11 0 13 117 429 463 1.4e-08 58.9
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|P99113|GLPK_STAAN 68.6 35 11 0 13 117 429 463 1.4e-08 58.9
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q6G9R3|GLPK_STAAS 68.6 35 11 0 13 117 429 463 1.4e-08 58.9
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|A8Z1X0|GLPK_STAAT 68.6 35 11 0 13 117 429 463 1.4e-08 58.9
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q9KDW8|GLPK_BACHD 68.6 35 11 0 13 117 430 464 1.4e-08 58.9
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q2FYZ5|GLPK_STAA8 68.6 35 11 0 13 117 429 463 1.4e-08 58.9
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|A6QGJ8|GLPK_STAAE 68.6 35 11 0 13 117 429 463 1.4e-08 58.9
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|A4J8E6|GLPK_DESRM 68.6 35 11 0 13 117 431 465 1.4e-08 58.9
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q5HGD2|GLPK_STAAC 68.6 35 11 0 13 117 429 463 1.4e-08 58.9
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|B8FXS7|GLPK_DESHD 71.9 32 9 0 10 105 428 459 1.9e-08 58.5
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|A7FX30|GLPK_CLOB1 63.9 36 13 0 10 117 427 462 2.4e-08 58.2
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|A5I5M0|GLPK_CLOBH 63.9 36 13 0 10 117 427 462 2.4e-08 58.2
NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|B1IKJ7|GLPK_CLOBK 63.9 36 13 0 10 117 427 462 2.4e-08 58.2




Quote:
Originally Posted by ucpete View Post
I'm not sure what you're aiming for really -- I'm not aware of BLAST having any special way of using paired-end information to make alignment more accurate, especially because your reads don't overlap. If what you're trying to do is resolve any discordant read pair alignments using BLAST as your aligner, you definitely do NOT need to take the reverse compliment first and you definitely do NOT want to merge the R1 and R2 data before aligning -- each read in the pair has the exact same title (the only difference in identification being the file from which they derive) so you won't be able to deconvolve the results afterwards! What I do in these situations is run two BLASTs: one for read 1, another for read 2, then I parse the results to output the concordant information. For example, R1 aligns to organism A, B, and C equally well, and R2 aligns well to organism B. That pair would be called as deriving from organism B. Often times I will consider the E value when making these calls as well, whether it's only using the top scoring hit per query, or using score to break ties. Also, taking the reverse compliment is nonsensical if this is your situation, as BLAST already searches both strands. Hope this helps!

Last edited by indugun; 11-05-2018 at 10:45 AM.
indugun is offline   Reply With Quote
Reply

Tags
blast, metagenomic, paired-end reads

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:45 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO