Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging paired end reads for BLAST

    Hi All,

    I've just read the various threads about dealing with paired end reads, but none seemed to address my problem.
    I've got several metagenomic datasets consisting of paired end reads from Illumina MiSeq technology, which we are planning on BLASTing. Reads are 100bp in length and are from a 300-400 bp fraction, so will not overlap. I'd like to know if there is a way in which I can combine each pair into a single file, which can be BLASTed to increase the accuracy of the BLAST.
    Also, would I be correct in saying that I require a reverse compliment of the R2 read before combination?

    Sorry if this is a little vague, I can provide more information if required.

    Thanks
    Joe

  • #2
    Yes. Reverse complement R2 using Fastx tool kit.
    Then you can upload your files to Galaxy: convert from FASTQ to Tabular format, and use the cut/merge column functions under text manipulation to join reads end-to end

    Comment


    • #3
      Yes, rev-comp read 2.

      Just write a script to do it. It would be pretty straightforward.

      Next-gen sequencing is kind of hard to do without a little unix and scripting ability.

      Comment


      • #4
        I'm not sure what you're aiming for really -- I'm not aware of BLAST having any special way of using paired-end information to make alignment more accurate, especially because your reads don't overlap. If what you're trying to do is resolve any discordant read pair alignments using BLAST as your aligner, you definitely do NOT need to take the reverse compliment first and you definitely do NOT want to merge the R1 and R2 data before aligning -- each read in the pair has the exact same title (the only difference in identification being the file from which they derive) so you won't be able to deconvolve the results afterwards! What I do in these situations is run two BLASTs: one for read 1, another for read 2, then I parse the results to output the concordant information. For example, R1 aligns to organism A, B, and C equally well, and R2 aligns well to organism B. That pair would be called as deriving from organism B. Often times I will consider the E value when making these calls as well, whether it's only using the top scoring hit per query, or using score to break ties. Also, taking the reverse compliment is nonsensical if this is your situation, as BLAST already searches both strands. Hope this helps!
        Last edited by ucpete; 02-22-2013, 04:34 PM. Reason: typo

        Comment


        • #5
          Thanks for the advice! We wanted to combine the pairs for two reasons, firstly to increase the amount of sequence available for the BLAST search, and to reduce our dataset size. Surely a BLAST of combined datasets would decrease the likelihood of returning multiple alignments due to there being 2x the amount of sequence?

          Comment


          • #6
            BLASTing each read file separately or BLASTing them in the same file will search the same amount of sequence. There will be no reduction in search space unless reads 1 and 2 are overlapping, and you merged them first by assembly. But you said there is no overlap. BLAST returns all valid alignments with E-values less than your threshold so you will get the same number of alignments whether you have all the reads in one file or you BLAST both reads separately. But if you merge them without modifying the FASTA/Q title, you will have the problem of not being able to distinguish which read is which as the read titles are exactly the same for each read in the pair. Why not run two BLASTs?! It takes the same amount of time, it produces the same output, but you will actually know which read is which!

            Comment


            • #7
              Ok, that makes more sense. Thanks!

              Comment


              • #8
                I am looking to combine R1 and R2 for a blast search as we sequenced amplicons and I want to blast the full 500 bp rather that 2 x 250 bp searches. Can I combine the files for this?

                Comment


                • #9
                  For amplicon data, try using Pandaseq. This should merge your reads which have areas of overlap, and can also be used to remove primers/barcodes.
                  PAired-eND Assembler for DNA sequences. Contribute to neufeld/pandaseq development by creating an account on GitHub.

                  Comment


                  • #10
                    Parsing the blast results

                    Hello,
                    I have Blasted reads R1 and R2 separately. For one of the read the results as below. Could you please suggest me how parse R1 and R1 to select appropriate protein.

                    For R1
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q2YXR6|GLPK_STAAB 65.1 43 15 0 136 8 438 480 1.4e-10 65.9
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q6GHD5|GLPK_STAAR 65.1 43 15 0 136 8 438 480 1.4e-10 65.9
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|B0K643|GLPK_THEPX 67.4 43 14 0 136 8 437 479 2.3e-10 65.1
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|A6U1B8|GLPK_STAA2 62.8 43 16 0 136 8 438 480 3.9e-10 64.3
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|P63741|GLPK_STAAM 62.8 43 16 0 136 8 438 480 3.9e-10 64.3
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q8NWX7|GLPK_STAAW 62.8 43 16 0 136 8 438 480 3.9e-10 64.3
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|A7X1U3|GLPK_STAA1 62.8 43 16 0 136 8 438 480 3.9e-10 64.3
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q2FHD9|GLPK_STAA3 62.8 43 16 0 136 8 438 480 3.9e-10 64.3
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|A5ISI2|GLPK_STAA9 62.8 43 16 0 136 8 438 480 3.9e-10 64.3
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|P99113|GLPK_STAAN 62.8 43 16 0 136 8 438 480 3.9e-10 64.3
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q6G9R3|GLPK_STAAS 62.8 43 16 0 136 8 438 480 3.9e-10 64.3
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|A8Z1X0|GLPK_STAAT 62.8 43 16 0 136 8 438 480 3.9e-10 64.3
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q2FYZ5|GLPK_STAA8 62.8 43 16 0 136 8 438 480 3.9e-10 64.3
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|A6QGJ8|GLPK_STAAE 62.8 43 16 0 136 8 438 480 3.9e-10 64.3
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q5HGD2|GLPK_STAAC 62.8 43 16 0 136 8 438 480 3.9e-10 64.3
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|B0K754|GLPK_THEP3 65.1 43 15 0 136 8 437 479 5.1e-10 63.9
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q9KDW8|GLPK_BACHD 62.8 43 16 0 136 8 439 481 6.7e-10 63.5
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q8R8J4|GLPK_CALS4 60.5 43 17 0 136 8 437 479 1.5e-09 62.4
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q8CSS0|GLPK_STAES 65.1 43 15 0 136 8 438 480 2.6e-09 61.6
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q5HPP1|GLPK_STAEQ 65.1 43 15 0 136 8 438 480 2.6e-09 61.6
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|C4ZGB4|GLPK_AGARV 56.8 44 19 0 136 5 437 480 4.4e-09 60.8
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|C6C1M7|GLPK_DESAD 60.5 43 17 0 136 8 437 479 9.7e-09 59.7
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|B2FI02|GLPK_STRMK 58.1 43 18 0 136 8 439 481 1.3e-08 59.3
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|B4SJT3|GLPK_STRM5 58.1 43 18 0 136 8 439 481 1.7e-08 58.9
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|B2I618|GLPK_XYLF2 53.7 41 19 0 136 14 439 479 2.2e-08 58.5

                    For R2:
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|B0K643|GLPK_THEPX 77.1 35 8 0 13 117 428 462 5.8e-10 63.5
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|B0K754|GLPK_THEP3 74.3 35 9 0 13 117 428 462 1.3e-09 62.4
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|C4ZGB4|GLPK_AGARV 71.4 35 10 0 13 117 428 462 1.7e-09 62.0
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|C6C1M7|GLPK_DESAD 71.4 35 10 0 13 117 428 462 2.9e-09 61.2
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q8R8J4|GLPK_CALS4 71.4 35 10 0 13 117 428 462 2.9e-09 61.2
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q2YXR6|GLPK_STAAB 71.4 35 10 0 13 117 429 463 4.9e-09 60.5
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q6GHD5|GLPK_STAAR 71.4 35 10 0 13 117 429 463 4.9e-09 60.5
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|A6U1B8|GLPK_STAA2 68.6 35 11 0 13 117 429 463 1.4e-08 58.9
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|P63741|GLPK_STAAM 68.6 35 11 0 13 117 429 463 1.4e-08 58.9
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q8NWX7|GLPK_STAAW 68.6 35 11 0 13 117 429 463 1.4e-08 58.9
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|A7X1U3|GLPK_STAA1 68.6 35 11 0 13 117 429 463 1.4e-08 58.9
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q2FHD9|GLPK_STAA3 68.6 35 11 0 13 117 429 463 1.4e-08 58.9
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|A5ISI2|GLPK_STAA9 68.6 35 11 0 13 117 429 463 1.4e-08 58.9
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|P99113|GLPK_STAAN 68.6 35 11 0 13 117 429 463 1.4e-08 58.9
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q6G9R3|GLPK_STAAS 68.6 35 11 0 13 117 429 463 1.4e-08 58.9
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|A8Z1X0|GLPK_STAAT 68.6 35 11 0 13 117 429 463 1.4e-08 58.9
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q9KDW8|GLPK_BACHD 68.6 35 11 0 13 117 430 464 1.4e-08 58.9
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q2FYZ5|GLPK_STAA8 68.6 35 11 0 13 117 429 463 1.4e-08 58.9
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|A6QGJ8|GLPK_STAAE 68.6 35 11 0 13 117 429 463 1.4e-08 58.9
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|A4J8E6|GLPK_DESRM 68.6 35 11 0 13 117 431 465 1.4e-08 58.9
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q5HGD2|GLPK_STAAC 68.6 35 11 0 13 117 429 463 1.4e-08 58.9
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|B8FXS7|GLPK_DESHD 71.9 32 9 0 10 105 428 459 1.9e-08 58.5
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|A7FX30|GLPK_CLOB1 63.9 36 13 0 10 117 427 462 2.4e-08 58.2
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|A5I5M0|GLPK_CLOBH 63.9 36 13 0 10 117 427 462 2.4e-08 58.2
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|B1IKJ7|GLPK_CLOBK 63.9 36 13 0 10 117 427 462 2.4e-08 58.2




                    Originally posted by ucpete View Post
                    I'm not sure what you're aiming for really -- I'm not aware of BLAST having any special way of using paired-end information to make alignment more accurate, especially because your reads don't overlap. If what you're trying to do is resolve any discordant read pair alignments using BLAST as your aligner, you definitely do NOT need to take the reverse compliment first and you definitely do NOT want to merge the R1 and R2 data before aligning -- each read in the pair has the exact same title (the only difference in identification being the file from which they derive) so you won't be able to deconvolve the results afterwards! What I do in these situations is run two BLASTs: one for read 1, another for read 2, then I parse the results to output the concordant information. For example, R1 aligns to organism A, B, and C equally well, and R2 aligns well to organism B. That pair would be called as deriving from organism B. Often times I will consider the E value when making these calls as well, whether it's only using the top scoring hit per query, or using score to break ties. Also, taking the reverse compliment is nonsensical if this is your situation, as BLAST already searches both strands. Hope this helps!
                    Last edited by indugun; 11-05-2018, 10:45 AM.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM
                    • seqadmin
                      Techniques and Challenges in Conservation Genomics
                      by seqadmin



                      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                      Avian Conservation
                      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                      03-08-2024, 10:41 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 06:37 PM
                    0 responses
                    10 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, Yesterday, 06:07 PM
                    0 responses
                    9 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-22-2024, 10:03 AM
                    0 responses
                    51 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-21-2024, 07:32 AM
                    0 responses
                    67 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X