Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging non-overlapping paired end reads

    I'm looking to assemble some paired-end reads, but I'm having some problems getting the ends back together before assembly. I have 105 bp reads, from either end of 300 bp fragments - so there's 90 bp of 'space' in between, with no overlap. I've seen lots of programs dealing with overlapping reads, but is there anything out there that will account for that 'gap'?

    Thanks!

  • #2
    I don't understand what use will those reads be to you once you merge them? The positional information, ie 90 bp in between, is also critical to the assembly. Why do you want to merge them?

    Comment


    • #3
      That's the idea - I'd like to preserve the positional information, but from what I understand, most assembly programs require a single input file, hence the merging. So I'd like to put the end back together, with that 90 bp unsequenced portion in the middle.

      Comment


      • #4
        this is a perl script from velvet's contrib folder and will do what you need so long as the reads from two separate files are paired.

        shuffleSequences_fastq.pl

        #!/usr/bin/perl

        $filenameA = $ARGV[0];
        $filenameB = $ARGV[1];
        $filenameOut = $ARGV[2];

        open $FILEA, "< $filenameA";
        open $FILEB, "< $filenameB";

        open $OUTFILE, "> $filenameOut";

        while(<$FILEA>) {
        print $OUTFILE $_;
        $_ = <$FILEA>;
        print $OUTFILE $_;
        $_ = <$FILEA>;
        print $OUTFILE $_;
        $_ = <$FILEA>;
        print $OUTFILE $_;

        $_ = <$FILEB>;
        print $OUTFILE $_;
        $_ = <$FILEB>;
        print $OUTFILE $_;
        $_ = <$FILEB>;
        print $OUTFILE $_;
        $_ = <$FILEB>;
        print $OUTFILE $_;
        }

        Comment


        • #5
          That puts the two files together in that it orders the paired ends together - say, end 1, then its pair, end 2, then its pair, end 3, then its pair, etc.. But wouldn't that still lose the positional information?

          Comment


          • #6
            Originally posted by karenr View Post
            I'm looking to assemble some paired-end reads, but I'm having some problems getting the ends back together before assembly. I have 105 bp reads, from either end of 300 bp fragments - so there's 90 bp of 'space' in between, with no overlap. I've seen lots of programs dealing with overlapping reads, but is there anything out there that will account for that 'gap'?

            Thanks!
            karenr,

            One would never do something like this with paired data. The first and most important reason is that your library IS NOT made up of 300bp fragments. It is made up of a population of fragments with an AVERAGE size of 300bp. The distribution of sizes may be narrow or wide depending on the particulars of the library preparation protocol. You can not know a priori what the size is of any individual fragment which produced a pair of reads, hence would have know way to determine how large a gap to insert between them.

            All short read mappers or de novo assemblers understand that the distance between read pairs will fall within a distribution. Some programs expect you to provide an insert size average and insert size standard deviation as command line parameters when you launch the program, and some will determine the distribution empirically from a sample of your data. You will need to read the documentation of the software you plan to use.

            The other matter is whether the software expects paired reads to be supplied as two separate files or as a single file. Again this is program specific and you need to read the documentation. [The script posted by ugolino above is actually the shuffleSequences_fastq.pl script from the velvet package which is intended to create a single, interleaved read file for input to velvet from two separate input files.]

            Comment


            • #7
              Thanks for the info.

              Comment


              • #8
                I know this thread has been inactive for a long time, but I am in a situation where I need to "merge" non-overlapping paired-end reads and preserve the positional information. Basically, I have a genus specific bacteria PCR of a region where I should be able to sort out the species present from the sequence information. Unfortunately, the amplicon is 630 bp, and I'm using Illumina MiSeq 250 bp PE for sequencing. I'd like to take the forward read, insert 180 Ns, and then take the reverse read prior to aligning to my database... Unfortunately, I haven't found any good ways to do this, and am a bit limited in my programming skills. Does anyone have any suggestions on how I could do this?

                Thanks

                Comment


                • #9
                  You can do that with the BBMap package like this:

                  Code:
                  fuse.sh in1=r1.fq in2=r2.fq pad=130 out=fused.fq fusepairs
                  It will automatically reverse-complement read 2. Given that you stated 2x250 and 630bp insert, I'm assuming that the pad amount should be (630-2*250)=130bp, even though you mentioned "180 Ns", so adjust that as necessary.

                  Even though I wrote a tool for this specific purpose, it seems like kind of an odd use-case. What will you do with the merged reads?

                  Comment


                  • #10
                    Thanks! There are barcodes and primers I have to trim before considering my insert size, and those take ~50 base pairs that I discard.

                    I'm trying to sort out the percentage of different species using genus specific primers. The forward read lets me sort out a chunk of the different species, but makes a tangle messed of a different. The reverse read is able to sort out the species that are a tangled mess from the forward read, but on its own can't separate everything either. I am hoping that by joining the two reads, I'll be able to sort out the full set of species I'm interested in... But I was worried that just concatenating the two reads would result in very odd results when comparing to my reference database due to the large size of the gap and the high penalty the aligner I'm using gives to long gaps.

                    Comment


                    • #11
                      Of course, there are always other aligners that don't give large penalties to large gaps But yes, that sounds like a potentially good solution. Simple concatenation without reverse-complementation would be a very bad idea, but as long as your aligner does not penalize Ns, your approach sounds fine. I'd be interested in hearing your results.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM
                      • seqadmin
                        Techniques and Challenges in Conservation Genomics
                        by seqadmin



                        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                        Avian Conservation
                        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                        03-08-2024, 10:41 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 03-27-2024, 06:37 PM
                      0 responses
                      15 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-27-2024, 06:07 PM
                      0 responses
                      13 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-22-2024, 10:03 AM
                      0 responses
                      55 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-21-2024, 07:32 AM
                      0 responses
                      70 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X