Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    I'm not sure that this works for you, but anyway if you have left and right fasta files and want to detect the paired-end and the single-end reads you will found usefull this script.https://github.com/lexnederbragt/den...leave_pairs.py
    Needs biopython installed. http://biopython.org/wiki/Biopython

    Comment


    • #17
      That would probably have done the job, but unfortunately we don't have Biopython installed on our cluster so I can't run it (same with Bioperl).

      What I am looking for is basically a standalone script that would do the trick - like the one for Fastq files, which works well.

      Comment


      • #18
        mergeShuffledfastqseqs.pl script issues

        I can't get the mergeShuffledFastqSeqs.pl script to work with my data. I have two shuffled paired read files (one is 63GB in size and the other is 35GB). When I submit the job via batch script, it gets killed and I'm left with the following message: swap rate due to memory oversubscription is too high.
        I've allocated 512GB of memory for this run so I don't think it has to do with that. Also, i've tried predefining the hash table size in the merge...pl script to be between 4-100 billion but this hasn't worked. Anyone have any ideas?

        Thanks,
        bmtb



        Originally posted by azneto View Post
        Hi,
        It most probably is a memory issue.
        The script loads only the first file into the memory and starts to match with the entries in the second file. You'll have to monitor the memory usage ('top' or 'free -m').
        I just ran a test and perl uses 220Gb RAM for two 33Gb fastq file.
        Soon I'll start to search for alternative ways to handle memory using perl in order to improve the script. I'll let you know.
        -Adhemar

        Comment


        • #19
          Hi bmtb,
          Sorry it took me so long to reply.
          The version of the script you have uses 40x the size of the f1 file.
          I've just attached a version that uses about 6x.
          So, if you use the 35GB file as f1 you should be able to run it this time.
          Please let me know if it worked.
          Perl hashes are really memory consuming structures and we're studing alternatives.
          Best,
          Adhemar
          Attached Files

          Comment


          • #20
            Hi azneto,

            I guess I should have posted this earlier, but I actually got your first script to work by increasing the memory allocation. Thanks for the updated script though.

            Cheers,
            bmtb

            Comment


            • #21
              That's really good news bmtb!
              Cheers
              Adhemar

              Comment


              • #22
                Been trying to use the script provided. But I cannot seem to get the regex to work.

                @HWI-ST965:305:C0MR9ACXX:6:1113:6758:31224 1:N:0:GTGAAA

                and Im using '^@(\S+)\s[1|2]\S+$'

                I guess I should ask if this handles .fastq.gz or does it only work on uncompressed files? Thank you.

                Comment


                • #23
                  Hello Azneto,
                  I would like to use your script to fix my mate-pairs but I have problems with the default expression definition to locate the ID:
                  @SBS123:173:C2RGEACXX:7:2214:5915:84780 1:N:0:ACTTGA
                  could you please recommend an expression that will work in this case.

                  Im using zipped fastq and I hope it is ok to do that.

                  Thanks!

                  Comment


                  • #24
                    Hi bwubb,
                    The regex is correct.
                    You can test it by running:

                    grep -P '^@(\S+)\s[1|2]\S+$' yourSequenceFile.fastq

                    The script does not handle zipped files.
                    -Best

                    Comment


                    • #25
                      Hi shatz,
                      The default regex should work.
                      You can test it by running:

                      grep -P '@(\S+)\s[1|2]\S+$' yourSequenceFile.fastq

                      The script does not handle zipped files.
                      -Best

                      Comment


                      • #26
                        Originally posted by shatz View Post
                        Hello Azneto,
                        I would like to use your script to fix my mate-pairs but I have problems with the default expression definition to locate the ID:
                        @SBS123:173:C2RGEACXX:7:2214:5915:84780 1:N:0:ACTTGA
                        could you please recommend an expression that will work in this case.

                        Im using zipped fastq and I hope it is ok to do that.

                        Thanks!
                        Another option would be to use Pairfq for this task because it can handle FASTA/FASTQ and compressed (bzip2/gzip) or uncompressed data. The specific command you would want would be makepairs. Just a disclaimer, I wrote this for a specific problem we were having with pairing really large numbers of sequences and for this reason there are some dependencies. Specifically with the "--index" option which uses virtually no memory. The requirements are all explained in the documentation and this has been tested on a number of operating systems. This may not be what you need but it doesn't hurt to mention other options.

                        Comment


                        • #27
                          How to use the -r option?

                          Originally posted by azneto View Post
                          Hi bmtb,
                          Sorry it took me so long to reply.
                          The version of the script you have uses 40x the size of the f1 file.
                          I've just attached a version that uses about 6x.
                          So, if you use the 35GB file as f1 you should be able to run it this time.
                          Please let me know if it worked.
                          Perl hashes are really memory consuming structures and we're studing alternatives.
                          Best,
                          Adhemar
                          Hello i'm trying use this script in a set of paired end reads. My read names have the following format:
                          Code:
                          @SN1054:328:HGF77BCX2:1:1104:1293:2046 1:N:0:GAGCTGAA
                          What I need to use in option
                          Code:
                          -r
                          ?

                          Comment


                          • #28
                            kcritap, are you trying to make sure that the paired-end reads are kept as pairs? I would use bbduk from bbtools for trimming by quality and adapter removal.

                            bbduk.sh in=R1.fq in2=R2.fq out=R1_trimmed.fq out2=R2_trimmed.fq qtrim=r removeifeitherbad=t
                            Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

                            Comment


                            • #29
                              Try this '^@(\S+)\s[1|2]\S+$'

                              Comment


                              • #30
                                Hello, azneto,

                                So my files pattern is @J00160:133:HVYVWBBXX:3:1101:7476:1297 1:N:0:GAACGAAG+CTCCTTAC.

                                I thougth the second regex '^@(\\S+)\\s[1|2]\\S+\$' would work fine for me.
                                I'm trying to end up with two separate files so I'm running the following:

                                perl mergeShuffledFastqSeqs.pl -f1 originals/SAMPLE-READ1.fastq -f2 originals/SAMPLE-READ2.fastq -r '^@(\S+)\s[1|2]\S+$' -o mergedsequences -t

                                After the run, i end up with two empy files (mergedsequences.1.fastq and mergedsequences.1.fastq) and a large file contain all the sequences and named mergedsequences.nomatch.fastq.

                                Am i doing something wrong? Any thoughts of what's happening?

                                I appreciate any help

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM
                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 06:37 PM
                                0 responses
                                10 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, Yesterday, 06:07 PM
                                0 responses
                                9 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-22-2024, 10:03 AM
                                0 responses
                                49 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-21-2024, 07:32 AM
                                0 responses
                                67 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X