Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Meaning of -n and -Q

    The -n flag limits the number of results reported, not the number of alignment results. Suppose you specify "-n 1", and GSNAP finds no alignments to the genome. Then the result would contain 0 hits, obviously. Likewise, if GSNAP finds 1 hit in the genome, the result would contain that one hit. But if GSNAP found multiple hits, then the "-n 1" flag would constrain the results to a single one.

    To find uniquely matching hits, you would need to add the "-Q" or "--quiet-if-excessive" flag. With "-n 1" and that flag, if GSNAP found multiple hits, then it would pretend that it really found no (unique) hits to the genome, and report no hits.

    I hope that makes sense. If you have other questions, you can also join the gsnap-users mailing list at EBI.

    Tom

    Comment


    • #17
      Thanks Parthav and Thomas!!!

      I have ~5 GB (each) of paired end reads from RNA seq experiment and I would like use GSNAP in HPC farm, however, the farm allows 12 hours maximum for a job. I tried to use 1 nodes and 16 process , time=12 hours and memory 64000. But the job crashed. Is there any way I can run this in HPC within 12 hours.

      Comment


      • #18
        Running GSNAP on a farm

        For HPC or Linux farms, it is probably best if you run GSNAP on several nodes for a given input file. You can do this easily with the -q or --part flag, which breaks up the input into parts. For example, if you want to spread the GSNAP computation over 50 nodes, you can run GSNAP like this:

        gsnap -q 00/50 <fastq file> > output.00
        gsnap -q 01/50 <fastq file> > output.01
        gsnap -q 02/50 <fastq file> > output.02
        ...
        gsnap -q 49/50 <fastq file> > output.49

        The meaning of "-q 02/50" is that in every set of 50 input reads, compute on the second one. If you submit each of the above jobs to a different node on your cluster, you should theoretically see a 50x speedup. However, your output will be spread among 50 different files.

        Note that I am working on making GSNAP a bit faster, but adding an initial alignment step that computes the easy alignments very quickly, but falls back upon the existing algorithm to harder alignments.

        Regards,

        Tom

        Comment


        • #19
          Tom,

          Thanks for your input on how to break up the input fastq file into parts for using multiple HPC nodes, using the "-q" flag.
          In case of a gsnap mapping run involving paired-end fastq reads, I am wondering how the '-q' works. How would one specify picking the read-pairs from the R1 and R2 files?

          Thanks
          Parthav


          Originally posted by twu View Post
          For HPC or Linux farms, it is probably best if you run GSNAP on several nodes for a given input file. You can do this easily with the -q or --part flag, which breaks up the input into parts. For example, if you want to spread the GSNAP computation over 50 nodes, you can run GSNAP like this:

          gsnap -q 00/50 <fastq file> > output.00
          gsnap -q 01/50 <fastq file> > output.01
          gsnap -q 02/50 <fastq file> > output.02
          ...
          gsnap -q 49/50 <fastq file> > output.49

          The meaning of "-q 02/50" is that in every set of 50 input reads, compute on the second one. If you submit each of the above jobs to a different node on your cluster, you should theoretically see a 50x speedup. However, your output will be spread among 50 different files.

          Note that I am working on making GSNAP a bit faster, but adding an initial alignment step that computes the easy alignments very quickly, but falls back upon the existing algorithm to harder alignments.

          Regards,

          Tom

          Comment


          • #20
            -q flag and paired-end reads

            If you have paired-end reads (by providing two files to GSNAP), then the -q flag takes the correct pairs from each of the files. That's the only thing that makes sense.

            For example, -q 2/50 takes the second read out of each set of 50, from each of the two files.

            Tom

            Comment


            • #21
              Thanks Tom.

              Comment


              • #22
                I have been running GSNAP as a part of trinity pipeline.

                To begin with, I have paired ends reads of 7GB of size for each read file.
                The pipeline has executed the following code,

                Code:
                gsnap -d target.gmap -D . -A sam -N 1 -w 10000 -n 20 -t 45 /home/amol/trimmed_datasets/AP_treated/R33_APT_s_3_1_trimmed.fastq /home/amol/trimmed_datasets/AP_treated/R33_APT_s_3_2_trimmed.fastq
                More that 30 hours have been passed and the script is showing a message "Starting alignment". I can see (using top) that, gsnap is taking resources but at the same time its showing the process status as 'sleeping'

                Can anyone please explain.

                Thanks

                Comment


                • #23
                  After running for 6 days, gsnap finally generated an 34 GB sam file.

                  Alternatively, I noticed at later point that you can monitor the output file size under your 'gsnap_out' directory and estimate where the alignment have reached.

                  Code:
                  du -h gsnap_out/gsnap.sam
                  File size would increase as the alignment proceeds and you can make sure that the program is running.

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM
                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  31 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  32 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 09:21 AM
                  0 responses
                  28 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-04-2024, 09:00 AM
                  0 responses
                  53 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X