Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • GSNAP gives a lot of variants!?

    Hello,
    I'm working on human exome data (Illumina, pair-end), and I have consistently notice GSNAP produces at least 2-5 times more variants than BWA. While I'm sure not all of these extra predictions are false, I am concern most of these are false positives.
    I've looked into the parameters for GSNAP and tried to tune them to restrict the # of variants I get. It would help a lot if anybody can share what parameters they use when running GSNAP on their NGS data, and what's their experience with it.

    Thanks,
    Casper

  • #2
    One thing you probably can check is whether GSNAP mapped more reads than BWA. This probably not the case for exome data, but GSNAP mapped 40% more reads than NOVOalign for our RNA-seq data (Illumina, pair-end). I guess this is because GSNAP can take known SNPs and splicing sites in consideration.

    I think you could check the alignment quality of the those reads that support your extra variants and see if these are real.
    Last edited by byb121; 03-15-2012, 06:58 AM.

    Comment


    • #3
      For highly polymorphic genes, 2-5x more variants should be about right.

      I am not sure whether highly polymorphic genes are the majority or not. But if you subscribe to the theory that majority of protein coding genes are not functionally important, then highly polymorphic genes can be the majority.

      Comment


      • #4
        I've just started with GSNAP.
        It's relatively easy to use.
        Comparing BWAxGenome with GSNAPxSpliceModel straight up is not good. Of course GSNAP is going to get better results as it knows about what splices there are via the "coords" file.

        I use BWA with my own model (align against genome, refome, estome, nonrefome: Best alignment wins).

        The results are similar to GSNAP (and MapSplice for that matter).

        The one thing that confuses me about GSNAP. In a default-ish configuration ( I did build a "coords" file from latest refFlat from UCSC), GSNAP reports multiple alignments for a read. ( Try running that through SamValidator ) Some reads map to 100 places.

        More snps might be a misleading metric in this kind of situation; it could mean "more false positives". [ Could be more true positives, too ... probably more of both ].

        If anyone has some insights on tuning GSNAP, please share the details.

        Comment


        • #5
          Here is some information from The Ohio State University College of Medicine I wanted to share with you.


          This paper says GSNAP slows down significantly when aligning pair ended reads. Is this fixed in the latest version?

          Comment


          • #6
            I am running gsnap on two 7.5GB fastqs. It's been running for more than six hours now but have yet to finish.

            gsnap -A sam -B 5 -t 6 --gunzip -D /tank/gsnap/hg19 -d hg19 ../../NA12878/SRR098401_1.filt.fastq.gz ../../NA12878/SRR098401_2.filt.fastq.gz | samtools view -bS - > 2.bam

            Anyone knows how to tweak to make it even faster here?

            I learned that being slow for pair ended reads is a feature not a bug. It does this to get even better mapping than bwa according to lh3.

            Comment


            • #7
              Oh well. Crashed after nine hours of run... What gives??

              GSNAP version 2012-07-12 called with args: src/gsnap -A sam -B 5 -t 6 --gunzip -D /tank/gsnap/hg19 -d hg19 ../../NA12878/SRR098401_1.filt.fastq.gz ../../NA12878/SRR098401_2.filt.fastq.gz
              Neither novel splicing (-N) nor known splicing (-s) turned on => assume reads are DNA-Seq (genomic)
              Allocating memory for compressed genome...done (1,160,885,244 bytes, 0.59 sec)
              Looking for index files in directory /tank/gsnap/hg19
              Gammaptrs file is hg19.ref12153gammaptrs
              Offsetscomp file is hg19.ref12153offsetscomp
              Positions file is hg19.ref12153positions
              Allocating memory (1073741825 words) for offsets, kmer 15...done
              Expanding offsetscomp into offsets...done
              Allocating memory for ref positions, kmer 15, interval 3...done (3,815,118,780 bytes, 3.65 sec)
              GMAP modes: pairsearch, indel_knownsplice, terminal, improvement
              Starting alignment
              [samopen] SAM header is present: 25 sequences.
              *** glibc detected *** src/gsnap: free(): invalid pointer: 0x00007fe3725e0400 ***
              ======= Backtrace: =========
              /lib/x86_64-linux-gnu/libc.so.6(+0x7e626)[0x7fe4c05bf626]
              src/gsnap[0x4586f8]
              src/gsnap[0x4e2d23]
              src/gsnap[0x4ecd7c]
              src/gsnap[0x4f675c]
              src/gsnap[0x4f749b]
              /lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a)[0x7fe4c0905e9a]
              /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fe4c06334bd]
              ======= Memory map: ========
              00400000-00def000 r-xp 00000000 08:22 25145 /tank/gsnap/gmap-2012-07-12/src/gsnap
              00fef000-00ff0000 r--p 009ef000 08:22 25145 /tank/gsnap/gmap-2012-07-12/src/gsnap
              00ff0000-00ff3000 rw-p 009f0000 08:22 25145 /tank/gsnap/gmap-2012-07-12/src/gsnap
              00ff3000-011e0000 rw-p 00000000 00:00 0
              014d1000-035d1000 rw-p 00000000 00:00 0 [heap]
              7fe148000000-7fe148c51000 rw-p 00000000 00:00 0
              7fe148c51000-7fe14c000000 ---p 00000000 00:00 0
              7fe14c000000-7fe14d732000 rw-p 00000000 00:00 0
              7fe14d732000-7fe150000000 ---p 00000000 00:00 0
              7fe150000000-7fe151711000 rw-p 00000000 00:00 0
              7fe151711000-7fe154000000 ---p 00000000 00:00 0
              7fe154000000-7fe158000000 rw-p 00000000 00:00 0
              7fe158000000-7fe15c000000 rw-p 00000000 00:00 0
              7fe15c000000-7fe160000000 rw-p 00000000 00:00 0
              7fe160000000-7fe160021000 rw-p 00000000 00:00 0
              7fe160021000-7fe164000000 ---p 00000000 00:00 0
              7fe164000000-7fe168000000 rw-p 00000000 00:00 0
              7fe16aafa000-7fe16c000000 rw-p 00000000 00:00 0
              7fe16c000000-7fe170000000 rw-p 00000000 00:00 0
              7fe1700ee000-7fe174000000 rw-p 00000000 00:00 0
              7fe174000000-7fe178000000 rw-p 00000000 00:00 0
              7fe17830f000-7fe17bea0000 rw-p 00000000 00:00 0
              7fe17bea0000-7fe17bea1000 ---p 00000000 00:00 0
              7fe17bea1000-7fe17c6a1000 rw-p 00000000 00:00 0
              7fe17c6a1000-7fe17c6a2000 ---p 00000000 00:00 0
              7fe17c6a2000-7fe17cea2000 rw-p 00000000 00:00 0
              7fe17cea2000-7fe17cea3000 ---p 00000000 00:00 0
              7fe17cea3000-7fe17d6a3000 rw-p 00000000 00:00 0
              7fe17d6a3000-7fe17d6a4000 ---p 00000000 00:00 0
              7fe17d6a4000-7fe17dea4000 rw-p 00000000 00:00 0
              7fe17dea4000-7fe17dea5000 ---p 00000000 00:00 0
              7fe17dea5000-7fe17e6a5000 rw-p 00000000 00:00 0
              7fe17e6a5000-7fe17e6a6000 ---p 00000000 00:00 0
              7fe17e6a6000-7fe36250a000 rw-p 00000000 00:00 0
              7fe362779000-7fe364000000 rw-p 00000000 00:00 0
              7fe364000000-7fe368000000 rw-p 00000000 00:00 0
              7fe3680ee000-7fe36c000000 rw-p 00000000 00:00 0
              7fe36c000000-7fe370000000 rw-p 00000000 00:00 0
              7fe3700cd000-7fe374000000 rw-p 00000000 00:00 0
              7fe374000000-7fe378000000 rw-p 00000000 00:00 0
              7fe378115000-7fe37812a000 r-xp 00000000 fc:02 134628 /lib/x86_64-linux-gnu/libgcc_s.so.1
              7fe37812a000-7fe378329000 ---p 00015000 fc:02 134628 /lib/x86_64-linux-gnu/libgcc_s.so.1
              7fe378329000-7fe37832a000 r--p 00014000 fc:02 134628 /lib/x86_64-linux-gnu/libgcc_s.so.1
              7fe37832a000-7fe37832b000 rw-p 00015000 fc:02 134628 /lib/x86_64-linux-gnu/libgcc_s.so.1
              7fe378348000-7fe37aa23000 rw-p 00000000 00:00 0
              7fe37aa23000-7fe37aa24000 ---p 00000000 00:00 0
              7fe37aa24000-7fe4c0541000 rw-p 00000000 00:00 0
              7fe4c0541000-7fe4c06f4000 r-xp 00000000 fc:02 134607 /lib/x86_64-linux-gnu/libc-2.15.so
              7fe4c06f4000-7fe4c08f3000 ---p 001b3000 fc:02 134607 /lib/x86_64-linux-gnu/libc-2.15.so
              7fe4c08f3000-7fe4c08f7000 r--p 001b2000 fc:02 134607 /lib/x86_64-linux-gnu/libc-2.15.so
              7fe4c08f7000-7fe4c08f9000 rw-p 001b6000 fc:02 134607 /lib/x86_64-linux-gnu/libc-2.15.so
              7fe4c08f9000-7fe4c08fe000 rw-p 00000000 00:00 0
              7fe4c08fe000-7fe4c0916000 r-xp 00000000 fc:02 134687 /lib/x86_64-linux-gnu/libpthread-2.15.so
              7fe4c0916000-7fe4c0b15000 ---p 00018000 fc:02 134687 /lib/x86_64-linux-gnu/libpthread-2.15.so
              7fe4c0b15000-7fe4c0b16000 r--p 00017000 fc:02 134687 /lib/x86_64-linux-gnu/libpthread-2.15.so
              7fe4c0b16000-7fe4c0b17000 rw-p 00018000 fc:02 134687 /lib/x86_64-linux-gnu/libpthread-2.15.so
              7fe4c0b17000-7fe4c0b1b000 rw-p 00000000 00:00 0
              7fe4c0b1b000-7fe4c0c14000 r-xp 00000000 fc:02 134639 /lib/x86_64-linux-gnu/libm-2.15.so
              7fe4c0c14000-7fe4c0e13000 ---p 000f9000 fc:02 134639 /lib/x86_64-linux-gnu/libm-2.15.so
              7fe4c0e13000-7fe4c0e14000 r--p 000f8000 fc:02 134639 /lib/x86_64-linux-gnu/libm-2.15.so
              7fe4c0e14000-7fe4c0e15000 rw-p 000f9000 fc:02 134639 /lib/x86_64-linux-gnu/libm-2.15.so
              7fe4c0e15000-7fe4c0e2b000 r-xp 00000000 fc:02 134718 /lib/x86_64-linux-gnu/libz.so.1.2.3.4
              7fe4c0e2b000-7fe4c102a000 ---p 00016000 fc:02 134718 /lib/x86_64-linux-gnu/libz.so.1.2.3.4
              7fe4c102a000-7fe4c102b000 r--p 00015000 fc:02 134718 /lib/x86_64-linux-gnu/libz.so.1.2.3.4
              7fe4c102b000-7fe4c102c000 rw-p 00016000 fc:02 134718 /lib/x86_64-linux-gnu/libz.so.1.2.3.4
              7fe4c102c000-7fe4c104e000 r-xp 00000000 fc:02 134587 /lib/x86_64-linux-gnu/ld-2.15.so
              7fe4c1065000-7fe4c122f000 rw-p 00000000 00:00 0
              7fe4c1244000-7fe4c1245000 rw-p 00000000 00:00 0
              7fe4c1245000-7fe4c1246000 r--s 00000000 08:22 25723 /tank/gsnap/hg19/hg19.chromosome.iit
              7fe4c1246000-7fe4c1247000 r--s 00000000 08:22 25723 /tank/gsnap/hg19/hg19.chromosome.iit
              7fe4c1247000-7fe4c1248000 r--s 00000000 08:22 25723 /tank/gsnap/hg19/hg19.chromosome.iit
              7fe4c1248000-7fe4c1249000 r--s 00000000 08:22 25723 /tank/gsnap/hg19/hg19.chromosome.iit
              7fe4c1249000-7fe4c124a000 r--s 00000000 08:22 25723 /tank/gsnap/hg19/hg19.chromosome.iit
              7fe4c124a000-7fe4c124e000 rw-p 00000000 00:00 0
              7fe4c124e000-7fe4c124f000 r--p 00022000 fc:02 134587 /lib/x86_64-linux-gnu/ld-2.15.so
              7fe4c124f000-7fe4c1251000 rw-p 00023000 fc:02 134587 /lib/x86_64-linux-gnu/ld-2.15.so
              7fffcbc7e000-7fffcbc9f000 rw-p 00000000 00:00 0 [stack]
              7fffcbccd000-7fffcbcce000 r-xp 00000000 00:00 0 [vdso]
              ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]

              real 528m55.091s
              user 2468m7.063s
              sys 194m32.477s

              Comment


              • #8
                This works for me ...
                /usr/local/gmap-gsnap/bin/gsnap -D /fdb/gmap/ -d hg19 --localsplicedist=10000 --nthreads=16 -A sam \
                --novelsplicing=1 \
                --read-group-id=98024 --read-group-name=98024 \
                --use-splicing=/data/nextgen/finneyr/GSNAP/hg19.exons.iit \
                /data/nextgen/finneyr/MapSplice_1.15.2/98024.trim3.1.fq \
                /data/nextgen/finneyr/MapSplice_1.15.2/98024.trim3.2.fq \
                > /data/nextgen/finneyr/GSNAP/GSNAP.24.out 2> /data/nextgen/finneyr/GSNAP/GSNAP.24.err

                Note: tune nthreads for your system!

                Comment


                • #9
                  I am able to finish it in 26.5 hours with -B 4. I think I will report to twu about the -B 5 crash.

                  user@HPC:/tank/gsnap/gmap-2012-07-12$ time src/gsnap -A sam -B 4 -t 6 --gunzip -D /tank/gsnap/hg19 -d hg19 ../../NA12878/SRR098401_1.filt.fastq.gz ../../NA12878/SRR098401_2.filt.fastq.gz | samtools view -bS - > 2.bam
                  GSNAP version 2012-07-12 called with args: src/gsnap -A sam -B 4 -t 6 --gunzip -D /tank/gsnap/hg19 -d hg19 ../../NA12878/SRR098401_1.filt.fastq.gz ../../NA12878/SRR098401_2.filt.fastq.gz
                  Neither novel splicing (-N) nor known splicing (-s) turned on => assume reads are DNA-Seq (genomic)
                  Allocating memory for compressed genome...done (1,160,885,244 bytes, 5.00 sec)
                  Looking for index files in directory /tank/gsnap/hg19
                  Gammaptrs file is hg19.ref12153gammaptrs
                  Offsetscomp file is hg19.ref12153offsetscomp
                  Positions file is hg19.ref12153positions
                  Allocating memory for ref gammaptrs, kmer 15, interval 3...done (67,108,868 bytes, 0.47 sec)
                  Allocating memory for ref offsets, kmer 15, interval 3...done (349,277,996 bytes, 0.15 sec)
                  Allocating memory for ref positions, kmer 15, interval 3...done (3,815,118,780 bytes, 20.67 sec)
                  GMAP modes: pairsearch, indel_knownsplice, terminal, improvement
                  Starting alignment
                  [samopen] SAM header is present: 25 sequences.
                  Processed 112049757 queries in 95349.18 seconds (1175.15 queries/sec)

                  real 1589m29.954s
                  user 6634m49.039s
                  sys 514m9.496s

                  Comment


                  • #10
                    I think it's worth noting here that GSNAP has a built-in RNASeq model that allows the program to map reads to splice junctions whereas this is totally missing in novoalign as a dedicated feature.
                    A comparison worth doing and one which I have done before is to compare ROC curves for GSNAP/novoalign/BWA mappings to genome only for RNASeq data. Add splice junctions to your genome and repeat the analysis with the three mappers. Finally turn on splice junctions and map with GSNAP and then compare the results. It's worth comparing apples to apples.
                    All these tools are freely available and it's usually a combination of mapper and database (genome + splices / transcripts + genome) that yield superior results.

                    An example workflow one of my colleagues has been using consists of the following:

                    1) Align paired-end reads to all spliced transcripts with novoalign
                    2) Extract all unmapped pairs or cases where one mate maps and the other does not map
                    3) Align the the reads extracted in (2) to the whole genome sequence - optionally mask out pseudogenes before alignment.
                    4) Combine all data back together

                    This works very well and the only technicality is that you need to convert (1) back to genome coordinates.



                    Originally posted by byb121 View Post
                    One thing you probably can check is whether GSNAP mapped more reads than BWA. This probably not the case for exome data, but GSNAP mapped 40% more reads than NOVOalign for our RNA-seq data (Illumina, pair-end). I guess this is because GSNAP can take known SNPs and splicing sites in consideration.

                    I think you could check the alignment quality of the those reads that support your extra variants and see if these are real.

                    Comment


                    • #11
                      Finished the same run but with SNP tolerant option on with dbsnp b135 data. Took 33.5 hours to run

                      gsnap -A sam -B 4 -t 7 --gunzip --show-refdiff -v b135.txt -D /tank/gsnap/hg19 -d hg19 ../../NA12878/SRR098401_1.filt.fastq.gz ../../NA12878/SRR098401_2.filt.fastq.gz | samtools view -bS - > 2.bam
                      GSNAP version 2012-07-12 called with args: src/gsnap -A sam -B 4 -t 7 --gunzip --show-refdiff -v b135.txt -D /tank/gsnap/hg19 -d hg19 ../../NA12878/SRR098401_1.filt.fastq.gz ../../NA12878/SRR098401_2.filt.fastq.gz
                      Neither novel splicing (-N) nor known splicing (-s) turned on => assume reads are DNA-Seq (genomic)
                      Allocating memory for compressed genome...done (1,160,885,244 bytes, 1.04 sec)
                      Allocating memory for compressed genome...done (1,160,885,244 bytes, 1.97 sec)
                      Looking for index files in directory /tank/gsnap/hg19
                      Gammaptrs file is hg19.ref12153gammaptrs.b135.txt
                      Offsetscomp file is hg19.ref12153offsetscomp.b135.txt
                      Positions file is hg19.ref12153positions.b135.txt
                      Allocating memory for ref (b135.txt) gammaptrs, kmer 15, interval 3...done (67,108,868 bytes, 0.03 sec)
                      Allocating memory for ref (b135.txt) offsets, kmer 15, interval 3...done (371,165,296 bytes, 0.22 sec)
                      Allocating memory for ref (b135.txt) positions, kmer 15, interval 3...done (4,799,239,364 bytes, 2.03 sec)
                      Reading SNPs file /tank/gsnap/hg19/hg19.maps/b135.txt...done
                      GMAP modes: pairsearch, indel_knownsplice, terminal, improvement
                      Starting alignment
                      [samopen] SAM header is present: 25 sequences.
                      Processed 112049757 queries in 120673.74 seconds (928.53 queries/sec)

                      real 2011m30.223s
                      user 7803m5.356s
                      sys 628m57.726s

                      Comment


                      • #12
                        Well. I've looked at Mapsplice, Gsnap, Rum and my own tool which is very similar to zee's workflow (see zee's comment).

                        Here's my comments ...
                        Scores are based on my looking at the results for many hours and chasing down why reads went where they went. Sorry, no "truth" to compare against, just pure subjective assessment.

                        GSNAP: very easy to use but not the best results. Great concept but needs work. Multiple entries per read. Score: 80.

                        Mapsplice: Paired mode doesn't work (Can't alloc memory ... tell me what you couldn't alloc and how many bytes ... phreaking C++ "try catch" silliness ... put a darn error check after the "new" and print out a real error message.). Decent results in unpaired mode. Absolutely annoying amounts of temp files. Has multiple entries for a read. Rare bogus snps in bases in intron near exon. Score 87.

                        RUM: Good results. This bottle is definitely half full. Needs to support pre-trimmed reads in paired mode. A little aggressive with some (blat assigned ?) reads that should be probably be left unmapped. Renames the read name. Has multiple entries for a read . Probably needs a cluster to to run on. Score 92.

                        My own: Results stunningly similar to RUM, though, without the overly aggressive (blat?) alignments. Does support Solid. Occasional bogus mapping to EST. Not published so ... Score 0 and disqualified!.

                        Winner : RUM.

                        Comment


                        • #13
                          I find GSNAP with SNP-tolerant on placing fewer reads for highly polymorphic region than bwa. It also soft clipped more than bwa. I think I will stick to bwa and tweak its gap penalty to see if it can give what I want.

                          Comment


                          • #14
                            in rum, you can preserve read names using ``--preserve-names`` (where all reads are the same length).

                            i wish they all had the equivalent to ``bowtie -m1``.

                            Originally posted by Richard Finney View Post
                            Well. I've looked at Mapsplice, Gsnap, Rum and my own tool which is very similar to zee's workflow (see zee's comment).

                            Here's my comments ...
                            Scores are based on my looking at the results for many hours and chasing down why reads went where they went. Sorry, no "truth" to compare against, just pure subjective assessment.

                            GSNAP: very easy to use but not the best results. Great concept but needs work. Multiple entries per read. Score: 80.

                            Mapsplice: Paired mode doesn't work (Can't alloc memory ... tell me what you couldn't alloc and how many bytes ... phreaking C++ "try catch" silliness ... put a darn error check after the "new" and print out a real error message.). Decent results in unpaired mode. Absolutely annoying amounts of temp files. Has multiple entries for a read. Rare bogus snps in bases in intron near exon. Score 87.

                            RUM: Good results. This bottle is definitely half full. Needs to support pre-trimmed reads in paired mode. A little aggressive with some (blat assigned ?) reads that should be probably be left unmapped. Renames the read name. Has multiple entries for a read . Probably needs a cluster to to run on. Score 92.

                            My own: Results stunningly similar to RUM, though, without the overly aggressive (blat?) alignments. Does support Solid. Occasional bogus mapping to EST. Not published so ... Score 0 and disqualified!.

                            Winner : RUM.

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Strategies for Sequencing Challenging Samples
                              by seqadmin


                              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                              03-22-2024, 06:39 AM
                            • seqadmin
                              Techniques and Challenges in Conservation Genomics
                              by seqadmin



                              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                              Avian Conservation
                              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                              03-08-2024, 10:41 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, Yesterday, 06:37 PM
                            0 responses
                            7 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, Yesterday, 06:07 PM
                            0 responses
                            7 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-22-2024, 10:03 AM
                            0 responses
                            49 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-21-2024, 07:32 AM
                            0 responses
                            66 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X