![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Novoalign with a lot of genomic data | lorddoskias | Bioinformatics | 1 | 03-14-2012 09:56 AM |
Newbie. May ask a lot of stupid questions. | entropy | Introductions | 1 | 03-07-2011 06:12 PM |
Issues with new XLR70 lot | MissDNA | 454 Pyrosequencing | 7 | 03-01-2011 07:36 AM |
Sequencing reagents lot 93801720 | gilly | 454 Pyrosequencing | 3 | 11-08-2010 10:40 AM |
New York State has a lot of nerve... | ECO | Personalized Genomics | 3 | 04-21-2008 10:33 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Junior Member
Location: Vancouver Join Date: Oct 2010
Posts: 7
|
![]()
Hello,
I'm working on human exome data (Illumina, pair-end), and I have consistently notice GSNAP produces at least 2-5 times more variants than BWA. While I'm sure not all of these extra predictions are false, I am concern most of these are false positives. I've looked into the parameters for GSNAP and tried to tune them to restrict the # of variants I get. It would help a lot if anybody can share what parameters they use when running GSNAP on their NGS data, and what's their experience with it. Thanks, Casper |
![]() |
![]() |
![]() |
#2 |
Member
Location: Newcastle upon Tyne Join Date: Aug 2009
Posts: 18
|
![]()
One thing you probably can check is whether GSNAP mapped more reads than BWA. This probably not the case for exome data, but GSNAP mapped 40% more reads than NOVOalign for our RNA-seq data (Illumina, pair-end). I guess this is because GSNAP can take known SNPs and splicing sites in consideration.
I think you could check the alignment quality of the those reads that support your extra variants and see if these are real. Last edited by byb121; 03-15-2012 at 07:58 AM. |
![]() |
![]() |
![]() |
#3 |
Senior Member
Location: Hong Kong Join Date: Mar 2010
Posts: 498
|
![]()
For highly polymorphic genes, 2-5x more variants should be about right.
I am not sure whether highly polymorphic genes are the majority or not. But if you subscribe to the theory that majority of protein coding genes are not functionally important, then highly polymorphic genes can be the majority. |
![]() |
![]() |
![]() |
#4 |
Senior Member
Location: bethesda Join Date: Feb 2009
Posts: 700
|
![]()
I've just started with GSNAP.
It's relatively easy to use. Comparing BWAxGenome with GSNAPxSpliceModel straight up is not good. Of course GSNAP is going to get better results as it knows about what splices there are via the "coords" file. I use BWA with my own model (align against genome, refome, estome, nonrefome: Best alignment wins). The results are similar to GSNAP (and MapSplice for that matter). The one thing that confuses me about GSNAP. In a default-ish configuration ( I did build a "coords" file from latest refFlat from UCSC), GSNAP reports multiple alignments for a read. ( Try running that through SamValidator ) Some reads map to 100 places. More snps might be a misleading metric in this kind of situation; it could mean "more false positives". [ Could be more true positives, too ... probably more of both ]. If anyone has some insights on tuning GSNAP, please share the details. |
![]() |
![]() |
![]() |
#5 |
Senior Member
Location: Hong Kong Join Date: Mar 2010
Posts: 498
|
![]()
http://bmi.osu.edu/hpc/papers/Hatem11-Bench.pdf
This paper says GSNAP slows down significantly when aligning pair ended reads. Is this fixed in the latest version? |
![]() |
![]() |
![]() |
#6 |
Senior Member
Location: Hong Kong Join Date: Mar 2010
Posts: 498
|
![]()
I am running gsnap on two 7.5GB fastqs. It's been running for more than six hours now but have yet to finish.
gsnap -A sam -B 5 -t 6 --gunzip -D /tank/gsnap/hg19 -d hg19 ../../NA12878/SRR098401_1.filt.fastq.gz ../../NA12878/SRR098401_2.filt.fastq.gz | samtools view -bS - > 2.bam Anyone knows how to tweak to make it even faster here? I learned that being slow for pair ended reads is a feature not a bug. It does this to get even better mapping than bwa according to lh3. |
![]() |
![]() |
![]() |
#7 |
Senior Member
Location: Hong Kong Join Date: Mar 2010
Posts: 498
|
![]()
Oh well. Crashed after nine hours of run... What gives??
GSNAP version 2012-07-12 called with args: src/gsnap -A sam -B 5 -t 6 --gunzip -D /tank/gsnap/hg19 -d hg19 ../../NA12878/SRR098401_1.filt.fastq.gz ../../NA12878/SRR098401_2.filt.fastq.gz Neither novel splicing (-N) nor known splicing (-s) turned on => assume reads are DNA-Seq (genomic) Allocating memory for compressed genome...done (1,160,885,244 bytes, 0.59 sec) Looking for index files in directory /tank/gsnap/hg19 Gammaptrs file is hg19.ref12153gammaptrs Offsetscomp file is hg19.ref12153offsetscomp Positions file is hg19.ref12153positions Allocating memory (1073741825 words) for offsets, kmer 15...done Expanding offsetscomp into offsets...done Allocating memory for ref positions, kmer 15, interval 3...done (3,815,118,780 bytes, 3.65 sec) GMAP modes: pairsearch, indel_knownsplice, terminal, improvement Starting alignment [samopen] SAM header is present: 25 sequences. *** glibc detected *** src/gsnap: free(): invalid pointer: 0x00007fe3725e0400 *** ======= Backtrace: ========= /lib/x86_64-linux-gnu/libc.so.6(+0x7e626)[0x7fe4c05bf626] src/gsnap[0x4586f8] src/gsnap[0x4e2d23] src/gsnap[0x4ecd7c] src/gsnap[0x4f675c] src/gsnap[0x4f749b] /lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a)[0x7fe4c0905e9a] /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fe4c06334bd] ======= Memory map: ======== 00400000-00def000 r-xp 00000000 08:22 25145 /tank/gsnap/gmap-2012-07-12/src/gsnap 00fef000-00ff0000 r--p 009ef000 08:22 25145 /tank/gsnap/gmap-2012-07-12/src/gsnap 00ff0000-00ff3000 rw-p 009f0000 08:22 25145 /tank/gsnap/gmap-2012-07-12/src/gsnap 00ff3000-011e0000 rw-p 00000000 00:00 0 014d1000-035d1000 rw-p 00000000 00:00 0 [heap] 7fe148000000-7fe148c51000 rw-p 00000000 00:00 0 7fe148c51000-7fe14c000000 ---p 00000000 00:00 0 7fe14c000000-7fe14d732000 rw-p 00000000 00:00 0 7fe14d732000-7fe150000000 ---p 00000000 00:00 0 7fe150000000-7fe151711000 rw-p 00000000 00:00 0 7fe151711000-7fe154000000 ---p 00000000 00:00 0 7fe154000000-7fe158000000 rw-p 00000000 00:00 0 7fe158000000-7fe15c000000 rw-p 00000000 00:00 0 7fe15c000000-7fe160000000 rw-p 00000000 00:00 0 7fe160000000-7fe160021000 rw-p 00000000 00:00 0 7fe160021000-7fe164000000 ---p 00000000 00:00 0 7fe164000000-7fe168000000 rw-p 00000000 00:00 0 7fe16aafa000-7fe16c000000 rw-p 00000000 00:00 0 7fe16c000000-7fe170000000 rw-p 00000000 00:00 0 7fe1700ee000-7fe174000000 rw-p 00000000 00:00 0 7fe174000000-7fe178000000 rw-p 00000000 00:00 0 7fe17830f000-7fe17bea0000 rw-p 00000000 00:00 0 7fe17bea0000-7fe17bea1000 ---p 00000000 00:00 0 7fe17bea1000-7fe17c6a1000 rw-p 00000000 00:00 0 7fe17c6a1000-7fe17c6a2000 ---p 00000000 00:00 0 7fe17c6a2000-7fe17cea2000 rw-p 00000000 00:00 0 7fe17cea2000-7fe17cea3000 ---p 00000000 00:00 0 7fe17cea3000-7fe17d6a3000 rw-p 00000000 00:00 0 7fe17d6a3000-7fe17d6a4000 ---p 00000000 00:00 0 7fe17d6a4000-7fe17dea4000 rw-p 00000000 00:00 0 7fe17dea4000-7fe17dea5000 ---p 00000000 00:00 0 7fe17dea5000-7fe17e6a5000 rw-p 00000000 00:00 0 7fe17e6a5000-7fe17e6a6000 ---p 00000000 00:00 0 7fe17e6a6000-7fe36250a000 rw-p 00000000 00:00 0 7fe362779000-7fe364000000 rw-p 00000000 00:00 0 7fe364000000-7fe368000000 rw-p 00000000 00:00 0 7fe3680ee000-7fe36c000000 rw-p 00000000 00:00 0 7fe36c000000-7fe370000000 rw-p 00000000 00:00 0 7fe3700cd000-7fe374000000 rw-p 00000000 00:00 0 7fe374000000-7fe378000000 rw-p 00000000 00:00 0 7fe378115000-7fe37812a000 r-xp 00000000 fc:02 134628 /lib/x86_64-linux-gnu/libgcc_s.so.1 7fe37812a000-7fe378329000 ---p 00015000 fc:02 134628 /lib/x86_64-linux-gnu/libgcc_s.so.1 7fe378329000-7fe37832a000 r--p 00014000 fc:02 134628 /lib/x86_64-linux-gnu/libgcc_s.so.1 7fe37832a000-7fe37832b000 rw-p 00015000 fc:02 134628 /lib/x86_64-linux-gnu/libgcc_s.so.1 7fe378348000-7fe37aa23000 rw-p 00000000 00:00 0 7fe37aa23000-7fe37aa24000 ---p 00000000 00:00 0 7fe37aa24000-7fe4c0541000 rw-p 00000000 00:00 0 7fe4c0541000-7fe4c06f4000 r-xp 00000000 fc:02 134607 /lib/x86_64-linux-gnu/libc-2.15.so 7fe4c06f4000-7fe4c08f3000 ---p 001b3000 fc:02 134607 /lib/x86_64-linux-gnu/libc-2.15.so 7fe4c08f3000-7fe4c08f7000 r--p 001b2000 fc:02 134607 /lib/x86_64-linux-gnu/libc-2.15.so 7fe4c08f7000-7fe4c08f9000 rw-p 001b6000 fc:02 134607 /lib/x86_64-linux-gnu/libc-2.15.so 7fe4c08f9000-7fe4c08fe000 rw-p 00000000 00:00 0 7fe4c08fe000-7fe4c0916000 r-xp 00000000 fc:02 134687 /lib/x86_64-linux-gnu/libpthread-2.15.so 7fe4c0916000-7fe4c0b15000 ---p 00018000 fc:02 134687 /lib/x86_64-linux-gnu/libpthread-2.15.so 7fe4c0b15000-7fe4c0b16000 r--p 00017000 fc:02 134687 /lib/x86_64-linux-gnu/libpthread-2.15.so 7fe4c0b16000-7fe4c0b17000 rw-p 00018000 fc:02 134687 /lib/x86_64-linux-gnu/libpthread-2.15.so 7fe4c0b17000-7fe4c0b1b000 rw-p 00000000 00:00 0 7fe4c0b1b000-7fe4c0c14000 r-xp 00000000 fc:02 134639 /lib/x86_64-linux-gnu/libm-2.15.so 7fe4c0c14000-7fe4c0e13000 ---p 000f9000 fc:02 134639 /lib/x86_64-linux-gnu/libm-2.15.so 7fe4c0e13000-7fe4c0e14000 r--p 000f8000 fc:02 134639 /lib/x86_64-linux-gnu/libm-2.15.so 7fe4c0e14000-7fe4c0e15000 rw-p 000f9000 fc:02 134639 /lib/x86_64-linux-gnu/libm-2.15.so 7fe4c0e15000-7fe4c0e2b000 r-xp 00000000 fc:02 134718 /lib/x86_64-linux-gnu/libz.so.1.2.3.4 7fe4c0e2b000-7fe4c102a000 ---p 00016000 fc:02 134718 /lib/x86_64-linux-gnu/libz.so.1.2.3.4 7fe4c102a000-7fe4c102b000 r--p 00015000 fc:02 134718 /lib/x86_64-linux-gnu/libz.so.1.2.3.4 7fe4c102b000-7fe4c102c000 rw-p 00016000 fc:02 134718 /lib/x86_64-linux-gnu/libz.so.1.2.3.4 7fe4c102c000-7fe4c104e000 r-xp 00000000 fc:02 134587 /lib/x86_64-linux-gnu/ld-2.15.so 7fe4c1065000-7fe4c122f000 rw-p 00000000 00:00 0 7fe4c1244000-7fe4c1245000 rw-p 00000000 00:00 0 7fe4c1245000-7fe4c1246000 r--s 00000000 08:22 25723 /tank/gsnap/hg19/hg19.chromosome.iit 7fe4c1246000-7fe4c1247000 r--s 00000000 08:22 25723 /tank/gsnap/hg19/hg19.chromosome.iit 7fe4c1247000-7fe4c1248000 r--s 00000000 08:22 25723 /tank/gsnap/hg19/hg19.chromosome.iit 7fe4c1248000-7fe4c1249000 r--s 00000000 08:22 25723 /tank/gsnap/hg19/hg19.chromosome.iit 7fe4c1249000-7fe4c124a000 r--s 00000000 08:22 25723 /tank/gsnap/hg19/hg19.chromosome.iit 7fe4c124a000-7fe4c124e000 rw-p 00000000 00:00 0 7fe4c124e000-7fe4c124f000 r--p 00022000 fc:02 134587 /lib/x86_64-linux-gnu/ld-2.15.so 7fe4c124f000-7fe4c1251000 rw-p 00023000 fc:02 134587 /lib/x86_64-linux-gnu/ld-2.15.so 7fffcbc7e000-7fffcbc9f000 rw-p 00000000 00:00 0 [stack] 7fffcbccd000-7fffcbcce000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] real 528m55.091s user 2468m7.063s sys 194m32.477s |
![]() |
![]() |
![]() |
#8 |
Senior Member
Location: bethesda Join Date: Feb 2009
Posts: 700
|
![]()
This works for me ...
/usr/local/gmap-gsnap/bin/gsnap -D /fdb/gmap/ -d hg19 --localsplicedist=10000 --nthreads=16 -A sam \ --novelsplicing=1 \ --read-group-id=98024 --read-group-name=98024 \ --use-splicing=/data/nextgen/finneyr/GSNAP/hg19.exons.iit \ /data/nextgen/finneyr/MapSplice_1.15.2/98024.trim3.1.fq \ /data/nextgen/finneyr/MapSplice_1.15.2/98024.trim3.2.fq \ > /data/nextgen/finneyr/GSNAP/GSNAP.24.out 2> /data/nextgen/finneyr/GSNAP/GSNAP.24.err Note: tune nthreads for your system! |
![]() |
![]() |
![]() |
#9 |
Senior Member
Location: Hong Kong Join Date: Mar 2010
Posts: 498
|
![]()
I am able to finish it in 26.5 hours with -B 4. I think I will report to twu about the -B 5 crash.
user@HPC:/tank/gsnap/gmap-2012-07-12$ time src/gsnap -A sam -B 4 -t 6 --gunzip -D /tank/gsnap/hg19 -d hg19 ../../NA12878/SRR098401_1.filt.fastq.gz ../../NA12878/SRR098401_2.filt.fastq.gz | samtools view -bS - > 2.bam GSNAP version 2012-07-12 called with args: src/gsnap -A sam -B 4 -t 6 --gunzip -D /tank/gsnap/hg19 -d hg19 ../../NA12878/SRR098401_1.filt.fastq.gz ../../NA12878/SRR098401_2.filt.fastq.gz Neither novel splicing (-N) nor known splicing (-s) turned on => assume reads are DNA-Seq (genomic) Allocating memory for compressed genome...done (1,160,885,244 bytes, 5.00 sec) Looking for index files in directory /tank/gsnap/hg19 Gammaptrs file is hg19.ref12153gammaptrs Offsetscomp file is hg19.ref12153offsetscomp Positions file is hg19.ref12153positions Allocating memory for ref gammaptrs, kmer 15, interval 3...done (67,108,868 bytes, 0.47 sec) Allocating memory for ref offsets, kmer 15, interval 3...done (349,277,996 bytes, 0.15 sec) Allocating memory for ref positions, kmer 15, interval 3...done (3,815,118,780 bytes, 20.67 sec) GMAP modes: pairsearch, indel_knownsplice, terminal, improvement Starting alignment [samopen] SAM header is present: 25 sequences. Processed 112049757 queries in 95349.18 seconds (1175.15 queries/sec) real 1589m29.954s user 6634m49.039s sys 514m9.496s |
![]() |
![]() |
![]() |
#10 | |
NGS specialist
Location: Malaysia Join Date: Apr 2008
Posts: 249
|
![]()
I think it's worth noting here that GSNAP has a built-in RNASeq model that allows the program to map reads to splice junctions whereas this is totally missing in novoalign as a dedicated feature.
A comparison worth doing and one which I have done before is to compare ROC curves for GSNAP/novoalign/BWA mappings to genome only for RNASeq data. Add splice junctions to your genome and repeat the analysis with the three mappers. Finally turn on splice junctions and map with GSNAP and then compare the results. It's worth comparing apples to apples. All these tools are freely available and it's usually a combination of mapper and database (genome + splices / transcripts + genome) that yield superior results. An example workflow one of my colleagues has been using consists of the following: 1) Align paired-end reads to all spliced transcripts with novoalign 2) Extract all unmapped pairs or cases where one mate maps and the other does not map 3) Align the the reads extracted in (2) to the whole genome sequence - optionally mask out pseudogenes before alignment. 4) Combine all data back together This works very well and the only technicality is that you need to convert (1) back to genome coordinates. Quote:
|
|
![]() |
![]() |
![]() |
#11 |
Senior Member
Location: Hong Kong Join Date: Mar 2010
Posts: 498
|
![]()
Finished the same run but with SNP tolerant option on with dbsnp b135 data. Took 33.5 hours to run
gsnap -A sam -B 4 -t 7 --gunzip --show-refdiff -v b135.txt -D /tank/gsnap/hg19 -d hg19 ../../NA12878/SRR098401_1.filt.fastq.gz ../../NA12878/SRR098401_2.filt.fastq.gz | samtools view -bS - > 2.bam GSNAP version 2012-07-12 called with args: src/gsnap -A sam -B 4 -t 7 --gunzip --show-refdiff -v b135.txt -D /tank/gsnap/hg19 -d hg19 ../../NA12878/SRR098401_1.filt.fastq.gz ../../NA12878/SRR098401_2.filt.fastq.gz Neither novel splicing (-N) nor known splicing (-s) turned on => assume reads are DNA-Seq (genomic) Allocating memory for compressed genome...done (1,160,885,244 bytes, 1.04 sec) Allocating memory for compressed genome...done (1,160,885,244 bytes, 1.97 sec) Looking for index files in directory /tank/gsnap/hg19 Gammaptrs file is hg19.ref12153gammaptrs.b135.txt Offsetscomp file is hg19.ref12153offsetscomp.b135.txt Positions file is hg19.ref12153positions.b135.txt Allocating memory for ref (b135.txt) gammaptrs, kmer 15, interval 3...done (67,108,868 bytes, 0.03 sec) Allocating memory for ref (b135.txt) offsets, kmer 15, interval 3...done (371,165,296 bytes, 0.22 sec) Allocating memory for ref (b135.txt) positions, kmer 15, interval 3...done (4,799,239,364 bytes, 2.03 sec) Reading SNPs file /tank/gsnap/hg19/hg19.maps/b135.txt...done GMAP modes: pairsearch, indel_knownsplice, terminal, improvement Starting alignment [samopen] SAM header is present: 25 sequences. Processed 112049757 queries in 120673.74 seconds (928.53 queries/sec) real 2011m30.223s user 7803m5.356s sys 628m57.726s |
![]() |
![]() |
![]() |
#12 |
Senior Member
Location: bethesda Join Date: Feb 2009
Posts: 700
|
![]()
Well. I've looked at Mapsplice, Gsnap, Rum and my own tool which is very similar to zee's workflow (see zee's comment).
Here's my comments ... Scores are based on my looking at the results for many hours and chasing down why reads went where they went. Sorry, no "truth" to compare against, just pure subjective assessment. GSNAP: very easy to use but not the best results. Great concept but needs work. Multiple entries per read. Score: 80. Mapsplice: Paired mode doesn't work (Can't alloc memory ... tell me what you couldn't alloc and how many bytes ... phreaking C++ "try catch" silliness ... put a darn error check after the "new" and print out a real error message.). Decent results in unpaired mode. Absolutely annoying amounts of temp files. Has multiple entries for a read. Rare bogus snps in bases in intron near exon. Score 87. RUM: Good results. This bottle is definitely half full. Needs to support pre-trimmed reads in paired mode. A little aggressive with some (blat assigned ?) reads that should be probably be left unmapped. Renames the read name. Has multiple entries for a read . Probably needs a cluster to to run on. Score 92. My own: Results stunningly similar to RUM, though, without the overly aggressive (blat?) alignments. Does support Solid. Occasional bogus mapping to EST. Not published so ... Score 0 and disqualified!. Winner : RUM. |
![]() |
![]() |
![]() |
#13 |
Senior Member
Location: Hong Kong Join Date: Mar 2010
Posts: 498
|
![]()
I find GSNAP with SNP-tolerant on placing fewer reads for highly polymorphic region than bwa. It also soft clipped more than bwa. I think I will stick to bwa and tweak its gap penalty to see if it can give what I want.
|
![]() |
![]() |
![]() |
#14 | |
Member
Location: Denver, CO Join Date: Mar 2011
Posts: 37
|
![]()
in rum, you can preserve read names using ``--preserve-names`` (where all reads are the same length).
i wish they all had the equivalent to ``bowtie -m1``. Quote:
|
|
![]() |
![]() |
![]() |
Thread Tools | |
|
|