SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
GMAP vs GSNAP Artur Jaroszewicz Bioinformatics 1 01-27-2012 04:45 AM
GSNAP Output splice junctions ChrisAU RNA Sequencing 0 01-23-2012 03:59 AM
GSNAP and GFF markestine Bioinformatics 0 04-11-2011 05:24 AM
Understanding GSNAP output burt Bioinformatics 0 01-16-2011 06:06 PM
problem with gsnap sam output lindseyjane Bioinformatics 2 11-04-2010 04:27 AM

Reply
 
Thread Tools
Old 09-26-2011, 08:57 PM   #1
FuzzyCoder
Member
 
Location: Minnesota

Join Date: Aug 2011
Posts: 13
Default GSNAP Output

I have been able to run GSNAP on a 8GB, 8 cpu machine using the following command (executed from the GSNAP bin directory):

Code:
./gsnap --format=sam --read-group-id=30D --batch=4 --nthreads=8 --read-group-name=Condition1 --input-buffer-size=5000 --max-mismatches=4 --kmer=15 --dir=/home/myPath/gsnap/gmapdb/mm9 --db=mm9 --novelsplicing=1 -s mm9.splicesites.iit /home/myPath/gsnap/30D_217_12WKS_R1.fastq /home/myPath/gsnap/30D_217_12WKS_R2.fastq
As it ran, it streamed what appeared to be output of aligned/mapped and annotated reads.

However, I am unable to find any output file (SAM or otherwise) generated by the process. The process appeared to end without error, reporting the reads/sec performance of GSNAP.

What do I need to do to either find the output file (already searched for *.sam, and *.* within past day, everywhere), or specify where the output should be written?
__________________
Best Regards,

Paul Bergmann
FuzzyCoder is offline   Reply With Quote
Old 09-26-2011, 11:15 PM   #2
schelhorn
Member
 
Location: Germany

Join Date: Sep 2010
Posts: 10
Default

Umm, I might be wrong since I have't run your configuration, but shouldn't redirecting the stdout output into a file be all you need? SAM is a line-based ASCII format, probably that's exactly the running output you're seeing. If that is the case just append
Code:
> mysamfile.sam
at the end of the command and you should be good. Btw, having a bioinformatician in the department helps, too
schelhorn is offline   Reply With Quote
Old 10-03-2011, 01:04 PM   #3
twu
Developer of GMAP and GSNAP
 
Location: South San Francisco, CA

Join Date: Oct 2011
Posts: 17
Default

Hi,

I am the developer of GSNAP. The program as you have run it will send the output to stdout (which is the stream of alignments you saw). You should therefore send the output into a file, as suggested above, using the "> outputfile" convention for Unix.

Alternatively, there is a flag called --split-output that will send output to individual files, according to how they aligned (i.e., concordant unique, halfmapping multiple, and so on).

Regards,

Thomas Wu
twu is offline   Reply With Quote
Old 12-15-2011, 05:02 PM   #4
gpcr
Member
 
Location: usa

Join Date: May 2010
Posts: 18
Default gsnap runn error?

Can any one solve this error:

Code:
gsnap --split-output bcsf -A sam -B 4 -t 8 -k 15 --read-group-id=SN388 --read-group-name=bcsf --input-buffer-size=3000 -D /data/gmap/hg19 -d hg19 -N 1 -s /data/gmap/hg19/hg19.maps/hg19.refGene.splicesitesfile.iit /seq/L005.ft.R1_1.fastq /seq/L005.ft.R2_2.fastq
Error message:

Novel splicing (-N) and known splicing (-s) both turned on => assume reads are RNA-Seq
Allocating memory for compressed genome...done (1,160,885,244 bytes, 1.79 sec)
Looking for index files in directory /data/gmap/hg19
Gammaptrs file is hg19.ref12153gammaptrs
Offsetscomp file is hg19.ref12153offsetscomp
Positions file is hg19.ref12153positions
Allocating memory for ref gammaptrs, kmer 15, interval 3...done (67,108,868 bytes, 0.11 sec)
Allocating memory for ref offsets, kmer 15, interval 3...done (349,277,996 bytes, 0.53 sec)
Allocating memory for ref positions, kmer 15, interval 3...done (3,815,118,780 bytes, 5.79 sec)
Reading splicing file /data/gmap/hg19/hg19.maps/hg19.refGene.splicesitesfile.iit locally...found donor and acceptor tags, so treating as splicesites file
splice distances present...390347 unique splicesites...390347 valid splicesites...splicetrie_obs has 644549 entries...splicetrie_max has 48518627 entries...done
GMAP modes: pairsearch, terminal, improvement
Starting alignment
Illegal instruction
gpcr is offline   Reply With Quote
Old 12-15-2011, 07:25 PM   #5
twu
Developer of GMAP and GSNAP
 
Location: South San Francisco, CA

Join Date: Oct 2011
Posts: 17
Default Illegal instruction in GSNAP

The illegal instruction error is occurring because the environment where you built GSNAP is different from the environment where you are running it. When you build GSNAP, the configure command sees if it can compile and run the program with the flag "-mpopcnt", which uses a machine-level instruction to count the number of bits in a word and results in a minor increase in speed. However, if you build on a machine where -mpopcnt works and then you move the program to a machine where the built-in popcount instruction does not exist, the machine will complain about an illegal instruction.

The solution is to rebuild GSNAP, this time running configure with the flag --disable-popcnt, which will not compile the programs with the "-mpopcnt" flag, even if it works in your build environment.

If you have any further questions or problems, please feel free to send email directly to me at twu@gene.com.

Regards,

Thomas Wu
twu is offline   Reply With Quote
Old 12-16-2011, 07:09 AM   #6
gpcr
Member
 
Location: usa

Join Date: May 2010
Posts: 18
Thumbs up

@Twu; I recompiled gmap in the current computer its working fine so far...Thanks!!!
gpcr is offline   Reply With Quote
Old 03-19-2012, 09:13 AM   #7
ParthavJailwala
Member
 
Location: Maryland, USA

Join Date: Oct 2009
Posts: 27
Default

Hi Thomas (@twu),

We have been trying to use GSNAP for alignment, followed by cufflinks for downstream analysis on RNA-seq data. When we look at the SAM file produced by GSNAP (alignment done with splicing on), the 'XS:A: ' tag should have either 'XS:A:+' or 'XS:A:-' to represent strand information for splicing reads; however, we also notice numerous alignment entries to have an 'XS:A:?'. These entries are not recognized by cufflinks as it expects either a '+' or a '-' in the XS:A tag. Any clues on what we should do to make the SAM file compatible with cufflinks? Thanks
ParthavJailwala is offline   Reply With Quote
Old 03-19-2012, 10:47 AM   #8
gpcr
Member
 
Location: usa

Join Date: May 2010
Posts: 18
Default

I think new version might have rectified 'XS:A:?'.
check here for the updated version
http://research-pub.gene.com/gmap/
"Eliminated reporting of XS:A:? in cases of long deletions being considered incorrectly as introns "
gpcr is offline   Reply With Quote
Old 03-19-2012, 12:39 PM   #9
ParthavJailwala
Member
 
Location: Maryland, USA

Join Date: Oct 2009
Posts: 27
Default

@gpcr,
Thanks for that pointer....it turns out we had the path to the older version in the .bash_profile.
We will run GSNAP using the latest version to see if the XS tag errors are resolved.

Thanks
ParthavJailwala is offline   Reply With Quote
Old 03-20-2012, 02:38 PM   #10
twu
Developer of GMAP and GSNAP
 
Location: South San Francisco, CA

Join Date: Oct 2011
Posts: 17
Default

Also, I am about to release a new version later today (version 2012-03-20). I believe this new version should completely eliminate all occurrences of "XS:A:?".
twu is offline   Reply With Quote
Old 03-20-2012, 06:52 PM   #11
ParthavJailwala
Member
 
Location: Maryland, USA

Join Date: Oct 2009
Posts: 27
Default

Quote:
Originally Posted by twu View Post
Also, I am about to release a new version later today (version 2012-03-20). I believe this new version should completely eliminate all occurrences of "XS:A:?".
Hi Thomas,

Thanks for your response. I checked the GSNAP webpage, and yes, there is a new version indeed (2012-03-20). We will use this version and see if the 'XS:A:?' related errors are resolved.

Thanks
ParthavJailwala is offline   Reply With Quote
Old 03-22-2012, 11:36 AM   #12
ParthavJailwala
Member
 
Location: Maryland, USA

Join Date: Oct 2009
Posts: 27
Default Parse error in SAM file from GSNAP

I used the latest version of GSNAP with default flags to produce a SAM file. When trying to convert SAM to BAM for input to cufflinks, I get this error:

'Parse error in line 19446415: missing colon in auxiliary data'
Aborted

Then, when I look at the read at that line number, I get this:
>$sed -n 19446415p MC01.sam

HTML Code:
3PO5HQ1:154:D0FV2ACXX:2:1104:9525:98311	387	15	82664506	1	100M1S	=	82664506	0	CTTAGATCTTATTCATCAGCCTGCTGAACAGTTCCTTTTTCAGAGACATAGATACCATCCAAAAATTTCCTGATATCCTTGTTTTTAACTGTTGTGGCTTT	CCCFFFFFHHHHHJJJJJJJJJJJJJJIJJJIIIJJJJJJIJIIJJJJJJJJJJJJJIJIJJIGIJJJHIJJJJJJEHIHHHHHHFFFFFEEEEEDDDDD3MD:Z:100	NH:i:4	HI:i:3	NM:i:0	SM:i:1	XQ:i:40	X2:i:40
What is wrong with this read entry in the SAM file that is throwing the error above?

Thanks

Last edited by ParthavJailwala; 03-22-2012 at 11:38 AM.
ParthavJailwala is offline   Reply With Quote
Old 01-09-2013, 12:43 AM   #13
ppoudel
Junior Member
 
Location: UK

Join Date: Feb 2012
Posts: 5
Default

hi,

what does -n and -Q option mean in GSNAP, does -n mean the number of places that a read maps to a genome? I would like to write only the uniquely mapped reads to the output file.
ppoudel is offline   Reply With Quote
Old 01-09-2013, 05:30 AM   #14
ParthavJailwala
Member
 
Location: Maryland, USA

Join Date: Oct 2009
Posts: 27
Default

Quote:
Originally Posted by ppoudel View Post
hi,

what does -n and -Q option mean in GSNAP, does -n mean the number of places that a read maps to a genome? I would like to write only the uniquely mapped reads to the output file.
Hi ppoudel,

As per the GSNAP README, there is a -n flag that the user can specify for the number of hits that GSNAP should show. I did not find the -Q option, but I guess you imply the "--quiet-if-excessive" which is linked to the -n option, as per this paragraph from the README:
Hope that helps.

Thanks
Parthav

------------------------------------------------------------------------------------
After the query line, each of the genomic hits is shown, up to the
'-n' parameter. If too many hits were found (more than the '-n'
parameter), the behavior depends on whether the "--quiet-if-excessive"
flag is given to GSNAP. If not, then the first n hits will be printed
and the rest will not be printed. If the "--quiet-if-excessive" flag
is given to GSNAP, then no hits will be printed if the number exceeds
n.

Each of the genomic hits contains one or more alignment segments,
which is a region of continuous one-to-one correspondence (matches or
mismatches) between the query and the genome. Multiple segments occur
when the alignment contains an insertion, deletion, or splice. The
first segment is marked by a space (" ") at the beginning of the line,
while the second and following segments are marked by a comma (",") at
the beginning of the line. (In the current implementation of GSNAP
that allows only a single indel or splice, the number of segments is
at most two.)
---------------------------------------------------------------------
ParthavJailwala is offline   Reply With Quote
Old 01-09-2013, 05:31 AM   #15
ParthavJailwala
Member
 
Location: Maryland, USA

Join Date: Oct 2009
Posts: 27
Default Re: GSNAP output

Quote:
Originally Posted by ppoudel View Post
hi,

what does -n and -Q option mean in GSNAP, does -n mean the number of places that a read maps to a genome? I would like to write only the uniquely mapped reads to the output file.
Hi ppoudel,

As per the GSNAP README, there is a -n flag that the user can specify for the number of hits that GSNAP should show. I did not find the -Q option, but I guess you imply the "--quiet-if-excessive" which is linked to the -n option, as per this paragraph from the README:
Hope that helps.

Thanks
Parthav

------------------------------------------------------------------------------------
After the query line, each of the genomic hits is shown, up to the
'-n' parameter. If too many hits were found (more than the '-n'
parameter), the behavior depends on whether the "--quiet-if-excessive"
flag is given to GSNAP. If not, then the first n hits will be printed
and the rest will not be printed. If the "--quiet-if-excessive" flag
is given to GSNAP, then no hits will be printed if the number exceeds
n.

Each of the genomic hits contains one or more alignment segments,
which is a region of continuous one-to-one correspondence (matches or
mismatches) between the query and the genome. Multiple segments occur
when the alignment contains an insertion, deletion, or splice. The
first segment is marked by a space (" ") at the beginning of the line,
while the second and following segments are marked by a comma (",") at
the beginning of the line. (In the current implementation of GSNAP
that allows only a single indel or splice, the number of segments is
at most two.)
---------------------------------------------------------------------
ParthavJailwala is offline   Reply With Quote
Old 01-10-2013, 08:32 AM   #16
twu
Developer of GMAP and GSNAP
 
Location: South San Francisco, CA

Join Date: Oct 2011
Posts: 17
Default Meaning of -n and -Q

The -n flag limits the number of results reported, not the number of alignment results. Suppose you specify "-n 1", and GSNAP finds no alignments to the genome. Then the result would contain 0 hits, obviously. Likewise, if GSNAP finds 1 hit in the genome, the result would contain that one hit. But if GSNAP found multiple hits, then the "-n 1" flag would constrain the results to a single one.

To find uniquely matching hits, you would need to add the "-Q" or "--quiet-if-excessive" flag. With "-n 1" and that flag, if GSNAP found multiple hits, then it would pretend that it really found no (unique) hits to the genome, and report no hits.

I hope that makes sense. If you have other questions, you can also join the gsnap-users mailing list at EBI.

Tom
twu is offline   Reply With Quote
Old 01-11-2013, 01:30 AM   #17
ppoudel
Junior Member
 
Location: UK

Join Date: Feb 2012
Posts: 5
Default

Thanks Parthav and Thomas!!!

I have ~5 GB (each) of paired end reads from RNA seq experiment and I would like use GSNAP in HPC farm, however, the farm allows 12 hours maximum for a job. I tried to use 1 nodes and 16 process , time=12 hours and memory 64000. But the job crashed. Is there any way I can run this in HPC within 12 hours.
ppoudel is offline   Reply With Quote
Old 01-11-2013, 02:36 PM   #18
twu
Developer of GMAP and GSNAP
 
Location: South San Francisco, CA

Join Date: Oct 2011
Posts: 17
Default Running GSNAP on a farm

For HPC or Linux farms, it is probably best if you run GSNAP on several nodes for a given input file. You can do this easily with the -q or --part flag, which breaks up the input into parts. For example, if you want to spread the GSNAP computation over 50 nodes, you can run GSNAP like this:

gsnap -q 00/50 <fastq file> > output.00
gsnap -q 01/50 <fastq file> > output.01
gsnap -q 02/50 <fastq file> > output.02
...
gsnap -q 49/50 <fastq file> > output.49

The meaning of "-q 02/50" is that in every set of 50 input reads, compute on the second one. If you submit each of the above jobs to a different node on your cluster, you should theoretically see a 50x speedup. However, your output will be spread among 50 different files.

Note that I am working on making GSNAP a bit faster, but adding an initial alignment step that computes the easy alignments very quickly, but falls back upon the existing algorithm to harder alignments.

Regards,

Tom
twu is offline   Reply With Quote
Old 01-12-2013, 01:00 PM   #19
ParthavJailwala
Member
 
Location: Maryland, USA

Join Date: Oct 2009
Posts: 27
Default

Tom,

Thanks for your input on how to break up the input fastq file into parts for using multiple HPC nodes, using the "-q" flag.
In case of a gsnap mapping run involving paired-end fastq reads, I am wondering how the '-q' works. How would one specify picking the read-pairs from the R1 and R2 files?

Thanks
Parthav


Quote:
Originally Posted by twu View Post
For HPC or Linux farms, it is probably best if you run GSNAP on several nodes for a given input file. You can do this easily with the -q or --part flag, which breaks up the input into parts. For example, if you want to spread the GSNAP computation over 50 nodes, you can run GSNAP like this:

gsnap -q 00/50 <fastq file> > output.00
gsnap -q 01/50 <fastq file> > output.01
gsnap -q 02/50 <fastq file> > output.02
...
gsnap -q 49/50 <fastq file> > output.49

The meaning of "-q 02/50" is that in every set of 50 input reads, compute on the second one. If you submit each of the above jobs to a different node on your cluster, you should theoretically see a 50x speedup. However, your output will be spread among 50 different files.

Note that I am working on making GSNAP a bit faster, but adding an initial alignment step that computes the easy alignments very quickly, but falls back upon the existing algorithm to harder alignments.

Regards,

Tom
ParthavJailwala is offline   Reply With Quote
Old 01-16-2013, 02:46 PM   #20
twu
Developer of GMAP and GSNAP
 
Location: South San Francisco, CA

Join Date: Oct 2011
Posts: 17
Default -q flag and paired-end reads

If you have paired-end reads (by providing two files to GSNAP), then the -q flag takes the correct pairs from each of the files. That's the only thing that makes sense.

For example, -q 2/50 takes the second read out of each set of 50, from each of the two files.

Tom
twu is offline   Reply With Quote
Reply

Tags
gsnap output

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:43 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO