SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
short read aligner with 3 mismatch and one gap allowed NicoBxl Bioinformatics 2 11-09-2011 10:26 AM
The best short read aligner Deutsche Bioinformatics 4 04-14-2011 07:12 PM
Short Read Micro re-Aligner Paper nilshomer Literature Watch 0 10-29-2010 09:59 AM
New Short Read Aligner sparks Bioinformatics 48 08-26-2009 08:01 AM
Very Short Read aligner Rupinder Bioinformatics 1 06-02-2009 07:10 PM

Reply
 
Thread Tools
Old 04-21-2010, 04:49 PM   #1
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Thumbs up Short Read Micro re-Aligner (beta release)

We are pleased to announce the beta release of new tool called SRMA: the short read micro re-aligner. We have tested this method on human cancer resequencing datasets as well as performed validation with simulations. We wish to find beta testers to provide feedback and suggest new features to the tool.

Link:
http://srma.sf.net

Short description:
Sequence alignment algorithms examine each read independently. When indels occur towards the ends of reads, the alignment can lead to false SNPs as well as improperly placed indels. This tool aims to perform a re-alignment of each read to a graphical representation of all alignments within a local region to provide a better overall base-resolution consensus.

Features:
- The input is a BAM, the output is BAM.
- Specify a co-ordinate range for large-scale parallelism or local regions of interest.
- SOLiD data is re-aligned using the original color space reads and qualities to maximally use all information available (SAM CS/CQ tags must be present).
- A base correction mode for Illumina/454 data automatically recalls bases in the reads based on all alignments, removing spurious variants and adjusting their respective base qualities.

Acknowledgments:
Thanks to the Picard team for their fast responses to questions about the SAM/BAM Picard API. We would also like to thank the members of the Nelson Lab at UCLA.

Sincerely,
Nils Homer
nilshomer is offline   Reply With Quote
Old 04-22-2010, 01:35 AM   #2
tcezard
Member
 
Location: Edinburgh

Join Date: Dec 2008
Posts: 13
Default

Thanks, That looks like a nice bit of software.
Just a question how does it handle CIGAR stretches of N in split read of RNA seq?

Thanks
Tim
tcezard is offline   Reply With Quote
Old 04-22-2010, 07:49 AM   #3
eyalbd
Member
 
Location: Hebrew University in Jerusalem

Join Date: Apr 2010
Posts: 11
Default

Thanks. this looks like a very useful piece of software!
I tried this on BFAST output for SOLiD run and got the following error:
Quote:
ctr:200 AL:1:38:8_15_1136 50b aligned read. java.lang.Exception: Error: could not understand the base
at srma.SRMAUtil.colorSpaceNextBase(SRMAUtil.java:77)
at srma.SRMAUtil.normalizeColorSpaceRead(SRMAUtil.java:150)
at srma.Align.align(Align.java:84)
at srma.SRMA.processList(SRMA.java:378)
at srma.SRMA.doWork(SRMA.java:254)
at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:150)
at srma.SRMA.main(SRMA.java:81)
If it would help I can probably look for the "ctr:200 AL:1:38:8_15_1136 50b" read in the sam file and ouput it.

Meanwhile, I'm trying this on a BWA output for the same run which interestingly enough got me very bad pileup output (maybe due to samse output being in color-space like was reported here), there is no error yet but it will probably take a while.

Thanks
Eyal
eyalbd is offline   Reply With Quote
Old 04-22-2010, 08:44 AM   #4
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by tcezard View Post
Thanks, That looks like a nice bit of software.
Just a question how does it handle CIGAR stretches of N in split read of RNA seq?

Thanks
Tim
It will treat them as though they are any other base, although I did not consider RNA seq as part of its application, which makes me think...

Quote:
Originally Posted by eyalbd View Post
Thanks. this looks like a very useful piece of software!
I tried this on BFAST output for SOLiD run and got the following error:

If it would help I can probably look for the "ctr:200 AL:1:38:8_15_1136 50b" read in the sam file and ouput it.

Meanwhile, I'm trying this on a BWA output for the same run which interestingly enough got me very bad pileup output (maybe due to samse output being in color-space like was reported here), there is no error yet but it will probably take a while.

Thanks
Eyal
I am sorry that you encountered an error, but this is exactly what I am looking for (bugs). Could you send me the first 500 reads in the BFAST BAM and the reference? I tested it on both BWA and BFAST with Illumina and SOLiD data and for good coverage (30x: ~1 billion mapped reads on human) it performed quite well. Hopefully we can get you to the same point with good results (i.e. pileup).


Also, you can post questions to the SRMA help mailing list (srma-help@lists.sourceforge.net) or even become a developer!
Nils

Last edited by nilshomer; 04-22-2010 at 08:47 AM. Reason: more information
nilshomer is offline   Reply With Quote
Old 04-22-2010, 11:44 AM   #5
eyalbd
Member
 
Location: Hebrew University in Jerusalem

Join Date: Apr 2010
Posts: 11
Default

Thanks Nils, I'll send you the first 500 lines. (could you pm me an e-mail to send it to?) are you sure you want data from the binary file and not the SAM? again I have no problem with taking the time to find the troublesome read itself, and append it as well. Good practice for my dwindling python skills, and might be more help to you.

I'll sign up for the mailing list as well, thanks.
eyalbd is offline   Reply With Quote
Old 04-22-2010, 11:22 PM   #6
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by eyalbd View Post
Thanks Nils, I'll send you the first 500 lines. (could you pm me an e-mail to send it to?) are you sure you want data from the binary file and not the SAM? again I have no problem with taking the time to find the troublesome read itself, and append it as well. Good practice for my dwindling python skills, and might be more help to you.

I'll sign up for the mailing list as well, thanks.
Thank-you for sending me a great example from which to debug. I have released a new version (0.1.3) that fixes this bug, as well as providing an overal speed improvement. For more information, see http://srma.sf.net.
nilshomer is offline   Reply With Quote
Old 05-14-2010, 12:11 AM   #7
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Thanks to those who have tested SRMA so far. It has now been successfully run on three human genome re-sequencing experiments with great success. A manuscript is in final preparation so shoot me a PM for users who want to read what's under the hood.

Version 0.1.5 has been released, which includes a number of cosmetic changes as well as feature requests. As always, please post questions here or via email to srma-help@lists.sourceforge.net.

Last edited by nilshomer; 05-14-2010 at 12:47 PM.
nilshomer is offline   Reply With Quote
Old 05-14-2010, 12:38 PM   #8
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

Realignment is important. I am looking forward to the publication. I am particularly interested in how realignment may improve the variant calls (I guess a lot) and how it is compared to GATK. Kees from Sanger also has a sort of realigner.
lh3 is offline   Reply With Quote
Old 05-14-2010, 12:52 PM   #9
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by lh3 View Post
Realignment is important. I am looking forward to the publication. I am particularly interested in how realignment may improve the variant calls (I guess a lot) and how it is compared to GATK. Kees from Sanger also has a sort of realigner.
My understanding of GATK is that it samples from possible consensuses and that the user first must identify the regions of interest as whole genome re-alignment is not possible (yet?) with GATK. SRMA can be applied to specific regions as desired but currently it is fast enough to apply to the whole genome as it treats the alignments as priors within a variant graph.

I have tested the results using both BFAST and BWA on separate experiments (so I don't get into an aligner shoot-out), and the re-aligner helps reduce the false-positive rate significantly for both, especially for indels and color space data.
nilshomer is offline   Reply With Quote
Old 05-14-2010, 06:29 PM   #10
quinlana
Senior Member
 
Location: Charlottesville

Join Date: Sep 2008
Posts: 119
Default

well, i know what i'll be using at 8am on Monday. this is very apropos for me at the moment, looking forward to testing this Nils.
quinlana is offline   Reply With Quote
Old 05-15-2010, 04:20 AM   #11
ohofmann
Member
 
Location: Melbourne, Australia

Join Date: Jan 2009
Posts: 37
Default

Nils, how much flexibility do I have with MINIMUM_ALLELE_PROBABILITY? Itching to give this a try with some high-coverage pathogen re-sequencing data, but we are particularly interested in rare variation. Just how much of a problem am I causing by lowering this to 0.5 or 1%?

Also, can it handle more than two alleles at a given position?
ohofmann is offline   Reply With Quote
Old 05-15-2010, 10:22 AM   #12
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by ohofmann View Post
Nils, how much flexibility do I have with MINIMUM_ALLELE_PROBABILITY? Itching to give this a try with some high-coverage pathogen re-sequencing data, but we are particularly interested in rare variation. Just how much of a problem am I causing by lowering this to 0.5 or 1%?
For high coverage data, you probably want to raise both the "MINIMUM_ALLELE_COVERAGE" and "MINIMUM_ALLELE_PROBABILITY", since spurious coverage on variant alleles (or the reference allele) is more likely. If you have 1000x coverage, then with a 1% error rate you will see each possible allele many times. I don't know about the ploidy of your pathogen or if you are sequencing mixture. The basic idea is to set minimum thresholds on what to include as a prior variant in your new re-alignment. I have not explored high coverage data on non-diploid genomes (cancer works well too), but would be happy to help tune the parameters with you.

Quote:
Originally Posted by ohofmann View Post
Also, can it handle more than two alleles at a given position?
It can handle at most 4 plus a missing base .
nilshomer is offline   Reply With Quote
Old 05-15-2010, 10:42 AM   #13
ohofmann
Member
 
Location: Melbourne, Australia

Join Date: Jan 2009
Posts: 37
Default

Thanks for the swift reply ;-) Going to give it a try, we've got a nice test set to measure improvements right away. Will report back once I had a chance to tinker.
ohofmann is offline   Reply With Quote
Old 06-21-2010, 11:08 PM   #14
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

SRMA version 0.1.6 is now available!

This version has the following additions:

- the option MAXIMUM_TOTAL_COVERAGE will cause SRMA to ignore regions of high coverage.
- utilizes latest Picard release (v1-23) to traverse the reference FASTA while executing. This can dramatically improve initial start time of SRMA when processing later chromosomes since SRMA can jump straight to the reference sequence in question.
- adds support for soft-clipping within the SAM file (now fully compatible with the the SAM spec and BWA, I hope).
- RANGE/RANGES options are supported in the submission script.
- the NUM_THREADS option will now allow for multi-threading. This option is experimental, and may decrease performance depending on application and system architecture. On my Mac it scales as expected (linearly), but on our Cent OS 5 cluster it does not help at all.

Please let me know of any horrible failures or wonderful successes. I am always looking to debug and for good examples for the website. Thank-you for all those who have helped by sending examples and test cases.

Sincerely,

Nils Homer
nilshomer is offline   Reply With Quote
Old 11-17-2010, 03:02 AM   #15
ohofmann
Member
 
Location: Melbourne, Australia

Join Date: Jan 2009
Posts: 37
Default

Quote:
Originally Posted by nilshomer View Post
For high coverage data, you probably want to raise both the "MINIMUM_ALLELE_COVERAGE" and "MINIMUM_ALLELE_PROBABILITY", since spurious coverage on variant alleles (or the reference allele) is more likely. If you have 1000x coverage, then with a 1% error rate you will see each possible allele many times. I don't know about the ploidy of your pathogen or if you are sequencing mixture. The basic idea is to set minimum thresholds on what to include as a prior variant in your new re-alignment. I have not explored high coverage data on non-diploid genomes (cancer works well too), but would be happy to help tune the parameters with you.
Nils, congratulations on getting the publication out!

I'm about to give this a try on an odd data set -- 2kb of genomic sequence at an average (but far from uniform) coverage of around 100.000 X. It's a sequencing mixture, and the lower cutoff of variation we'd like to be able to detect is at around 0.5% (after error correction) or 500 observations.

Other than the biological samples we also have a mix of known genomic frequencies and defined indel regions to optimize parameters. Can you think of a realistic set of starting parameters?
ohofmann is offline   Reply With Quote
Old 11-17-2010, 06:59 PM   #16
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by ohofmann View Post
Nils, congratulations on getting the publication out!

I'm about to give this a try on an odd data set -- 2kb of genomic sequence at an average (but far from uniform) coverage of around 100.000 X. It's a sequencing mixture, and the lower cutoff of variation we'd like to be able to detect is at around 0.5% (after error correction) or 500 observations.

Other than the biological samples we also have a mix of known genomic frequencies and defined indel regions to optimize parameters. Can you think of a realistic set of starting parameters?
It wasn't designed for such high coverage so all bets are off.
nilshomer is offline   Reply With Quote
Old 06-28-2011, 09:44 AM   #17
apratap
Member
 
Location: Bay Area

Join Date: Jan 2009
Posts: 58
Default

Hi Nils

Just wondering can SRMA be used for rescuing orphaned reads. So we have a dataset of variable insert library as we are sequencing the 5' and 3' end of transcripts. As a result the distance between the mates( <--- --->) is dependent on the length of transcript. To map the reads initially I am first using Mosaik which i belv does a better job with variable insert mate pair data.

After mapping we still see 40% orphaned reads where one read maps and the other doesn't. I am wondering if SRMA can rescue these reads.

Thanks!
-Abhi
apratap is offline   Reply With Quote
Old 06-29-2011, 08:28 AM   #18
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

No, SRMA is not for read rescue. It is for re-aligning the reads to create a better consensus.
nilshomer is offline   Reply With Quote
Old 06-29-2011, 09:47 AM   #19
apratap
Member
 
Location: Bay Area

Join Date: Jan 2009
Posts: 58
Default

Ok good to know. I will start a new thread for my question then.

Best,
-Abhi
apratap is offline   Reply With Quote
Old 08-03-2012, 01:51 AM   #20
ymc
Senior Member
 
Location: Hong Kong

Join Date: Mar 2010
Posts: 498
Default

Dead project now? Are there other alternatives that work on the whole genome?
ymc is offline   Reply With Quote
Reply

Tags
alignment, illumina, solid, srma

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:37 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO