SEQanswers

Go Back   SEQanswers > General



Reply
 
Thread Tools
Old 12-04-2011, 10:40 AM   #1
hanshart
Member
 
Location: Germany

Join Date: Nov 2011
Posts: 27
Default subread: "a superfast read aligner"

Hello

Has anyone of you allready tried subread ("subread: a superfast read aligner": http://sourceforge.net/projects/subread/)?
I'm very new to RNA-Seq. I'm examining Illumina HiSeq data for different tissues of the human body. At first I used bowtie2 (beta3, without splicing library) and it mapped about 85% of the reads to the hg19 genome (which is quite good I think?!) using the standard parameters for bowtie2 end-to-end alignment.
After that I tried out subread with standard settings. Subread should be able to detect splicing events. It actually mapped 95 to 98% of my reads to hg19 in the same time as bowtie2 did!
I'm asking if anybody is experienced with subread and could say something about the quality of the program/mapping process and traps or drawbacks of it? I'm wondering why it is not mentioned anywhere although it seems to be very fast and powerful?
Thank you.

Last edited by hanshart; 12-04-2011 at 02:18 PM.
hanshart is offline   Reply With Quote
Old 12-05-2011, 01:42 AM   #2
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Dear Hanshart,

Yes, subread can map junction reads. However it does this not by looking for the junction locations, but by using part of the read sequence to find the mapping location for the read. Therefore, the mapping location returned by subread for a junction is on one of the exons which are spliced with other exons (usually the one which has the largest segment in the read).

However, the subjunc program included in the subread package can find the precise junction location in the read and return a CIGAR string which gives the length of introns.

If you are using the latest version of Subread (1.1.0), you should be able to find a User Manual in the package which will give you a brief introduction to the algorithm used by subread.

Hope this helps.


Cheers,
Wei
shi is offline   Reply With Quote
Old 06-01-2013, 11:20 PM   #3
Jon_Keats
Senior Member
 
Location: Phoenix, AZ

Join Date: Mar 2010
Posts: 279
Default

For subjunc analysis I'm a bit confused by the following parameters:

-d --mindist <int> the minimum distance between two reads in a pair, 50 by default
-D --maxdist <int> the maximum distance between two reads in a pair, 600 by default

If you have fragments of 150 bp which are sequenced with 100x100 HiSeq runs then you have read1 covering from 1-100 and read2 covering 50-150. So would the --mindist be -50?
Jon_Keats is offline   Reply With Quote
Old 06-02-2013, 03:44 PM   #4
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Dear Jon Keats,

The distance is measured as the distance between the mapping location of the leftmost base of the left read and the mapping location of the leftmost base of the right read. So for your example, the distance between the two reads is 49 (50-1).

We will make the usage info more clearer for these two arguments.

Best wishes,

Wei
shi is offline   Reply With Quote
Old 06-03-2013, 10:33 PM   #5
Jon_Keats
Senior Member
 
Location: Phoenix, AZ

Join Date: Mar 2010
Posts: 279
Default

Thanks for the prompt response.

I don't understand how you get 49
Code:
    Read 1------------------------->
Fragment --------------------------------------------------
                          <--------------------------------Read2
I would call it -49?

On a separate note I just finished our first test of subjunc. I'd say it performed well but had a couple of issues. First I have to say I really like the actual MAPQ value compared to Tophat or STAR. But it did fail on some picard validation steps due to the encoding of the sam/bam file
Code:
Ignoring SAM validation error: ERROR: Record 6, Read name HWI-ST388-W7D:30:D25DKACXX:6:2112:4784:17492, Mate Alignment start should be 0 because reference name = *.
Ignoring SAM validation error: ERROR: Record 6, Read name HWI-ST388-W7D:30:D25DKACXX:6:2112:4784:17492, Mapped mate should have mate reference nameq

HWI-ST388-W7D:30:D25DKACXX:6:2112:4784:17492    129     1       11683   199     100M    *       114359203       0       GAGATTCTTATTAGTGATTTGGGCTGGGGCCTGGCCATGTGTATTTTTTTAAATTTCCACTGATGATTTTGCTGCATGGCCGGTGTTGAGAATGACTGCG    <=?DDADDHDFHHABFHIGHE>ECG<GC@G1??FHICGF<D9BAFEIJJE=F@CEHGIBCHHHEHHHFFFFFEDEEECADCBBB?28<:CDDD@>A:C<<    PG:Z:MarkDuplicate
We typically do markduplicates for our RNAseq runs as a quality control metric but you get similar validation errors until you turn on VALIDATION_STRINGENCY=LENIENT.

Also it looks like the "correctly-paired" flagstats are encoded in a hard-coded variable. In this case these mRNAseq libraries have fragments of 150bp with a SD of 36bp but a read of 146bp insert was called incorrectly paired. I'm not sure if this is the mindist option but this is a basic issue.

On a good note the alignments look decent and for mapping rate it was higher than Tophat2.0.8 and just behind STAR2.3.0

Last edited by Jon_Keats; 06-03-2013 at 10:35 PM.
Jon_Keats is offline   Reply With Quote
Old 06-04-2013, 10:44 PM   #6
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Dear Jon Keats,

The paired-distance used by subjunc is always a positive value, ie. the absolute value of calculated distance is used. Sbujunc uses this absolute value compare with mindist and maxdist values.

The 'correctly-paired' flag was not hard-coded. The reason why a 146bp insert was called incorrectly paired is because of the use of --mindist=50. The paired-distance for a 146bp insert is calculated as 45 by subjunc, which is less than 50. Therefore it was called incorrectly paired.

You may set --mindist=10 and this should allow inserts with one SD away from your average fragment length to be called correctly paired.

Thanks for reporting the problem you encountered when running picard on the subjunc mapping results. We have identified the problem and are working on it. An updated version should be available in a day or two. I will let you know when it is released.

Best wishes,

Wei
shi is offline   Reply With Quote
Old 06-04-2013, 11:04 PM   #7
Jon_Keats
Senior Member
 
Location: Phoenix, AZ

Join Date: Mar 2010
Posts: 279
Default

So by --mindist does this really mean the sequence length that is sequenced by both read1 and read2?

I only pick on this as it was a MAJOR issue in Tophat as read lengths increased and we started to cross-over resulting in negative values for their -inner-mate-distance. It might be worth creating a figure to illustrate the point. I'm happy to provide ours that outlines:
Library fragment
Insert Size
Inner-mate-distance

I would just add min-dist myself but I'm not confident I have it right or that we are completely talking about the same thing as I don't understand how you arrive at a paired-distance of 45 for a 146bp insert sequenced in both directions by 100bp, which results in bases 1-45 being sequenced by read1, bases 46-100 being sequenced twice (once by read1 and once by read2), and bases 101-146 being sequenced by read2. I would interpret -mindist as the distance separating the two reads, historically this was always a positive value as the reads never crossed each other. But now, particularly in RNAseq experiments and Agilent Exomes it is not at all uncommon that the two reads cross-over.
Jon_Keats is offline   Reply With Quote
Old 06-04-2013, 11:27 PM   #8
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

Jon it seems there are at least a couple "standard" type things that subread redefines. Their method of identifying reads as multi-mapping reads is slightly off from how other short read aligners define them. This threw me off a little today as I was testing out subread.

You can see how they arrive at the value of 49 in the example you've mentioned - I think that's clear enough - but the terminology is a little confusing. They are measuring from left-most alignment coordinate of mate 1 to left-most alignment coordinate of mate 2 where the "standard" is to measure from right-most alignment coordinate of mate 1 to left-most coordinate of mate 2. I can't guess why they chose to use a kind of non-standard way of specifying this distance but it is a little confusing at first.
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 06-04-2013, 11:46 PM   #9
Jon_Keats
Senior Member
 
Location: Phoenix, AZ

Join Date: Mar 2010
Posts: 279
Default

Assuming Wei can confirm that is correct. I actually understand your explanation, so thanks @sdriscoll

So if we were to set it dynamically to include read pairs within 1 standard deviation it would be something like:
(mean insert size) - (1SD) - (Length of Read2) = mindist

Ultimately it would be nice if these values were just learned by the program so they didn't need to be hard coded by the user. Most aligners have moved away from this practice as people don't often know the correct values and it can easily be learned from the actual data.
Jon_Keats is offline   Reply With Quote
Old 06-05-2013, 02:19 AM   #10
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Dear Jon and sdriscoll,

sdriscoll' explanation is correct.

Both subread and subjunc need users to specify the minimum and maximum insert lengths. The parameters --mindist and --maxdist are provided for this purpose. The minimum and maximum insert length can be calculated as mindist + length of read2 + 1 and maxdist + length of read2 + 1, respectively. However I think I do agree that these two parameters seem confusing. Probably it is better to specify the minimum and maximum insert lengths directly. We will change them in the next release.

We do not like redefining things unless absolutely necessary. We do not use the the -inner-mate-distance because we do not consider this in our read mapping.

Many thanks for your helpful comments.

Best wishes,

Wei
shi is offline   Reply With Quote
Old 06-05-2013, 09:46 PM   #11
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Dear Jon and sdriscoll,

We have fixed a couple of bugs and made a few changes to Subread. The latest version can be downloaded from sourceforge - http://subread.sourceforge.net . You can see the latest changes in the CHANGLOG as well.

Let me know if you run into other problems.

Best wishes,

Wei
shi is offline   Reply With Quote
Old 06-05-2013, 09:58 PM   #12
Jon_Keats
Senior Member
 
Location: Phoenix, AZ

Join Date: Mar 2010
Posts: 279
Default

Quote:
Originally Posted by shi View Post
Dear Jon and sdriscoll,

We have fixed a couple of bugs and made a few changes to Subread. The latest version can be downloaded from sourceforge - http://subread.sourceforge.net . You can see the latest changes in the CHANGLOG as well.

Let me know if you run into other problems.

Best wishes,

Wei
Thanks, running at test now. I'll let you know how it goes
Jon_Keats is offline   Reply With Quote
Old 06-05-2013, 11:02 PM   #13
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

The updates sound good - I'll test it out tomorrow.
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 06-06-2013, 10:10 PM   #14
Jon_Keats
Senior Member
 
Location: Phoenix, AZ

Join Date: Mar 2010
Posts: 279
Default

Hi Wei,

Got a new bug during SAM->BAM conversion using samtools:
Code:
[samopen] SAM header is present: 178 sequences.
Parse error at line 580: sequence and quality are inconsistent
I pulled out the lines 579-581, and what you see is that it didn't print the quality values for some reason
Code:
HWI-ST388-W7D:30:D25DKACXX:2:1101:11621:1970	217	MT	7312	199	100M	*	0	0	TCATGATTTGAGAAGCCTTCGCTTCGAAGCGAAAAGTCCTAATAGTAGAAGAACCCTCCATAAACCTGGAGTGACTATATGGATGCCCCCCACCCTACCA	CDDDDCDDEDDDDDBDDDDDDDDDDDDDDDCDDDDDDDEEEEDEDDDCCCDB<DDBCDDDCDC@DDDDDDDEEEDEDDDDDDDDDDJHHHD=FFEDBCB@
HWI-ST388-W7D:30:D25DKACXX:2:1101:11726:1975	165	*	0	0	*	MT	7312	0	TAAAAATACAAAAATTAGCCAGGCTTGGTGGCGGGAGCCTGTAATCCCAGCTACTTGGGAGGCTGAGGCATGAGAATCACTTGAACCGGGGAGGTGGAGG	
HWI-ST388-W7D:30:D25DKACXX:2:1101:11726:1975	69	*	0	0	*	22	23243225	0	CTTGATCTTTGCTCACTGCAACCTCCACCTCCCCGGTTCAAGTGATTCTCATGCCTCAGCCTCCCAAGTAGCTGGGATTACAGGCTCCCGCCACCAAGCC	JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
I checked through the file and found that 2.9% of the lines written to the SAM file are missing the quality values

Some other examples
Code:
HWI-ST388-W7D:30:D25DKACXX:2:1101:11989:2774	133	*	0	0	*	22	23029478	0	GCGCGGGGATCCCTGACGATCTCTCACTGTCTTTATGTATCACCAAAACAGGGGCCTGGCCTGGCTTCTGTTGATACCAAAAAGTGTATTGTCTTGCCAA	
HWI-ST388-W7D:30:D25DKACXX:2:1101:12573:2770	165	*	0	0	*	1	179051348	0	CGTCTTCTGCCTGGACTCCACTGATGGTCAAAGTGACTGTTGTCCCTGAAGTGGAGCCTGAGAATCGCGCGGGGATCCCTGACGATCTCTCACTGTCTTT	
HWI-ST388-W7D:30:D25DKACXX:2:1101:12678:3000	141	*	0	0	*	*	0	0	CCAGCATTTTGGGAGGCCAAGGTGGGCAGATCACCTGAGGTCAGGAGTTTGAGACCAGCCTGGCCAACATGGAGAAATCCCGTCTCTACTAAAAATACAA	
HWI-ST388-W7D:30:D25DKACXX:2:1101:13149:2807	165	*	0	0	*	22	23243223	0	ACGAGGCTGACTATTATTGTCAATCAGGAGACAGCAGTACCTGGGTCTTCGGCGGAGGGACCAGGCTGACCGTCCTGAGTCAGCCCAAGGCTGCCCCCCC	
HWI-ST388-W7D:30:D25DKACXX:2:1101:14167:2974	141	*	0	0	*	*	0	0	CAAGACAATACACTTTTTGGTATCAACAGAAGCCAGGCCAGGCCCCTGTTTTGGTGATACATAAAGACAGTGAGAGATCGTCAGGGATCCCCGCGCGATT	
HWI-ST388-W7D:30:D25DKACXX:2:1101:16334:2859	133	*	0	0	*	8	126448195	0	CGAAGTGGAGCCTGAGAATCGCGCGGGGATCCCTGACGATCTCTCACTGTCTTTATGTATCACCAAAACAGGGGCCTGGCCTGGCTTCTGTTGATACCAA	
HWI-ST388-W7D:30:D25DKACXX:2:1101:19005:2869	165	*	0	0	*	22	23090325	0	AACGTCTTCTGCCTGGACTCCACTGATGGTCAAAGTGACTGTTGTCCCTGAAGTGGAGCCTGAGAATCGCGCGGGGATCCCTGACGATCTCTCACTGTCT	
HWI-ST388-W7D:30:D25DKACXX:2:1101:19233:2985	133	*	0	0	*	12	25403108	0	TCCAGCTACTCAGGAGGCTGAGGCAGGAGAATTGCTTGAACCTGGGAGGCAGAGGTTGCAGTGAGCTGAGATGGCACCATAGCACTCCAGCCTAGGCGAC	
HWI-ST388-W7D:30:D25DKACXX:2:1101:19566:2846	165	*	0	0	*	15	45007689	0	GCAATGGCGTGATCTTGGCTCACCACAACCTCCGCCTCCCGGGTTCAAGCAATTCTCCTGCCTCAGCCTCCTGAGTAGCTGGGATTACAGGCATGTGCCA	
HWI-ST388-W7D:30:D25DKACXX:2:1101:19762:2931	133	*	0	0	*	22	23248677	0	TCTGGAGATGCATTGGCAAGACAATACACTTTTTGGTATCAACAGAAGCCAGGCCAGGCCCCTGTTTTGGTGATACATAAAGACAGTGAGAGATCGTCAG	
HWI-ST388-W7D:30:D25DKACXX:2:1101:20141:2998	141	*	0	0	*	*	0	0	AGGGGGCAGCCTTGGGCTGACTCAGGACGGTCAGCCTGGTCCCTCCGCCGAAGACCCAGGTACTGCTGTCTCCTGATTGACAATAATAGTCAGCCTCGTC	
HWI-ST388-W7D:30:D25DKACXX:2:1101:1551:3119	165	*	0	0	*	22	23243487	0	CTCCACTGATGGTCAAAGTGACTGTTGTCCCTGAAGTGGAGCCTGAGAATCGCGCGGGGATCCCTGACGATCTCTCACTGTCTTTATGTATCACCAAAAC	
HWI-ST388-W7D:30:D25DKACXX:2:1101:2045:3173	165	*	0	0	*	11	118309802	0	CCTGAAGTGGAGCCTGAGAATCGCGCGGGGATCCCTGACGATCTCTCACTGTCTTTATGTATCACCAAAACAGGGGCCTGGCCTGGCTTCTGTTGATACC	
HWI-ST388-W7D:30:D25DKACXX:2:1101:2296:3207	133	*	0	0	*	6	7891857	0	TACACTTTTTGGTATCAACAGAAGCCAGGCCAGGCCCCTGTTTTGGTGATACATAAAGACAGTGAGAGATCGTCAGGGATCCCCGCGCGATTCTCAGGCT	
HWI-ST388-W7D:30:D25DKACXX:2:1101:2861:3094	165	*	0	0	*	MT	8025	0	AGACAGTGAGAGATCGTCAGGGATCCCCGCGCGATTCTCAGGCTCCACTTCAGGGACAACAGTCACTTTGACCATCAGTGGAGTCCAGGCAGAAGACGAG	
HWI-ST388-W7D:30:D25DKACXX:2:1101:12366:3168	165	*	0	0	*	1	565136	0	CCTTGGGCTGACTATTATTGTCAATCAGGAGACAGCAGTACCTGGGTCTTCGGCGGAGGGACCAGGCTGACCGTCCTGAGTCAGCCCAAGGAGATCGGAA	
HWI-ST388-W7D:30:D25DKACXX:2:1101:12388:3239	165	*	0	0	*	19	41813306	0	CTGGACTCCACTGATGGTCAAAGTGACTGTTGTCCCTGAAGTGGAGCCTGAGAATCGCGCGGGGATCCCTGACGATCTCTCACTGTCTTTATGTATCACC	
HWI-ST388-W7D:30:D25DKACXX:2:1101:13339:3182	165	*	0	0	*	7	5568181	0	CAGAAGCCAGGCCAGGCCCCTGTTTTGGTGATACATAAAGACAGTGAGAGATCGTCAGGGATCCCCGCGCGATTCTCAGGCTCCACTTCAGGGACAACAG	
HWI-ST388-W7D:30:D25DKACXX:2:1101:13684:3121	165	*	0	0	*	4	3534046	0	GCAGAAGCCTCAGGAGCCGATGCGATCAACTGGAAGAAAGGGTATCAGCAATGGAAGATGAAATGAATGAAATGAAGCGAGAAGGGAAGTTTAGAGAAAA	
HWI-ST388-W7D:30:D25DKACXX:2:1101:16067:3233	165	*	0	0	*	19	39125643	0	CTCCACTGATGGTCAAAGTGACTGTTGTCCCTGAAGTGGAGCCTGAGAATCGCGCGGGGATCCCTGACGATCTCTCACTGTCTTTATGTATCACCAAAAC
Jon_Keats is offline   Reply With Quote
Old 06-06-2013, 10:45 PM   #15
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

Hey I had something very similar happen to me! I sent them some data and they're working on it.
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 06-06-2013, 10:49 PM   #16
Jon_Keats
Senior Member
 
Location: Phoenix, AZ

Join Date: Mar 2010
Posts: 279
Default

Didn't happen in the previous version so I'm guessing its a result of the fix they put in for my last issue. I'm impressed with how fast they are turning around fixes so it makes it worth the effort.
Jon_Keats is offline   Reply With Quote
Old 06-06-2013, 10:58 PM   #17
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

Customer service makes all the difference! I don't know if you have ever tried to get help from the Tophat/cufflinks people but man...those guys are impossible. So far Wei has been on top of things. The same goes for the BWA, STAR and RSEM devs. I've had pretty quick responses from all of them.

It's sort of convenient for me that these subread people are just getting to work when I'm about to go to sleep. I get back to the lab the next day and they've fixed stuff during the night.
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 06-07-2013, 04:21 AM   #18
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Dear sdriscoll and Jon,

Many thanks for your nice comments. We really appreciate you putting up with the bugs and helping us to improve our programs.

As you said, the sam output bug was introduced in v1.3.5. We have fixed it in v1.3.5-p1. We also enhanced the subread-buildindex to let it check the integrity of the provided reference sequences and report any unexpected characters in a more informative way.

The latest version v1.3.5-p1 can be downloaded from http://subread.sourceforge.net . We have done a more thorough test by using much bigger test datasets. Hope it works for you. But please let me know if found any other bugs.

Best wishes,

Wei
shi is offline   Reply With Quote
Old 06-09-2013, 11:02 PM   #19
Jon_Keats
Senior Member
 
Location: Phoenix, AZ

Join Date: Mar 2010
Posts: 279
Default

Running a test tonight...
Jon_Keats is offline   Reply With Quote
Old 06-12-2013, 09:44 PM   #20
Jon_Keats
Senior Member
 
Location: Phoenix, AZ

Join Date: Mar 2010
Posts: 279
Default

Hi Shi,

Just wanted to thank you this last patch seems to be much improved. Now moving on to test featureCounts.
Jon_Keats is offline   Reply With Quote
Reply

Tags
bowtie2, subread

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:21 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO