SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   General (http://seqanswers.com/forums/forumdisplay.php?f=16)
-   -   subread (http://seqanswers.com/forums/showthread.php?t=15999)

hanshart 12-04-2011 10:40 AM

subread: "a superfast read aligner"
 
Hello

Has anyone of you allready tried subread ("subread: a superfast read aligner": http://sourceforge.net/projects/subread/)?
I'm very new to RNA-Seq. I'm examining Illumina HiSeq data for different tissues of the human body. At first I used bowtie2 (beta3, without splicing library) and it mapped about 85% of the reads to the hg19 genome (which is quite good I think?!) using the standard parameters for bowtie2 end-to-end alignment.
After that I tried out subread with standard settings. Subread should be able to detect splicing events. It actually mapped 95 to 98% of my reads to hg19 in the same time as bowtie2 did!
I'm asking if anybody is experienced with subread and could say something about the quality of the program/mapping process and traps or drawbacks of it? I'm wondering why it is not mentioned anywhere although it seems to be very fast and powerful?
Thank you.

shi 12-05-2011 01:42 AM

Dear Hanshart,

Yes, subread can map junction reads. However it does this not by looking for the junction locations, but by using part of the read sequence to find the mapping location for the read. Therefore, the mapping location returned by subread for a junction is on one of the exons which are spliced with other exons (usually the one which has the largest segment in the read).

However, the subjunc program included in the subread package can find the precise junction location in the read and return a CIGAR string which gives the length of introns.

If you are using the latest version of Subread (1.1.0), you should be able to find a User Manual in the package which will give you a brief introduction to the algorithm used by subread.

Hope this helps.


Cheers,
Wei

Jon_Keats 06-01-2013 11:20 PM

For subjunc analysis I'm a bit confused by the following parameters:

-d --mindist <int> the minimum distance between two reads in a pair, 50 by default
-D --maxdist <int> the maximum distance between two reads in a pair, 600 by default

If you have fragments of 150 bp which are sequenced with 100x100 HiSeq runs then you have read1 covering from 1-100 and read2 covering 50-150. So would the --mindist be -50?

shi 06-02-2013 03:44 PM

Dear Jon Keats,

The distance is measured as the distance between the mapping location of the leftmost base of the left read and the mapping location of the leftmost base of the right read. So for your example, the distance between the two reads is 49 (50-1).

We will make the usage info more clearer for these two arguments.

Best wishes,

Wei

Jon_Keats 06-03-2013 10:33 PM

Thanks for the prompt response.

I don't understand how you get 49
Code:

    Read 1------------------------->
Fragment --------------------------------------------------
                          <--------------------------------Read2

I would call it -49?

On a separate note I just finished our first test of subjunc. I'd say it performed well but had a couple of issues. First I have to say I really like the actual MAPQ value compared to Tophat or STAR. But it did fail on some picard validation steps due to the encoding of the sam/bam file
Code:

Ignoring SAM validation error: ERROR: Record 6, Read name HWI-ST388-W7D:30:D25DKACXX:6:2112:4784:17492, Mate Alignment start should be 0 because reference name = *.
Ignoring SAM validation error: ERROR: Record 6, Read name HWI-ST388-W7D:30:D25DKACXX:6:2112:4784:17492, Mapped mate should have mate reference nameq

HWI-ST388-W7D:30:D25DKACXX:6:2112:4784:17492    129    1      11683  199    100M    *      114359203      0      GAGATTCTTATTAGTGATTTGGGCTGGGGCCTGGCCATGTGTATTTTTTTAAATTTCCACTGATGATTTTGCTGCATGGCCGGTGTTGAGAATGACTGCG    <=?DDADDHDFHHABFHIGHE>ECG<GC@G1??FHICGF<D9BAFEIJJE=F@CEHGIBCHHHEHHHFFFFFEDEEECADCBBB?28<:CDDD@>A:C<<    PG:Z:MarkDuplicate

We typically do markduplicates for our RNAseq runs as a quality control metric but you get similar validation errors until you turn on VALIDATION_STRINGENCY=LENIENT.

Also it looks like the "correctly-paired" flagstats are encoded in a hard-coded variable. In this case these mRNAseq libraries have fragments of 150bp with a SD of 36bp but a read of 146bp insert was called incorrectly paired. I'm not sure if this is the mindist option but this is a basic issue.

On a good note the alignments look decent and for mapping rate it was higher than Tophat2.0.8 and just behind STAR2.3.0

shi 06-04-2013 10:44 PM

Dear Jon Keats,

The paired-distance used by subjunc is always a positive value, ie. the absolute value of calculated distance is used. Sbujunc uses this absolute value compare with mindist and maxdist values.

The 'correctly-paired' flag was not hard-coded. The reason why a 146bp insert was called incorrectly paired is because of the use of --mindist=50. The paired-distance for a 146bp insert is calculated as 45 by subjunc, which is less than 50. Therefore it was called incorrectly paired.

You may set --mindist=10 and this should allow inserts with one SD away from your average fragment length to be called correctly paired.

Thanks for reporting the problem you encountered when running picard on the subjunc mapping results. We have identified the problem and are working on it. An updated version should be available in a day or two. I will let you know when it is released.

Best wishes,

Wei

Jon_Keats 06-04-2013 11:04 PM

So by --mindist does this really mean the sequence length that is sequenced by both read1 and read2?

I only pick on this as it was a MAJOR issue in Tophat as read lengths increased and we started to cross-over resulting in negative values for their -inner-mate-distance. It might be worth creating a figure to illustrate the point. I'm happy to provide ours that outlines:
Library fragment
Insert Size
Inner-mate-distance

I would just add min-dist myself but I'm not confident I have it right or that we are completely talking about the same thing as I don't understand how you arrive at a paired-distance of 45 for a 146bp insert sequenced in both directions by 100bp, which results in bases 1-45 being sequenced by read1, bases 46-100 being sequenced twice (once by read1 and once by read2), and bases 101-146 being sequenced by read2. I would interpret -mindist as the distance separating the two reads, historically this was always a positive value as the reads never crossed each other. But now, particularly in RNAseq experiments and Agilent Exomes it is not at all uncommon that the two reads cross-over.

sdriscoll 06-04-2013 11:27 PM

Jon it seems there are at least a couple "standard" type things that subread redefines. Their method of identifying reads as multi-mapping reads is slightly off from how other short read aligners define them. This threw me off a little today as I was testing out subread.

You can see how they arrive at the value of 49 in the example you've mentioned - I think that's clear enough - but the terminology is a little confusing. They are measuring from left-most alignment coordinate of mate 1 to left-most alignment coordinate of mate 2 where the "standard" is to measure from right-most alignment coordinate of mate 1 to left-most coordinate of mate 2. I can't guess why they chose to use a kind of non-standard way of specifying this distance but it is a little confusing at first.

Jon_Keats 06-04-2013 11:46 PM

Assuming Wei can confirm that is correct. I actually understand your explanation, so thanks @sdriscoll

So if we were to set it dynamically to include read pairs within 1 standard deviation it would be something like:
(mean insert size) - (1SD) - (Length of Read2) = mindist

Ultimately it would be nice if these values were just learned by the program so they didn't need to be hard coded by the user. Most aligners have moved away from this practice as people don't often know the correct values and it can easily be learned from the actual data.

shi 06-05-2013 02:19 AM

Dear Jon and sdriscoll,

sdriscoll' explanation is correct.

Both subread and subjunc need users to specify the minimum and maximum insert lengths. The parameters --mindist and --maxdist are provided for this purpose. The minimum and maximum insert length can be calculated as mindist + length of read2 + 1 and maxdist + length of read2 + 1, respectively. However I think I do agree that these two parameters seem confusing. Probably it is better to specify the minimum and maximum insert lengths directly. We will change them in the next release.

We do not like redefining things unless absolutely necessary. We do not use the the -inner-mate-distance because we do not consider this in our read mapping.

Many thanks for your helpful comments.

Best wishes,

Wei

shi 06-05-2013 09:46 PM

Dear Jon and sdriscoll,

We have fixed a couple of bugs and made a few changes to Subread. The latest version can be downloaded from sourceforge - http://subread.sourceforge.net . You can see the latest changes in the CHANGLOG as well.

Let me know if you run into other problems.

Best wishes,

Wei

Jon_Keats 06-05-2013 09:58 PM

Quote:

Originally Posted by shi (Post 106916)
Dear Jon and sdriscoll,

We have fixed a couple of bugs and made a few changes to Subread. The latest version can be downloaded from sourceforge - http://subread.sourceforge.net . You can see the latest changes in the CHANGLOG as well.

Let me know if you run into other problems.

Best wishes,

Wei

Thanks, running at test now. I'll let you know how it goes

sdriscoll 06-05-2013 11:02 PM

The updates sound good - I'll test it out tomorrow.

Jon_Keats 06-06-2013 10:10 PM

Hi Wei,

Got a new bug during SAM->BAM conversion using samtools:
Code:

[samopen] SAM header is present: 178 sequences.
Parse error at line 580: sequence and quality are inconsistent

I pulled out the lines 579-581, and what you see is that it didn't print the quality values for some reason
Code:

HWI-ST388-W7D:30:D25DKACXX:2:1101:11621:1970        217        MT        7312        199        100M        *        0        0        TCATGATTTGAGAAGCCTTCGCTTCGAAGCGAAAAGTCCTAATAGTAGAAGAACCCTCCATAAACCTGGAGTGACTATATGGATGCCCCCCACCCTACCA        CDDDDCDDEDDDDDBDDDDDDDDDDDDDDDCDDDDDDDEEEEDEDDDCCCDB<DDBCDDDCDC@DDDDDDDEEEDEDDDDDDDDDDJHHHD=FFEDBCB@
HWI-ST388-W7D:30:D25DKACXX:2:1101:11726:1975        165        *        0        0        *        MT        7312        0        TAAAAATACAAAAATTAGCCAGGCTTGGTGGCGGGAGCCTGTAATCCCAGCTACTTGGGAGGCTGAGGCATGAGAATCACTTGAACCGGGGAGGTGGAGG       
HWI-ST388-W7D:30:D25DKACXX:2:1101:11726:1975        69        *        0        0        *        22        23243225        0        CTTGATCTTTGCTCACTGCAACCTCCACCTCCCCGGTTCAAGTGATTCTCATGCCTCAGCCTCCCAAGTAGCTGGGATTACAGGCTCCCGCCACCAAGCC        JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ

I checked through the file and found that 2.9% of the lines written to the SAM file are missing the quality values

Some other examples
Code:

HWI-ST388-W7D:30:D25DKACXX:2:1101:11989:2774        133        *        0        0        *        22        23029478        0        GCGCGGGGATCCCTGACGATCTCTCACTGTCTTTATGTATCACCAAAACAGGGGCCTGGCCTGGCTTCTGTTGATACCAAAAAGTGTATTGTCTTGCCAA       
HWI-ST388-W7D:30:D25DKACXX:2:1101:12573:2770        165        *        0        0        *        1        179051348        0        CGTCTTCTGCCTGGACTCCACTGATGGTCAAAGTGACTGTTGTCCCTGAAGTGGAGCCTGAGAATCGCGCGGGGATCCCTGACGATCTCTCACTGTCTTT       
HWI-ST388-W7D:30:D25DKACXX:2:1101:12678:3000        141        *        0        0        *        *        0        0        CCAGCATTTTGGGAGGCCAAGGTGGGCAGATCACCTGAGGTCAGGAGTTTGAGACCAGCCTGGCCAACATGGAGAAATCCCGTCTCTACTAAAAATACAA       
HWI-ST388-W7D:30:D25DKACXX:2:1101:13149:2807        165        *        0        0        *        22        23243223        0        ACGAGGCTGACTATTATTGTCAATCAGGAGACAGCAGTACCTGGGTCTTCGGCGGAGGGACCAGGCTGACCGTCCTGAGTCAGCCCAAGGCTGCCCCCCC       
HWI-ST388-W7D:30:D25DKACXX:2:1101:14167:2974        141        *        0        0        *        *        0        0        CAAGACAATACACTTTTTGGTATCAACAGAAGCCAGGCCAGGCCCCTGTTTTGGTGATACATAAAGACAGTGAGAGATCGTCAGGGATCCCCGCGCGATT       
HWI-ST388-W7D:30:D25DKACXX:2:1101:16334:2859        133        *        0        0        *        8        126448195        0        CGAAGTGGAGCCTGAGAATCGCGCGGGGATCCCTGACGATCTCTCACTGTCTTTATGTATCACCAAAACAGGGGCCTGGCCTGGCTTCTGTTGATACCAA       
HWI-ST388-W7D:30:D25DKACXX:2:1101:19005:2869        165        *        0        0        *        22        23090325        0        AACGTCTTCTGCCTGGACTCCACTGATGGTCAAAGTGACTGTTGTCCCTGAAGTGGAGCCTGAGAATCGCGCGGGGATCCCTGACGATCTCTCACTGTCT       
HWI-ST388-W7D:30:D25DKACXX:2:1101:19233:2985        133        *        0        0        *        12        25403108        0        TCCAGCTACTCAGGAGGCTGAGGCAGGAGAATTGCTTGAACCTGGGAGGCAGAGGTTGCAGTGAGCTGAGATGGCACCATAGCACTCCAGCCTAGGCGAC       
HWI-ST388-W7D:30:D25DKACXX:2:1101:19566:2846        165        *        0        0        *        15        45007689        0        GCAATGGCGTGATCTTGGCTCACCACAACCTCCGCCTCCCGGGTTCAAGCAATTCTCCTGCCTCAGCCTCCTGAGTAGCTGGGATTACAGGCATGTGCCA       
HWI-ST388-W7D:30:D25DKACXX:2:1101:19762:2931        133        *        0        0        *        22        23248677        0        TCTGGAGATGCATTGGCAAGACAATACACTTTTTGGTATCAACAGAAGCCAGGCCAGGCCCCTGTTTTGGTGATACATAAAGACAGTGAGAGATCGTCAG       
HWI-ST388-W7D:30:D25DKACXX:2:1101:20141:2998        141        *        0        0        *        *        0        0        AGGGGGCAGCCTTGGGCTGACTCAGGACGGTCAGCCTGGTCCCTCCGCCGAAGACCCAGGTACTGCTGTCTCCTGATTGACAATAATAGTCAGCCTCGTC       
HWI-ST388-W7D:30:D25DKACXX:2:1101:1551:3119        165        *        0        0        *        22        23243487        0        CTCCACTGATGGTCAAAGTGACTGTTGTCCCTGAAGTGGAGCCTGAGAATCGCGCGGGGATCCCTGACGATCTCTCACTGTCTTTATGTATCACCAAAAC       
HWI-ST388-W7D:30:D25DKACXX:2:1101:2045:3173        165        *        0        0        *        11        118309802        0        CCTGAAGTGGAGCCTGAGAATCGCGCGGGGATCCCTGACGATCTCTCACTGTCTTTATGTATCACCAAAACAGGGGCCTGGCCTGGCTTCTGTTGATACC       
HWI-ST388-W7D:30:D25DKACXX:2:1101:2296:3207        133        *        0        0        *        6        7891857        0        TACACTTTTTGGTATCAACAGAAGCCAGGCCAGGCCCCTGTTTTGGTGATACATAAAGACAGTGAGAGATCGTCAGGGATCCCCGCGCGATTCTCAGGCT       
HWI-ST388-W7D:30:D25DKACXX:2:1101:2861:3094        165        *        0        0        *        MT        8025        0        AGACAGTGAGAGATCGTCAGGGATCCCCGCGCGATTCTCAGGCTCCACTTCAGGGACAACAGTCACTTTGACCATCAGTGGAGTCCAGGCAGAAGACGAG       
HWI-ST388-W7D:30:D25DKACXX:2:1101:12366:3168        165        *        0        0        *        1        565136        0        CCTTGGGCTGACTATTATTGTCAATCAGGAGACAGCAGTACCTGGGTCTTCGGCGGAGGGACCAGGCTGACCGTCCTGAGTCAGCCCAAGGAGATCGGAA       
HWI-ST388-W7D:30:D25DKACXX:2:1101:12388:3239        165        *        0        0        *        19        41813306        0        CTGGACTCCACTGATGGTCAAAGTGACTGTTGTCCCTGAAGTGGAGCCTGAGAATCGCGCGGGGATCCCTGACGATCTCTCACTGTCTTTATGTATCACC       
HWI-ST388-W7D:30:D25DKACXX:2:1101:13339:3182        165        *        0        0        *        7        5568181        0        CAGAAGCCAGGCCAGGCCCCTGTTTTGGTGATACATAAAGACAGTGAGAGATCGTCAGGGATCCCCGCGCGATTCTCAGGCTCCACTTCAGGGACAACAG       
HWI-ST388-W7D:30:D25DKACXX:2:1101:13684:3121        165        *        0        0        *        4        3534046        0        GCAGAAGCCTCAGGAGCCGATGCGATCAACTGGAAGAAAGGGTATCAGCAATGGAAGATGAAATGAATGAAATGAAGCGAGAAGGGAAGTTTAGAGAAAA       
HWI-ST388-W7D:30:D25DKACXX:2:1101:16067:3233        165        *        0        0        *        19        39125643        0        CTCCACTGATGGTCAAAGTGACTGTTGTCCCTGAAGTGGAGCCTGAGAATCGCGCGGGGATCCCTGACGATCTCTCACTGTCTTTATGTATCACCAAAAC


sdriscoll 06-06-2013 10:45 PM

Hey I had something very similar happen to me! I sent them some data and they're working on it.

Jon_Keats 06-06-2013 10:49 PM

Didn't happen in the previous version so I'm guessing its a result of the fix they put in for my last issue. I'm impressed with how fast they are turning around fixes so it makes it worth the effort.

sdriscoll 06-06-2013 10:58 PM

Customer service makes all the difference! I don't know if you have ever tried to get help from the Tophat/cufflinks people but man...those guys are impossible. So far Wei has been on top of things. The same goes for the BWA, STAR and RSEM devs. I've had pretty quick responses from all of them.

It's sort of convenient for me that these subread people are just getting to work when I'm about to go to sleep. I get back to the lab the next day and they've fixed stuff during the night.

shi 06-07-2013 04:21 AM

Dear sdriscoll and Jon,

Many thanks for your nice comments. We really appreciate you putting up with the bugs and helping us to improve our programs.

As you said, the sam output bug was introduced in v1.3.5. We have fixed it in v1.3.5-p1. We also enhanced the subread-buildindex to let it check the integrity of the provided reference sequences and report any unexpected characters in a more informative way.

The latest version v1.3.5-p1 can be downloaded from http://subread.sourceforge.net . We have done a more thorough test by using much bigger test datasets. Hope it works for you. But please let me know if found any other bugs.

Best wishes,

Wei

Jon_Keats 06-09-2013 11:02 PM

Running a test tonight...

Jon_Keats 06-12-2013 09:44 PM

Hi Shi,

Just wanted to thank you this last patch seems to be much improved. Now moving on to test featureCounts.


All times are GMT -8. The time now is 02:10 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.