Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • subread

    Hello

    Has anyone of you allready tried subread ("subread: a superfast read aligner": http://sourceforge.net/projects/subread/)?
    I'm very new to RNA-Seq. I'm examining Illumina HiSeq data for different tissues of the human body. At first I used bowtie2 (beta3, without splicing library) and it mapped about 85% of the reads to the hg19 genome (which is quite good I think?!) using the standard parameters for bowtie2 end-to-end alignment.
    After that I tried out subread with standard settings. Subread should be able to detect splicing events. It actually mapped 95 to 98% of my reads to hg19 in the same time as bowtie2 did!
    I'm asking if anybody is experienced with subread and could say something about the quality of the program/mapping process and traps or drawbacks of it? I'm wondering why it is not mentioned anywhere although it seems to be very fast and powerful?
    Thank you.
    Last edited by hanshart; 12-04-2011, 03:18 PM.

  • #2
    Dear Hanshart,

    Yes, subread can map junction reads. However it does this not by looking for the junction locations, but by using part of the read sequence to find the mapping location for the read. Therefore, the mapping location returned by subread for a junction is on one of the exons which are spliced with other exons (usually the one which has the largest segment in the read).

    However, the subjunc program included in the subread package can find the precise junction location in the read and return a CIGAR string which gives the length of introns.

    If you are using the latest version of Subread (1.1.0), you should be able to find a User Manual in the package which will give you a brief introduction to the algorithm used by subread.

    Hope this helps.


    Cheers,
    Wei

    Comment


    • #3
      For subjunc analysis I'm a bit confused by the following parameters:

      -d --mindist <int> the minimum distance between two reads in a pair, 50 by default
      -D --maxdist <int> the maximum distance between two reads in a pair, 600 by default

      If you have fragments of 150 bp which are sequenced with 100x100 HiSeq runs then you have read1 covering from 1-100 and read2 covering 50-150. So would the --mindist be -50?

      Comment


      • #4
        Dear Jon Keats,

        The distance is measured as the distance between the mapping location of the leftmost base of the left read and the mapping location of the leftmost base of the right read. So for your example, the distance between the two reads is 49 (50-1).

        We will make the usage info more clearer for these two arguments.

        Best wishes,

        Wei

        Comment


        • #5
          Thanks for the prompt response.

          I don't understand how you get 49
          Code:
              Read 1------------------------->
          Fragment --------------------------------------------------
                                    <--------------------------------Read2
          I would call it -49?

          On a separate note I just finished our first test of subjunc. I'd say it performed well but had a couple of issues. First I have to say I really like the actual MAPQ value compared to Tophat or STAR. But it did fail on some picard validation steps due to the encoding of the sam/bam file
          Code:
          Ignoring SAM validation error: ERROR: Record 6, Read name HWI-ST388-W7D:30:D25DKACXX:6:2112:4784:17492, Mate Alignment start should be 0 because reference name = *.
          Ignoring SAM validation error: ERROR: Record 6, Read name HWI-ST388-W7D:30:D25DKACXX:6:2112:4784:17492, Mapped mate should have mate reference nameq
          
          HWI-ST388-W7D:30:D25DKACXX:6:2112:4784:17492    129     1       11683   199     100M    *       114359203       0       GAGATTCTTATTAGTGATTTGGGCTGGGGCCTGGCCATGTGTATTTTTTTAAATTTCCACTGATGATTTTGCTGCATGGCCGGTGTTGAGAATGACTGCG    <=?DDADDHDFHHABFHIGHE>ECG<GC@G1??FHICGF<D9BAFEIJJE=F@CEHGIBCHHHEHHHFFFFFEDEEECADCBBB?28<:CDDD@>A:C<<    PG:Z:MarkDuplicate
          We typically do markduplicates for our RNAseq runs as a quality control metric but you get similar validation errors until you turn on VALIDATION_STRINGENCY=LENIENT.

          Also it looks like the "correctly-paired" flagstats are encoded in a hard-coded variable. In this case these mRNAseq libraries have fragments of 150bp with a SD of 36bp but a read of 146bp insert was called incorrectly paired. I'm not sure if this is the mindist option but this is a basic issue.

          On a good note the alignments look decent and for mapping rate it was higher than Tophat2.0.8 and just behind STAR2.3.0
          Last edited by Jon_Keats; 06-03-2013, 10:35 PM.

          Comment


          • #6
            Dear Jon Keats,

            The paired-distance used by subjunc is always a positive value, ie. the absolute value of calculated distance is used. Sbujunc uses this absolute value compare with mindist and maxdist values.

            The 'correctly-paired' flag was not hard-coded. The reason why a 146bp insert was called incorrectly paired is because of the use of --mindist=50. The paired-distance for a 146bp insert is calculated as 45 by subjunc, which is less than 50. Therefore it was called incorrectly paired.

            You may set --mindist=10 and this should allow inserts with one SD away from your average fragment length to be called correctly paired.

            Thanks for reporting the problem you encountered when running picard on the subjunc mapping results. We have identified the problem and are working on it. An updated version should be available in a day or two. I will let you know when it is released.

            Best wishes,

            Wei

            Comment


            • #7
              So by --mindist does this really mean the sequence length that is sequenced by both read1 and read2?

              I only pick on this as it was a MAJOR issue in Tophat as read lengths increased and we started to cross-over resulting in negative values for their -inner-mate-distance. It might be worth creating a figure to illustrate the point. I'm happy to provide ours that outlines:
              Library fragment
              Insert Size
              Inner-mate-distance

              I would just add min-dist myself but I'm not confident I have it right or that we are completely talking about the same thing as I don't understand how you arrive at a paired-distance of 45 for a 146bp insert sequenced in both directions by 100bp, which results in bases 1-45 being sequenced by read1, bases 46-100 being sequenced twice (once by read1 and once by read2), and bases 101-146 being sequenced by read2. I would interpret -mindist as the distance separating the two reads, historically this was always a positive value as the reads never crossed each other. But now, particularly in RNAseq experiments and Agilent Exomes it is not at all uncommon that the two reads cross-over.

              Comment


              • #8
                Jon it seems there are at least a couple "standard" type things that subread redefines. Their method of identifying reads as multi-mapping reads is slightly off from how other short read aligners define them. This threw me off a little today as I was testing out subread.

                You can see how they arrive at the value of 49 in the example you've mentioned - I think that's clear enough - but the terminology is a little confusing. They are measuring from left-most alignment coordinate of mate 1 to left-most alignment coordinate of mate 2 where the "standard" is to measure from right-most alignment coordinate of mate 1 to left-most coordinate of mate 2. I can't guess why they chose to use a kind of non-standard way of specifying this distance but it is a little confusing at first.
                /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                Salk Institute for Biological Studies, La Jolla, CA, USA */

                Comment


                • #9
                  Assuming Wei can confirm that is correct. I actually understand your explanation, so thanks @sdriscoll

                  So if we were to set it dynamically to include read pairs within 1 standard deviation it would be something like:
                  (mean insert size) - (1SD) - (Length of Read2) = mindist

                  Ultimately it would be nice if these values were just learned by the program so they didn't need to be hard coded by the user. Most aligners have moved away from this practice as people don't often know the correct values and it can easily be learned from the actual data.

                  Comment


                  • #10
                    Dear Jon and sdriscoll,

                    sdriscoll' explanation is correct.

                    Both subread and subjunc need users to specify the minimum and maximum insert lengths. The parameters --mindist and --maxdist are provided for this purpose. The minimum and maximum insert length can be calculated as mindist + length of read2 + 1 and maxdist + length of read2 + 1, respectively. However I think I do agree that these two parameters seem confusing. Probably it is better to specify the minimum and maximum insert lengths directly. We will change them in the next release.

                    We do not like redefining things unless absolutely necessary. We do not use the the -inner-mate-distance because we do not consider this in our read mapping.

                    Many thanks for your helpful comments.

                    Best wishes,

                    Wei

                    Comment


                    • #11
                      Dear Jon and sdriscoll,

                      We have fixed a couple of bugs and made a few changes to Subread. The latest version can be downloaded from sourceforge - http://subread.sourceforge.net . You can see the latest changes in the CHANGLOG as well.

                      Let me know if you run into other problems.

                      Best wishes,

                      Wei

                      Comment


                      • #12
                        Originally posted by shi View Post
                        Dear Jon and sdriscoll,

                        We have fixed a couple of bugs and made a few changes to Subread. The latest version can be downloaded from sourceforge - http://subread.sourceforge.net . You can see the latest changes in the CHANGLOG as well.

                        Let me know if you run into other problems.

                        Best wishes,

                        Wei
                        Thanks, running at test now. I'll let you know how it goes

                        Comment


                        • #13
                          The updates sound good - I'll test it out tomorrow.
                          /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                          Salk Institute for Biological Studies, La Jolla, CA, USA */

                          Comment


                          • #14
                            Hi Wei,

                            Got a new bug during SAM->BAM conversion using samtools:
                            Code:
                            [samopen] SAM header is present: 178 sequences.
                            Parse error at line 580: sequence and quality are inconsistent
                            I pulled out the lines 579-581, and what you see is that it didn't print the quality values for some reason
                            Code:
                            HWI-ST388-W7D:30:D25DKACXX:2:1101:11621:1970	217	MT	7312	199	100M	*	0	0	TCATGATTTGAGAAGCCTTCGCTTCGAAGCGAAAAGTCCTAATAGTAGAAGAACCCTCCATAAACCTGGAGTGACTATATGGATGCCCCCCACCCTACCA	CDDDDCDDEDDDDDBDDDDDDDDDDDDDDDCDDDDDDDEEEEDEDDDCCCDB<DDBCDDDCDC@DDDDDDDEEEDEDDDDDDDDDDJHHHD=FFEDBCB@
                            HWI-ST388-W7D:30:D25DKACXX:2:1101:11726:1975	165	*	0	0	*	MT	7312	0	TAAAAATACAAAAATTAGCCAGGCTTGGTGGCGGGAGCCTGTAATCCCAGCTACTTGGGAGGCTGAGGCATGAGAATCACTTGAACCGGGGAGGTGGAGG	
                            HWI-ST388-W7D:30:D25DKACXX:2:1101:11726:1975	69	*	0	0	*	22	23243225	0	CTTGATCTTTGCTCACTGCAACCTCCACCTCCCCGGTTCAAGTGATTCTCATGCCTCAGCCTCCCAAGTAGCTGGGATTACAGGCTCCCGCCACCAAGCC	JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
                            I checked through the file and found that 2.9% of the lines written to the SAM file are missing the quality values

                            Some other examples
                            Code:
                            HWI-ST388-W7D:30:D25DKACXX:2:1101:11989:2774	133	*	0	0	*	22	23029478	0	GCGCGGGGATCCCTGACGATCTCTCACTGTCTTTATGTATCACCAAAACAGGGGCCTGGCCTGGCTTCTGTTGATACCAAAAAGTGTATTGTCTTGCCAA	
                            HWI-ST388-W7D:30:D25DKACXX:2:1101:12573:2770	165	*	0	0	*	1	179051348	0	CGTCTTCTGCCTGGACTCCACTGATGGTCAAAGTGACTGTTGTCCCTGAAGTGGAGCCTGAGAATCGCGCGGGGATCCCTGACGATCTCTCACTGTCTTT	
                            HWI-ST388-W7D:30:D25DKACXX:2:1101:12678:3000	141	*	0	0	*	*	0	0	CCAGCATTTTGGGAGGCCAAGGTGGGCAGATCACCTGAGGTCAGGAGTTTGAGACCAGCCTGGCCAACATGGAGAAATCCCGTCTCTACTAAAAATACAA	
                            HWI-ST388-W7D:30:D25DKACXX:2:1101:13149:2807	165	*	0	0	*	22	23243223	0	ACGAGGCTGACTATTATTGTCAATCAGGAGACAGCAGTACCTGGGTCTTCGGCGGAGGGACCAGGCTGACCGTCCTGAGTCAGCCCAAGGCTGCCCCCCC	
                            HWI-ST388-W7D:30:D25DKACXX:2:1101:14167:2974	141	*	0	0	*	*	0	0	CAAGACAATACACTTTTTGGTATCAACAGAAGCCAGGCCAGGCCCCTGTTTTGGTGATACATAAAGACAGTGAGAGATCGTCAGGGATCCCCGCGCGATT	
                            HWI-ST388-W7D:30:D25DKACXX:2:1101:16334:2859	133	*	0	0	*	8	126448195	0	CGAAGTGGAGCCTGAGAATCGCGCGGGGATCCCTGACGATCTCTCACTGTCTTTATGTATCACCAAAACAGGGGCCTGGCCTGGCTTCTGTTGATACCAA	
                            HWI-ST388-W7D:30:D25DKACXX:2:1101:19005:2869	165	*	0	0	*	22	23090325	0	AACGTCTTCTGCCTGGACTCCACTGATGGTCAAAGTGACTGTTGTCCCTGAAGTGGAGCCTGAGAATCGCGCGGGGATCCCTGACGATCTCTCACTGTCT	
                            HWI-ST388-W7D:30:D25DKACXX:2:1101:19233:2985	133	*	0	0	*	12	25403108	0	TCCAGCTACTCAGGAGGCTGAGGCAGGAGAATTGCTTGAACCTGGGAGGCAGAGGTTGCAGTGAGCTGAGATGGCACCATAGCACTCCAGCCTAGGCGAC	
                            HWI-ST388-W7D:30:D25DKACXX:2:1101:19566:2846	165	*	0	0	*	15	45007689	0	GCAATGGCGTGATCTTGGCTCACCACAACCTCCGCCTCCCGGGTTCAAGCAATTCTCCTGCCTCAGCCTCCTGAGTAGCTGGGATTACAGGCATGTGCCA	
                            HWI-ST388-W7D:30:D25DKACXX:2:1101:19762:2931	133	*	0	0	*	22	23248677	0	TCTGGAGATGCATTGGCAAGACAATACACTTTTTGGTATCAACAGAAGCCAGGCCAGGCCCCTGTTTTGGTGATACATAAAGACAGTGAGAGATCGTCAG	
                            HWI-ST388-W7D:30:D25DKACXX:2:1101:20141:2998	141	*	0	0	*	*	0	0	AGGGGGCAGCCTTGGGCTGACTCAGGACGGTCAGCCTGGTCCCTCCGCCGAAGACCCAGGTACTGCTGTCTCCTGATTGACAATAATAGTCAGCCTCGTC	
                            HWI-ST388-W7D:30:D25DKACXX:2:1101:1551:3119	165	*	0	0	*	22	23243487	0	CTCCACTGATGGTCAAAGTGACTGTTGTCCCTGAAGTGGAGCCTGAGAATCGCGCGGGGATCCCTGACGATCTCTCACTGTCTTTATGTATCACCAAAAC	
                            HWI-ST388-W7D:30:D25DKACXX:2:1101:2045:3173	165	*	0	0	*	11	118309802	0	CCTGAAGTGGAGCCTGAGAATCGCGCGGGGATCCCTGACGATCTCTCACTGTCTTTATGTATCACCAAAACAGGGGCCTGGCCTGGCTTCTGTTGATACC	
                            HWI-ST388-W7D:30:D25DKACXX:2:1101:2296:3207	133	*	0	0	*	6	7891857	0	TACACTTTTTGGTATCAACAGAAGCCAGGCCAGGCCCCTGTTTTGGTGATACATAAAGACAGTGAGAGATCGTCAGGGATCCCCGCGCGATTCTCAGGCT	
                            HWI-ST388-W7D:30:D25DKACXX:2:1101:2861:3094	165	*	0	0	*	MT	8025	0	AGACAGTGAGAGATCGTCAGGGATCCCCGCGCGATTCTCAGGCTCCACTTCAGGGACAACAGTCACTTTGACCATCAGTGGAGTCCAGGCAGAAGACGAG	
                            HWI-ST388-W7D:30:D25DKACXX:2:1101:12366:3168	165	*	0	0	*	1	565136	0	CCTTGGGCTGACTATTATTGTCAATCAGGAGACAGCAGTACCTGGGTCTTCGGCGGAGGGACCAGGCTGACCGTCCTGAGTCAGCCCAAGGAGATCGGAA	
                            HWI-ST388-W7D:30:D25DKACXX:2:1101:12388:3239	165	*	0	0	*	19	41813306	0	CTGGACTCCACTGATGGTCAAAGTGACTGTTGTCCCTGAAGTGGAGCCTGAGAATCGCGCGGGGATCCCTGACGATCTCTCACTGTCTTTATGTATCACC	
                            HWI-ST388-W7D:30:D25DKACXX:2:1101:13339:3182	165	*	0	0	*	7	5568181	0	CAGAAGCCAGGCCAGGCCCCTGTTTTGGTGATACATAAAGACAGTGAGAGATCGTCAGGGATCCCCGCGCGATTCTCAGGCTCCACTTCAGGGACAACAG	
                            HWI-ST388-W7D:30:D25DKACXX:2:1101:13684:3121	165	*	0	0	*	4	3534046	0	GCAGAAGCCTCAGGAGCCGATGCGATCAACTGGAAGAAAGGGTATCAGCAATGGAAGATGAAATGAATGAAATGAAGCGAGAAGGGAAGTTTAGAGAAAA	
                            HWI-ST388-W7D:30:D25DKACXX:2:1101:16067:3233	165	*	0	0	*	19	39125643	0	CTCCACTGATGGTCAAAGTGACTGTTGTCCCTGAAGTGGAGCCTGAGAATCGCGCGGGGATCCCTGACGATCTCTCACTGTCTTTATGTATCACCAAAAC

                            Comment


                            • #15
                              Hey I had something very similar happen to me! I sent them some data and they're working on it.
                              /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                              Salk Institute for Biological Studies, La Jolla, CA, USA */

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              7 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              7 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              66 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X