Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #46
    Originally posted by gringer View Post
    One little niggle I have with the paper is that you use the computer science terms 'accuracy' and 'recall', rather than the biomedical terms 'sensitivity' and 'specificity' (or alternatively the more explicit terms 'false positive' and 'false negative'). All these terms are easily interchangeable, so it's a good idea to use the terms most appropriate for your audience.

    Otherwise, great paper. One question that doesn't seem to be addressed: what is the size (on disk) of the indexes that subread generates?

    Along a similar vein, could it be used for indexing a massive database with similar sequences (e.g. NCBI-nr) to replace something like BLAST?

    Edit: just noticed that you discussed BLAST-like databases at the end of the paper, and you leave it open for investigation.
    Accuracy and recall aren't interchangeable with sensitivity and specificity. Sensitivity is for binary classifiers and recall is for a database. Suppose you framed your search as a binary classifier, where every object in the database was classified as returned or not returned. Since there are so few returnable objects compared to the ones that are returnable. Sensitivity might as well be irrelevant. IE you could mark everything as not returnable and be 100.00% accurate, since there is only one in three billion correct answers. This is why it makes more sense to frame the evaluation in a relevance framework eg Accuracy Recall.

    Comment


    • #47
      Originally posted by rskr View Post
      Accuracy and recall aren't interchangeable with sensitivity and specificity. Sensitivity is for binary classifiers and recall is for a database. Suppose you framed your search as a binary classifier, where every object in the database was classified as returned or not returned. Since there are so few returnable objects compared to the ones that are returnable. Sensitivity might as well be irrelevant. IE you could mark everything as not returnable and be 100.00% accurate, since there is only one in three billion correct answers. This is why it makes more sense to frame the evaluation in a relevance framework eg Accuracy Recall.
      Understood, thanks for the clarification. I don't deny that accuracy and recall work well for what has been done in the paper, it's just that they're not biology-friendly.

      FWIW, Medical science uses positive and negative predictive value to account for extreme chances of correct/incorrect clasifications. Wikipedia tells me that PPV is equivalent to precision, while sensitivity is equivalent to recall.

      Comment


      • #48
        Originally posted by gringer View Post
        Understood, thanks for the clarification. I don't deny that accuracy and recall work well for what has been done in the paper, it's just that they're not biology-friendly.

        FWIW, Medical science uses positive and negative predictive value to account for extreme chances of correct/incorrect clasifications. Wikipedia tells me that PPV is equivalent to precision, while sensitivity is equivalent to recall.
        I'm just saying, I wouldn't dumb down the content, just because you think Doctors aren't smart enough to understand. Many people would consider that arrogant. Besides, many patients find it annoying when doctors treat them as objects, which is just one of the pitfalls of using statistics for medical trials outside of the proper domain, where random variables don't represent people.

        Comment


        • #49
          Originally posted by Bernt.Popp View Post
          Hey Wei,

          I am trying to align SOLiD colorspace reads with subread (1.4.0).
          The commands used are:
          1)
          subread-buildindex -c -o human_g1k_v37_decoy human_g1k_v37_decoy.fasta
          2)
          subread-align -T 16 -I 16 -b -i $ref -r $myfilename".csfasta" -o $mydnaID.$myslide.subread.sam
          3) adding readgroup information, sorting and converting to BAM with picard.

          Unfortunately either there is some bug in the conversion from colorspace to basespace (option -b) or I am doing something wrong as the alignments are totally messy when viewed in IGV (although the reads seem to be at the right position).
          Here is a example with a comparison to CUSHAW2 and novoalignCS alignments:
          https://www.dropbox.com/s/4vgi0c7ev1...%20subread.jpg
          Do you have any idea what could be wrong?

          Also the new Indel feature does not emit any variants for the colorspace exomes analyzed...

          Cheers,

          Bernt
          Hi Bernt,

          We found a problem with color base conversion for those reads mapped to negative strand. We are now investigating this and will fix it with a patch.

          Thanks for reporting this.

          Wei

          Comment


          • #50
            Originally posted by shi View Post
            Hi Bernt,

            We found a problem with color base conversion for those reads mapped to negative strand. We are now investigating this and will fix it with a patch.

            Thanks for reporting this.

            Wei
            We have fixed the bug. Please update your Subread with the latest version (1.4.0-p1) and rerun your alignments.

            Best,
            Wei

            Comment


            • #51
              Originally posted by shi View Post
              We have fixed the bug. Please update your Subread with the latest version (1.4.0-p1) and rerun your alignments.

              Best,
              Wei
              Error persists for me, alignment with version 1.4.0-p1:
              https://www.dropbox.com/s/zr0zhrtsqx...or_subread.jpg

              I did not rebuild the index though, should I?

              Maybe the dynamic programming approach described in Li H, Durbin R Bioinformatics (2009) could help in solving the conversion problem?

              Cheers,

              Bernt

              Comment


              • #52
                Dear Bernt,

                I think the alignment result on SOLiD data has been largely improved in subread-1.4.0-p1. In your screenshot, most reads have the full length or a substantially long part mapped to the reference genome correctly. When I looked closely, I found that the reads with a part mismatched are very likely to have one color in the middle wrong, ruining the remaining part in color->base conversion.

                There were also few reads entirely mismatched because Subread on SOLiD data does not compare base by base, but color by color, and it trims off the first two characters from the read before mapping (as what bowtie does). If the first base in the SOLiD read is wrong, the entire read has all its bases distorted.

                If you convert those highly mismatched reads into colors, you may find that all these reads matched the genome very well in the color space.

                By the way, if the data is from RNA-seq, it may contain junctions that our subjunc program can discover. Subjunc also works on SOLiD reads, so maybe it's worth a try

                Cheers,

                Yang

                Originally posted by Bernt.Popp View Post
                Error persists for me, alignment with version 1.4.0-p1:
                https://www.dropbox.com/s/zr0zhrtsqx...or_subread.jpg

                I did not rebuild the index though, should I?

                Maybe the dynamic programming approach described in Li H, Durbin R Bioinformatics (2009) could help in solving the conversion problem?

                Cheers,

                Bernt
                Last edited by yangliao; 10-25-2013, 01:57 PM.

                Comment


                • #53
                  Originally posted by yangliao View Post
                  ... Subread on SOLiD data does not compare base by base, but color by color, and it trims off the first two characters from the read before mapping (as what bowtie does). If the first base in the SOLiD read is wrong, the entire read has all its bases distorted.
                  Looks like my guess about not correcting colour-space to base-space conversions was correct (but there was an additional reverse-complement bug).

                  If you convert those highly mismatched reads into colors, you may find that all these reads matched the genome very well in the color space.
                  The problem with this "it's almost identical in colour-space" point of view is that people don't live in colour-space when they're looking at genome alignments -- it's just not intuitive when the sequence changes completely half-way through the alignment. Can you really tell me that the following sequences look the same to you?

                  Code:
                  .31230
                  ATGATT
                  CGTCGG
                  GCAGCC
                  TACTAA
                  Colour-space should only be used as an intermediate data format, and should not be treated as the most correct representation when showing sequences as base space.
                  Last edited by gringer; 10-25-2013, 02:03 PM.

                  Comment


                  • #54
                    Yes, I agree the color to base conversion caused a lot of trouble for SNP calling although the reads seem to be mapped to the correct locations. I also agree that the color representations of the alignments are not intuitive and it is hard to see if they match with the reference or not.

                    One way to get around this issue is possibly to convert the color-space reads to base-space reads before carrying out alignments. This may reduce the number of mapped reads, but it should considerably reduce the number of mismatched bases due to the issue with color to base conversion.

                    Wei

                    Comment


                    • #55
                      Or alternatively you may perform a more stringent alignment by using a larger -m value (eg. -m=6). This will reduce the number of mismatched color bases present in mapped reads, which should help alleviate the color to base conversion issue.

                      Comment


                      • #56
                        Originally posted by shi View Post
                        One way to get around this issue is possibly to convert the color-space reads to base-space reads before carrying out alignments. This may reduce the number of mapped reads, but it should considerably reduce the number of mismatched bases due to the issue with color to base conversion
                        You need to align in colour-space for the reasons I've already mentioned. Basically the base space sequence changes too much. Any base-space alignments would have far too many misses due to small errors in the colour-space sequences.

                        However, when representing an alignment in base-space, you need to consider the base-space representation of the reference sequence, and modify the aligned colour-space sequence to fix any colour-shift errors.

                        edit: Note that it is always the case that a single colour-space difference between read and reference sequence is an instrument read error, and will cause a base-shift error in any base-space representation. A single SNP will modify two consecutive colours, and an INDEL will shift all subsequent colours (in the same fashion as in base-space) as well as (possibly) changing the colour at the site of the INDEL.
                        Last edited by gringer; 10-25-2013, 10:01 PM.

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Essential Discoveries and Tools in Epitranscriptomics
                          by seqadmin


                          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                          Yesterday, 07:01 AM
                        • seqadmin
                          Current Approaches to Protein Sequencing
                          by seqadmin


                          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                          04-04-2024, 04:25 PM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 04-11-2024, 12:08 PM
                        0 responses
                        39 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 10:19 PM
                        0 responses
                        41 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 09:21 AM
                        0 responses
                        35 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-04-2024, 09:00 AM
                        0 responses
                        55 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X