Seqanswers Leaderboard Ad

**sdriscoll** · 06-04-2013, 12:41 PM

Is there a user group for these tools or is it best to post questions in this forum?

**sdriscoll** · 06-04-2013, 02:48 PM

I'll just start putting questions here and see what happens.

First off, here's what I'm doing:

I've just been testing this aligner out to see how it performs and to compare it's alignments with control simulated reads. I simulate illumina-like reads (in this case 100 bp single end reads) from transcript sequences which were generated from a copy of the mm10 genome with random single point mutations added in. I've not included any INDELS. By translating the sampled read coordinates back into genomic coordinates (similar to how Tophat does this post-transcriptome alignment) I obtain "control" alignments in genomic coordinates that include spliced alignments marking junctions between exons.

I'm quantifying alignment "correctness" in a pretty simple way. I make 4 categories: 1) is the read aligned?, 2) is the read the correct orientation, 3) is the read on the correct chromosome? and 4) is the read in the correct posiiton? "correct position" is overly simplified where I only check to see if the aligned coordinates overlap with the control alignment coordinates so my leniency in correct position is essentially the entire length of the read. I assume true positives would go down a little if I refined that metric.

So on to the question:

I ran subread-align with '-u' to see how precise its alignments are when not reporting any multi-mapping reads at all. Surprisingly there were still some reads that aligned incorrectly despite the fact that the aligner reported these as unique alignments. For starters I extracted 94 reads that were mapped "uniquely" in the wrong orientation and remapped those with bowtie, to the mm10 genome, with all defaults (implies -n 2 -e 70) and with the '-a' option enabled to allow all possible alignments to be reported. In terms of my simulated read quals this would allow up to 2 mismatches in the seed and up to 2 mismatches in the remaining bases of the read. 67 of those reads aligned producing 8,665 alignments.

so the questions are - why are these read alignments considered to be uniquely mapped when 1) I know they are mapped in the wrong location and 2) those same reads report an average of more than 100 alignments each to the same genome using bowtie?

**gringer** · 06-04-2013, 03:20 PM

One little niggle I have with the paper is that you use the computer science terms 'accuracy' and 'recall', rather than the biomedical terms 'sensitivity' and 'specificity' (or alternatively the more explicit terms 'false positive' and 'false negative'). All these terms are easily interchangeable, so it's a good idea to use the terms most appropriate for your audience.

Otherwise, great paper. One question that doesn't seem to be addressed: what is the size (on disk) of the indexes that subread generates?

Along a similar vein, could it be used for indexing a massive database with similar sequences (e.g. NCBI-nr) to replace something like BLAST?

Edit: just noticed that you discussed BLAST-like databases at the end of the paper, and you leave it open for investigation.

**sdriscoll** · 06-04-2013, 03:22 PM

Question/Bug

I'm running version 1.3.3 on Mac OS 10.7.5. I've used both subread-align and subjunc and both of them crash if I specify the '-T' option with anything other than 1 thread.

Here's what subjunc returns:

Code:

subread-align -J --allow-repeating  -T 2 -i '/Users/pfafflab/alind/ensemble/mm10/subread/mm10e' -r 'reads.1.fq' -R 'reads.2.fq' -o './subjunc-temp-sam-004023-uVJ2Zl' -P 3 -d 50 -D 600   -I 5 -b 0 

Number of selected subreads = 10
Consensus threshold = 3
Number of threads=2
Number of positions reported per multi-mapping read=1
Number of indels allowed=5


Performing paired-end alignment:
Maximum distance between reads=600
Minimum distance between reads=50
Threshold on number of subreads for a successful mapping (the minor end in the pair)=1
Number of anchors=10
The directions of the two input files are: forward, reversed

Input file './subjunc-temp-sam-004023-uVJ2Zl' is not found or is in an incorrect format.

The file '/subjunc-temp-sam-004023-uVJ2Zl' does exist and it has zero content. Passing it through the unix 'file' utility reports:

Code:

subjunc-temp-sam-004023-uVJ2Zl: empty

subread-align is slightly less friendly throwing a seg fault

Code:

subread-align -T 2 -i ~/alind/ensemble/mm10/subread/mm10e -r reads.1.fq -R reads.2.fq -o temp.sam

Number of selected subreads = 10
Consensus threshold = 3
Number of threads=2
Number of positions reported per multi-mapping read=1
Number of indels allowed=5


Performing paired-end alignment:
Maximum distance between reads=600
Minimum distance between reads=50
Threshold on number of subreads for a successful mapping (the minor end in the pair)=1
Number of anchors=10
The directions of the two input files are: forward, reversed

completed=1.25%; time used=7.1s; rate=29.5k reads/s; time left=103.4s; total=1593k pairs            
completed=1.88%; time used=7.5s; rate=29.2k reads/s; time left=103.5s; total=1593k pairs            
completed=2.51%; time used=7.8s; rate=28.9k reads/s; time left=103.7s; total=1592k pairs            
Segmentation fault: 11

it's not always the same though. sometimes it throws the seg fault without actually making any progress at all (during the 'Loading the 01-th index file ...') stage. In fact it even throws the fault at different points if I run the program several times in a row with the exact same options.

**sdriscoll** · 06-04-2013, 03:35 PM

Originally posted by gringer View Post

One little niggle I have with the paper is that you use the computer science terms 'accuracy' and 'recall', rather than the biomedical terms 'sensitivity' and 'specificity' (or alternatively the more explicit terms 'false positive' and 'false negative'). All these terms are easily interchangeable, so it's a good idea to use the terms most appropriate for your audience.

That's funny I was thinking the same thing when I was looking at their plots. When I run my tests I like to use "precision" and "recall" where I define precision as the ratio of "correct" results to total results and recall as the ratio of "correct" results to total possible results (as in correctly aligned simulated reads to total simulated reads). Simple terms that don't include the confusion of trying to define both true and false positives and negatives. Since in these simulations we don't include any reads that intentionally should NOT align it's hard to define all four of those.

I was wondering if they really defined accuracy by strict terms: (TP+TN)/(TP+TN+FP+FN) which, depending on how you define things, can really only have three values in that equation instead of four. So it might become (TP)/(TP+FP+FN) if you define false negatives as unaligned reads. now the equation has become equal to recall or sensitivity comparing correctly aligned reads to the total number of reads attempted to be aligned. So...accuracy isn't the right word anymore.

I suppose you can be extra strict and say that since there are no "true" negatives then there can't be false negatives either in which case the equation for accuracy becomes the same as that for precision. This also makes it impossible to define specificity (aka the true negative rate) since it would always be 0.

In my opinion it would be best to avoid any terms that include true or false negatives for aligner benchmarking. Then we can avoid bastardizing these basic terms.

**sdriscoll** · 06-04-2013, 03:38 PM

Another question:

Is it really necessary to have a limit to the number of references allowed when building an aligner index? I know people who work with poorly defined genomes who have to deal with 10's or sometimes over 100,000 scaffolds. This limit seems to completely exclude those possible users and puts a limitation on this aligner that doesn't exist in any other aligners that I've used.

**shi** · 06-04-2013, 08:10 PM

Originally posted by sdriscoll View Post

I'll just start putting questions here and see what happens.

First off, here's what I'm doing:

I've just been testing this aligner out to see how it performs and to compare it's alignments with control simulated reads. I simulate illumina-like reads (in this case 100 bp single end reads) from transcript sequences which were generated from a copy of the mm10 genome with random single point mutations added in. I've not included any INDELS. By translating the sampled read coordinates back into genomic coordinates (similar to how Tophat does this post-transcriptome alignment) I obtain "control" alignments in genomic coordinates that include spliced alignments marking junctions between exons.

I'm quantifying alignment "correctness" in a pretty simple way. I make 4 categories: 1) is the read aligned?, 2) is the read the correct orientation, 3) is the read on the correct chromosome? and 4) is the read in the correct posiiton? "correct position" is overly simplified where I only check to see if the aligned coordinates overlap with the control alignment coordinates so my leniency in correct position is essentially the entire length of the read. I assume true positives would go down a little if I refined that metric.

So on to the question:

I ran subread-align with '-u' to see how precise its alignments are when not reporting any multi-mapping reads at all. Surprisingly there were still some reads that aligned incorrectly despite the fact that the aligner reported these as unique alignments. For starters I extracted 94 reads that were mapped "uniquely" in the wrong orientation and remapped those with bowtie, to the mm10 genome, with all defaults (implies -n 2 -e 70) and with the '-a' option enabled to allow all possible alignments to be reported. In terms of my simulated read quals this would allow up to 2 mismatches in the seed and up to 2 mismatches in the remaining bases of the read. 67 of those reads aligned producing 8,665 alignments.

so the questions are - why are these read alignments considered to be uniquely mapped when 1) I know they are mapped in the wrong location and 2) those same reads report an average of more than 100 alignments each to the same genome using bowtie?

Dear sdriscoll,

The definition of 'multi-mapping reads' in Subread is that if a read is called a multi-mapping read, it must have the same number of votes (number of maximum consensus subheads) and also the same length of genomic regions spanned by these consensus subheads in more than one location. This definition is apparently different from the definitions used by other aligners. I guess this to some extend explains the discrepancy you observed between Subread and Bowtie, although this does not explain why these reads were wrongly mapped by Subread.

I noticed that there were exon-spanning reads in your simulation data. Subread determines the read mapping locations by using the largest mapped regions in the reads. For an exon-spanning read, the continuous mappable region is shorter than those of reads falling within an exon and thus the exon-spanning reads are more likely to be wrongly mapped. Given that you introduced SNP to the reference genome when you generated simulated reads, it is possible that Subread assigned such reads to locations where it got higher mapping quality according to its mapping criteria (and it found only one such location).

You may try the Subjunc program (an exon-exon junction detecting program included in the Subread package) to align your reads. Subjunc is more accurate but less sensitive than Subread for the mapping of exon-spanning reads. Subjunc uses two best mappable regions in each read (instead of one as used by Subread) to determine the read mapping location. It also checks the donor/receptor sites. Your simulated reads must have the correct donor/receptor sites if you want to run Subjunc on them (Subjunc only recognizes GT/AG).

I hope this explanation makes sense to you. We will be happy to have a close look at your simulation data if you'd like make it available to us.

Best wishes,

Wei

**shi** · 06-04-2013, 08:50 PM

Originally posted by gringer View Post

One little niggle I have with the paper is that you use the computer science terms 'accuracy' and 'recall', rather than the biomedical terms 'sensitivity' and 'specificity' (or alternatively the more explicit terms 'false positive' and 'false negative'). All these terms are easily interchangeable, so it's a good idea to use the terms most appropriate for your audience.

Otherwise, great paper. One question that doesn't seem to be addressed: what is the size (on disk) of the indexes that subread generates?

Along a similar vein, could it be used for indexing a massive database with similar sequences (e.g. NCBI-nr) to replace something like BLAST?

Edit: just noticed that you discussed BLAST-like databases at the end of the paper, and you leave it open for investigation.

Dear David Eccles,

Thanks for your comments.

I hope we made it clear in the paper what we mean by accuracy and recall. But I think I agree with you that these computer science terms may confuse people from other fields. So here I just try to describe a bit more these two terms to make them more clearer.

The accuracy used in the paper was calculated as the percentage of correctly mapped reads out of all the mapped reads. This is equivalent to the false discovery rate. The recall was calculated as the percentage of correctly mapped reads out of all the reads included in the test dataset. This is equivalent to the sensitivity (the sensitivity in my dictionary). Please note that the denominators for the calculation of accuracy and recall are different, but the nominators are the same. So as you can see, our paper focused on the correctly mapped reads.

The size of the index built for human genome hg19 is 6.8GB. But please note that the amount of memory needed for mapping using this index ranges from 1GB to 7.6GB, depending on the memory size selection when you built the index.

As sdriscoll pointed out, there is currently a limit (50,000) on the number of chromosomes/contigs allowed when building the index. A massive database such as NCBI nucleotide database certainly has a lot more sequences than this number. So we are now working on removing this limitation (this should not be too hard). Once this is fixed, we may try to do a search similar to the blast search to see how Subread behaves. I guess there are many issues we need to address for this, such as not only reporting the best location/s but also reporting the second, third or fourth locations. This will need quite a bit changes to our program. But I think the exciting part for us is that we can test if the seed-and-vote paradigm is going to be useful in this search (like many short read aligners, blast also uses the seed-and-vote paradigm). I will be very happy to hear any suggestions for this.

Please let me know if you have any further questions.

Best wishes,

Wei

**sdriscoll** · 06-04-2013, 09:03 PM

Wei that's interesting. I haven't read into exactly how the subread mapper works but I follow your explanation of how it defines a multi-mapping read. It may be useful to users if it were possible to somehow make that clear because I think we are used to using tools that define a multi-mapped read simply as one that can align in more than one location within some edit distance cutoff. Or at least that's what I'm used to from using STAR, bowtie1/2/tophat, and bwa. It seems to be a dramatic difference based on this one test. Do you suppose there is a way to tweak the input options of subread to mimic the behavior of an edit distance cutoff for multi-mapping?

Also the reason the read is not aligned "correctly" it's likely easily explainable and what you mentioned is no doubt the case. These reads probably wouldn't be aligned correctly by most aligners except maybe by random chance because there exists a continuous region in the genome that is a pretty good mapping for the reads as well as a spliced position that's a perfect mapping (and the actual source of the read). For example one such read should aligned spliced with this alignment position X:28585510 and cigar=91M2449N9M with 0 mismatches. subread-align reported a 6-mismatch alignment at X:30352028 with MAPQ=46 and CIGAR=100M. So it did as you say it should - favored the full length alignment over a truncated one. In this case that wasn't the correct option but it seems the aligner is using the logic you described.

I'll run this "misaligned" reads back against the genome with subjunc and see what it does with them.

**sdriscoll** · 06-04-2013, 09:08 PM

Originally posted by shi View Post

The accuracy used in the paper was calculated as the percentage of correctly mapped reads out of all the mapped reads. This is equivalent to the false discovery rate.

isn't this also the same as 'precision'/'positive predicative value'? I believe the FDR is the percentage of incorrectly mapped reads out of all mapped reads. anytime I get mixed up about these terms I go back to this wikipeida article...

Receiver operating characteristic - Wikipedia

http://en.wikipedia.org/wiki/Receiver_operating_characteristic

**shi** · 06-04-2013, 09:13 PM

Originally posted by sdriscoll View Post

isn't this also the same as 'precision'/'positive predicative value'? I believe the FDR is the percentage of incorrectly mapped reads out of all mapped reads. anytime I get mixed up about these terms I go back to this wikipeida article...

http://en.wikipedia.org/wiki/Receive...characteristic

Sorry for my typo. It meant to be 1 - FDR. And you are correct that this is exactly the same as the 'precision' mentioned in one of your posts. We did not use precision in the paper because we use it to refer to the variation between replicates, which is irrelevant in this context to us.

Wei

**sdriscoll** · 06-04-2013, 09:20 PM

Interesting. subjunc didn't manage to pick up on the correct alignment for the read I used as an example. To review:

Code:

Correct:        X:28585510, CIGAR=91M2449N9M (NM=0) flag=0
subread-align:  X:30352028, MAPQ=46, CIGAR=100M (NM=6) flag=16
subjunc:        X:30352036, MAPQ=187, CIGAR=8S92M (NM=6) flag=16

subjunc seems to have put the read in the exact same spot but this time soft-clipped it. It also appears to be even more confident that it aligned the read correctly this time.

Here's the read if you'd like to take a look (again this is in mm10):

Code:

ACTTCTCATGCATGTCTTCAATTGCTTCCATGGTCGGACTCTGACTACATTCGGACAGTTTTAATGCTTGTTGTTCTTTTTGATAATTGTTCACTGATTT

**sdriscoll** · 06-04-2013, 09:22 PM

Also I don't want to distract you too much with this alignment snag - I'm really more interested in having the aligners NOT crash when using multiple threads. Any ideas on that one?

**shi** · 06-04-2013, 09:31 PM

Originally posted by sdriscoll View Post

Wei that's interesting. I haven't read into exactly how the subread mapper works but I follow your explanation of how it defines a multi-mapping read. It may be useful to users if it were possible to somehow make that clear because I think we are used to using tools that define a multi-mapped read simply as one that can align in more than one location within some edit distance cutoff. Or at least that's what I'm used to from using STAR, bowtie1/2/tophat, and bwa. It seems to be a dramatic difference based on this one test. Do you suppose there is a way to tweak the input options of subread to mimic the behavior of an edit distance cutoff for multi-mapping?

Also the reason the read is not aligned "correctly" it's likely easily explainable and what you mentioned is no doubt the case. These reads probably wouldn't be aligned correctly by most aligners except maybe by random chance because there exists a continuous region in the genome that is a pretty good mapping for the reads as well as a spliced position that's a perfect mapping (and the actual source of the read). For example one such read should aligned spliced with this alignment position X:28585510 and cigar=91M2449N9M with 0 mismatches. subread-align reported a 6-mismatch alignment at X:30352028 with MAPQ=46 and CIGAR=100M. So it did as you say it should - favored the full length alignment over a truncated one. In this case that wasn't the correct option but it seems the aligner is using the logic you described.

I'll run this "misaligned" reads back against the genome with subjunc and see what it does with them.

Hi sdriscoll,

Yes, Subread is quite different from other aligners in determining if a read is a multi-mapping reads. This stems from the way in which it finds the best mapping locations for the reads. It does not perform a full alignment for a read to determine its mapping location. It takes a number of equally-spaced 16bp subreads from the read, mapping them to the reference and then using the mapping location of these subreads to vote for the mapping location of the read (ie. the largest set of consensus subreads determines the mapping location of the read). So for example, if the largest set of consensus subreads include 5 subreads and the read length is 100bp, then the total number of known matched bases at the voting time is <= 80. So the read may have more than 20bp mismatched bases but still get mapped. On the other hand, if the largest consensus set includes 10 subreads, then all 100 bases are perfectly matched with the reference. However, Subread does perform a full read alignment in the end with the determined position.

You could possibly use the number of votes (-m option) to try to mimic the behavior of the edit distance cutoff. But I'm not really sure if you can find a good correlation between them.

Best wishes,

Wei

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 18 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 47 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News