SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
PubMed: rNA: a fast and accurate short reads numerical aligner. Newsbot! Literature Watch 0 08-28-2012 02:00 AM
PubMed: DECOD: fast and accurate discriminative DNA motif finding. Newsbot! Literature Watch 0 01-11-2012 10:40 AM
Fast and accurate long read alignment with Burrows-Wheeler transform. nilshomer Literature Watch 1 01-28-2010 09:38 PM

Reply
 
Thread Tools
Old 04-04-2013, 08:46 PM   #1
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote

Dear All,

Although our 'Subread' read aligner and 'Subjunc' junction detector have been around for a while, we have just got them published on Nucleic Acids Research today.

Subread and Subjunc employ a novel read mapping paradigm called "seed-and-vote" to achieve a fast mapping speed and a high mapping accuracy. It takes Subread less than 20 minutes to map 10 million 100bp reads using one thread. Its running time remains nearly the same when mapping longer reads thanks to the high scalability of the seed-and-vote paradigm.

Subread and Subjunc can be used to map reads generated from all major sequencing platforms including Illumina GA/HiSeq, Roche 454, ABI SOLiD and Ion Torrent. They can run on both Linux/unix and Mac computers.

Subread and Subjunc are included in a software package called 'Subread', which can freely downloaded from http://subread.sourceforge.net/ .

Our paper for Subread and Subjunc can be freely accessed at: http://nar.oxfordjournals.org/conten...kt214.abstract

A short tutorial for Subread can be found here - http://bioinf.wehi.edu.au/subread/

A short tutorial for Subjunc can be found here - http://bioinf.wehi.edu.au/subjunc/

The User's Guides provides comprehensive descriptions to these two programs.

Please do not hesitate to contact me if you have any questions ([email protected]).


Best regards,
------------------------
Wei Shi, Ph.D
Bioinformatics Division,
The Walter and Eliza Hall Institute of Medical Research
1G Royal Parade, Parkville 3052, Victoria
Australia

Last edited by shi; 05-15-2013 at 08:07 PM.
shi is offline   Reply With Quote
Old 06-04-2013, 12:41 PM   #2
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

Is there a user group for these tools or is it best to post questions in this forum?
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 06-04-2013, 02:48 PM   #3
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

I'll just start putting questions here and see what happens.

First off, here's what I'm doing:

I've just been testing this aligner out to see how it performs and to compare it's alignments with control simulated reads. I simulate illumina-like reads (in this case 100 bp single end reads) from transcript sequences which were generated from a copy of the mm10 genome with random single point mutations added in. I've not included any INDELS. By translating the sampled read coordinates back into genomic coordinates (similar to how Tophat does this post-transcriptome alignment) I obtain "control" alignments in genomic coordinates that include spliced alignments marking junctions between exons.

I'm quantifying alignment "correctness" in a pretty simple way. I make 4 categories: 1) is the read aligned?, 2) is the read the correct orientation, 3) is the read on the correct chromosome? and 4) is the read in the correct posiiton? "correct position" is overly simplified where I only check to see if the aligned coordinates overlap with the control alignment coordinates so my leniency in correct position is essentially the entire length of the read. I assume true positives would go down a little if I refined that metric.

So on to the question:

I ran subread-align with '-u' to see how precise its alignments are when not reporting any multi-mapping reads at all. Surprisingly there were still some reads that aligned incorrectly despite the fact that the aligner reported these as unique alignments. For starters I extracted 94 reads that were mapped "uniquely" in the wrong orientation and remapped those with bowtie, to the mm10 genome, with all defaults (implies -n 2 -e 70) and with the '-a' option enabled to allow all possible alignments to be reported. In terms of my simulated read quals this would allow up to 2 mismatches in the seed and up to 2 mismatches in the remaining bases of the read. 67 of those reads aligned producing 8,665 alignments.

so the questions are - why are these read alignments considered to be uniquely mapped when 1) I know they are mapped in the wrong location and 2) those same reads report an average of more than 100 alignments each to the same genome using bowtie?
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */

Last edited by sdriscoll; 06-04-2013 at 02:49 PM. Reason: forgot to include my simulated read length and type
sdriscoll is offline   Reply With Quote
Old 06-04-2013, 03:20 PM   #4
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 821
Default

One little niggle I have with the paper is that you use the computer science terms 'accuracy' and 'recall', rather than the biomedical terms 'sensitivity' and 'specificity' (or alternatively the more explicit terms 'false positive' and 'false negative'). All these terms are easily interchangeable, so it's a good idea to use the terms most appropriate for your audience.

Otherwise, great paper. One question that doesn't seem to be addressed: what is the size (on disk) of the indexes that subread generates?

Along a similar vein, could it be used for indexing a massive database with similar sequences (e.g. NCBI-nr) to replace something like BLAST?

Edit: just noticed that you discussed BLAST-like databases at the end of the paper, and you leave it open for investigation.

Last edited by gringer; 06-04-2013 at 03:34 PM.
gringer is offline   Reply With Quote
Old 06-04-2013, 03:22 PM   #5
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

Question/Bug

I'm running version 1.3.3 on Mac OS 10.7.5. I've used both subread-align and subjunc and both of them crash if I specify the '-T' option with anything other than 1 thread.

Here's what subjunc returns:
Code:
subread-align -J --allow-repeating  -T 2 -i '/Users/pfafflab/alind/ensemble/mm10/subread/mm10e' -r 'reads.1.fq' -R 'reads.2.fq' -o './subjunc-temp-sam-004023-uVJ2Zl' -P 3 -d 50 -D 600   -I 5 -b 0 

Number of selected subreads = 10
Consensus threshold = 3
Number of threads=2
Number of positions reported per multi-mapping read=1
Number of indels allowed=5


Performing paired-end alignment:
Maximum distance between reads=600
Minimum distance between reads=50
Threshold on number of subreads for a successful mapping (the minor end in the pair)=1
Number of anchors=10
The directions of the two input files are: forward, reversed

Input file './subjunc-temp-sam-004023-uVJ2Zl' is not found or is in an incorrect format.
The file '/subjunc-temp-sam-004023-uVJ2Zl' does exist and it has zero content. Passing it through the unix 'file' utility reports:

Code:
subjunc-temp-sam-004023-uVJ2Zl: empty
subread-align is slightly less friendly throwing a seg fault

Code:
subread-align -T 2 -i ~/alind/ensemble/mm10/subread/mm10e -r reads.1.fq -R reads.2.fq -o temp.sam

Number of selected subreads = 10
Consensus threshold = 3
Number of threads=2
Number of positions reported per multi-mapping read=1
Number of indels allowed=5


Performing paired-end alignment:
Maximum distance between reads=600
Minimum distance between reads=50
Threshold on number of subreads for a successful mapping (the minor end in the pair)=1
Number of anchors=10
The directions of the two input files are: forward, reversed

completed=1.25%; time used=7.1s; rate=29.5k reads/s; time left=103.4s; total=1593k pairs            
completed=1.88%; time used=7.5s; rate=29.2k reads/s; time left=103.5s; total=1593k pairs            
completed=2.51%; time used=7.8s; rate=28.9k reads/s; time left=103.7s; total=1592k pairs            
Segmentation fault: 11
it's not always the same though. sometimes it throws the seg fault without actually making any progress at all (during the 'Loading the 01-th index file ...') stage. In fact it even throws the fault at different points if I run the program several times in a row with the exact same options.
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 06-04-2013, 03:35 PM   #6
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

Quote:
Originally Posted by gringer View Post
One little niggle I have with the paper is that you use the computer science terms 'accuracy' and 'recall', rather than the biomedical terms 'sensitivity' and 'specificity' (or alternatively the more explicit terms 'false positive' and 'false negative'). All these terms are easily interchangeable, so it's a good idea to use the terms most appropriate for your audience.
That's funny I was thinking the same thing when I was looking at their plots. When I run my tests I like to use "precision" and "recall" where I define precision as the ratio of "correct" results to total results and recall as the ratio of "correct" results to total possible results (as in correctly aligned simulated reads to total simulated reads). Simple terms that don't include the confusion of trying to define both true and false positives and negatives. Since in these simulations we don't include any reads that intentionally should NOT align it's hard to define all four of those.

I was wondering if they really defined accuracy by strict terms: (TP+TN)/(TP+TN+FP+FN) which, depending on how you define things, can really only have three values in that equation instead of four. So it might become (TP)/(TP+FP+FN) if you define false negatives as unaligned reads. now the equation has become equal to recall or sensitivity comparing correctly aligned reads to the total number of reads attempted to be aligned. So...accuracy isn't the right word anymore.

I suppose you can be extra strict and say that since there are no "true" negatives then there can't be false negatives either in which case the equation for accuracy becomes the same as that for precision. This also makes it impossible to define specificity (aka the true negative rate) since it would always be 0.

In my opinion it would be best to avoid any terms that include true or false negatives for aligner benchmarking. Then we can avoid bastardizing these basic terms.
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 06-04-2013, 03:38 PM   #7
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

Another question:

Is it really necessary to have a limit to the number of references allowed when building an aligner index? I know people who work with poorly defined genomes who have to deal with 10's or sometimes over 100,000 scaffolds. This limit seems to completely exclude those possible users and puts a limitation on this aligner that doesn't exist in any other aligners that I've used.
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 06-04-2013, 08:10 PM   #8
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Quote:
Originally Posted by sdriscoll View Post
I'll just start putting questions here and see what happens.

First off, here's what I'm doing:

I've just been testing this aligner out to see how it performs and to compare it's alignments with control simulated reads. I simulate illumina-like reads (in this case 100 bp single end reads) from transcript sequences which were generated from a copy of the mm10 genome with random single point mutations added in. I've not included any INDELS. By translating the sampled read coordinates back into genomic coordinates (similar to how Tophat does this post-transcriptome alignment) I obtain "control" alignments in genomic coordinates that include spliced alignments marking junctions between exons.

I'm quantifying alignment "correctness" in a pretty simple way. I make 4 categories: 1) is the read aligned?, 2) is the read the correct orientation, 3) is the read on the correct chromosome? and 4) is the read in the correct posiiton? "correct position" is overly simplified where I only check to see if the aligned coordinates overlap with the control alignment coordinates so my leniency in correct position is essentially the entire length of the read. I assume true positives would go down a little if I refined that metric.

So on to the question:

I ran subread-align with '-u' to see how precise its alignments are when not reporting any multi-mapping reads at all. Surprisingly there were still some reads that aligned incorrectly despite the fact that the aligner reported these as unique alignments. For starters I extracted 94 reads that were mapped "uniquely" in the wrong orientation and remapped those with bowtie, to the mm10 genome, with all defaults (implies -n 2 -e 70) and with the '-a' option enabled to allow all possible alignments to be reported. In terms of my simulated read quals this would allow up to 2 mismatches in the seed and up to 2 mismatches in the remaining bases of the read. 67 of those reads aligned producing 8,665 alignments.

so the questions are - why are these read alignments considered to be uniquely mapped when 1) I know they are mapped in the wrong location and 2) those same reads report an average of more than 100 alignments each to the same genome using bowtie?
Dear sdriscoll,

The definition of 'multi-mapping reads' in Subread is that if a read is called a multi-mapping read, it must have the same number of votes (number of maximum consensus subheads) and also the same length of genomic regions spanned by these consensus subheads in more than one location. This definition is apparently different from the definitions used by other aligners. I guess this to some extend explains the discrepancy you observed between Subread and Bowtie, although this does not explain why these reads were wrongly mapped by Subread.

I noticed that there were exon-spanning reads in your simulation data. Subread determines the read mapping locations by using the largest mapped regions in the reads. For an exon-spanning read, the continuous mappable region is shorter than those of reads falling within an exon and thus the exon-spanning reads are more likely to be wrongly mapped. Given that you introduced SNP to the reference genome when you generated simulated reads, it is possible that Subread assigned such reads to locations where it got higher mapping quality according to its mapping criteria (and it found only one such location).

You may try the Subjunc program (an exon-exon junction detecting program included in the Subread package) to align your reads. Subjunc is more accurate but less sensitive than Subread for the mapping of exon-spanning reads. Subjunc uses two best mappable regions in each read (instead of one as used by Subread) to determine the read mapping location. It also checks the donor/receptor sites. Your simulated reads must have the correct donor/receptor sites if you want to run Subjunc on them (Subjunc only recognizes GT/AG).

I hope this explanation makes sense to you. We will be happy to have a close look at your simulation data if you'd like make it available to us.

Best wishes,

Wei
shi is offline   Reply With Quote
Old 06-04-2013, 08:50 PM   #9
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Quote:
Originally Posted by gringer View Post
One little niggle I have with the paper is that you use the computer science terms 'accuracy' and 'recall', rather than the biomedical terms 'sensitivity' and 'specificity' (or alternatively the more explicit terms 'false positive' and 'false negative'). All these terms are easily interchangeable, so it's a good idea to use the terms most appropriate for your audience.

Otherwise, great paper. One question that doesn't seem to be addressed: what is the size (on disk) of the indexes that subread generates?

Along a similar vein, could it be used for indexing a massive database with similar sequences (e.g. NCBI-nr) to replace something like BLAST?

Edit: just noticed that you discussed BLAST-like databases at the end of the paper, and you leave it open for investigation.

Dear David Eccles,

Thanks for your comments.

I hope we made it clear in the paper what we mean by accuracy and recall. But I think I agree with you that these computer science terms may confuse people from other fields. So here I just try to describe a bit more these two terms to make them more clearer.

The accuracy used in the paper was calculated as the percentage of correctly mapped reads out of all the mapped reads. This is equivalent to the false discovery rate. The recall was calculated as the percentage of correctly mapped reads out of all the reads included in the test dataset. This is equivalent to the sensitivity (the sensitivity in my dictionary). Please note that the denominators for the calculation of accuracy and recall are different, but the nominators are the same. So as you can see, our paper focused on the correctly mapped reads.

The size of the index built for human genome hg19 is 6.8GB. But please note that the amount of memory needed for mapping using this index ranges from 1GB to 7.6GB, depending on the memory size selection when you built the index.

As sdriscoll pointed out, there is currently a limit (50,000) on the number of chromosomes/contigs allowed when building the index. A massive database such as NCBI nucleotide database certainly has a lot more sequences than this number. So we are now working on removing this limitation (this should not be too hard). Once this is fixed, we may try to do a search similar to the blast search to see how Subread behaves. I guess there are many issues we need to address for this, such as not only reporting the best location/s but also reporting the second, third or fourth locations. This will need quite a bit changes to our program. But I think the exciting part for us is that we can test if the seed-and-vote paradigm is going to be useful in this search (like many short read aligners, blast also uses the seed-and-vote paradigm). I will be very happy to hear any suggestions for this.

Please let me know if you have any further questions.

Best wishes,

Wei

Last edited by shi; 06-04-2013 at 08:53 PM.
shi is offline   Reply With Quote
Old 06-04-2013, 09:03 PM   #10
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

Wei that's interesting. I haven't read into exactly how the subread mapper works but I follow your explanation of how it defines a multi-mapping read. It may be useful to users if it were possible to somehow make that clear because I think we are used to using tools that define a multi-mapped read simply as one that can align in more than one location within some edit distance cutoff. Or at least that's what I'm used to from using STAR, bowtie1/2/tophat, and bwa. It seems to be a dramatic difference based on this one test. Do you suppose there is a way to tweak the input options of subread to mimic the behavior of an edit distance cutoff for multi-mapping?

Also the reason the read is not aligned "correctly" it's likely easily explainable and what you mentioned is no doubt the case. These reads probably wouldn't be aligned correctly by most aligners except maybe by random chance because there exists a continuous region in the genome that is a pretty good mapping for the reads as well as a spliced position that's a perfect mapping (and the actual source of the read). For example one such read should aligned spliced with this alignment position X:28585510 and cigar=91M2449N9M with 0 mismatches. subread-align reported a 6-mismatch alignment at X:30352028 with MAPQ=46 and CIGAR=100M. So it did as you say it should - favored the full length alignment over a truncated one. In this case that wasn't the correct option but it seems the aligner is using the logic you described.

I'll run this "misaligned" reads back against the genome with subjunc and see what it does with them.
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 06-04-2013, 09:08 PM   #11
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

Quote:
Originally Posted by shi View Post
The accuracy used in the paper was calculated as the percentage of correctly mapped reads out of all the mapped reads. This is equivalent to the false discovery rate.
isn't this also the same as 'precision'/'positive predicative value'? I believe the FDR is the percentage of incorrectly mapped reads out of all mapped reads. anytime I get mixed up about these terms I go back to this wikipeida article...

http://en.wikipedia.org/wiki/Receive...characteristic
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 06-04-2013, 09:13 PM   #12
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Quote:
Originally Posted by sdriscoll View Post
isn't this also the same as 'precision'/'positive predicative value'? I believe the FDR is the percentage of incorrectly mapped reads out of all mapped reads. anytime I get mixed up about these terms I go back to this wikipeida article...

http://en.wikipedia.org/wiki/Receive...characteristic
Sorry for my typo. It meant to be 1 - FDR. And you are correct that this is exactly the same as the 'precision' mentioned in one of your posts. We did not use precision in the paper because we use it to refer to the variation between replicates, which is irrelevant in this context to us.

Wei
shi is offline   Reply With Quote
Old 06-04-2013, 09:20 PM   #13
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

Interesting. subjunc didn't manage to pick up on the correct alignment for the read I used as an example. To review:

Code:
Correct:        X:28585510, CIGAR=91M2449N9M (NM=0) flag=0
subread-align:  X:30352028, MAPQ=46, CIGAR=100M (NM=6) flag=16
subjunc:        X:30352036, MAPQ=187, CIGAR=8S92M (NM=6) flag=16
subjunc seems to have put the read in the exact same spot but this time soft-clipped it. It also appears to be even more confident that it aligned the read correctly this time.

Here's the read if you'd like to take a look (again this is in mm10):

Code:
ACTTCTCATGCATGTCTTCAATTGCTTCCATGGTCGGACTCTGACTACATTCGGACAGTTTTAATGCTTGTTGTTCTTTTTGATAATTGTTCACTGATTT
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 06-04-2013, 09:22 PM   #14
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

Also I don't want to distract you too much with this alignment snag - I'm really more interested in having the aligners NOT crash when using multiple threads. Any ideas on that one?
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 06-04-2013, 09:31 PM   #15
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Quote:
Originally Posted by sdriscoll View Post
Wei that's interesting. I haven't read into exactly how the subread mapper works but I follow your explanation of how it defines a multi-mapping read. It may be useful to users if it were possible to somehow make that clear because I think we are used to using tools that define a multi-mapped read simply as one that can align in more than one location within some edit distance cutoff. Or at least that's what I'm used to from using STAR, bowtie1/2/tophat, and bwa. It seems to be a dramatic difference based on this one test. Do you suppose there is a way to tweak the input options of subread to mimic the behavior of an edit distance cutoff for multi-mapping?

Also the reason the read is not aligned "correctly" it's likely easily explainable and what you mentioned is no doubt the case. These reads probably wouldn't be aligned correctly by most aligners except maybe by random chance because there exists a continuous region in the genome that is a pretty good mapping for the reads as well as a spliced position that's a perfect mapping (and the actual source of the read). For example one such read should aligned spliced with this alignment position X:28585510 and cigar=91M2449N9M with 0 mismatches. subread-align reported a 6-mismatch alignment at X:30352028 with MAPQ=46 and CIGAR=100M. So it did as you say it should - favored the full length alignment over a truncated one. In this case that wasn't the correct option but it seems the aligner is using the logic you described.

I'll run this "misaligned" reads back against the genome with subjunc and see what it does with them.
Hi sdriscoll,

Yes, Subread is quite different from other aligners in determining if a read is a multi-mapping reads. This stems from the way in which it finds the best mapping locations for the reads. It does not perform a full alignment for a read to determine its mapping location. It takes a number of equally-spaced 16bp subreads from the read, mapping them to the reference and then using the mapping location of these subreads to vote for the mapping location of the read (ie. the largest set of consensus subreads determines the mapping location of the read). So for example, if the largest set of consensus subreads include 5 subreads and the read length is 100bp, then the total number of known matched bases at the voting time is <= 80. So the read may have more than 20bp mismatched bases but still get mapped. On the other hand, if the largest consensus set includes 10 subreads, then all 100 bases are perfectly matched with the reference. However, Subread does perform a full read alignment in the end with the determined position.

You could possibly use the number of votes (-m option) to try to mimic the behavior of the edit distance cutoff. But I'm not really sure if you can find a good correlation between them.

Best wishes,

Wei
shi is offline   Reply With Quote
Old 06-04-2013, 09:36 PM   #16
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Quote:
Originally Posted by sdriscoll View Post
Interesting. subjunc didn't manage to pick up on the correct alignment for the read I used as an example. To review:

Code:
Correct:        X:28585510, CIGAR=91M2449N9M (NM=0) flag=0
subread-align:  X:30352028, MAPQ=46, CIGAR=100M (NM=6) flag=16
subjunc:        X:30352036, MAPQ=187, CIGAR=8S92M (NM=6) flag=16
subjunc seems to have put the read in the exact same spot but this time soft-clipped it. It also appears to be even more confident that it aligned the read correctly this time.

Here's the read if you'd like to take a look (again this is in mm10):

Code:
ACTTCTCATGCATGTCTTCAATTGCTTCCATGGTCGGACTCTGACTACATTCGGACAGTTTTAATGCTTGTTGTTCTTTTTGATAATTGTTCACTGATTT
Hi sdriscoll,

You have to provide Subjunc your entire read dataset. Subjunc firstly uses all the junction reads to discover exon-exon junctions and then it uses the discovered junctions to re-align the reads.

Subjunc needs at least 16bp to identify exons. Your read has only 9 bases in the second exon, so providing this read only to Subjunc will not let it find that junction.

Cheers,
Wei
shi is offline   Reply With Quote
Old 06-04-2013, 09:41 PM   #17
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Quote:
Originally Posted by sdriscoll View Post
Also I don't want to distract you too much with this alignment snag - I'm really more interested in having the aligners NOT crash when using multiple threads. Any ideas on that one?
Hi sdriscoll,

No worries. It is good to get to the bottom of this alignment issue.

We have identified the problem which caused subread and subjunc to crash when using multiple threads. A bug was introduced into v1.3.3 when we added some new functions. We have reproduced this problem on our laptop. It should be fixed in a couple of days. Will let you know when it is fixed.

Thanks,
Wei
shi is offline   Reply With Quote
Old 06-04-2013, 09:46 PM   #18
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

Ah, I only gave it the same 94 reads I was using before. I already tied my computer up in some analysis that'll run for a few hours so I'll investigate this further tomorrow (california time).

Thanks for all of the info so far. I did see at least one other little glitch earlier today but I'll have to repeat it to find it. It was a paired alignment where the two mates were aligned to different chromosomes, which was fine, but the SAM field #7 (RNEXT) for one of the two mates was set to *. Everything else was set correctly for the two mates and 'samtools fixmate' was able to correct the issue. Not a big deal but it made my benchmarking code flip out a little because it checks for those fields when evaluating these type of paired alignments.
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 06-06-2013, 12:42 PM   #19
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

Question:

What triggers this warning message during the index building process?

Quote:
Unknown character in the chromosome data: 62, ignored!
*nevermind I read the source code*.

I saw this warning message building indexes for transcriptomes - specifically mouse and human. I wonder what characters they have in those files that don't fall into the categories you've covered in the code...
Code:
int is_gene_char(char c)
{
	if((c>='A' && c<'Z') || (c>='a' && c<='z'))
		return GENE_SPACE_BASE;
	if(c>='0' && c<'9')
		return GENE_SPACE_COLOR;
	if(c=='N' || c == '.')
		return GENE_SPACE_BASE;
	return 0;
}
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */

Last edited by sdriscoll; 06-06-2013 at 12:46 PM.
sdriscoll is offline   Reply With Quote
Old 06-06-2013, 05:04 PM   #20
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 821
Default

Quote:
Originally Posted by sdriscoll View Post
I wonder what characters they have in those files that don't fall into the categories you've covered in the code...
ASCII character 62d is a '>' (i.e. the sequence ID prefix). Perhaps the line endings aren't being parsed correctly, or there's a blank line instead of sequence, or something like that.

FWIW, a '-' can also occur in FASTA files as an allowed character (more common when aligning different sequences together).
gringer is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:33 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO