Seqanswers Leaderboard Ad

**shi** · 06-04-2013, 09:36 PM

Originally posted by sdriscoll View Post

Interesting. subjunc didn't manage to pick up on the correct alignment for the read I used as an example. To review:

Code:

Correct:        X:28585510, CIGAR=91M2449N9M (NM=0) flag=0
subread-align:  X:30352028, MAPQ=46, CIGAR=100M (NM=6) flag=16
subjunc:        X:30352036, MAPQ=187, CIGAR=8S92M (NM=6) flag=16

subjunc seems to have put the read in the exact same spot but this time soft-clipped it. It also appears to be even more confident that it aligned the read correctly this time.

Here's the read if you'd like to take a look (again this is in mm10):

Code:

ACTTCTCATGCATGTCTTCAATTGCTTCCATGGTCGGACTCTGACTACATTCGGACAGTTTTAATGCTTGTTGTTCTTTTTGATAATTGTTCACTGATTT

Hi sdriscoll,

You have to provide Subjunc your entire read dataset. Subjunc firstly uses all the junction reads to discover exon-exon junctions and then it uses the discovered junctions to re-align the reads.

Subjunc needs at least 16bp to identify exons. Your read has only 9 bases in the second exon, so providing this read only to Subjunc will not let it find that junction.

Cheers,
Wei

**shi** · 06-04-2013, 09:41 PM

Originally posted by sdriscoll View Post

Also I don't want to distract you too much with this alignment snag - I'm really more interested in having the aligners NOT crash when using multiple threads. Any ideas on that one?

Hi sdriscoll,

No worries. It is good to get to the bottom of this alignment issue.

We have identified the problem which caused subread and subjunc to crash when using multiple threads. A bug was introduced into v1.3.3 when we added some new functions. We have reproduced this problem on our laptop. It should be fixed in a couple of days. Will let you know when it is fixed.

Thanks,
Wei

**sdriscoll** · 06-04-2013, 09:46 PM

Ah, I only gave it the same 94 reads I was using before. I already tied my computer up in some analysis that'll run for a few hours so I'll investigate this further tomorrow (california time).

Thanks for all of the info so far. I did see at least one other little glitch earlier today but I'll have to repeat it to find it. It was a paired alignment where the two mates were aligned to different chromosomes, which was fine, but the SAM field #7 (RNEXT) for one of the two mates was set to *. Everything else was set correctly for the two mates and 'samtools fixmate' was able to correct the issue. Not a big deal but it made my benchmarking code flip out a little because it checks for those fields when evaluating these type of paired alignments.

**sdriscoll** · 06-06-2013, 12:42 PM

Question:

What triggers this warning message during the index building process?

Unknown character in the chromosome data: 62, ignored!

*nevermind I read the source code*.

I saw this warning message building indexes for transcriptomes - specifically mouse and human. I wonder what characters they have in those files that don't fall into the categories you've covered in the code...

Code:

int is_gene_char(char c)
{
	if((c>='A' && c<'Z') || (c>='a' && c<='z'))
		return GENE_SPACE_BASE;
	if(c>='0' && c<'9')
		return GENE_SPACE_COLOR;
	if(c=='N' || c == '.')
		return GENE_SPACE_BASE;
	return 0;
}

**gringer** · 06-06-2013, 05:04 PM

Originally posted by sdriscoll View Post

I wonder what characters they have in those files that don't fall into the categories you've covered in the code...

ASCII character 62d is a '>' (i.e. the sequence ID prefix). Perhaps the line endings aren't being parsed correctly, or there's a blank line instead of sequence, or something like that.

FWIW, a '-' can also occur in FASTA files as an allowed character (more common when aligning different sequences together).

**shi** · 06-06-2013, 05:20 PM

Dear sdriscoll and David,

Yes, it seems to us that there is an empty reference sequence there. Sorry the subread-buildindex output was not very informative.

We will release a patch shortly, which will try to fix this. It will also output more information such as the line numbers where the unexpected characters were found. This will help figure out what went wrong if it was some other problems.

Thanks for reporting this.

Best wishes,

Wei

**gringer** · 06-06-2013, 05:31 PM

Originally posted by shi View Post

We will release a patch shortly, which will try to fix this. It will also output more information such as the line numbers where the unexpected characters were found. This will help figure out what went wrong if it was some other problems.

If there are not too many file errors, the best thing to do would be line numbers as well the entirety of surrounding lines, so that the user doesn't need to hunt through the file for simple errors:

Code:

Warning: Unknown character in the chromosome data: 62, at postition 1, line 79. This will be ignored!
77 TGCGTTTATGGTACGCTGGACTTTGTGGGATACCCTCGCTTTCCTGCTCCTGTTGAGTTTATTGCTGCCG
78 >LarivSequence1
79 >RivalSequence1
   ^
80 ACTATTTTAAAGCGCCGTGGATGCCTGACCGTACCGAGGCTAACCCTAATGAGCTTAATCAAGATGATGC
81 TCGTTATGGTTTCCGTTGCTGCCATCTCAAAAACATTTGGACTGCTCCGCTTCCTCCTGAGACTGAGCTT

edit: after seeing the warning like this and the previous code, I presume that this example line would have been declared to be a colour-space sequence because it contains a number....

**shi** · 06-06-2013, 06:07 PM

Thanks for the suggestion, David. Yes, that's what we are going to do!

Best wishes,

Wei

**sdriscoll** · 06-06-2013, 10:20 PM

Wei-

I caught another bug today. I was aligning paired end reads with subjunc (all default options except with -T 8). Everything finished normally and I saw no errors. I called on samtools to convert the alignments from SAM to BAM and it said...

Parse error at line 538: sequence and quality are inconsistent
Abort trap: 6

Here's the SAM output starting with line 538.

Code:

ENST00000549441_2353_2605_1_0:2_1	165	*	0	0	*	12	49740949	CCAGGCTGGAGTGCAGTGGCGTGATCTCGGCTCACTGCAACCTCCGCCTCTCAGGTTCAGGTGATTCTCCTGCCTCAGCCTCCCAAGTAGCTGGGACTAC
ENST00000549441_2353_2605_1_0:2_1	101	*	0	0	*	12	49745282	TGGTGGCTCATGCCTGTAATCCCAGCACTTTGGCAGGCCGAGGCGGGCAGATCACGAGGTCAGGAGTTCGAGACCAGCAGCCTTGCCAATATAGTGAAAJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
ENST00000549441_2029_2327_1_0:3_2	217	12	49745282	199	100M	*	0	GTGAAAAGAAACTTCCCCCTGAGGACTGACTCTTCCTAGCAGAGCTGGGCAACTTGTCCCAAATCTAGCTTTGCCCACGAATGGCATCCCAACAGAGTTJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
ENST00000549441_2029_2327_1_0:3_2	195	12	49745098	171	14S86M	=	49745183	185	TTAGCTGGCTTACCAGGTTTTGGGCCTCTCAGAAGGGGCAACAAATGAAGAAGTACATCGGAGTTGCCAGGAGCTAGTGAAGGTCTGGCACCCAGACCAC	????????????????????????????????????????????????????????????????????????????????????????????????????

Aside from the obvious formatting errors with the first through third lines there's total madness going on with the flags and the mate alignment fields.

I scanned through the SAM file from the top and found that the trouble starts with this section..

Code:

ENST00000380900_28_234_1_1:1_3  99      21      40555182        195     100M    =       40555288      206     TCCGCGCCAGCTGCAGACGCACCTCCCTGTCCTCGGGCGTCTCCGTCCGCTCCTCCTCCTCCTCCTCTTCGTCCTCAGTCCCAGCTCGGCACGGCGCCTT    ????????????????????????????????????????????????????????????????????????????????????????????????????
ENST00000380900_28_234_1_1:1_3  147     21      40555288        197     100M    =       40555182      -206    CTCTCCGAAGAACGTGGCCGCCATAGCCGCCCCGTGACCGGCTGGACACAACTGCAGCGCCGCGGGACCGCACGCCGGCTTGCGCGGGACCACGCTCCCT    ????????????????????????????????????????????????????????????????????????????????????????????????????
ENST00000380900_216_549_0_1:2_4 81      21      40553727        197     78M1373N22M     =       40549468        4359    AACTTGGAGCACGGATATTTTTCTAGCAAAGAAACTTCCAAAGATGTTTTTGTTTGTCTTCGAAGGAGCCGCAGTTCCCTCTTCCGCGCCAGCTGCAGAC    ????????????????????????????????????????????????????????????????????????????????????????????????????
ENST00000380900_503_726_1_0:0_5 161     21      40549468        199     30M877N70M      =       40553727        -4359   GGATTTTCCATACTTGACAGTAGCTTAGAACTGCTGCAGGAAGGTCGTGTACTATATTCGGTTGTTCTAGCAATGGACAACACGCCGAGTCTTTGAAATT    ????????????????????????????????????????????????????????????????????????????????????????????????????
ENST00000380900_503_726_1_0:0_5 209     21      40550468        199     100M    =       40549445      1123    AAGGAGAAGGAAGGCTGCCGGTGGATTCTGAGGTTTTATAATCGGTAACATGTCGACATGTGAGAATAGTTATCTGCATGTTCTTCCTTGGACAAGAGCC    ????????????????????????????????????????????????????????????????????????????????????????????????????
ENST00000380900_488_749_1_0:1_6 161     21      40549445        199     53M877N47M      =       40550468        -1123   ATAACACAAGTACAGAATTGCTGGGATTTTCCATACTTGACAGTAGCTTAGAACTGCTGCAGGAAGGTCGTGTACTATATTCGGTTGTTCTAGCAATGGA    ????????????????????????????????????????????????????????????????????????????????????????????????????
ENST00000380900_488_749_1_0:1_6 81      21      40550483        197     91M1637N9M      =       40552329        1946    TGCCGGTGGATTCTGAGGTTTTATAATCGGTAACATGTCGACATGTGAGAATAGTTATCTGAATGTTCTTCCTTGGACAAGAGCCAAAAACCGAGGGATT    ????????????????????????????????????????????????????????????????????????????????????????????????????

You can see how things are fine and then a solo read 'ENST00000380900_216_549_0_1:2_4' shows up without a mate and starts the downfall. I suspect the entire rest of the file is messed up from this point on.

You can download the fastq files if you need them:

Dropbox - 404

https://dl.dropboxusercontent.com/u/10508639/sample_reads.tar.gz

I'm aligning to the Ensemble hg19 "primary" genome.

**shi** · 06-06-2013, 10:33 PM

Dear sdriscoll,

Thanks for reporting the bug and providing the data. We are now looking into this and will get back to you soon. Sorry about this.

Best wishes,

Wei

**sdriscoll** · 06-06-2013, 10:46 PM

Thanks Wei. No worries. The single-end alignments didn't throw any errors at the samtools conversion stage so I assume they are all good.

**Bernt.Popp** · 06-07-2013, 03:57 AM

Hey Wei,

Very nice paper on Subread!

I have one question before trying Subread:
What SOLiD color-space format does it support?
i.e. does it support CSFASTA/QUAL files or only FASTQ (what type of FASTQ, BFAST type?).

bernt

**shi** · 06-07-2013, 04:30 AM

Dear Bernt,

Thanks. For SOLiD color-space data, Subread only supports the CSFASTA/QUAL format files. Actually Subread only uses the CSFASTA files and it does not need the QUAL files.

You will need to build a color-space index before carrying out the alignment.

Best wishes,

Wei

**Bernt.Popp** · 06-07-2013, 04:46 AM

Hey Wei,

thanks for the fast answer!

From the UsersGude I understand that the mapping quality is calculated using the base qualities. How is this done without the QUAL files for color-space? Or does Subread not emit a MQS for SOLID reads?

**shi** · 06-07-2013, 05:07 AM

Hi Bernt,

From our experience of mapping SOLiD reads, we found their base calling qualities were not much useful. This is partly because the SOLiD reads had pretty high base calling qualities and also partly because the Subread aligner can tolerate quite a few errors in each read in the mapping. Therefore we decided to not use their base calling score information. When calculating MQS, we use base calling p value of zero for every base (meaning that every base is correctly called). So the calculated MQS scores are slightly higher than the scores calculating from using the QUAL files. But this shouldn't be of much concern because the differences are quite minor.

However, I have to say that the evaluations we performed on Subread was mainly based on Illumina reads. So it is possible that the current form of Subread aligner may not be at its best in mapping SOLiD reads.

I hopeful you find these information useful.

Best wishes,

Wei

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Today, 08:47 AM	0 responses 9 views 0 likes	Last Post by seqadmin Today, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 57 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News