Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Originally posted by sdriscoll View Post
    Interesting. subjunc didn't manage to pick up on the correct alignment for the read I used as an example. To review:

    Code:
    Correct:        X:28585510, CIGAR=91M2449N9M (NM=0) flag=0
    subread-align:  X:30352028, MAPQ=46, CIGAR=100M (NM=6) flag=16
    subjunc:        X:30352036, MAPQ=187, CIGAR=8S92M (NM=6) flag=16
    subjunc seems to have put the read in the exact same spot but this time soft-clipped it. It also appears to be even more confident that it aligned the read correctly this time.

    Here's the read if you'd like to take a look (again this is in mm10):

    Code:
    ACTTCTCATGCATGTCTTCAATTGCTTCCATGGTCGGACTCTGACTACATTCGGACAGTTTTAATGCTTGTTGTTCTTTTTGATAATTGTTCACTGATTT
    Hi sdriscoll,

    You have to provide Subjunc your entire read dataset. Subjunc firstly uses all the junction reads to discover exon-exon junctions and then it uses the discovered junctions to re-align the reads.

    Subjunc needs at least 16bp to identify exons. Your read has only 9 bases in the second exon, so providing this read only to Subjunc will not let it find that junction.

    Cheers,
    Wei

    Comment


    • #17
      Originally posted by sdriscoll View Post
      Also I don't want to distract you too much with this alignment snag - I'm really more interested in having the aligners NOT crash when using multiple threads. Any ideas on that one?
      Hi sdriscoll,

      No worries. It is good to get to the bottom of this alignment issue.

      We have identified the problem which caused subread and subjunc to crash when using multiple threads. A bug was introduced into v1.3.3 when we added some new functions. We have reproduced this problem on our laptop. It should be fixed in a couple of days. Will let you know when it is fixed.

      Thanks,
      Wei

      Comment


      • #18
        Ah, I only gave it the same 94 reads I was using before. I already tied my computer up in some analysis that'll run for a few hours so I'll investigate this further tomorrow (california time).

        Thanks for all of the info so far. I did see at least one other little glitch earlier today but I'll have to repeat it to find it. It was a paired alignment where the two mates were aligned to different chromosomes, which was fine, but the SAM field #7 (RNEXT) for one of the two mates was set to *. Everything else was set correctly for the two mates and 'samtools fixmate' was able to correct the issue. Not a big deal but it made my benchmarking code flip out a little because it checks for those fields when evaluating these type of paired alignments.
        /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
        Salk Institute for Biological Studies, La Jolla, CA, USA */

        Comment


        • #19
          Question:

          What triggers this warning message during the index building process?

          Unknown character in the chromosome data: 62, ignored!
          *nevermind I read the source code*.

          I saw this warning message building indexes for transcriptomes - specifically mouse and human. I wonder what characters they have in those files that don't fall into the categories you've covered in the code...
          Code:
          int is_gene_char(char c)
          {
          	if((c>='A' && c<'Z') || (c>='a' && c<='z'))
          		return GENE_SPACE_BASE;
          	if(c>='0' && c<'9')
          		return GENE_SPACE_COLOR;
          	if(c=='N' || c == '.')
          		return GENE_SPACE_BASE;
          	return 0;
          }
          Last edited by sdriscoll; 06-06-2013, 12:46 PM.
          /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
          Salk Institute for Biological Studies, La Jolla, CA, USA */

          Comment


          • #20
            Originally posted by sdriscoll View Post
            I wonder what characters they have in those files that don't fall into the categories you've covered in the code...
            ASCII character 62d is a '>' (i.e. the sequence ID prefix). Perhaps the line endings aren't being parsed correctly, or there's a blank line instead of sequence, or something like that.

            FWIW, a '-' can also occur in FASTA files as an allowed character (more common when aligning different sequences together).

            Comment


            • #21
              Dear sdriscoll and David,

              Yes, it seems to us that there is an empty reference sequence there. Sorry the subread-buildindex output was not very informative.

              We will release a patch shortly, which will try to fix this. It will also output more information such as the line numbers where the unexpected characters were found. This will help figure out what went wrong if it was some other problems.

              Thanks for reporting this.

              Best wishes,

              Wei

              Comment


              • #22
                Originally posted by shi View Post
                We will release a patch shortly, which will try to fix this. It will also output more information such as the line numbers where the unexpected characters were found. This will help figure out what went wrong if it was some other problems.
                If there are not too many file errors, the best thing to do would be line numbers as well the entirety of surrounding lines, so that the user doesn't need to hunt through the file for simple errors:

                Code:
                Warning: Unknown character in the chromosome data: 62, at postition 1, line 79. This will be ignored!
                77 TGCGTTTATGGTACGCTGGACTTTGTGGGATACCCTCGCTTTCCTGCTCCTGTTGAGTTTATTGCTGCCG
                78 >LarivSequence1
                79 >RivalSequence1
                   ^
                80 ACTATTTTAAAGCGCCGTGGATGCCTGACCGTACCGAGGCTAACCCTAATGAGCTTAATCAAGATGATGC
                81 TCGTTATGGTTTCCGTTGCTGCCATCTCAAAAACATTTGGACTGCTCCGCTTCCTCCTGAGACTGAGCTT
                edit: after seeing the warning like this and the previous code, I presume that this example line would have been declared to be a colour-space sequence because it contains a number....
                Last edited by gringer; 06-06-2013, 05:35 PM.

                Comment


                • #23
                  Thanks for the suggestion, David. Yes, that's what we are going to do!

                  Best wishes,

                  Wei

                  Comment


                  • #24
                    Wei-

                    I caught another bug today. I was aligning paired end reads with subjunc (all default options except with -T 8). Everything finished normally and I saw no errors. I called on samtools to convert the alignments from SAM to BAM and it said...

                    Parse error at line 538: sequence and quality are inconsistent
                    Abort trap: 6
                    Here's the SAM output starting with line 538.

                    Code:
                    ENST00000549441_2353_2605_1_0:2_1	165	*	0	0	*	12	49740949	CCAGGCTGGAGTGCAGTGGCGTGATCTCGGCTCACTGCAACCTCCGCCTCTCAGGTTCAGGTGATTCTCCTGCCTCAGCCTCCCAAGTAGCTGGGACTAC
                    ENST00000549441_2353_2605_1_0:2_1	101	*	0	0	*	12	49745282	TGGTGGCTCATGCCTGTAATCCCAGCACTTTGGCAGGCCGAGGCGGGCAGATCACGAGGTCAGGAGTTCGAGACCAGCAGCCTTGCCAATATAGTGAAAJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
                    ENST00000549441_2029_2327_1_0:3_2	217	12	49745282	199	100M	*	0	GTGAAAAGAAACTTCCCCCTGAGGACTGACTCTTCCTAGCAGAGCTGGGCAACTTGTCCCAAATCTAGCTTTGCCCACGAATGGCATCCCAACAGAGTTJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
                    ENST00000549441_2029_2327_1_0:3_2	195	12	49745098	171	14S86M	=	49745183	185	TTAGCTGGCTTACCAGGTTTTGGGCCTCTCAGAAGGGGCAACAAATGAAGAAGTACATCGGAGTTGCCAGGAGCTAGTGAAGGTCTGGCACCCAGACCAC	????????????????????????????????????????????????????????????????????????????????????????????????????
                    Aside from the obvious formatting errors with the first through third lines there's total madness going on with the flags and the mate alignment fields.

                    I scanned through the SAM file from the top and found that the trouble starts with this section..

                    Code:
                    ENST00000380900_28_234_1_1:1_3  99      21      40555182        195     100M    =       40555288      206     TCCGCGCCAGCTGCAGACGCACCTCCCTGTCCTCGGGCGTCTCCGTCCGCTCCTCCTCCTCCTCCTCTTCGTCCTCAGTCCCAGCTCGGCACGGCGCCTT    ????????????????????????????????????????????????????????????????????????????????????????????????????
                    ENST00000380900_28_234_1_1:1_3  147     21      40555288        197     100M    =       40555182      -206    CTCTCCGAAGAACGTGGCCGCCATAGCCGCCCCGTGACCGGCTGGACACAACTGCAGCGCCGCGGGACCGCACGCCGGCTTGCGCGGGACCACGCTCCCT    ????????????????????????????????????????????????????????????????????????????????????????????????????
                    ENST00000380900_216_549_0_1:2_4 81      21      40553727        197     78M1373N22M     =       40549468        4359    AACTTGGAGCACGGATATTTTTCTAGCAAAGAAACTTCCAAAGATGTTTTTGTTTGTCTTCGAAGGAGCCGCAGTTCCCTCTTCCGCGCCAGCTGCAGAC    ????????????????????????????????????????????????????????????????????????????????????????????????????
                    ENST00000380900_503_726_1_0:0_5 161     21      40549468        199     30M877N70M      =       40553727        -4359   GGATTTTCCATACTTGACAGTAGCTTAGAACTGCTGCAGGAAGGTCGTGTACTATATTCGGTTGTTCTAGCAATGGACAACACGCCGAGTCTTTGAAATT    ????????????????????????????????????????????????????????????????????????????????????????????????????
                    ENST00000380900_503_726_1_0:0_5 209     21      40550468        199     100M    =       40549445      1123    AAGGAGAAGGAAGGCTGCCGGTGGATTCTGAGGTTTTATAATCGGTAACATGTCGACATGTGAGAATAGTTATCTGCATGTTCTTCCTTGGACAAGAGCC    ????????????????????????????????????????????????????????????????????????????????????????????????????
                    ENST00000380900_488_749_1_0:1_6 161     21      40549445        199     53M877N47M      =       40550468        -1123   ATAACACAAGTACAGAATTGCTGGGATTTTCCATACTTGACAGTAGCTTAGAACTGCTGCAGGAAGGTCGTGTACTATATTCGGTTGTTCTAGCAATGGA    ????????????????????????????????????????????????????????????????????????????????????????????????????
                    ENST00000380900_488_749_1_0:1_6 81      21      40550483        197     91M1637N9M      =       40552329        1946    TGCCGGTGGATTCTGAGGTTTTATAATCGGTAACATGTCGACATGTGAGAATAGTTATCTGAATGTTCTTCCTTGGACAAGAGCCAAAAACCGAGGGATT    ????????????????????????????????????????????????????????????????????????????????????????????????????
                    You can see how things are fine and then a solo read 'ENST00000380900_216_549_0_1:2_4' shows up without a mate and starts the downfall. I suspect the entire rest of the file is messed up from this point on.

                    You can download the fastq files if you need them:



                    I'm aligning to the Ensemble hg19 "primary" genome.
                    /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                    Salk Institute for Biological Studies, La Jolla, CA, USA */

                    Comment


                    • #25
                      Dear sdriscoll,

                      Thanks for reporting the bug and providing the data. We are now looking into this and will get back to you soon. Sorry about this.

                      Best wishes,

                      Wei

                      Comment


                      • #26
                        Thanks Wei. No worries. The single-end alignments didn't throw any errors at the samtools conversion stage so I assume they are all good.
                        /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                        Salk Institute for Biological Studies, La Jolla, CA, USA */

                        Comment


                        • #27
                          Hey Wei,

                          Very nice paper on Subread!

                          I have one question before trying Subread:
                          What SOLiD color-space format does it support?
                          i.e. does it support CSFASTA/QUAL files or only FASTQ (what type of FASTQ, BFAST type?).

                          bernt

                          Comment


                          • #28
                            Dear Bernt,

                            Thanks. For SOLiD color-space data, Subread only supports the CSFASTA/QUAL format files. Actually Subread only uses the CSFASTA files and it does not need the QUAL files.

                            You will need to build a color-space index before carrying out the alignment.

                            Best wishes,

                            Wei

                            Comment


                            • #29
                              Hey Wei,

                              thanks for the fast answer!

                              From the UsersGude I understand that the mapping quality is calculated using the base qualities. How is this done without the QUAL files for color-space? Or does Subread not emit a MQS for SOLID reads?

                              Comment


                              • #30
                                Hi Bernt,

                                From our experience of mapping SOLiD reads, we found their base calling qualities were not much useful. This is partly because the SOLiD reads had pretty high base calling qualities and also partly because the Subread aligner can tolerate quite a few errors in each read in the mapping. Therefore we decided to not use their base calling score information. When calculating MQS, we use base calling p value of zero for every base (meaning that every base is correctly called). So the calculated MQS scores are slightly higher than the scores calculating from using the QUAL files. But this shouldn't be of much concern because the differences are quite minor.

                                However, I have to say that the evaluations we performed on Subread was mainly based on Illumina reads. So it is possible that the current form of Subread aligner may not be at its best in mapping SOLiD reads.

                                I hopeful you find these information useful.


                                Best wishes,

                                Wei

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Essential Discoveries and Tools in Epitranscriptomics
                                  by seqadmin




                                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                  04-22-2024, 07:01 AM
                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Today, 08:47 AM
                                0 responses
                                9 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                60 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                57 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                53 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X