Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    @dpryan: thanks for the reply!
    Yes, I used tophat-fusion, so this means that the counts I get from these alignments are unreliable?
    Also, I am using dexseq_count.py on my 105 RNA-Seq samples, and only one of the files triggered an error:
    ValueError: ("unsupported format character ''' (0x27) at index 34", 'line 433369370
    And this is the line:
    SRR391511.sra.ncbi_enc.52549224 163 chr3 49396810 50 50M = 49396966 205 GGGAAACCAATTCCTATTTACTTAGCCCAGCTCCATGGGGTACTGAGATA B*CBABA><7ABB6BACAA>;AB@B@?ABB@@A?=AB<<6-?>BB>A>6@ XA:i:1 MD:Z:1T48 N@BAA=?9:@4@ XA:i:0 MD:Z:50 NM:i:0 XS:A:- XP:Z:chr3 49397428 50M NH:i:1

    Is there something I should fix in that line to make it work? Also, it would be good to know why it is only failing with that one file when all of them have been processed the same way. Thanks for any input in the matter.

    Comment


    • #17
      I wouldn't say that the counts are unreliable, so much as not necessarily representing exactly what you want...though that's just splitting semantic hairs.

      In the example you posted above (in #14), if all the parts of that fusion gene mapped to, say, KCNJ12, then that fusion read would end up getting counted twice, which is not what you'll want. If the second part of the fusion read mapped to, say, KCNJ18 (random example), then the whole paired-read would end up getting counted once for each of the genes, which is also probably not ideal.

      The conservative solution (short of just realigning) would be to simply remove fusion reads, which could be done with
      Code:
      samtools view file.bam | grep -v "XF:Z" | htseq-count [options] - GTF_file > counts_file
      or something like that. This will still produce a bunch of error messages due to having missing mates, but you remove the chimeric-alignment-double-counting issue. In an ideal world, the counting program would treat chimeric alignments that map to the same feature as a single entity and those aligning to different features in a user-defined manner. If you have some budget to spare, perhaps Simon Anders could be "monetarily encouraged" to modify htseq-count to do that...

      Regarding the dexseq_count.py error, it's not clear to me from that line why that would be occuring. It's complaining about there being an apostrophe somewhere, though I don't see one. I suspect there's something with the format of the chimeric read that it doesn't like, but I'm not familiar enough with the inner-workings of the script to say what (perhaps Simon will chime in).

      Comment


      • #18
        Thanks for the quick reply @dpryan
        I will try to get the counts again following your suggestion for removing the chimeric/fusion alignments.
        It would be great if Simon Anders can implement a better solution for this, but unfortunately "monetary encouragement" is beyond my possibilities at this early stage of my PhD studies ....
        So I will give it a try now and perhaps by Monday I will have the new counts.
        thanks agan!

        Comment


        • #19
          Hi, I was wondering if you ever figured out the error message you were describing above? I am getting a similar thing with HTSeq, only it is with the first read of the SAM file.

          Error occured when reading first line of sam file.
          Error: ("unsupported format character ''' (0x27) at index 34", 'line 30 of file
          CAL120_sr.sam')
          [Exception type: ValueError, raised in _HTSeq.pyx:1168]

          And this is the line:
          Y71VY:00004:00106 16 chr10 62135255 0 12M * 0 0 GGGGGGGGAGGG 6666666,,'** ZPBf,0.0155953,0.00769459,0.0049578 ZMBs,266,0,242,0,0,252,0,278,506,0,0,550,262,278,0,0,226,0,260,0,26,38,42,30,104,0,0,0,764,0,0,0,70,0,5224,0,5866,0,0,0,0,0,0,0,120,124,14,122,0,46 ZFi28 RGZY71VY.IonXpress_014 PGZtmap MDZ12 NMi0 ASi12 XAZmap4-1 XSi12

          Any help would be GREATLY appreciated!

          Comment


          • #20
            That appears to be a pretty screwed up SAM line. Did your file get corrupted?

            Comment


            • #21
              Oh sorry, there should be a bunch of colons in there. That's because I was fooling around with it when I was getting errors earlier about an illegal colon in the optional fields. Here is the real line:

              Y71VY:00004:00106 16 chr10 62135255 0 12M * 0 0 GGGGGGGGAGGG 6666666,,'** ZP:B:f,0.0155953,0.00769459,0.0049578 ZM:B:s,266,0,242,0,0,252,0,278,506,0,0,550,262,278,0,0,226,0,260,0,26,38,42,30,104,0,0,0,764,0,0,0,70,0,5224,0,5866,0,0,0,0,0,0,0,120,124,14,122,0,46 ZF:i:28 RG:Z:Y71VY.IonXpress_014 PG:Z:tmap MD:Z:12 NM:i:0 AS:i:12 XA:Z:map4-1 XS:i:12

              Does this still look screwed up?

              Comment


              • #22
                How does HTseq-count count paired-end reads which are mapped on different chromosomes or genes? Or should they be filtered out prior htseq-count run?

                Comment


                • #23
                  Transcriptom annotation

                  Hello to everyone,
                  I would like to ask for advice wich tools to use to annotate assembled RNA seq. We have sequenced several species at Illumina HiSeq and want to annotate genes now. Then see the differences in expression, because we have different species and hybrids as well.
                  Does anyone can advice me the pipeline? I have heard about RAST and then to manually approved it in GMOD.
                  I see majority is working with blast2go, but it is not free

                  Comment


                  • #24
                    Originally posted by bbl View Post
                    How does HTseq-count count paired-end reads which are mapped on different chromosomes or genes? Or should they be filtered out prior htseq-count run?
                    They are counted either as "ambiguous" (default mode "union") or as "no feature" ("intersection" modes).

                    Comment


                    • #25
                      Thanks Simon. Does it mean I can understand that I only need to filter multiple mapped reads in the accepted_hits.bam from tophat2? No other filtering step necessary prior to HTseq-count?

                      Originally posted by Simon Anders View Post
                      They are counted either as "ambiguous" (default mode "union") or as "no feature" ("intersection" modes).

                      Comment


                      • #26
                        If TopHat marks the multimapping reads with the "NH" tag (and, IIRC, it does), htseq-count filters these out for you, too.

                        Comment


                        • #27
                          Sorting SAM file properly can fix the warning

                          Hi,
                          I came across the same issue while using htseq-count.
                          I ve used
                          Code:
                          sort -n 2234_accepted_hits.sam > 2234_accepted_hits_sorted.sam
                          command to sort my accepted_hits.sam, but still getting the warning message. When I little analysed the warning message and my input sorted SAM file, found that my sam file was not sorted properly. One read which is showing warning is as follows:

                          ERR032234.10311751 147 chr12 100208754 50 76M = 100208593 -237 GGCAGCCCAGGGCTTGGTGTTGGCAGTTTGAGTTCGTGAGATAGAAAGAGTGGGGTGTCAAGGCAGTACCCCTGAG >>=@=<9;9?>A<=;A;7>5;;;;=8@=<7==@:;5AA;A@A19A5AAAA@<<<;<;=8=5;@;8;<?.55>7?7> XA:i:0 MD:Z:76 NM:i:0 NH:i:1

                          ERR032234.1031175 145 chr1 51757180 50 76M = 51755662 -1594 CTCGGACTTGATCTGCCCAGACTTTTGGTCAGCAAGGAGAAGGTTATTGTTTGTTAAGAGGAAAATCCGAGATGTA A3=>D<D??>D?@DD?9DDD>DDBBDDDDDDDDDBDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD XA:i:0 MD:Z:76 NM:i:0 NH:i:1

                          ERR032234.10311751 99 chr12 100208593 50 76M = 100208754 237 CTAAAAGCTTACCTCCAAAACAGGATTCTTGTGTAACTAGGAATCCTGCATGAGAACCAGAAACCCTAACCTCCGA CAA@?AC8>ACC>AA<C?8B=6B<98?=>BC<C9=AAA<CA>A=A:=BB=>>::<>@==@*2:;8:7><<;7<><; XA:i:0 MD:Z:76 NM:i:0 NH:i:1


                          So the reads were not sorted correctly. I used the instruction from here

                          Code:
                          sort -bn -k 1.11,1 2234_accepted_hits.sam > 2234_accepted_hits_sorted.sam
                          my reads have ids like this: ERR032234.1031175
                          this sort command would sort the file on field 1, from chars 11 to end of field, numerically.


                          now htseq-count working fine.
                          Last edited by kcm.eid; 03-02-2014, 10:07 PM. Reason: correcting instructions

                          Comment


                          • #28
                            A possible problem :

                            According to htseq documentation : http://www-huber.embl.de/users/ander.../counting.html

                            "The fact that the records describe the same fragment can be seen from the fact that they have the same read name"

                            So Tophat for instance is ouputing read paris with different names. (Adding 1 or 2 at the end of the name)

                            So simply do that:

                            (samtools view -H in.bam; in.bam | awk '{print substr($1,1,length($1)-1),$0}' | sed 's/ [^ ]*//') | samtools view -bSh - > in.samereadname.bam

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Strategies for Sequencing Challenging Samples
                              by seqadmin


                              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                              03-22-2024, 06:39 AM
                            • seqadmin
                              Techniques and Challenges in Conservation Genomics
                              by seqadmin



                              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                              Avian Conservation
                              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                              03-08-2024, 10:41 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, Yesterday, 06:37 PM
                            0 responses
                            10 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, Yesterday, 06:07 PM
                            0 responses
                            9 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-22-2024, 10:03 AM
                            0 responses
                            51 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-21-2024, 07:32 AM
                            0 responses
                            67 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X