Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • [DEXSeq] prepare_annotation.py: exonic part starts too early!

    When I tried the command below,

    "python dexseq_prepare_annotation.py Mt.gtf Mt_dexseq.gtf"

    an error occurs, saying that "AssertionError: <GenomicFeature: exonic_part 'Medtr7g090810' at Mt:3458:Mt3.5.1Chr7: 28524762 -> 28523495 (strand '-') > starts too early"

    So, would you please give me a tip about how I will be able to fix this error?
    Thank you.

  • #2
    Could you grep for 'Medtr7g090810' in you GTF file and post the lines containing the term?

    Comment


    • #3
      Originally posted by Simon Anders View Post
      Could you grep for 'Medtr7g090810' in you GTF file and post the lines containing the term?
      The attached file includes the part of 'Medtr7g090810'.
      Thank you for sparing your precious time.
      Attached Files

      Comment


      • #4
        It seems HTSeq got confused because the same gene occurs on both the "+" and the "-_ strand.

        Comment


        • #5
          Originally posted by Simon Anders View Post
          It seems HTSeq got confused because the same gene occurs on both the "+" and the "-_ strand.
          1. So, annotation file has to be fixed??

          2. When I have just made small changes on the "dexseq_prepare_annotation.py", it worked.

          exons = HTSeq.GenomicArrayOfSets( "auto", stranded=True )
          for f in HTSeq.GFF_Reader( gtf_file ):
          if f.type != "exon":
          continue
          f.attr['transcript_id'] = f.attr['transcript_id'].replace( ":", "_" )
          exons[f.iv] += ( f.attr['transcript_id'], f.attr['transcript_id'] )

          But, seeing the original gtf file, since there are both exon and CDS, I am not sure whether this code is okay for my gtf file or not.

          3. I have another gtf file. For this, I also made a small change. And it worked.

          exons = HTSeq.GenomicArrayOfSets( "auto", stranded=True )
          for f in HTSeq.GFF_Reader( gtf_file ):
          if f.type != "CDS":
          continue
          f.attr['transcript_id'] = f.attr['transcript_id'].replace( ":", "_" )
          CDS[f.iv] += ( f.attr['transcript_id'], f.attr['transcript_id'] )

          But, I am not sure whether this code is okay or not.

          4. Do you think the codes that I have modified would be okay?
          The attached file is about original gtf and dexseq_prepare_annotation.py and output gtf.

          Thank you very much!

          Comment


          • #6
            Originally posted by Simon Anders View Post
            It seems HTSeq got confused because the same gene occurs on both the "+" and the "-_ strand.
            This is the attachment.
            Attached Files

            Comment


            • #7
              No, yopu cannot change from gene ID to transcript ID, because there may be many genes with several overlapping transcripts, and they won't be handled correctly anymore.

              You really should fix your GTF file: Wherever the same gene ID is used for features on different strands, add something to the gene ID. If this is complicated, just add a "+" or "-" to all gene IDs.

              BTW, dexseq_prepare only looks at "exon" lines and ignored "CDS" lines

              Comment


              • #8
                Originally posted by Simon Anders View Post
                No, yopu cannot change from gene ID to transcript ID, because there may be many genes with several overlapping transcripts, and they won't be handled correctly anymore.

                You really should fix your GTF file: Wherever the same gene ID is used for features on different strands, add something to the gene ID. If this is complicated, just add a "+" or "-" to all gene IDs.

                BTW, dexseq_prepare only looks at "exon" lines and ignored "CDS" lines
                1.
                So, you mean I will have to replace each +/- into + ?
                (Alternatively, replace each +/- into -).

                2.
                In my Mhapla.gtf file, it has only exon. So do I need to fix my gtf file by replaceing CDS with exon?

                Comment


                • #9
                  1. No, change the gene ID from, say, "Medtr7g090810" to "Medtr7g090810+" and "Medtr7g090810-", depending on strand. This is assuming that you know a scripting language. I wouldn't want to do that manually.

                  Where did you get this strange GTF file from, anyway? Having the same gene name on both strands is a bug.

                  2. No, why?

                  Comment


                  • #10
                    Originally posted by Simon Anders View Post
                    1. No, change the gene ID from, say, "Medtr7g090810" to "Medtr7g090810+" and "Medtr7g090810-", depending on strand. This is assuming that you know a scripting language. I wouldn't want to do that manually.

                    Where did you get this strange GTF file from, anyway? Having the same gene name on both strands is a bug.

                    2. No, why?
                    1. Thank you. I'll try that. I obtained these gtf files from a member of my project group.

                    2. If Mhapla.gtf only has CDS but dexseq_prepare.py only looks at "exon" lines and ignored "CDS" lines, the output might have no line at all, I guess.

                    Comment


                    • #11
                      2. Sure, if it's this way round, you need to change the CDS lines to exon lines.

                      Comment


                      • #12
                        Originally posted by Simon Anders View Post
                        1. No, change the gene ID from, say, "Medtr7g090810" to "Medtr7g090810+" and "Medtr7g090810-", depending on strand. This is assuming that you know a scripting language. I wouldn't want to do that manually.

                        Where did you get this strange GTF file from, anyway? Having the same gene name on both strands is a bug.

                        2. No, why?

                        I have the same problem. If you search UCSC Genome Browser for the gene for ex; HIST2H3C, you will see 2 genes appearing, one on + and on one - strand. dexseq cannot deal with this, but this situation actually happens in reality for those who use gtf file downloaded from UCSC track.

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Advancing Precision Medicine for Rare Diseases in Children
                          by seqadmin




                          Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                          12-16-2024, 07:57 AM
                        • seqadmin
                          Recent Advances in Sequencing Technologies
                          by seqadmin



                          Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                          Long-Read Sequencing
                          Long-read sequencing has seen remarkable advancements,...
                          12-02-2024, 01:49 PM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 12-17-2024, 10:28 AM
                        0 responses
                        25 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 12-13-2024, 08:24 AM
                        0 responses
                        42 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 12-12-2024, 07:41 AM
                        0 responses
                        28 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 12-11-2024, 07:45 AM
                        0 responses
                        42 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X