Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • [DEXSeq] prepare_annotation.py: exonic part starts too early!

    When I tried the command below,

    "python dexseq_prepare_annotation.py Mt.gtf Mt_dexseq.gtf"

    an error occurs, saying that "AssertionError: <GenomicFeature: exonic_part 'Medtr7g090810' at Mt:3458:Mt3.5.1Chr7: 28524762 -> 28523495 (strand '-') > starts too early"

    So, would you please give me a tip about how I will be able to fix this error?
    Thank you.

  • #2
    Could you grep for 'Medtr7g090810' in you GTF file and post the lines containing the term?

    Comment


    • #3
      Originally posted by Simon Anders View Post
      Could you grep for 'Medtr7g090810' in you GTF file and post the lines containing the term?
      The attached file includes the part of 'Medtr7g090810'.
      Thank you for sparing your precious time.
      Attached Files

      Comment


      • #4
        It seems HTSeq got confused because the same gene occurs on both the "+" and the "-_ strand.

        Comment


        • #5
          Originally posted by Simon Anders View Post
          It seems HTSeq got confused because the same gene occurs on both the "+" and the "-_ strand.
          1. So, annotation file has to be fixed??

          2. When I have just made small changes on the "dexseq_prepare_annotation.py", it worked.

          exons = HTSeq.GenomicArrayOfSets( "auto", stranded=True )
          for f in HTSeq.GFF_Reader( gtf_file ):
          if f.type != "exon":
          continue
          f.attr['transcript_id'] = f.attr['transcript_id'].replace( ":", "_" )
          exons[f.iv] += ( f.attr['transcript_id'], f.attr['transcript_id'] )

          But, seeing the original gtf file, since there are both exon and CDS, I am not sure whether this code is okay for my gtf file or not.

          3. I have another gtf file. For this, I also made a small change. And it worked.

          exons = HTSeq.GenomicArrayOfSets( "auto", stranded=True )
          for f in HTSeq.GFF_Reader( gtf_file ):
          if f.type != "CDS":
          continue
          f.attr['transcript_id'] = f.attr['transcript_id'].replace( ":", "_" )
          CDS[f.iv] += ( f.attr['transcript_id'], f.attr['transcript_id'] )

          But, I am not sure whether this code is okay or not.

          4. Do you think the codes that I have modified would be okay?
          The attached file is about original gtf and dexseq_prepare_annotation.py and output gtf.

          Thank you very much!

          Comment


          • #6
            Originally posted by Simon Anders View Post
            It seems HTSeq got confused because the same gene occurs on both the "+" and the "-_ strand.
            This is the attachment.
            Attached Files

            Comment


            • #7
              No, yopu cannot change from gene ID to transcript ID, because there may be many genes with several overlapping transcripts, and they won't be handled correctly anymore.

              You really should fix your GTF file: Wherever the same gene ID is used for features on different strands, add something to the gene ID. If this is complicated, just add a "+" or "-" to all gene IDs.

              BTW, dexseq_prepare only looks at "exon" lines and ignored "CDS" lines

              Comment


              • #8
                Originally posted by Simon Anders View Post
                No, yopu cannot change from gene ID to transcript ID, because there may be many genes with several overlapping transcripts, and they won't be handled correctly anymore.

                You really should fix your GTF file: Wherever the same gene ID is used for features on different strands, add something to the gene ID. If this is complicated, just add a "+" or "-" to all gene IDs.

                BTW, dexseq_prepare only looks at "exon" lines and ignored "CDS" lines
                1.
                So, you mean I will have to replace each +/- into + ?
                (Alternatively, replace each +/- into -).

                2.
                In my Mhapla.gtf file, it has only exon. So do I need to fix my gtf file by replaceing CDS with exon?

                Comment


                • #9
                  1. No, change the gene ID from, say, "Medtr7g090810" to "Medtr7g090810+" and "Medtr7g090810-", depending on strand. This is assuming that you know a scripting language. I wouldn't want to do that manually.

                  Where did you get this strange GTF file from, anyway? Having the same gene name on both strands is a bug.

                  2. No, why?

                  Comment


                  • #10
                    Originally posted by Simon Anders View Post
                    1. No, change the gene ID from, say, "Medtr7g090810" to "Medtr7g090810+" and "Medtr7g090810-", depending on strand. This is assuming that you know a scripting language. I wouldn't want to do that manually.

                    Where did you get this strange GTF file from, anyway? Having the same gene name on both strands is a bug.

                    2. No, why?
                    1. Thank you. I'll try that. I obtained these gtf files from a member of my project group.

                    2. If Mhapla.gtf only has CDS but dexseq_prepare.py only looks at "exon" lines and ignored "CDS" lines, the output might have no line at all, I guess.

                    Comment


                    • #11
                      2. Sure, if it's this way round, you need to change the CDS lines to exon lines.

                      Comment


                      • #12
                        Originally posted by Simon Anders View Post
                        1. No, change the gene ID from, say, "Medtr7g090810" to "Medtr7g090810+" and "Medtr7g090810-", depending on strand. This is assuming that you know a scripting language. I wouldn't want to do that manually.

                        Where did you get this strange GTF file from, anyway? Having the same gene name on both strands is a bug.

                        2. No, why?

                        I have the same problem. If you search UCSC Genome Browser for the gene for ex; HIST2H3C, you will see 2 genes appearing, one on + and on one - strand. dexseq cannot deal with this, but this situation actually happens in reality for those who use gtf file downloaded from UCSC track.

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Current Approaches to Protein Sequencing
                          by seqadmin


                          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                          04-04-2024, 04:25 PM
                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 04-11-2024, 12:08 PM
                        0 responses
                        18 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 10:19 PM
                        0 responses
                        22 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 09:21 AM
                        0 responses
                        16 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-04-2024, 09:00 AM
                        0 responses
                        47 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X