Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • dexseq_prepare script error with Ensembl human gtf

    Hi all

    I was following the instructions to analyze my RNASeq data with the DEXSeq package but I run into the following error while preparing the gff file:

    /home/aleadam/R/x86_64-pc-linux-gnu-library/3.3/DEXSeq/python_scripts/dexseq_prepare_annotation.py", line 127, in <module>
    assert l[i].iv.end <= l[i+1].iv.start, str(l[i+1]) + " starts too early"
    AssertionError: <GenomicFeature: exonic_part 'ENSG00000166260+ENSG00000141198' at 17: 54951904 -> 54951900 (strand '-')> starts too early


    I've seen a few posts with similar errors but never with the files downloaded from Ensembl itself, thus my post here.

    I got the files from: ftp://ftp.ensembl.org/pub/release-89/gtf/homo_sapiens/

    The command I run is:

    python /home/aleadam/R/x86_64-pc-linux-gnu-library/3.3/DEXSeq/python_scripts/dexseq_prepare_annotation.py Homo_sapiens.GRCh38.89.gtf.gz Homo_sapiens.GRCh38.89.DEXSeq.gff

    I do not know what "ENSG00000166260+ENSG00000141198" is. Is there something I'm doing wrong?

    BTW, it happens with all the gtf files, and with version 88 as well. My apologies if this has been answered and I missed it. I'm struggling to understand what I'm doing here!

  • #2
    Hi aleadam,

    By the look of it, that assert statement is making sure that the 'exonic parts' of an aggregated gene set do not overlap. I.e. the end of one exonic part "l[i].iv.end" should not be located a higher bp position than the start of the next exonic part "l[i+1].iv.start". That's all the error is saying.

    I noticed that those two genes are in a different orientation so I'm not sure why the script is complaining about this. You could open up that python file "dexseq_prepare_annotation.py" in a text editor and have a read to try and figure out what exactly it's doing. I had a quick look on my computer and it does contain comments about this.

    Also, you can run the script without doing the 'aggregate gene' part and it should work. Something like:

    python /home/aleadam/R/x86_64-pc-linux-gnu-library/3.3/DEXSeq/python_scripts/dexseq_prepare_annotation.py -r 'no' Homo_sapiens.GRCh38.89.gtf.gz Homo_sapiens.GRCh38.89.DEXSeq.gff

    Although you may not want to turn this off depending on your requirements.

    Good luck!

    Matt.

    Comment


    • #3
      Thanks Matt for your reply.

      I don't know any python so it will take me a while to try to understand the script. My first attempt was to simply comment out that assert line, but who knows what other problems would that bring later on!

      I'm looking for advice on what would be the best approach to fix the issue. All I want is to get a list of differential exon usage on my data as part of an exploration for changes in endothelial behavior. I think I can deal with some lost entries, so I'll try your suggestion to use the -r 'no' option.
      By looking at similar posts like mine, it seems that it is usually a problem with the gtf annotation. My other option would be to delete the entries regarding that particular gene and try again, or try to find if there is an error in the annotation (maybe a sign misplaced confusing the orientation of a particular exon?).
      I'm very new to this and I'm learning by doing, so I apologize in advance if anything I am saying does not make any sense.

      Thanks again,

      Alex.

      Comment


      • #4
        As a quick update, both commands:

        python /home/aleadam/R/x86_64-pc-linux-gnu-library/3.3/DEXSeq/python_scripts/dexseq_prepare_annotation.py -r 'no' Homo_sapiens.GRCh38.89.gtf.gz Homo_sapiens.GRCh38.89.DEXSeq.gff

        and

        python /home/aleadam/R/x86_64-pc-linux-gnu-library/3.3/DEXSeq/python_scripts/dexseq_prepare_annotation.py Homo_sapiens.GRCh38.87.gtf.gz Homo_sapiens.GRCh38.87.DEXSeq.gff

        seem to work just fine. Thus it appears to be a problem specifically with the aggregate gene option in releases 88 and 89.

        Thanks again for your help

        Alex
        Last edited by aleadam; 07-24-2017, 06:25 AM.

        Comment


        • #5
          Hi Alex,

          Yes, it seems that something changed in the annotation of the latest version. Perhaps you could compare the start and end positions for that particular gene in each of the gtf versions? This might give you a clue to what is going on.

          I suppose it might be that the script is not quite configured to deal with these edge cases of genes that are close together and sort of intertwined. Maybe if you email the authors of DEXSeq they could give you an explanation, or else you might have to dig through and learn some python!

          Good luck,

          Matt.

          Comment


          • #6
            Hi Matt

            I indeed tried
            zcat Homo_sapiens.GRCh38.87.gtf.gz | grep ENSG00000166260 > 87.txt
            zcat Homo_sapiens.GRCh38.89.gtf.gz | grep ENSG00000166260 > 89.txt
            diff 87.txt 89.txt > 87vs89.diff

            But the diff was not very helpful. Lots of small changes in the annotations. I'm not a bioinformatician, but a cell biologist using bioinformatic tools, so my abilities to understand small differences in code or annotations are limited.

            I saw the authors of DEXSeq answering a few questions here, so they might see it eventually. If not I will write them directly so they can update the script if needed.

            Anyhow, using the release 87 I was able to follow through the analysis, so now it's time to dig into pubmed to figure out what might be the role (if any) for each hit.

            Cheers,

            Alex.

            Comment


            • #7
              Looks to me like an error in the Ensembl annotation, transcript ENST00000639671 is misannotated as being a transcript produced by the TOM1L1 gene (TOM1L1-224), when it is in fact a COX11 derived transcript. The error thrown by the DEXSeq script could be because the two genes are located on opposite strands.

              Comment


              • #8
                We're aware of the problem with TOM1L1 and COX11 in Ensembl, and it is fixed in the next release, coming out next week.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM
                • seqadmin
                  Techniques and Challenges in Conservation Genomics
                  by seqadmin



                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                  Avian Conservation
                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                  03-08-2024, 10:41 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 03-27-2024, 06:37 PM
                0 responses
                12 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-27-2024, 06:07 PM
                0 responses
                11 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-22-2024, 10:03 AM
                0 responses
                53 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-21-2024, 07:32 AM
                0 responses
                69 views
                0 likes
                Last Post seqadmin  
                Working...
                X