Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • alittleboy
    Member
    • Apr 2011
    • 60

    creation of ExonCountSet in DEXSeq

    Hi,

    I am using DEXSeq for differential exon usage tests. In the vignette of producing ExonCountSet object, the author used dexseq_prepare_annotation.py to convert the GTF file to GFF file. In the GFF output, I see that (since the GTF is downloaded from Ensembl) the gene_id's start with "ENSG".

    I know that the next step is to use dexseq_count.py on the GFF and SAM files to generate counts. However, because currently we have the count data file (which we prefer to use), we are hoping to use our own counts (i.e. the treated2fb.txt as in the vignette example) for the analysis. The issue is that, our count files contain EntrezGene ID's, NOT Ensembl IDs, and the conversion between the two is not bijective (i.e. 1-1). Therefore, we I run the read.HTSeqCounts() function in R, the error message "Count files do not correspond to the flattened annotation file" appears.

    Question:

    (1) is Ensembl GTF the only input for dexseq_prepare_annotation.py? It seems the resultant GFF file contains only Ensembl gene IDs, accordingly...

    (2) in my case of non-Ensembl gene IDs, how can I instruct or manipulate the codes to generate an ExonCountSet object?

    Thank you!
  • Simon Anders
    Senior Member
    • Feb 2010
    • 995

    #2
    Use the function "newExonCountDataSet" instead of "read.HTSeqCounts". See "?newExonCountDataSet" for details.

    Comment

    • alittleboy
      Member
      • Apr 2011
      • 60

      #3
      Originally posted by Simon Anders View Post
      Use the function "newExonCountDataSet" instead of "read.HTSeqCounts". See "?newExonCountDataSet" for details.
      Thank you so much for the information, Simon! I think that's the function that suits my current situation ;-)

      Comment

      • alittleboy
        Member
        • Apr 2011
        • 60

        #4
        So I got one more question: if I don't use read.HTSeqCounts() which takes an annotation file, then it takes extra efforts to generate a plot using plotDEXSeq(ExonCountSet, ...), right? The function read.HTSeqCounts() automatically takes care of that, but in newExonCountSet(), I need to specify the "transcripts" argument in order to plot? Thanks!

        Comment

        • Simon Anders
          Senior Member
          • Feb 2010
          • 995

          #5
          Exactly. Or more specifically, you need the "transcripts" argument (and the "exonIntervals" argument) if you want to get gene models at the bottom of your plot.

          Comment

          • alittleboy
            Member
            • Apr 2011
            • 60

            #6
            Originally posted by Simon Anders View Post
            Exactly. Or more specifically, you need the "transcripts" argument (and the "exonIntervals" argument) if you want to get gene models at the bottom of your plot.
            Hi Simon:

            Thanks for the information! I have two question while using DEXSeq, and hopefully you can help me clarify:

            1. in the newExonCountSet() function, you mentioned that in order to get the gene model at the bottom of plot, two more arguments need to be passed: "exonIntervals" and "transcripts" -- do you have an example of how these two inputs are formatted? Are they derived from the GFF files? Any functions in DEXSeq will help me to get the two inputs? (Sorry I didn't find any examples in the help document...)

            2. in the vignette written by Alejandro, about "data preprocessing and creation of the data objects pasillaGenes and pasillaExons", up to section 5 all the steps aim to generate the per-exon read counts for each sample (finally in the form of .txt files to be used in R). My question is, if I have those per-exon read counts files (from other sources) and would like to use them directly, can I just start from section 5: "creation of CountDataSet" using the per-exon read counts I have at hand? Actually, I have another file of junction reads, and I don't know if DEXSeq can ever take it as inputs in whatever functions? Or DEXSeq just work with per-exon read counts?

            Thank you so much!!

            Comment

            • areyes
              Senior Member
              • Aug 2010
              • 165

              #7
              Hi @alittleboy,

              The function newExonCountSet will allow you to generate your ExonCountSet from basic R data structures. For an example of an ExonCountSet object you could have a look at the pasillaExons object in the pasilla package. You will find how this is supposed to be formatted. Regarding junction reads, you could also input them in DEXSeq, and it will be consider as an additional exon bin!

              Alejandro

              Comment

              • alittleboy
                Member
                • Apr 2011
                • 60

                #8
                Originally posted by areyes View Post
                Hi @alittleboy,

                The function newExonCountSet will allow you to generate your ExonCountSet from basic R data structures. For an example of an ExonCountSet object you could have a look at the pasillaExons object in the pasilla package. You will find how this is supposed to be formatted. Regarding junction reads, you could also input them in DEXSeq, and it will be consider as an additional exon bin!

                Alejandro
                Hi Alejandro:

                Thank you for your suggestions! Would you please be more specific on how can I input my junction_reads file to DEXSeq? I read the related vignette and online documents, but didn't find a solution... Thanks ;-)

                Comment

                • areyes
                  Senior Member
                  • Aug 2010
                  • 165

                  #9
                  Hi @alittleboy,

                  You probably realized that exons in DEXSeq are not "real" exons, but rather exon bins (defined as non overlapping exonic parts of transcripts, see the publication for more detail).

                  Now if you want to test your junction reads, you would have to add as a counting bin, e.g. as a row in your count data, a counting bin that reflects the junction between your exons of interest:

                  If you have this gene model with exons A, B and C:


                  ---[ A ]----[ B ]----[ C ]---

                  Your counting bins would be A, B and C, you will count reads that fall into this exons and your matrix would look like this ( I am making the numbers up):

                  A 2
                  B 4
                  C 3


                  If you want to test your exons bins you would need a matrix like this:

                  A 2
                  A-B 1
                  B 4
                  B-C 3
                  C 3
                  A-C 2

                  And input this into DEXSeq

                  Best regards,
                  Alejandro

                  Comment

                  • alittleboy
                    Member
                    • Apr 2011
                    • 60

                    #10
                    Originally posted by areyes View Post
                    Hi @alittleboy,

                    You probably realized that exons in DEXSeq are not "real" exons, but rather exon bins (defined as non overlapping exonic parts of transcripts, see the publication for more detail).

                    Now if you want to test your junction reads, you would have to add as a counting bin, e.g. as a row in your count data, a counting bin that reflects the junction between your exons of interest:

                    If you have this gene model with exons A, B and C:


                    ---[ A ]----[ B ]----[ C ]---

                    Your counting bins would be A, B and C, you will count reads that fall into this exons and your matrix would look like this ( I am making the numbers up):

                    A 2
                    B 4
                    C 3


                    If you want to test your exons bins you would need a matrix like this:

                    A 2
                    A-B 1
                    B 4
                    B-C 3
                    C 3
                    A-C 2

                    And input this into DEXSeq

                    Best regards,
                    Alejandro
                    Hi @areyes:

                    That's very clear! Thanks for the examples -- I see how important to understand it is the exonic region (exon bin) instead of real exon that constitutes the building block for DEXSeq counting.

                    I remember someone in the past asked why testing these exonic regions instead of real exons is more important biologically and in terms of interpretations. That is also my concern and needs to be clarified. Would you share your thoughts on this? Sorry maybe you've already discussed before, and I appreciate if you can redirect me to the posts ;-)

                    I think maybe it's more concrete to give an example: the top gene in my test comparing two groups is ENSG00000113845. In the HTML output of DEXSeq, there are 22 exonID's from E001 to E022, and the last exon E022 is deferentially expressed. However, from the Ensembl website on this gene here, there are 9 transcripts (splice variants) for this gene. How can I know E022, the exon counting bin, corresponds to which transcript? Or the inference is only limited to the gene-level (maybe wrong), i.e. this gene has at least 1 exon that is DEU, but we don't know which exon is DEU?

                    Thank you so much for your clarifications!
                    Last edited by alittleboy; 07-01-2013, 05:21 PM.

                    Comment

                    • nbahlis
                      Member
                      • May 2013
                      • 25

                      #11
                      DEXSeq exon bins

                      Thank you for this useful discussion.
                      If I understand correctly the "exon bins" to not correspond to actual exons?
                      I did run DEXSeq on 22 samples with condition pre- and post-treament. Inspecting one of the genes of interest ENSG00000113851, in plotDEXSeq this gene is plotted as having 37 exons (or bins) while in reality this gene has only 11 exons. Is the difference due to bin (instead of exon) counting or something went wrong in my analysis?

                      Comment

                      • areyes
                        Senior Member
                        • Aug 2010
                        • 165

                        #12
                        Hi @nbahlis,

                        The preprocessing scripts from DEXSeq define new exon bins based on non overlapping regions of the transcripts (http://www.ncbi.nlm.nih.gov/pmc/arti...195/figure/F1/). If you look at this gene ID in ENSEMBL, (http://www.ensembl.org/Homo_sapiens/...190676-3221394), Two isoforms contain 11 exons, but there are many other isoforms as well. The 37 exon bins defined in DEXSeq come from the preprocessing considering all these isoforms.

                        Comment

                        Latest Articles

                        Collapse

                        • SEQadmin2
                          Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                          by SEQadmin2


                          I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                          Here are nine questions we think about, in roughly the order they matter, before...
                          06-18-2026, 07:11 AM
                        • SEQadmin2
                          From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                          by SEQadmin2


                          Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                          The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                          ...
                          06-02-2026, 10:05 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by SEQadmin2, Today, 05:37 AM
                        0 responses
                        5 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 06-26-2026, 11:10 AM
                        0 responses
                        16 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 06-17-2026, 06:09 AM
                        0 responses
                        49 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 06-09-2026, 11:58 AM
                        0 responses
                        109 views
                        0 reactions
                        Last Post SEQadmin2  
                        Working...