Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Hi Simon,

    Can I ask you a question that if I have paired end reads who are overlapping with each other, for a overlapped region, does HTSeq count it as single coverages if reads are from the same fragment (but different ends) or count it as covered twice?

    Thanks!

    Originally posted by Simon Anders View Post
    Here is the script that I made for this purpose: http://www-huber.embl.de/users/ander...doc/count.html

    Simon

    Comment


    • #17
      Originally posted by wenhuang View Post
      Can I ask you a question that if I have paired end reads who are overlapping with each other, for a overlapped region, does HTSeq count it as single coverages if reads are from the same fragment (but different ends) or count it as covered twice?
      If you use the full Python framework (as opposed to the the simple htseq-count script) you can adjust such stuff as you need them.

      At the moment, however, HTSeq does not support paired-end reads at all. I'll add this functionality very soon, hopefully next week.

      Your data would be very useful as test data. Could you maybe send me your SAM file (or parts of it)?

      By the way, last time I tried, Bowtie was unable to deal with overlapping ends from the same fragment and did not align these at all. Have you used another software, or did they improve that?

      Cheers
      Simon

      Comment


      • #18
        Hi Simon,

        Thanks for your reply. I ended up using tophat and it has no problem aligning overlapping reads by specifying a negative -r as suggested by Cole Trapnell.

        I attached an example SAM file below. It is from a highly expressed gene on bovine chromosome one (ATPO). The Ensembl annotation is as below.

        I am very eager to see what you get.

        Thanks a lot.


        chr1 protein_coding exon 720699 720764 . + . gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "1"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN";
        chr1 protein_coding CDS 720729 720764 . + 0 gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "1"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN"; protein_id "ENSBTAP00000024326";
        chr1 protein_coding start_codon 720729 720731 . + 0 gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "1"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN";
        chr1 protein_coding exon 722101 722151 . + . gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "2"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN";
        chr1 protein_coding CDS 722101 722151 . + 0 gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "2"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN"; protein_id "ENSBTAP00000024326";
        chr1 protein_coding exon 723098 723208 . + . gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "3"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN";
        chr1 protein_coding CDS 723098 723208 . + 0 gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "3"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN"; protein_id "ENSBTAP00000024326";
        chr1 protein_coding exon 725362 725491 . + . gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "4"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN";
        chr1 protein_coding CDS 725362 725491 . + 0 gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "4"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN"; protein_id "ENSBTAP00000024326";
        chr1 protein_coding exon 726174 726286 . + . gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "5"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN";
        chr1 protein_coding CDS 726174 726286 . + 2 gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "5"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN"; protein_id "ENSBTAP00000024326";
        chr1 protein_coding exon 727512 727598 . + . gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "6"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN";
        chr1 protein_coding CDS 727512 727598 . + 0 gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "6"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN"; protein_id "ENSBTAP00000024326";
        chr1 protein_coding exon 727876 728057 . + . gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "7"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN";
        chr1 protein_coding CDS 727876 727986 . + 0 gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "7"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN"; protein_id "ENSBTAP00000024326";
        chr1 protein_coding stop_codon 727987 727989 . + 0 gene_id "ENSBTAG00000018278"; transcript_id "ENSBTAT00000024326"; exon_number "7"; gene_name "ATPO_BOVIN"; transcript_name "ATPO_BOVIN";



        Originally posted by Simon Anders View Post
        If you use the full Python framework (as opposed to the the simple htseq-count script) you can adjust such stuff as you need them.

        At the moment, however, HTSeq does not support paired-end reads at all. I'll add this functionality very soon, hopefully next week.

        Your data would be very useful as test data. Could you maybe send me your SAM file (or parts of it)?

        By the way, last time I tried, Bowtie was unable to deal with overlapping ends from the same fragment and did not align these at all. Have you used another software, or did they improve that?

        Cheers
        Simon
        Attached Files

        Comment


        • #19
          Hi wenhuang

          I've now added paired-end support for the SAM fomat to HTSeq and htseq-count. It would be great if you could download the new version 0.4.1 of HTSeq, try to do your counting job with it and tell me whether it worked correctly.

          Cheers
          Simon

          Comment


          • #20
            Hi Simon,

            Thank you for your work. Unfortunately, although I am able to get the yeast data provided on your website to work with htseq-count, I am not able to get my own data to work. The SAM file I used is two posts ago and I attach the GTF file to this post.

            The error message I got was as below.

            Again, thank you for your help.

            Traceback (most recent call last):
            File "/usr/local/bin/htseq-count", line 5, in <module>
            pkg_resources.run_script('HTSeq==0.4.1', 'htseq-count')
            File "/System/Library/Frameworks/Python.framework/Versions/2.5/Extras/lib/python/pkg_resources.py", line 489, in run_script

            File "/System/Library/Frameworks/Python.framework/Versions/2.5/Extras/lib/python/pkg_resources.py", line 1207, in run_script
            path = self.module_path
            File "/Library/Python/2.5/site-packages/HTSeq-0.4.1-py2.5-macosx-10.5-i386.egg/EGG-INFO/scripts/htseq-count", line 5, in <module>
            HTSeq.scripts.count.main()
            File "/Library/Python/2.5/site-packages/HTSeq-0.4.1-py2.5-macosx-10.5-i386.egg/HTSeq/scripts/count.py", line 156, in main
            opts.mode, opts.featuretype, opts.idattr, opts.quiet )
            File "/Library/Python/2.5/site-packages/HTSeq-0.4.1-py2.5-macosx-10.5-i386.egg/HTSeq/scripts/count.py", line 68, in count_reads_in_features
            if iv.chrom not in features.step_vectors:
            AttributeError: 'NoneType' object has no attribute 'chrom'


            Originally posted by Simon Anders View Post
            Hi wenhuang

            I've now added paired-end support for the SAM fomat to HTSeq and htseq-count. It would be great if you could download the new version 0.4.1 of HTSeq, try to do your counting job with it and tell me whether it worked correctly.

            Cheers
            Simon
            Attached Files

            Comment


            • #21
              read counting

              Hi Simon,

              I just want to clarify something about the gtf/gff formating and the counting by HTseq

              My gff looks like this:

              Code:
              chr1	taeGut1_ensGene	gene	8373168	8392590	.	+	.	gene_id=ENSTGUT00000007148;Name=
              chr1	taeGut1_ensGene	mRNA	8373168	8392590	.	+	.	gene_id=ENSTGUT00000007148;Name=;Parent=ENSTGUT00000007148
              chr1	taeGut1_ensGene	exon	8373168	8373328	.	+	.	gene_id=ENSTGUT00000007148.;Name=;Parent=ENSTGUT00000007148
              chr1	taeGut1_ensGene	exon	8374845	8375006	.	+	.	gene_id=ENSTGUT00000007148.;Name=;Parent=ENSTGUT00000007148
              chr1	taeGut1_ensGene	exon	8383527	8383695	.	+	.	gene_id=ENSTGUT00000007148.;Name=;Parent=ENSTGUT00000007148
              chr1	taeGut1_ensGene	exon	8392390	8392590	.	+	.	gene_id=ENSTGUT00000007148.;Name=;Parent=ENSTGUT00000007148
              THe HTseq documentation specifies that the default is counting by exon, yet the output from for HTseq corresponds to the ensembl #s. (One number is output per ensembl id). Is this because I do not have unique ids for each exon? Or is it actually counting across the transcript? IN order to ACTUALLY count by exon, would I just have to include unique ids for each? (and remove the mRNA and and gene lines?) Does anyone have an easy way to to assign unique ids for each exon? thanks!

              Comment


              • #22
                Hi Chris,

                the default that you mention -- gene IDs but exon features -- is meant to count how many reads fall onto each gene. I put this as default as it seemed to me to be the most common use case.

                Many users might prefer to count by transcript but this is actually a rather involved problem. If a read maps to an exon shared by many transcripts how do you decide which one to use? If you have a good algorithm to decide this, HTSeq gives you the tools to implement it conveniently, but I don't have one handy (am I'm eager to see what the cufflink people came up with once their paper is out).

                I also have a script to count by exons, which I can give you if you need it. Giving each exon in the GTF file a unique ID is not a problem: just concatenate the transcript ID with the exon number. The problem is rather that each exon appears multiple times in an Ensembl GTF file, namely once for each transcript in which it appears. If you attempt to collapse all these multiple copies of an exon to a single entity, you'll notice that it is quite common for an exon to have have variants: Outer exons may have varying outer borders due to alternative transcription starts and ends (i.e., UTRs of varying length), and sometimes introns are not spliced out, i.e., an exon-intron-exon sequence in one transcript is just one long exon in the other one.

                Hence, your job is first to decide what is, for your specific application, the right way to count these cases. Once you know this, coding it in Python with HTSeq is easy.

                Simon

                Comment


                • #23
                  Originally posted by wenhuang View Post
                  Thank you for your work. Unfortunately, although I am able to get the yeast data provided on your website to work with htseq-count, I am not able to get my own data to work. The SAM file I used is two posts ago and I attach the GTF file to this post.
                  I've fixed the bug.

                  Comment


                  • #24
                    counting reads

                    Hi Simon,

                    You might be able to tell, but I'm working with a "new" and fairly poorly annotated genome. So we actually don't have much info at all about alternative transcripts in ensembl. So I'm willing to be a little sloppy about how reads are assigned to exons shared among transcripts (assign to both). In fact it is because I suspect that assigning reads is going to be really tricky that I just want to test for differential expression of exons to identify some candidates for alternative splicing. I'm trying out cufflinks predictions as well...

                    But if you think this makes sense, I'd be thankful to use your script. If there any key format specs just let me know. would you post it here? Or somewhere where I can download?

                    Thanks,

                    Chris

                    Comment


                    • #25
                      Hi Simon,

                      this sounds like a very simple question, but I have to ask anyway...

                      The gene counts file in the DESeq vignette gives one gene count per gene, per lane, or seems to at least. Can I assume that this is a summed count over all exons of that gene? if so, then is there a need to worry about alternative splicings, where the same gene contains different exons? Do they all get summed to give a final, single overall value for the gene count, and is that ok...? Or am I missing something very basic?

                      Comment


                      • #26
                        Originally posted by blackgore View Post
                        The gene counts file in the DESeq vignette gives one gene count per gene, per lane, or seems to at least. Can I assume that this is a summed count over all exons of that gene? if so, then is there a need to worry about alternative splicings, where the same gene contains different exons? Do they all get summed to give a final, single overall value for the gene count, and is that ok...? Or am I missing something very basic?
                        The short answer:

                        No, you described it accurately.

                        When a gene's splicing ratios change, its expression strength surely will not stay constant either, so I am not overly worried about mistaking changes in splicing for changes in expression.

                        ---

                        The long answer:

                        I wonder whether your being "worried" about alternative splicing might be due to a confusion about who has 'burden of proof".

                        You see, DESeq performs hypothesis testing against the null hypothesis of no change. Our aim is to reject the null hypothesis, i.e., to get convinced that, for a given gene, expression is different in the two experimental condition. Due to noise, we worry that an apparent difference is just a fluke of this noise, and hence, the expression difference has to be either strong enough to stand out against the background noise, or we need many replicates too keep the noise down.

                        Changes in isoform proportion are much harder to detect than changes of overall expression levels. So, if the concentration of transcripts of a gene goes down by a quarter, this might easily stick out from noise, even if we have only two or three replicates. However, if the overall expression levels stays as it is, but the ratio of isoform A to isoform B changes from 3:2 to 2:3, this causes more subtle changes in the counts and, in my opinion, two or so lanes per condition will usually be insufficient to be sure that such an observation is a biological effect and not just noise.

                        This is why we recommend to first go for counts by gene, summing over all exons. You will miss out on alternative splicing events but you will find something.

                        Strictly speaking, you might mistake alternative splicing for differential expression: if a gene contains a cassette exon that is present in nearly all transcripts in condition A and absent in nearly all transcripts in condition B, you will notice a difference in overall counts. Hence, what DESeq gives you is genes with changes in expression, which will typically be changes in overall expression strength but may also include changes in splicing patterns.

                        To distinguish this, you can also count by exon rather than by gene, and check for each exon whether its expression changes with experimental condition. If all exons of a gene change the same way, overall expression has changed but splicing ratios have maybe not, but if one exon falls out, it may be a cassette exon whose presence is influenced by your experimental condition.

                        Using DESeq with such a "counts by exon" table works fine as well. However, you have less counts per exon than per gene, and hence, you have less power to detect anything.

                        Comment


                        • #27
                          sorry Simon, I had meant to reply much sooner than this - thanks for your response, it was altogether very useful!

                          Comment


                          • #28
                            coverageBed question

                            I try to run coverageBed like this:
                            /home/ay55/BEDTools/bin/genomeCoverageBed -ibam 1382_1_sorted.bam -i hg18.bed.sorted -g hg18.genome > hg18_seq.cov.txt

                            no error at all, but when I open hg18_seq.cov.txt, I got this
                            read block failed - CheckBlockHeader() returned false
                            Could not read header type

                            Any idea?




                            Originally posted by Auction View Post
                            I tried to run coverageBed for my BAM file from pair-end RNA-seq. When I compared the result in a certain gene to my calculation using Bio:B::Sam. It seems coverageBed don't consider the mate properly paired flag in the BAM file. Are there any options control for it? Thanks

                            Comment


                            • #29
                              Originally posted by Siva View Post
                              Simon:
                              You would be surprised to know that one of the clusters I use needs Python 2.3!! Hope fully they update it soon. Fortunately we have another cluster but some one else has to help me load HTseq on it as I do not have access to it.

                              Siva
                              could you create the ENV on your home directory and add the path to the $PATH?

                              Comment


                              • #30
                                how does HTSeq proccess pair-end RNA-seq data

                                Dear Simon,
                                Very haoppy to use HTSeq and DEseq developped by you. I have several questions to ask you about.
                                1. How does HTSeq count pair-end reads by genes when both singleton and paired reads are presented? If the count is based on fragment, then paired two reads are counted by one, and singleton also counted by one, right?
                                2. How does HTSeq treats multiple mapped reads? Are they just counted several times by different genes?

                                sincerely tangx

                                Originally posted by Simon Anders View Post
                                Hi wenhuang

                                I've now added paired-end support for the SAM fomat to HTSeq and htseq-count. It would be great if you could download the new version 0.4.1 of HTSeq, try to do your counting job with it and tell me whether it worked correctly.

                                Cheers
                                Simon

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Advancing Precision Medicine for Rare Diseases in Children
                                  by seqadmin




                                  Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                                  12-16-2024, 07:57 AM
                                • seqadmin
                                  Recent Advances in Sequencing Technologies
                                  by seqadmin



                                  Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                                  Long-Read Sequencing
                                  Long-read sequencing has seen remarkable advancements,...
                                  12-02-2024, 01:49 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 12-17-2024, 10:28 AM
                                0 responses
                                26 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 12-13-2024, 08:24 AM
                                0 responses
                                42 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 12-12-2024, 07:41 AM
                                0 responses
                                28 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 12-11-2024, 07:45 AM
                                0 responses
                                42 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X