Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • read counts per transcript, edgeR, tophat

    Hi everyone,

    I'm using Illumina RNAseq to test for differential expression between two conditions. I have successfully mapped reads with tophat/bowtie, and used cufflinks/cuffcompare to test for differential expression.

    A number of methods (edgeR, DEGseq, DESseq), however, ask for raw read counts rather FPKM.

    Is there an easy way to go from tophat output (say the bedgraph or .sam) to raw read counts for each transcript (say using a gtf that I would supply)?

    Thanks!

  • #2
    If I understand your question correctly, BEDTools has a utility called coverageBed that will compute the raw counts of reads in a BAM file among features in a BED or GTF file.

    For example:
    Code:
    $ coverageBed -abam alignedReads.bam -b transcripts.gtf > transcript.cov.txt
    There is also a histogram (-hist) mode that will give you a histogram of coverage for each feature in the -b file. I have not yet updated the manual , but the help (-h) for this tools describes the output format.

    Comment


    • #3
      Here is the script that I made for this purpose: http://www-huber.embl.de/users/ander...doc/count.html

      Simon

      Comment


      • #4
        Yeah, I have used bedtools for similar analysis and it is quite intuitive

        Originally posted by quinlana View Post
        If I understand your question correctly, BEDTools has a utility called coverageBed that will compute the raw counts of reads in a BAM file among features in a BED or GTF file.

        For example:
        Code:
        $ coverageBed -abam alignedReads.bam -b transcripts.gtf > transcript.cov.txt
        There is also a histogram (-hist) mode that will give you a histogram of coverage for each feature in the -b file. I have not yet updated the manual , but the help (-h) for this tools describes the output format.
        --
        bioinfosm

        Comment


        • #5
          Originally posted by Simon Anders View Post
          Here is the script that I made for this purpose: http://www-huber.embl.de/users/ander...doc/count.html

          Simon
          Simon, are there any specific reason to use Python2.5, because the cluster I can use only have Python2.4. And I trid to build the source codes using Python 2.4. When I ran "python setup.py build" I got the warning message which doesn't matter.

          /usr/lib64/python2.4/distutils/dist.py:236: UserWarning: Unknown distribution option: 'requires'
          warnings.warn(msg)

          However, when I ran "python setup.py install --prefix=...", I got the error message

          byte-compiling $HOME/librarys/lib64/python2.4/site-packages/HTSeq/StepVector.py to StepVector.pyc
          File $HOME/librarys/lib64/python2.4/site-packages/HTSeq/StepVector.py", line 390
          start = index.start if index.start is not None else self.start_index()
          ^
          SyntaxError: invalid syntax

          The reason seems to that "Conditional expressions" is not supported in Python2.4, how can I change the codes without upgrading to Python 2.5

          Thanks

          Ying
          Last edited by Auction; 04-09-2010, 12:46 PM.

          Comment


          • #6
            When I started developing HTSeq, I had to make a decision on which Python version to require. I refrained from using any Python 2.6 features as many people still work with 2.5, but I didn't expect many 2.4 installations to be still around.

            Python 2.5 brought a few nifty feature but I think I didn't use too many of them, and most are only syntactic sugar anyway. I guess that not too many changes would be needed to make it run on 2.4

            But wouldn't it be much easier if you convinced your cluster admin to update the system? (Or install it yourself in your home directory: This takes only a couple of minute and Python typically builds from source out of the box.)

            Python 2.5 was released 4 years ago, and by now, the release version is 2.6.5, and 2.7 is in alpha. You shouldn't need to waste your time to get software to run in such an out-of-date system.

            Comment


            • #7
              Originally posted by Simon Anders View Post

              Python 2.5 was released 4 years ago, and by now, the release version is 2.6.5, and 2.7 is in alpha. You shouldn't need to waste your time to get software to run in such an out-of-date system.
              Simon:
              You would be surprised to know that one of the clusters I use needs Python 2.3!! Hope fully they update it soon. Fortunately we have another cluster but some one else has to help me load HTseq on it as I do not have access to it.

              Siva

              Comment


              • #8
                Originally posted by Simon Anders View Post
                When I started developing HTSeq, I had to make a decision on which Python version to require. I refrained from using any Python 2.6 features as many people still work with 2.5, but I didn't expect many 2.4 installations to be still around.

                Python 2.5 brought a few nifty feature but I think I didn't use too many of them, and most are only syntactic sugar anyway. I guess that not too many changes would be needed to make it run on 2.4

                But wouldn't it be much easier if you convinced your cluster admin to update the system? (Or install it yourself in your home directory: This takes only a couple of minute and Python typically builds from source out of the box.)

                Python 2.5 was released 4 years ago, and by now, the release version is 2.6.5, and 2.7 is in alpha. You shouldn't need to waste your time to get software to run in such an out-of-date system.
                The reason I stick on Python 2.4 is that several clusters I can use all use Rock 5.2+Python2.4. By the way, I upgraded my workstation to Python2.6 and installed HTSeq, but it doesn't indicate support for paired-ends. Are there specific options for paired-ends? Thanks

                Comment


                • #9
                  Originally posted by quinlana View Post
                  If I understand your question correctly, BEDTools has a utility called coverageBed that will compute the raw counts of reads in a BAM file among features in a BED or GTF file.

                  For example:
                  Code:
                  $ coverageBed -abam alignedReads.bam -b transcripts.gtf > transcript.cov.txt
                  There is also a histogram (-hist) mode that will give you a histogram of coverage for each feature in the -b file. I have not yet updated the manual , but the help (-h) for this tools describes the output format.
                  I tried to run coverageBed for my BAM file from pair-end RNA-seq. When I compared the result in a certain gene to my calculation using Bio:B::Sam. It seems coverageBed don't consider the mate properly paired flag in the BAM file. Are there any options control for it? Thanks

                  Comment


                  • #10
                    Originally posted by Auction View Post
                    I tried to run coverageBed for my BAM file from pair-end RNA-seq. When I compared the result in a certain gene to my calculation using Bio:B::Sam. It seems coverageBed don't consider the mate properly paired flag in the BAM file. Are there any options control for it? Thanks
                    Hi,
                    You are correct, coverageBed processes every BAM entry in the BAM file provided. The rationale here is that one can use samtools to filter your data before passing to coverageBed. For example, to only count proper pairs, do the following (passing the output of samtools to coverageBed via a pipe, hence the "stdin" for the -abam parameter:

                    Code:
                    $ samtools -f 0x2 <BAM> | coverageBed -abam stdin -b transcripts.bed > transcripts.proper.cov
                    Does this help?

                    Comment


                    • #11
                      Originally posted by quinlana View Post
                      Hi,
                      You are correct, coverageBed processes every BAM entry in the BAM file provided. The rationale here is that one can use samtools to filter your data before passing to coverageBed. For example, to only count proper pairs, do the following (passing the output of samtools to coverageBed via a pipe, hence the "stdin" for the -abam parameter:

                      Code:
                      $ samtools -f 0x2 <BAM> | coverageBed -abam stdin -b transcripts.bed > transcripts.proper.cov
                      Does this help?
                      I see, thanks.

                      Comment


                      • #12
                        HT-Seq

                        Hi Simon,

                        Sorry to trouble you with with what is surely a silly question. I've just installed HT-Seq from source on my Mac. I've just tried to run a couple of basic commands to see if things were working:

                        python -m HTSeq.scripts.qa s_3_sequence.txt

                        but I'm getting an error message:

                        Code:
                        Traceback (most recent call last):
                          File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/runpy.py", line 85, in run_module
                            loader = get_loader(mod_name)
                          File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/pkgutil.py", line 456, in get_loader
                            return find_loader(fullname)
                          File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/pkgutil.py", line 466, in find_loader
                            for importer in iter_importers(fullname):
                          File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/pkgutil.py", line 422, in iter_importers
                            __import__(pkg)
                        ImportError: No module named HTSeq.scripts
                        I'm running the script from within the HT-Seq directory. I placed the HT-Seq folder within my Applications directory.

                        Any idea what I've done wrong?

                        Comment


                        • #13
                          Originally posted by chrisbala View Post
                          I've just installed HT-Seq from source on my Mac. I've just tried to run a couple of basic commands to see if things were working:

                          python -m HTSeq.scripts.qa s_3_sequence.txt

                          but I'm getting an error message:
                          It seems that Python does not know were you put the HTSeq modules.

                          Once you have unpacked the HTSeq tarball, you should run the commands

                          Code:
                             python setup.py build
                             python setup.py install
                          The second command puts it somewhere where Python should find it.

                          If this does not work, use only the first command. It makes a subdirectory 'build', and in there something like 'lib.macos' (or called similarly), and in there, 'HTSeq'. You can just tell Python to search for HTSeq there by writing

                          Code:
                          export PYTHONPATH=$PYTHONPATH:/<the_path_to_your_directory/build/lib.macos/
                          and Python will find it there.

                          Do not run the script in the directory that you have just unpacked, go somewhere else, as Python otherwise tends to get confused.

                          Comment


                          • #14
                            almost there

                            Thanks Simon,

                            THIngs seem to be working...

                            Will be try DESeq as soon as I get these counts finalized.

                            One question: for the counting script, you ask for a gff, and elsewhere you use gtf. I've noticed in tophat as well that is some cases gffs are requested (or output) and sometimes gtf.

                            Is there are particular aspect of gff that makes it more suitable for this counting algorithm?

                            just curious...

                            Comment


                            • #15
                              Hi Chris,

                              GTF is just a tightening of the GFF specification, i.e., every GTF file is also a GFF file. The tightening is mainly with respect to the "feature type" column: in GTF, it has to be one of "exon", "CDS", "stop_codon", etc., while in GFF, it is not clearly said which words should be used.

                              In the htseq-count script, I give the user the option to tell the script how the features are named, and hence, it works with GFF as well.

                              See here:
                              GFF: http://www.sanger.ac.uk/resources/so.../gff/spec.html
                              GTF: http://mblab.wustl.edu/GTF22.html

                              Simon
                              Last edited by Simon Anders; 04-15-2010, 09:11 AM. Reason: add links

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              9 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              50 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X