Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • HTseq count to DESeq (?)

    Hi all,

    I have a little question concerning HTSeq and DESeq.

    I am dealing with some bacterial RNAseq data (control + test, 2 biological replicates each). Data were obtained from Illumina, single-end 50bp.

    I was able to map these reads to my reference genome and generate SAM and BAM files for the alignment. Now, I would like to get some statistics done over it in order to get a list of genes that are statistically up or down regulated in between my control and test condition. As I saw here, I should be able to do that with DESeq package. So, in order to use DESeq I need a well-organized table with genes count. I thought about using HTSeq for getting such a input table. So, I installed HTSeq and I run my first SAM file with following command line:

    htseq-count -m intersection-nonempty -s no -t gene -i locus_tag alignWT1.sam NC_013316.gff -o WT1_out

    After the run was complete, I had printed in the command line genes locus_tag with genes count and also

    no feature: 9034352
    ambiguous: 958299
    not aligned: 2097930

    That's okay (at least I think) since I had more then 37 millions reads for this replicate.

    Now, the problem is what to do next! I looked at the header of my WT1_out file, and there is way more information then it should be if I want to use this table for DESeq. So, here are my questions:
    • how to sort the "out" file in order to obtain a nice table required by DESeq? I read in another thread that I could use:

      sort -g -r -k 2 <outfile>

      Is that a good way to procede?

    • Even if I get that nicely organized table with HTSeq, that would be for only one replicate. How can I manage to get a table containing all of my conditions?


    Any help would be more then welcome. Thanks in advance guys!

  • #2
    You don't need to do any sorting of the output files prior to using them in DESeq (the output files should actually be in the same order). I usually merge them into a single file and alter the header line to make things easier. I wrote a small program in C to do that for me (PM me if you want a copy, though it'd be very simple to write yourself in any language), but you could even just use Excel or Openoffice if you prefer (in reality, you could also just do it in R by merging the various data frames).

    Comment


    • #3
      Thank you for your replay!

      As I said, I'm a mere beginer in all aspects of bioinformatics, so writing scripts in C, R, Python and so on is out of reach for me (hopefully one day...). Also, the generated <out> files are text files of 15-30 Gb, so trying to manipulate them with Excell or any other conventional software is a suicide.

      I could use your C program (you run that under windows right?) if it's not too complicated or any other script for which I could get instructions how to use it.

      By the way, here is the header of my <out> file; does it sounds okay?

      ognjen@ThePresident-Studio-XPS-7100:~/Documents/RNAseq$ head WT2_out
      HWI-ST766:125:C0PRMACXX:6:1101:1955:2481 0 gi|260211391|emb|FN545816.1| 155731 255 50M * 0 0 ACCANCAGAAATAGTTGAAAAAGCTAAAAAAATTATTGCCGGAGAATTGG @@@D#2ABFDHHBHABHGGIIIGGFHEGHIHH@DGIIHIHIFAFHCC@GG XA:i:1 MD:Z:4G45 NM:i:1 XF:Z:CDR20291_0115
      HWI-ST766:125:C0PRMACXX:6:1101:1915:2481 0 gi|260211391|emb|FN545816.1| 439745 255 50M * 0 0 GTAANGGTGAGAGAGTAACAGTAAGTCTTTTCCATACAGCTATATATGGA @@@F#2A=CFHHHJJCGHIJJCHIHEHGIIJIJIIIICHIGIIIIGIJJJ XA:i:1 MD:Z:4A45 NM:i:1 XF:Z:CDR20291_0366
      HWI-ST766:125:C0PRMACXX:6:1101:1939:2483 16 gi|260211391|emb|FN545816.1| 251919 255 50M * 0 0 ACCTAGTTAAGTAAATGTGTAGGCTGAGTTATACTCCCTATGTGTTGGCT 0GD?HHCGBDGFGIIGGFIHEEE@HC<FHF<A2GBGHDDDHHDDDDD@@@ XA:i:0 MD:Z:50 NM:i:0 XF:Z:no_feature
      HWI-ST766:125:C0PRMACXX:6:1101:2708:2478 16 gi|260211391|emb|FN545816.1| 233067 255 50M * 0 0 TCAGAAGGTTCTTATGCTATATACAATGAAAATGGTTTAGATGGTNAAGC IIJJJJJIJJIJJIJJJJJJJIIJJJJJJJJJIJJJJHHHHHDA2#FCCC XA:i:1 MD:Z:45C4 NM:i:1 XF:Z:CDR20291_0180
      HWI-ST766:125:C0PRMACXX:6:1101:2238:2479 0 gi|260211391|emb|FN545816.1| 94704 255 50M * 0 0 CTAGNAAAAGTTGGTAGATATAAATTCAATAAGAAACTTGCTTTATGTTA CCCF#4ADHHHHHJCGHIJJJJJJJJJJJJJJJJJJJJJJJJJJJIJHII XA:i:1 MD:Z:4C45 NM:i:1 XF:Z:CDR20291_0060
      HWI-ST766:125:C0PRMACXX:6:1101:1837:2494 16 gi|260211391|emb|FN545816.1| 720993 255 50M * 0 0 ATATCTACTTCACTAGATTTTTCCTATGAACCTCTCTATGGAATAGATAG ##F<?:1?1*4D:*?3CGHFHBGH>FEBDA?<EAFBHFFHFD4ADDD@@@ XA:i:0 MD:Z:50 NM:i:0 XF:Z:CDR20291_0584
      HWI-ST766:125:C0PRMACXX:6:1101:1993:2495 16 gi|260211391|emb|FN545816.1| 444348 255 50M * 0 0 GGTGTTGCTGCCTTAGCTCTAGGAATAGCTCAAGGTGCTTTAGATGAAGC JIJIJJJIGIJJJJJIJIJJJJIIJJJJJJJJJJJJJHHHHHFFFFFCCC XA:i:0 MD:Z:50 NM:i:0 XF:Z:CDR20291_0370
      HWI-ST766:125:C0PRMACXX:6:1101:2798:2478 1:N:0:CGATGT 4 * 0 0 * * 0 0 CGGCNTGCTTTGAACACTCTAATTTTTTCAAAGTAAAAGTCCTGGTTTGC CCCF#2ADHHHHHJJJJJJJJJJJJJJJJJJJJGHIJJJGIJJJJGHIII XM:i:0 XF:Z:not_aligned
      HWI-ST766:125:C0PRMACXX:6:1101:2596:2496 16 gi|260211391|emb|FN545816.1| 1138777 255 50M * 0 0 AGTCACCCACCTGGGAGTGCCTAGGGTGTGAATTTACTTAGCCACCGGTG JIGGFJHGBJIGJJIGIEDJJJIJHGIJJJJGIIHFJHHHHFDDFDFC@@ XA:i:0 MD:Z:50 NM:i:0 XF:Z:no_feature
      HWI-ST766:125:C0PRMACXX:6:1101:3697:2475 16 gi|260211391|emb|FN545816.1| 623244 255 50M * 0 0 ATGTTATAAAAAGAAAATATGAACAATTATGTGAAAATTTAGATANTAAT IIIJHIGJJJJJGJJJIHIIIIIJJJIIIIIGJJJJJHHHHHDA2#FCC@ XA:i:1 MD:Z:45G4 NM:i:1 XF:Z:CDR20291_0513

      what do you think?

      Comment


      • #4
        Ah, I misunderstood where your confusion was. That file that you were looking at is of no use to you in this case. Those are simply the reads that were kept by htseq-count. You need to redirect the output of htseq-count to a file, which will be much smaller than the one you're looking at and will be easier for you to deal with. So:
        Code:
        htseq-count -m intersection-nonempty -s no -t gene -i locus_tag alignWT1.sam NC_013316.gff > WT1_out

        Comment


        • #5
          Great!!! I'm going to try it right away!

          Comment


          • #6
            Oh my God, it does work! I love you Ryan!

            I'm going to try DESeq soon, hopefully, there won't be any problems!

            Thanks again!!!!

            Comment


            • #7
              Originally posted by dpryan View Post
              Code:
              htseq-count -m intersection-nonempty -s no -t gene -i locus_tag alignWT1.sam NC_013316.gff > WT1_out
              Could you explain your code, would be also quite useful for me?
              thanks in advance

              Comment


              • #8
                Originally posted by ekimmike View Post
                Could you explain your code, would be also quite useful for me?
                thanks in advance
                Which code in particular would you like to know about? Merging outputs from htseq-count?

                Comment


                • #9
                  I am not sure how to keep all different featuretypes using -t

                  second question is about ID, as I'd to keep more then one ID, eg. gene_id and/or transcript_id
                  should I run HTseq twice first time having sam output and so on...

                  Comment


                  • #10
                    Originally posted by ekimmike View Post
                    I am not sure how to keep all different featuretypes using -t

                    second question is about ID, as I'd to keep more then one ID, eg. gene_id and/or transcript_id
                    should I run HTseq twice first time having sam output and so on...
                    Ah, I assumed you were asking about code, rather than how to use htseq-count, which I have nothing to do with aside from being one of very many users.

                    Normally you wouldn't want to do that. I would recommend that you describe exactly what you're trying to do rather than us trying to figure it out from your so-far limited description.

                    Comment


                    • #11
                      Thanks,

                      so this i what I get:
                      DDX11L1 17


                      and what I'd to have:
                      gene_id my_id feature counts
                      DDX11L1 145793 exon 17
                      13248 ncRNA 29
                      LOC100132062 557120 lincRNA 15

                      this is because I have merged various annotation types so sometimes there is no gene_id, can this be specified in HTSeq, as now I have to run it twice and merge tables

                      Comment


                      • #12
                        So, in other words, your annotation file is kind of a mess and you're hoping to finagle a way around that with htseq-count. Perhaps you'll have luck using using the python interface directly, I've never needed to do that. Alternatively, you might do yourself a favour for the future and simply write a script to clean up your annotation file (if an entry lacks a gene_id, then add one, if it lacks an exon feature, then add such an entry, etc.).

                        Comment


                        • #13
                          Hi guys,

                          I was using the Cufflinks->CuffDiff protocol, but I'm now going to try and use DESeq. I'm having the same problem as in the original post.

                          Are you telling me that the way you have to run it to use DESeq is by using "> WTout" instead of "-o WTout"? If so, why didn't Simon put that in the HTseq website?


                          The vignette for DESeq is excellent and I just went through it, but now I want to run this with my data, and I haven't found a clear way to do so.

                          I converted all my Tophat files to SAM and then sorted all of my outputted Tophat files by name. I then ran HT-Seq count on them using this code:
                          python -m HTSeq.scripts.count -s no -o WT1 accepted_hits.sam annotation.gtf

                          So the outputted files were huge, even after I filtered by removing "no feature, ambiguous, etc." by using grep.

                          Now I want to use DESeq to analyze it, but I don't know the next steps. Do I need to rerun it using the ">" instead of "-o" option? After that, I guess I can then use Ethanol's R script to merge the files together?
                          Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc
                          Last edited by billstevens; 09-04-2012, 07:32 AM.

                          Comment


                          • #14
                            Hi,

                            Yeah, you have to use > FileOUT, and I have no clue why it isn't in the vignette. As you said, vignette is well written although some details are missing...

                            Anyway, if you use > FileOUT you will get a small (less then 1Mb) text file with all your reads. For merging, I simply used excel and then saved as text ou tab-delimited text, don't remember any more. And then, you can use DESeq in R. Good luck

                            Comment


                            • #15
                              Well that could have saved me a week of time! On the other hand, what's a week when I've been working on this for almost a year?

                              Thanks!
                              Last edited by billstevens; 09-04-2012, 08:21 AM.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Advancing Precision Medicine for Rare Diseases in Children
                                by seqadmin




                                Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                                12-16-2024, 07:57 AM
                              • seqadmin
                                Recent Advances in Sequencing Technologies
                                by seqadmin



                                Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                                Long-Read Sequencing
                                Long-read sequencing has seen remarkable advancements,...
                                12-02-2024, 01:49 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 12-17-2024, 10:28 AM
                              0 responses
                              39 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-13-2024, 08:24 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-12-2024, 07:41 AM
                              0 responses
                              38 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-11-2024, 07:45 AM
                              0 responses
                              46 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X