Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • DESeq2 SummarizedExperiment help

    Hi!

    I feel totally stupid, but I have 48 BAMs after alignment with Tophat2 and want to try DESeq2 for diff.exp. analysis.

    I dont understand what "SummarizedExperiment" is and how to create it? I have googled to death, but now I have to ask here.

    Thank you so much, this is a new field to me.

  • #2
    Have you used htseq-count yet to get counts?

    Comment


    • #3
      Hi!
      Do I have to? Im stuck here:



      Im trying the "2 A First Example", but I dont understand what "features <- GRanges" is (seqnames/IRanges).

      Im also not quite sure how to type in "group_id": I have 2 different groups at two timepoints (as you already know).

      Thank you very much!
      Last edited by sindrle; 10-16-2013, 11:30 AM.

      Comment


      • #4
        Well, you don't have to do anything you don't want to do :P However, it's generally a bit easier to just use a quick command line script than to try to deal with GenomicRanges. The later can do it, but has a much steeper learning curve and will just produce the same results (unless you're trying to do something unusual).

        Regarding your GRanges question, seqnames would be things like "chr1" (UCSC chromosome naming scheme) or "10" (Ensembl naming scheme). IRanges are genomic bounds, so just the 5' and 3' boundaries of an exon/CDS/whatever. In general, you can get that stuff all made for you by reading in a GTF or GFF file (I've used import.gff from rtracklayer). The "group_id" in the example is more of a place-holder for gene_id later on. In essence, it's probably meant to be the actual feature level that's counted over (though the don't do that in the example), so you don't actually need it that example. In a real case, you'd read in your annotation file into a GRanges object and then
        Code:
         split(GRangesObject, mcols(GRangesObject)$label_of_interest)
        to get a GRangesList that you would then summarizeOverlaps on. Here, label_of_interest would probably be the gene_id field of the annotation file (though it could be something else, such as sprintf("%s:E%03d", mcols(blah)$gene_id, mcols(blah)$exon), which is what you'd do for DEXSeq counts).

        Comment


        • #5
          Thank you so much!

          I installed HTseq and have started HTseq-count on all 48 BAMs.

          From one BAM, is this a problem?:

          no_feature 7984183
          ambiguous 521840
          too_low_aQual 0
          not_aligned 0
          alignment_not_unique 13730878


          And also, can you compress the HTseq-count output? Its 16gb for each file..
          Last edited by sindrle; 10-17-2013, 05:46 AM.

          Comment


          • #6
            Ah, you're saving the wrong file, don't use -o. The command to execute is something like:
            Code:
            samtools view namesorted.bam | htseq-count -m intersection-nonempty -s no -a 10 - GTF_FILE > namesorted.counts
            That file will be much much much smaller and is all that you need.

            Comment


            • #7
              I should add, "namesorted.bam" and "GTF_FILE" above are just place-holder names. You'll want to change "namesorted.counts" to something meaningful so you can keep track of which counts are associated with which sample. Finally, those options for htseq-count are just examples, you can change them to fit your needs.

              Comment


              • #8
                Great!
                I aborted the run and updated the CMD as you wrote it.

                I was using the "union" option, but changed it to "intersection-nonempty". But I dont actually know why.. :P

                Thanks again!
                Last edited by sindrle; 10-17-2013, 06:45 AM.

                Comment


                • #9
                  Originally posted by sindrle View Post
                  Great!
                  I aborted the run and updated the CMD as you wrote it.

                  I was using the "union" option, but changed it to "intersection-nonempty". But I dont actually know why.. :P

                  Thanks again!
                  take a look here to see the difference between the union and the intersection-nonempty.

                  It basically tells htseq-count how to deal with reads matched to two genes, or what to do if the overlap isn't complete.

                  Comment


                  • #10
                    Thanks I have looked at that one, but what is the implications for differential expression?

                    Comment


                    • #11
                      As far as I understand it, depends on the option you take you'll be counting more or less reads.

                      If I am not mistaken, reads marked as ambiguous are not taken into consideration in the count of read numbers. So if you take union you'll have more ambiguous reads than the intersection options.

                      BTW, take a look at featureCounts from the sub read package. It works also very good with the data. It finds more results than htseq-counts and works faster.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Current Approaches to Protein Sequencing
                        by seqadmin


                        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                        04-04-2024, 04:25 PM
                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 04-11-2024, 12:08 PM
                      0 responses
                      22 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 10:19 PM
                      0 responses
                      24 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 09:21 AM
                      0 responses
                      19 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-04-2024, 09:00 AM
                      0 responses
                      52 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X