Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Newbie question regarding mapping of RNA-seq data

    Hi all,
    I'm stuck with a total newbie problem here. I'm analyzing RNA-seq data from mouse, I mapped the (paired-end) sequences using TopHat against mm9 (using bowtie1) but when I look at the SAM output files, the hits list chromosomes as map targets, when instead I'm interested in gene IDs. I'm assuming I missed something trivial?

  • #2
    Since you did the mapping against genome you need to summarize the alignments using a program like featureCounts or HTSeq-count along with an annotation file that will translate the alignments you have into counts per gene/exon (any features included in the annotation file).

    You could have also provided that annotation file to TopHat (when you ran it) if you only wanted to look at the transcriptome (instead of the whole genome).

    Comment


    • #3
      That's what should happen. Your next step is to get counts of aligned fragments per gene, for which you can use featureCounts or htseq-count. Both of those expect exactly what you have as input.

      Edit: Genomax beat me by a minute. I should note that mapping against the transcriptome with tophat still produces alignments in genomic coordinates.

      Comment


      • #4
        Thanks guys!! Appreciate it! I'll give it a try.
        Thanks again!

        Comment


        • #5
          Using featureCounts gives me nice summary and counts text files. However, my SAM and BAM files still contain the original, genomic, annotations (obviously). Ideally, I would like to convert the annotations in the BAM/SAM files so that I can further process them.

          This leads me to a more broader question: what reference (for mouse rna-seq) do people use when they want gene_ids instead of genomic targets?. I noticed that reference files such as mRNA.fa or refMrna.fa only contain accession numbers, but not gene ids.
          Thanks in advance

          Comment


          • #6
            Gene IDs, names, and numbers vary depending on the database in question. You can either get a translation table, or try find a fasta file already named with the identifiers you want to use.

            Comment


            • #7
              Originally posted by analog900 View Post
              Using featureCounts gives me nice summary and counts text files. However, my SAM and BAM files still contain the original, genomic, annotations (obviously). Ideally, I would like to convert the annotations in the BAM/SAM files so that I can further process them.
              What is "further processing" referring to here? Most downstream analysis is going to use the counts files (unless you are going to call SNPs from this data) and will always refer to the gene names contained in that file.

              Comment


              • #8
                Originally posted by GenoMax View Post
                What is "further processing" referring to here? Most downstream analysis is going to use the counts files (unless you are going to call SNPs from this data) and will always refer to the gene names contained in that file.
                I've been loosely following the "simple fool's guide for rna seq" by the group of Stephen Palumbi (http://sfg.stanford.edu/guide.html). They parse their SAM output files with a series of python scripts to obtain similar summary statistics like the ones I can now get with featureCounts. Then, they use DESeq for functional enrichment (which I would really like to do in order to compare my different samples).

                Comment


                • #9
                  I would recommend ignoring that guide. If you want to use DESeq (use DESeq2), just directly use the counts from featureCounts. This would be the standard and accepted pipeline and there's no reason to use any kludgy scripts.

                  Comment


                  • #10
                    Originally posted by dpryan View Post
                    I would recommend ignoring that guide. If you want to use DESeq (use DESeq2), just directly use the counts from featureCounts. This would be the standard and accepted pipeline and there's no reason to use any kludgy scripts.
                    Thank you. Really appreciate it! Can you recommend any other standard/accepted pipelines downstream of featureCounts?

                    Comment


                    • #11
                      We use limma/voom and edgeR in downstream analyses to discover differentially expressed genes. The link below is a short tutorial for using our pipeline for analyzing RNA-seq data which you might find helpful:

                      Comment


                      • #12
                        for DESeq2 you would use the DESeqDataSetFromMatrix function to start the analysis, using the counts matrix returned by featureCounts. Example of starting from count matrix is in the DESeq2 vignette.

                        Comment


                        • #13
                          Thanks so much guys!
                          Working through the DESeq2 vignette now and learning new stuff... really excited!
                          Thanks again!

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Current Approaches to Protein Sequencing
                            by seqadmin


                            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                            04-04-2024, 04:25 PM
                          • seqadmin
                            Strategies for Sequencing Challenging Samples
                            by seqadmin


                            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                            03-22-2024, 06:39 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 04-11-2024, 12:08 PM
                          0 responses
                          31 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 10:19 PM
                          0 responses
                          32 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 09:21 AM
                          0 responses
                          28 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-04-2024, 09:00 AM
                          0 responses
                          53 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X