SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
Newbie's Question on align published Chip-Seq data muhe1985 General 0 01-13-2014 05:46 PM
Newbie question: low cluster density on MiSeq (Chip-Seq pool) HelenaSC Sample Prep / Library Generation 12 07-18-2013 07:28 AM
RNA-Seq: A Statistical Framework for eQTL Mapping Using RNA-seq Data. Newsbot! Literature Watch 0 08-16-2011 02:00 AM
newbie question: mapping solid reads crh SOLiD 3 04-30-2010 07:10 AM

Reply
 
Thread Tools
Old 01-15-2015, 02:03 PM   #1
analog900
Member
 
Location: West Coast

Join Date: Oct 2014
Posts: 13
Default Newbie question regarding mapping of RNA-seq data

Hi all,
I'm stuck with a total newbie problem here. I'm analyzing RNA-seq data from mouse, I mapped the (paired-end) sequences using TopHat against mm9 (using bowtie1) but when I look at the SAM output files, the hits list chromosomes as map targets, when instead I'm interested in gene IDs. I'm assuming I missed something trivial?
analog900 is offline   Reply With Quote
Old 01-15-2015, 02:20 PM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,054
Default

Since you did the mapping against genome you need to summarize the alignments using a program like featureCounts or HTSeq-count along with an annotation file that will translate the alignments you have into counts per gene/exon (any features included in the annotation file).

You could have also provided that annotation file to TopHat (when you ran it) if you only wanted to look at the transcriptome (instead of the whole genome).
GenoMax is offline   Reply With Quote
Old 01-15-2015, 02:21 PM   #3
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

That's what should happen. Your next step is to get counts of aligned fragments per gene, for which you can use featureCounts or htseq-count. Both of those expect exactly what you have as input.

Edit: Genomax beat me by a minute. I should note that mapping against the transcriptome with tophat still produces alignments in genomic coordinates.
dpryan is offline   Reply With Quote
Old 01-15-2015, 02:28 PM   #4
analog900
Member
 
Location: West Coast

Join Date: Oct 2014
Posts: 13
Default

Thanks guys!! Appreciate it! I'll give it a try.
Thanks again!
analog900 is offline   Reply With Quote
Old 01-16-2015, 08:47 AM   #5
analog900
Member
 
Location: West Coast

Join Date: Oct 2014
Posts: 13
Default

Using featureCounts gives me nice summary and counts text files. However, my SAM and BAM files still contain the original, genomic, annotations (obviously). Ideally, I would like to convert the annotations in the BAM/SAM files so that I can further process them.

This leads me to a more broader question: what reference (for mouse rna-seq) do people use when they want gene_ids instead of genomic targets?. I noticed that reference files such as mRNA.fa or refMrna.fa only contain accession numbers, but not gene ids.
Thanks in advance
analog900 is offline   Reply With Quote
Old 01-16-2015, 09:08 AM   #6
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Gene IDs, names, and numbers vary depending on the database in question. You can either get a translation table, or try find a fasta file already named with the identifiers you want to use.
Brian Bushnell is offline   Reply With Quote
Old 01-16-2015, 09:33 AM   #7
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,054
Default

Quote:
Originally Posted by analog900 View Post
Using featureCounts gives me nice summary and counts text files. However, my SAM and BAM files still contain the original, genomic, annotations (obviously). Ideally, I would like to convert the annotations in the BAM/SAM files so that I can further process them.
What is "further processing" referring to here? Most downstream analysis is going to use the counts files (unless you are going to call SNPs from this data) and will always refer to the gene names contained in that file.
GenoMax is offline   Reply With Quote
Old 01-16-2015, 09:51 AM   #8
analog900
Member
 
Location: West Coast

Join Date: Oct 2014
Posts: 13
Default

Quote:
Originally Posted by GenoMax View Post
What is "further processing" referring to here? Most downstream analysis is going to use the counts files (unless you are going to call SNPs from this data) and will always refer to the gene names contained in that file.
I've been loosely following the "simple fool's guide for rna seq" by the group of Stephen Palumbi (http://sfg.stanford.edu/guide.html). They parse their SAM output files with a series of python scripts to obtain similar summary statistics like the ones I can now get with featureCounts. Then, they use DESeq for functional enrichment (which I would really like to do in order to compare my different samples).
analog900 is offline   Reply With Quote
Old 01-16-2015, 10:58 AM   #9
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

I would recommend ignoring that guide. If you want to use DESeq (use DESeq2), just directly use the counts from featureCounts. This would be the standard and accepted pipeline and there's no reason to use any kludgy scripts.
dpryan is offline   Reply With Quote
Old 01-16-2015, 12:12 PM   #10
analog900
Member
 
Location: West Coast

Join Date: Oct 2014
Posts: 13
Default

Quote:
Originally Posted by dpryan View Post
I would recommend ignoring that guide. If you want to use DESeq (use DESeq2), just directly use the counts from featureCounts. This would be the standard and accepted pipeline and there's no reason to use any kludgy scripts.
Thank you. Really appreciate it! Can you recommend any other standard/accepted pipelines downstream of featureCounts?
analog900 is offline   Reply With Quote
Old 01-18-2015, 01:13 PM   #11
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

We use limma/voom and edgeR in downstream analyses to discover differentially expressed genes. The link below is a short tutorial for using our pipeline for analyzing RNA-seq data which you might find helpful:

http://bioinf.wehi.edu.au/RNAseqCaseStudy/
shi is offline   Reply With Quote
Old 01-19-2015, 06:23 AM   #12
Michael Love
Senior Member
 
Location: Boston

Join Date: Jul 2013
Posts: 333
Default

for DESeq2 you would use the DESeqDataSetFromMatrix function to start the analysis, using the counts matrix returned by featureCounts. Example of starting from count matrix is in the DESeq2 vignette.
Michael Love is offline   Reply With Quote
Old 01-20-2015, 08:56 AM   #13
analog900
Member
 
Location: West Coast

Join Date: Oct 2014
Posts: 13
Default

Thanks so much guys!
Working through the DESeq2 vignette now and learning new stuff... really excited!
Thanks again!
analog900 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:11 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO