SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
filter sequences from rRNA, tRNA cascoamarillo Bioinformatics 8 01-29-2013 11:13 AM
tRNA/5S rRNA depletion mateo Sample Prep / Library Generation 4 01-26-2012 02:48 PM
coding exons/ repeats/ rRNA, tRNA, snRNA, snoRNA EADIE SOLiD 3 09-17-2010 01:07 AM
How to filter rRNA reads in SAM file. townway Bioinformatics 3 07-15-2010 07:54 AM
Tophat: Is it necessary to pre-filter reads swarbre Bioinformatics 1 09-10-2009 02:48 PM

Reply
 
Thread Tools
Old 07-02-2011, 11:02 PM   #1
mart555
Member
 
Location: Shanghai

Join Date: Jan 2011
Posts: 11
Default Could I filter rRNA and tRNA by using Tophat or Cufflinks??

My RNA-seq data was mapped by Tophat, but rRNA and tRNA were not removed, so I wonder whether Tophat or Cufflinks can remove reads match to rRNA or tRNA?
mart555 is offline   Reply With Quote
Old 07-03-2011, 09:15 AM   #2
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Why would you want to filter them out, anyway?
Simon Anders is offline   Reply With Quote
Old 07-03-2011, 11:14 AM   #3
DZhang
Senior Member
 
Location: East Coast, US

Join Date: Jun 2010
Posts: 177
Default

Let me share my experience on this. When I was analyzing a set of microbial RNA-seq data, cufflink got stuck at "99% complete" for days. It is a known issue - check cufflinks FAQ. The authors suggest to remove rRNA and MT DNA. So I removed those in the GTF file and the run finished in a few hours. I believe the rRNA genes usually have excessive coverage, which may choke cufflinks.
DZhang is offline   Reply With Quote
Old 07-03-2011, 12:50 PM   #4
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

I asked because if you just want to do simple counting in your next analysis step, you would just get a few extra count values, which you can then ignore. Isn't cufflink a bit too sophisticated for prokaryotic genomes, anyway? Wouldn't it spend all its time trying to assemble multi-exonic transcripts, of which there aren't any, or can you tell it not to bother with splice junctions?
Simon Anders is offline   Reply With Quote
Old 07-07-2011, 06:33 AM   #5
mart555
Member
 
Location: Shanghai

Join Date: Jan 2011
Posts: 11
Default

Thank you, DZhang.
I checked cufflinks FAQ, and as it suggest,I run cufflinks With -M rRNA.gtf , but it still takes me more than 1 day when caculating.
So I wonder is there some tools can filter all the reads like what ABI's bioscope could do: discard the reads which mappable to filter reference, and the remaining reads then align to genome?
mart555 is offline   Reply With Quote
Old 07-07-2011, 07:41 AM   #6
DZhang
Senior Member
 
Location: East Coast, US

Join Date: Jun 2010
Posts: 177
Default

Mart555,

To answer your question directly, yes, you can. Map the reads to your filter reference and extract the unmapped reads for further processing. Bowtie/BWA can do the former part and samtools can do the latter.

My understanding of your challenge is that you do not know what part of the reference sequences taking too many reads, or even if that is the root cause or not in your case. I assume your job is done by now, although it took a bit longer. Can you explain your situation so everybody understands your situation better?

Thank you,
Douglas
www.contigexpress.com
DZhang is offline   Reply With Quote
Old 07-08-2011, 07:32 AM   #7
mart555
Member
 
Location: Shanghai

Join Date: Jan 2011
Posts: 11
Default

Douglas,
Thank you for your answer.
As you suggest, now I want build a Bowtie index of rRNA+tRNA+mtRNA, and I think I can assess the percentage of these junk RNA by Bowtie with this index.
But I still cannot find out how to extract the umapped reads by using samtools, and if bowtie generated a sam file with rRNA index, how can the unmapped reads remapped to genomic sequence?

My situation:
I was done a RIP. As 2100 show, mock RNA has peaks represent the rRNA,but RIP RNA have no such thing.
Then I sequencing my RNA with HiSeq2000. I use Tophat to mapping with mm9.
When mapping, RIP-reads take about 8h, wherease the Mock-reads takes me more than 24h.
So I want filter them out, that will make my analysis much more fast.

Last edited by mart555; 07-08-2011 at 07:34 AM.
mart555 is offline   Reply With Quote
Old 07-08-2011, 08:04 AM   #8
DZhang
Senior Member
 
Location: East Coast, US

Join Date: Jun 2010
Posts: 177
Default

Hi Mart555,

check this post:
http://seqanswers.com/forums/showthread.php?t=5787

Regards,
Douglas
www.contigexpress.com
DZhang is offline   Reply With Quote
Old 07-08-2011, 08:16 AM   #9
cascoamarillo
Senior Member
 
Location: MA

Join Date: Oct 2010
Posts: 160
Default

Hi guys,

In your case, what I do is the following:
Map my reads against the junk/non desire reference (rRNA, mt,....) with bowtie. Using the --un option and saving a fastq/fasta file with the unmapped reads (desired reads).
Then, you can take this file and run it with bowtie/tophat/cufflinks and your referene.

Hope it helps.
cascoamarillo is offline   Reply With Quote
Old 07-09-2011, 10:53 PM   #10
mart555
Member
 
Location: Shanghai

Join Date: Jan 2011
Posts: 11
Default

Hi all,
Thank you for your help, I'm very appreciate.
Now I finished my filter with rRNA\tRNA\mtDNA.
About 50% IgG Reads and 26% RIP Reads were filterd, that's reasonable.

But another question is where can I get the correct rRNA sequence?
Some people recommended get rRNA sequence from http://www.arb-silva.de/
I searched “mus musculus”, and download the high quality sequence(about 70 record)with fasta format, and transfer "U" to "T". But these sequence doesn't work.

So I searched mouse rRNA sequence from Genebank, and I got only 4 record:
gi|262231778|ref|NR_030686.1| Mus musculus 5S RNA (Rn5s), ribosomal RNA
gi|120444901|ref|NR_003280.1| Mus musculus 5.8S ribosomal RNA (LOC790956), ribosomal RNA
gi|120444900|ref|NR_003279.1| Mus musculus 28S ribosomal RNA (28s), ribosomal RNA
gi|328447215|ref|NR_003278.2| Mus musculus 18S ribosomal RNA (Rn18s), ribosomal RNA

Integrade these four sequence with tRNA and mtDNA, I successfully filtered my reads, but I still wonder are these four sequence enough?
mart555 is offline   Reply With Quote
Old 07-11-2011, 08:57 PM   #11
mart555
Member
 
Location: Shanghai

Join Date: Jan 2011
Posts: 11
Default

Now I finished my filtering work with rRNA sequences download from Genebank and Silva.

Thanks for help, all of you!
mart555 is offline   Reply With Quote
Old 11-26-2011, 06:15 PM   #12
cutcopy11
Member
 
Location: Purdue University

Join Date: Nov 2009
Posts: 19
Default

Hi DZhang and other SeqAnswer frequenters,

I want to filter rRNA and mtDNA genes from GTF files.

I am using RSEM to map and count reads per gene for a class project with RNA-seq data from from various publications. Then, I am comparing performances of edgeR and DESeq with the outputs of RSEM. I believe the excess coverage of rRNA and possibly mtDNA is messing up my differential expression results.

I downloaded my mouse and human GTF files from the USCS genome browser and converted a GFF file from arabidopsis to GTF.

How can you filter the rRNA and/or mtDNA out of the GTF file. Is there a list of gene IDs somewhere? I can write scripts in Perl by the way. So, I can do it myself if someone points me in the right direction. I would actually probably use the rRNA/ mtDNA gene ID list to filter the RSEM results.

Thanks so much,
Clayton

Last edited by cutcopy11; 11-26-2011 at 06:17 PM.
cutcopy11 is offline   Reply With Quote
Old 11-26-2011, 07:18 PM   #13
DZhang
Senior Member
 
Location: East Coast, US

Join Date: Jun 2010
Posts: 177
Default

Hi Clayton,

Since you have the gtf file, you may search any gene/transcript name with "rRNA" or "ribosomal RNA", and review each entry to confirm before removing it. For mtDNA, it is even easier as you can tell from the Chr. ID.

Cheers,
Douglas
DZhang is offline   Reply With Quote
Old 11-26-2011, 08:32 PM   #14
cutcopy11
Member
 
Location: Purdue University

Join Date: Nov 2009
Posts: 19
Default

Thanks douglas for your quick response. Where do you recommend searching for those rRNA gene ids? Thanks again, Clay
cutcopy11 is offline   Reply With Quote
Old 11-27-2011, 06:30 AM   #15
DZhang
Senior Member
 
Location: East Coast, US

Join Date: Jun 2010
Posts: 177
Default

Hi Clay, you should do your search in your gtf/gff file. The overall idea is to remove the rRNA/mtGenes from your gtf/gff file so the program does not process the excessive reads mapped to those genes.
DZhang is offline   Reply With Quote
Old 11-27-2011, 07:37 AM   #16
cutcopy11
Member
 
Location: Purdue University

Join Date: Nov 2009
Posts: 19
Default

Hi Douglas,

I appreciate your help but I cannot seem to figure out what might be a unique identifier that associate genes with rRNA within the GTF file. I have noticed, however, that the three letters within the gene_id or transcript_id seem to change frequently among genes. But I haven't been able to figure out how to associate those three letters with rRNA yet. Below are three lines in a human GTF file from the UCSC gb. BTW, simply searching for rRNA within the GTF file does not work.

-Clay

chr1 hg19_knownGene exon 1159212 1159348 0.000000 - . gene_id "uc001adh.3"; transcript_id "uc001adh.3";

chr1 hg19_knownGene CDS 1163848 1164173 0.000000 - 0 gene_id "uc001adh.3"; transcript_id "uc001adh.3";

chr1 hg19_knownGene start_codon 1164171 1164173 0.000000 - . gene_id "uc001adh.3"; transcript_id "uc001adh.3";
cutcopy11 is offline   Reply With Quote
Old 11-27-2011, 08:10 AM   #17
DZhang
Senior Member
 
Location: East Coast, US

Join Date: Jun 2010
Posts: 177
Default

Hi Clay,

rRNA genes are usually annotated as repeat regions so your gtf normally does not include it. (I know the standard Bowtie index file does not.). Check this link for more details:

http://genome.ucsc.edu/cgi-bin/hgGen...hgg_end=156152.

Go back to your original question: you need to review the genes/transcripts with extremely high counts to see what they are. My guess is that they may not be rRNA coding genes.
DZhang is offline   Reply With Quote
Old 01-19-2012, 01:26 AM   #18
paolo.kunder
Member
 
Location: Milano, Italy

Join Date: Aug 2011
Posts: 93
Default

Quote:
Originally Posted by mart555 View Post
Thank you, DZhang.
I checked cufflinks FAQ, and as it suggest,I run cufflinks With -M rRNA.gtf , but it still takes me more than 1 day when caculating.
So I wonder is there some tools can filter all the reads like what ABI's bioscope could do: discard the reads which mappable to filter reference, and the remaining reads then align to genome?
where did you find rRNA.gtf file?
paolo.kunder is offline   Reply With Quote
Old 01-19-2012, 07:47 AM   #19
polyatail
Member
 
Location: New York, NY

Join Date: Dec 2010
Posts: 25
Default

Quote:
Originally Posted by paolo.kunder View Post
where did you find rRNA.gtf file?
You can get this pretty easily from the UCSC table browser.
  1. Select "All Tables" from the group drop-down list
  2. Select the "rmsk" table from the table drop-down list
  3. Choose "GTF" as the output format
  4. Type a filename in "output file" so your browser downloads the result
  5. Click "create" next to filter
  6. Next to "repClass," type rRNA
  7. Next to free-form query, select "OR" and type repClass = "tRNA"
  8. Click submit on that page, then get output on the main page

Check out the attached screenshots.
Attached Images
File Type: png 20120119_UCSC_mask1.png (138.2 KB, 147 views)
File Type: png 20120119_UCSC_mask2.png (83.1 KB, 124 views)
polyatail is offline   Reply With Quote
Old 01-20-2012, 01:06 AM   #20
paolo.kunder
Member
 
Location: Milano, Italy

Join Date: Aug 2011
Posts: 93
Default

many many thanks!!!

Last edited by paolo.kunder; 01-20-2012 at 01:13 AM.
paolo.kunder is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:09 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO