SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
filter sequences from rRNA, tRNA cascoamarillo Bioinformatics 8 01-29-2013 11:13 AM
tRNA/5S rRNA depletion mateo Sample Prep / Library Generation 4 01-26-2012 02:48 PM
coding exons/ repeats/ rRNA, tRNA, snRNA, snoRNA EADIE SOLiD 3 09-17-2010 01:07 AM
How to filter rRNA reads in SAM file. townway Bioinformatics 3 07-15-2010 07:54 AM
Tophat: Is it necessary to pre-filter reads swarbre Bioinformatics 1 09-10-2009 02:48 PM

Reply
 
Thread Tools
Old 10-08-2012, 06:56 AM   #21
SEQond
Member
 
Location: Italy

Join Date: Jul 2010
Posts: 27
Default

Quote:
Originally Posted by mart555 View Post
Hi all,
Thank you for your help, I'm very appreciate.
Now I finished my filter with rRNA\tRNA\mtDNA.
About 50% IgG Reads and 26% RIP Reads were filterd, that's reasonable.

But another question is where can I get the correct rRNA sequence?
Some people recommended get rRNA sequence from http://www.arb-silva.de/
I searched “mus musculus”, and download the high quality sequence(about 70 record)with fasta format, and transfer "U" to "T". But these sequence doesn't work.

So I searched mouse rRNA sequence from Genebank, and I got only 4 record:
gi|262231778|ref|NR_030686.1| Mus musculus 5S RNA (Rn5s), ribosomal RNA
gi|120444901|ref|NR_003280.1| Mus musculus 5.8S ribosomal RNA (LOC790956), ribosomal RNA
gi|120444900|ref|NR_003279.1| Mus musculus 28S ribosomal RNA (28s), ribosomal RNA
gi|328447215|ref|NR_003278.2| Mus musculus 18S ribosomal RNA (Rn18s), ribosomal RNA

Integrade these four sequence with tRNA and mtDNA, I successfully filtered my reads, but I still wonder are these four sequence enough?
Interesting topic
mart555 How did you do the integration? Just cat or anything more?

Thanks

Last edited by SEQond; 10-08-2012 at 07:06 AM. Reason: update info
SEQond is offline   Reply With Quote
Old 10-08-2012, 04:48 PM   #22
mart555
Member
 
Location: Shanghai

Join Date: Jan 2011
Posts: 11
Default

Quote:
Originally Posted by SEQond View Post
Interesting topic
mart555 How did you do the integration? Just cat or anything more?

Thanks
I just cat them together.
mart555 is offline   Reply With Quote
Old 10-09-2012, 03:28 AM   #23
SEQond
Member
 
Location: Italy

Join Date: Jul 2010
Posts: 27
Default

Quote:
Originally Posted by DZhang View Post
Hi Clay, you should do your search in your gtf/gff file. The overall idea is to remove the rRNA/mtGenes from your gtf/gff file so the program does not process the excessive reads mapped to those genes.
There should be a way to minimize time allocated for RefGenome mapping by throwing out the fragments that align to rRNA, tRNA or mtGenes by filtering out prior to the alignment to the refGenome, and not filtering out using the UCSC gtf s after the main mapping has been done.

What I have done so far is


A. to align to the rRNA of the following Genebank ids and get back what does not align

Code:
Genebank,  4 record: (MOUSE)
gi|262231778|ref|NR_030686.1| Mus musculus 5S RNA (Rn5s), ribosomal RNA
gi|374093199|ref|NR_003280.2| Mus musculus 5.8S ribosomal RNA (Rs5-8s1)
gi|120444900|ref|NR_003279.1| Mus musculus 28S ribosomal RNA (28s), ribosomal RNA
gi|374088232|ref|NR_003278.3| Mus musculus 18S ribosomal RNA (Rn18s), ribosomal RNA
B. use the mm9 or whichever RefGenome to get the chrM (mitochondrial), and build a new index for bowtie then align what came out of (A) and as before get back what does not align to (B).

Code:
bowtie-build /RefGenomes/mouse/mm9/chrM.fa /RefGenomes/mtDNA/mouse/mm9/chrM &

Please share your thoughts
SEQond is offline   Reply With Quote
Old 10-12-2012, 04:54 AM   #24
SEQond
Member
 
Location: Italy

Join Date: Jul 2010
Posts: 27
Default

Quote:
Originally Posted by polyatail View Post
You can get this pretty easily from the UCSC table browser.
  1. Select "All Tables" from the group drop-down list
  2. Select the "rmsk" table from the table drop-down list
  3. Choose "GTF" as the output format
  4. Type a filename in "output file" so your browser downloads the result
  5. Click "create" next to filter
  6. Next to "repClass," type "rRNA"
  7. Next to free-form query, select "OR" and type repClass = "tRNA"
  8. Click submit on that page, then get output on the main page

Check out the attached screenshots.

With this same method one can get the sequence in fasta format also so you can build a Bowtie index.

Instead of "GTF" choose "sequence" to get the FASTA.

As a side not remember that with the above method you get also the mito_tRNAs and the mito_rRNAs. If you dont want the mito , then on the "Free-form query: " you should specify NOT LIKE "chr??" . Further help here

Last edited by SEQond; 10-12-2012 at 07:14 AM.
SEQond is offline   Reply With Quote
Old 06-21-2013, 09:03 AM   #25
pmgr
Junior Member
 
Location: Sweden

Join Date: Jun 2012
Posts: 5
Default

Simon Anders,

You say that in case of simple counting then removing the rRNA,mtDNA etc. is not necessary. Could you elaborate on it? In which case is it imprortant?
pmgr is offline   Reply With Quote
Old 06-24-2013, 01:54 AM   #26
hanshart
Member
 
Location: Germany

Join Date: Nov 2011
Posts: 27
Default

Quote:
Originally Posted by Simon Anders View Post
... if you just want to do simple counting in your next analysis step, you would just get a few extra count values, which you can then ignore.
He just said that for read counting issues (like differential expression analyses) the filtering based on annotation descriptors would be easier than to filter prior to the alignment (map against filter-feature-index -> map only remaining unmapped reads against interesting genome regions). For counting you just can ignore the read counts associated with filtering criteria.
But there are indeed some scenarios were it might be more reasonable to filter prior to the alignment, e.g. if you are not interested in some overrepresented gene groups or the rRNA content is very high for some reason. The you can save mapping time with the smaller, prefiltered reference. Another aspect is the mapping of "mulit-mapped" reads. One could discard all reads (perfectly) mapping to rRNA etc. Then mapping of multi-mapped reads would at least not include reads that were mapped to genes by chance but could also be mapped to rRNA => the reliability of the mapping increases.
hanshart is offline   Reply With Quote
Old 06-24-2013, 02:05 AM   #27
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Note that hanshart's suggestions are all ways to speed up the alignment process a bit. Investing much time into figuring out how to pre-filter your reads before alignment is probably worth your while only if you have really many reads to align.

After all, after the alignment, it's trivial to remove the reads that map to rRNA, if this is important to you for some reason. Just look at the alignments and flag all reads that were aligned to rRNA loci.
Simon Anders is offline   Reply With Quote
Old 09-11-2013, 01:40 PM   #28
apredeus
Senior Member
 
Location: Bioinformatics Institute, SPb

Join Date: Jul 2012
Posts: 151
Default

Quote:
Originally Posted by polyatail View Post
You can get this pretty easily from the UCSC table browser.
Thank you, this is very useful. Great post!
apredeus is offline   Reply With Quote
Old 03-05-2014, 05:00 PM   #29
sindrle
Senior Member
 
Location: Norway

Join Date: Aug 2013
Posts: 266
Default

Can you get a version for Ensembl as well?
sindrle is offline   Reply With Quote
Old 06-01-2014, 07:48 AM   #30
slowkow
Junior Member
 
Location: Boston

Join Date: May 2010
Posts: 2
Default The UCSC table does not provide ribosomal genes

Below, I show one example of a ribosomal gene that is present in the iGenomes GTF file and absent from the rmsk UCSC table.

genes.gtf provided by iGenomes hg19 contains RPS25:

Code:
grep 118886422 genes.gtf
chr11   unknown exon    118886422       118886468       .       -       .       gene_id "RPS25"; gene_name "RPS25"; p_id "P8979"; transcript_id "NM_001028"; tss_id "TSS14859";
I got the rmsk.gtf from the UCSC table browser without using any filters. I downloaded the entire 520MB table.

It does NOT have this gene:

Code:
grep chr11 rmsk.gtf | grep -P '118886\d{3}'
chr11   hg19_rmsk       exon    118886716       118886989       1723.000000     +       .       gene_id "AluSz"; transcript_id "AluSz_dup3616";
Nor does it have any RPS genes:

Code:
grep RPS rmsk.gtf | wc -l
0
If I "grep RPS genes.gtf" to find ribosomal genes, then I will also find non-ribosomal genes such as the transcription factor TRPS1, because TRPS1 contains contains the string "RPS".

Also, it is possible that some ribosomal genes are not found by "grep RPS". Perhaps I'm wrong about this.

In general, this is why suggesting "grep" is poor advice.

If you have proper annotations for ribosomal genes and tRNAs, I would appreciate it if you could please share your method for obtaining them. You can also share your file if you want, but I'm only interested in the method and not the file.
slowkow is offline   Reply With Quote
Old 06-01-2014, 09:28 AM   #31
mastal
Senior Member
 
Location: uk

Join Date: Mar 2009
Posts: 667
Default

You can avoid matching things like 'TRPS1' when using grep by doing something like

Code:
grep -P '\bRPS'
mastal is offline   Reply With Quote
Old 06-01-2014, 09:40 AM   #32
Zapages
Member
 
Location: NJ

Join Date: Oct 2012
Posts: 94
Default

Very interesting and informative. Thank you for sharing this information.
Zapages is offline   Reply With Quote
Old 06-01-2014, 06:43 PM   #33
slowkow
Junior Member
 
Location: Boston

Join Date: May 2010
Posts: 2
Default

Quote:
Originally Posted by mastal View Post
You can avoid matching things like 'TRPS1' when using grep by doing something like

Code:
grep -P '\bRPS'
"\bRPS" does not match MRPS10 and other ribosomal genes that do not begin with "RPS".

grep matching on gene names can easily fail when your aim is to capture all genes in a particular class:
  1. You may miss some genes of interest that do not match your pattern.
  2. You may match genes that are not of interest, and it's hard to notice if your total list contains thousands of genes.

If you aim to capture a just few genes, you can probably grep and manually confirm that you have everything you need.

Visual inspection is not appropriate when you have thousands of genes: that is too many to count, and gene names have synonyms you will not recognize by visual inspection and not match with your patterns.
slowkow is offline   Reply With Quote
Old 07-18-2014, 12:28 PM   #34
shangzhong0619
Member
 
Location: La Jolla

Join Date: Nov 2013
Posts: 17
Default Tophat remove rRNA reads automatically?

Hi,
I just have the opposite problem as mart555. I mapped my RNAseq to mouse gff3 file downloaded from ncbi. I didn't make any change, rRNA sequence are in the file. After mapping using tophat, I blasted the unmappd reads and many mapped to rRNA. The command I run is:
tophat -p 8 -G $annotation -o out $database L1_1.fq.gz L1_2.fq.gz

So my question is:
(1) tophat can automatically filter rRNA reads even if the gff file has rRNA annotation?
(2) I tried using only bowtie2 instead of tophat, the result is better, but in the unmapped reads they still map to rRNA. So bowtie2 can also filter some reads when indexing?

Thank you.
shangzhong0619 is offline   Reply With Quote
Old 12-17-2014, 10:54 AM   #35
Gonza
Member
 
Location: Ithaca, NY

Join Date: Mar 2013
Posts: 78
Default

Hi all,

I manually removed rRNA using the Linux command:

$ grep -v 'rRNA' genes.gtf > new_genes.gtf

One question though, if you are looking for gene differential expression between conditions (treatment vs control for example) shouldn't you remove tRNA as well?

Cheers
-G
Gonza is offline   Reply With Quote
Old 12-17-2014, 12:09 PM   #36
apredeus
Senior Member
 
Location: Bioinformatics Institute, SPb

Join Date: Jul 2012
Posts: 151
Default

Technically you don't need to remove rRNA from annotation - you need to remove them from your library. And if it's poly-A selected, it will remove tRNA as well. If you see some differential expression of tRNAs, well, then you do it should not influence any other genes.
apredeus is offline   Reply With Quote
Old 12-17-2014, 06:35 PM   #37
yueluo
Member
 
Location: Guangzhou China

Join Date: Aug 2013
Posts: 82
Default

If you have a gtf/gff of the rRNA genes, you can filter alignment results with Cufflinks using the '-M/--mask-file' option.

http://cole-trapnell-lab.github.io/c...nks/index.html

Quote:
Tells Cufflinks to ignore all reads that could have come from transcripts in this GTF file. We recommend including any annotated rRNA, mitochondrial transcripts other abundant transcripts you wish to ignore in your analysis in this file. Due to variable efficiency of mRNA enrichment methods and rRNA depletion kits, masking these transcripts often improves the overall robustness of transcript abundance estimates.
yueluo is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:43 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO