Seqanswers Leaderboard Ad

**cutcopy11** · 11-27-2011, 08:37 AM

Hi Douglas,

I appreciate your help but I cannot seem to figure out what might be a unique identifier that associate genes with rRNA within the GTF file. I have noticed, however, that the three letters within the gene_id or transcript_id seem to change frequently among genes. But I haven't been able to figure out how to associate those three letters with rRNA yet. Below are three lines in a human GTF file from the UCSC gb. BTW, simply searching for rRNA within the GTF file does not work.

-Clay

chr1 hg19_knownGene exon 1159212 1159348 0.000000 - . gene_id "uc001adh.3"; transcript_id "uc001adh.3";

chr1 hg19_knownGene CDS 1163848 1164173 0.000000 - 0 gene_id "uc001adh.3"; transcript_id "uc001adh.3";

chr1 hg19_knownGene start_codon 1164171 1164173 0.000000 - . gene_id "uc001adh.3"; transcript_id "uc001adh.3";

**DZhang** · 11-27-2011, 09:10 AM

Hi Clay,

rRNA genes are usually annotated as repeat regions so your gtf normally does not include it. (I know the standard Bowtie index file does not.). Check this link for more details:

http://genome.ucsc.edu/cgi-bin/hgGen...hgg_end=156152.

Go back to your original question: you need to review the genes/transcripts with extremely high counts to see what they are. My guess is that they may not be rRNA coding genes.

**paolo.kunder** · 01-19-2012, 02:26 AM

Originally posted by mart555 View Post

Thank you, DZhang.
I checked cufflinks FAQ, and as it suggest，I run cufflinks With -M rRNA.gtf , but it still takes me more than 1 day when caculating.
So I wonder is there some tools can filter all the reads like what ABI's bioscope could do: discard the reads which mappable to filter reference, and the remaining reads then align to genome?

where did you find rRNA.gtf file?

**polyatail** · 01-19-2012, 08:47 AM

Originally posted by paolo.kunder View Post

where did you find rRNA.gtf file?

You can get this pretty easily from the UCSC table browser.

Select "All Tables" from the group drop-down list
Select the "rmsk" table from the table drop-down list
Choose "GTF" as the output format
Type a filename in "output file" so your browser downloads the result
Click "create" next to filter
Next to "repClass," type rRNA
Next to free-form query, select "OR" and type repClass = "tRNA"
Click submit on that page, then get output on the main page

Check out the attached screenshots.

Attached Files

**paolo.kunder** · 01-20-2012, 02:06 AM

many many thanks!!!

**SEQond** · 10-08-2012, 06:56 AM

Originally posted by mart555 View Post

Hi all,
Thank you for your help, I'm very appreciate.
Now I finished my filter with rRNA\tRNA\mtDNA.
About 50% IgG Reads and 26% RIP Reads were filterd, that's reasonable.

But another question is where can I get the correct rRNA sequence？
Some people recommended get rRNA sequence from http://www.arb-silva.de/
I searched “mus musculus”, and download the high quality sequence(about 70 record)with fasta format, and transfer "U" to "T". But these sequence doesn't work.

So I searched mouse rRNA sequence from Genebank, and I got only 4 record:
gi|262231778|ref|NR_030686.1| Mus musculus 5S RNA (Rn5s), ribosomal RNA
gi|120444901|ref|NR_003280.1| Mus musculus 5.8S ribosomal RNA (LOC790956), ribosomal RNA
gi|120444900|ref|NR_003279.1| Mus musculus 28S ribosomal RNA (28s), ribosomal RNA
gi|328447215|ref|NR_003278.2| Mus musculus 18S ribosomal RNA (Rn18s), ribosomal RNA

Integrade these four sequence with tRNA and mtDNA, I successfully filtered my reads, but I still wonder are these four sequence enough?

Interesting topic
mart555 How did you do the integration? Just cat or anything more?

Thanks

**mart555** · 10-08-2012, 04:48 PM

Originally posted by SEQond View Post

Interesting topic
mart555 How did you do the integration? Just cat or anything more?

Thanks

I just cat them together.

**SEQond** · 10-09-2012, 03:28 AM

Originally posted by DZhang View Post

Hi Clay, you should do your search in your gtf/gff file. The overall idea is to remove the rRNA/mtGenes from your gtf/gff file so the program does not process the excessive reads mapped to those genes.

There should be a way to minimize time allocated for RefGenome mapping by throwing out the fragments that align to rRNA, tRNA or mtGenes by filtering out prior to the alignment to the refGenome, and not filtering out using the UCSC gtf s after the main mapping has been done.

What I have done so far is

A. to align to the rRNA of the following Genebank ids and get back what does not align

Code:

Genebank,  4 record: (MOUSE)
gi|262231778|ref|NR_030686.1| Mus musculus 5S RNA (Rn5s), ribosomal RNA
gi|374093199|ref|NR_003280.2| Mus musculus 5.8S ribosomal RNA (Rs5-8s1)
gi|120444900|ref|NR_003279.1| Mus musculus 28S ribosomal RNA (28s), ribosomal RNA
gi|374088232|ref|NR_003278.3| Mus musculus 18S ribosomal RNA (Rn18s), ribosomal RNA

B. use the mm9 or whichever RefGenome to get the chrM (mitochondrial), and build a new index for bowtie then align what came out of (A) and as before get back what does not align to (B).

Code:

bowtie-build /RefGenomes/mouse/mm9/chrM.fa /RefGenomes/mtDNA/mouse/mm9/chrM &

Please share your thoughts

**SEQond** · 10-12-2012, 04:54 AM

Originally posted by polyatail View Post

You can get this pretty easily from the UCSC table browser.

Select "All Tables" from the group drop-down list
Select the "rmsk" table from the table drop-down list
Choose "GTF" as the output format
Type a filename in "output file" so your browser downloads the result
Click "create" next to filter
Next to "repClass," type "rRNA"
Next to free-form query, select "OR" and type repClass = "tRNA"
Click submit on that page, then get output on the main page

Check out the attached screenshots.

With this same method one can get the sequence in fasta format also so you can build a Bowtie index.

Instead of "GTF" choose "sequence" to get the FASTA.

As a side not remember that with the above method you get also the mito_tRNAs and the mito_rRNAs. If you dont want the mito , then on the "Free-form query: " you should specify NOT LIKE "chr??" . Further help here

**pmgr** · 06-21-2013, 09:03 AM

Simon Anders,

You say that in case of simple counting then removing the rRNA,mtDNA etc. is not necessary. Could you elaborate on it? In which case is it imprortant?

**hanshart** · 06-24-2013, 01:54 AM

Originally posted by Simon Anders View Post

... if you just want to do simple counting in your next analysis step, you would just get a few extra count values, which you can then ignore.

He just said that for read counting issues (like differential expression analyses) the filtering based on annotation descriptors would be easier than to filter prior to the alignment (map against filter-feature-index -> map only remaining unmapped reads against interesting genome regions). For counting you just can ignore the read counts associated with filtering criteria.
But there are indeed some scenarios were it might be more reasonable to filter prior to the alignment, e.g. if you are not interested in some overrepresented gene groups or the rRNA content is very high for some reason. The you can save mapping time with the smaller, prefiltered reference. Another aspect is the mapping of "mulit-mapped" reads. One could discard all reads (perfectly) mapping to rRNA etc. Then mapping of multi-mapped reads would at least not include reads that were mapped to genes by chance but could also be mapped to rRNA => the reliability of the mapping increases.

**Simon Anders** · 06-24-2013, 02:05 AM

Note that hanshart's suggestions are all ways to speed up the alignment process a bit. Investing much time into figuring out how to pre-filter your reads before alignment is probably worth your while only if you have really many reads to align.

After all, after the alignment, it's trivial to remove the reads that map to rRNA, if this is important to you for some reason. Just look at the alignments and flag all reads that were aligned to rRNA loci.

**apredeus** · 09-11-2013, 01:40 PM

Originally posted by polyatail View Post

You can get this pretty easily from the UCSC table browser.

Thank you, this is very useful. Great post!

**sindrle** · 03-05-2014, 06:00 PM

Can you get a version for Ensembl as well?

**slowkow** · 06-01-2014, 07:48 AM

The UCSC table does not provide ribosomal genes

Below, I show one example of a ribosomal gene that is present in the iGenomes GTF file and absent from the rmsk UCSC table.

genes.gtf provided by iGenomes hg19 contains RPS25:

Code:

grep 118886422 genes.gtf
chr11   unknown exon    118886422       118886468       .       -       .       gene_id "RPS25"; gene_name "RPS25"; p_id "P8979"; transcript_id "NM_001028"; tss_id "TSS14859";

I got the rmsk.gtf from the UCSC table browser without using any filters. I downloaded the entire 520MB table.

It does NOT have this gene:

Code:

grep chr11 rmsk.gtf | grep -P '118886\d{3}'
chr11   hg19_rmsk       exon    118886716       118886989       1723.000000     +       .       gene_id "AluSz"; transcript_id "AluSz_dup3616";

Nor does it have any RPS genes:

Code:

grep RPS rmsk.gtf | wc -l
0

If I "grep RPS genes.gtf" to find ribosomal genes, then I will also find non-ribosomal genes such as the transcription factor TRPS1, because TRPS1 contains contains the string "RPS".

Also, it is possible that some ribosomal genes are not found by "grep RPS". Perhaps I'm wrong about this.

In general, this is why suggesting "grep" is poor advice.

If you have proper annotations for ribosomal genes and tRNAs, I would appreciate it if you could please share your method for obtaining them. You can also share your file if you want, but I'm only interested in the method and not the file.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 57 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 56 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News