Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Hi Douglas,

    I appreciate your help but I cannot seem to figure out what might be a unique identifier that associate genes with rRNA within the GTF file. I have noticed, however, that the three letters within the gene_id or transcript_id seem to change frequently among genes. But I haven't been able to figure out how to associate those three letters with rRNA yet. Below are three lines in a human GTF file from the UCSC gb. BTW, simply searching for rRNA within the GTF file does not work.

    -Clay

    chr1 hg19_knownGene exon 1159212 1159348 0.000000 - . gene_id "uc001adh.3"; transcript_id "uc001adh.3";

    chr1 hg19_knownGene CDS 1163848 1164173 0.000000 - 0 gene_id "uc001adh.3"; transcript_id "uc001adh.3";

    chr1 hg19_knownGene start_codon 1164171 1164173 0.000000 - . gene_id "uc001adh.3"; transcript_id "uc001adh.3";

    Comment


    • #17
      Hi Clay,

      rRNA genes are usually annotated as repeat regions so your gtf normally does not include it. (I know the standard Bowtie index file does not.). Check this link for more details:

      http://genome.ucsc.edu/cgi-bin/hgGen...hgg_end=156152.

      Go back to your original question: you need to review the genes/transcripts with extremely high counts to see what they are. My guess is that they may not be rRNA coding genes.

      Comment


      • #18
        Originally posted by mart555 View Post
        Thank you, DZhang.
        I checked cufflinks FAQ, and as it suggest,I run cufflinks With -M rRNA.gtf , but it still takes me more than 1 day when caculating.
        So I wonder is there some tools can filter all the reads like what ABI's bioscope could do: discard the reads which mappable to filter reference, and the remaining reads then align to genome?
        where did you find rRNA.gtf file?

        Comment


        • #19
          Originally posted by paolo.kunder View Post
          where did you find rRNA.gtf file?
          You can get this pretty easily from the UCSC table browser.
          1. Select "All Tables" from the group drop-down list
          2. Select the "rmsk" table from the table drop-down list
          3. Choose "GTF" as the output format
          4. Type a filename in "output file" so your browser downloads the result
          5. Click "create" next to filter
          6. Next to "repClass," type rRNA
          7. Next to free-form query, select "OR" and type repClass = "tRNA"
          8. Click submit on that page, then get output on the main page


          Check out the attached screenshots.
          Attached Files

          Comment


          • #20
            many many thanks!!!
            Last edited by paolo.kunder; 01-20-2012, 02:13 AM.

            Comment


            • #21
              Originally posted by mart555 View Post
              Hi all,
              Thank you for your help, I'm very appreciate.
              Now I finished my filter with rRNA\tRNA\mtDNA.
              About 50% IgG Reads and 26% RIP Reads were filterd, that's reasonable.

              But another question is where can I get the correct rRNA sequence?
              Some people recommended get rRNA sequence from http://www.arb-silva.de/
              I searched “mus musculus”, and download the high quality sequence(about 70 record)with fasta format, and transfer "U" to "T". But these sequence doesn't work.

              So I searched mouse rRNA sequence from Genebank, and I got only 4 record:
              gi|262231778|ref|NR_030686.1| Mus musculus 5S RNA (Rn5s), ribosomal RNA
              gi|120444901|ref|NR_003280.1| Mus musculus 5.8S ribosomal RNA (LOC790956), ribosomal RNA
              gi|120444900|ref|NR_003279.1| Mus musculus 28S ribosomal RNA (28s), ribosomal RNA
              gi|328447215|ref|NR_003278.2| Mus musculus 18S ribosomal RNA (Rn18s), ribosomal RNA

              Integrade these four sequence with tRNA and mtDNA, I successfully filtered my reads, but I still wonder are these four sequence enough?
              Interesting topic
              mart555 How did you do the integration? Just cat or anything more?

              Thanks
              Last edited by SEQond; 10-08-2012, 07:06 AM. Reason: update info

              Comment


              • #22
                Originally posted by SEQond View Post
                Interesting topic
                mart555 How did you do the integration? Just cat or anything more?

                Thanks
                I just cat them together.

                Comment


                • #23
                  Originally posted by DZhang View Post
                  Hi Clay, you should do your search in your gtf/gff file. The overall idea is to remove the rRNA/mtGenes from your gtf/gff file so the program does not process the excessive reads mapped to those genes.
                  There should be a way to minimize time allocated for RefGenome mapping by throwing out the fragments that align to rRNA, tRNA or mtGenes by filtering out prior to the alignment to the refGenome, and not filtering out using the UCSC gtf s after the main mapping has been done.

                  What I have done so far is


                  A. to align to the rRNA of the following Genebank ids and get back what does not align

                  Code:
                  Genebank,  4 record: (MOUSE)
                  gi|262231778|ref|NR_030686.1| Mus musculus 5S RNA (Rn5s), ribosomal RNA
                  gi|374093199|ref|NR_003280.2| Mus musculus 5.8S ribosomal RNA (Rs5-8s1)
                  gi|120444900|ref|NR_003279.1| Mus musculus 28S ribosomal RNA (28s), ribosomal RNA
                  gi|374088232|ref|NR_003278.3| Mus musculus 18S ribosomal RNA (Rn18s), ribosomal RNA
                  B. use the mm9 or whichever RefGenome to get the chrM (mitochondrial), and build a new index for bowtie then align what came out of (A) and as before get back what does not align to (B).

                  Code:
                  bowtie-build /RefGenomes/mouse/mm9/chrM.fa /RefGenomes/mtDNA/mouse/mm9/chrM &

                  Please share your thoughts

                  Comment


                  • #24
                    Originally posted by polyatail View Post
                    You can get this pretty easily from the UCSC table browser.
                    1. Select "All Tables" from the group drop-down list
                    2. Select the "rmsk" table from the table drop-down list
                    3. Choose "GTF" as the output format
                    4. Type a filename in "output file" so your browser downloads the result
                    5. Click "create" next to filter
                    6. Next to "repClass," type "rRNA"
                    7. Next to free-form query, select "OR" and type repClass = "tRNA"
                    8. Click submit on that page, then get output on the main page


                    Check out the attached screenshots.

                    With this same method one can get the sequence in fasta format also so you can build a Bowtie index.

                    Instead of "GTF" choose "sequence" to get the FASTA.

                    As a side not remember that with the above method you get also the mito_tRNAs and the mito_rRNAs. If you dont want the mito , then on the "Free-form query: " you should specify NOT LIKE "chr??" . Further help here
                    Last edited by SEQond; 10-12-2012, 07:14 AM.

                    Comment


                    • #25
                      Simon Anders,

                      You say that in case of simple counting then removing the rRNA,mtDNA etc. is not necessary. Could you elaborate on it? In which case is it imprortant?

                      Comment


                      • #26
                        Originally posted by Simon Anders View Post
                        ... if you just want to do simple counting in your next analysis step, you would just get a few extra count values, which you can then ignore.
                        He just said that for read counting issues (like differential expression analyses) the filtering based on annotation descriptors would be easier than to filter prior to the alignment (map against filter-feature-index -> map only remaining unmapped reads against interesting genome regions). For counting you just can ignore the read counts associated with filtering criteria.
                        But there are indeed some scenarios were it might be more reasonable to filter prior to the alignment, e.g. if you are not interested in some overrepresented gene groups or the rRNA content is very high for some reason. The you can save mapping time with the smaller, prefiltered reference. Another aspect is the mapping of "mulit-mapped" reads. One could discard all reads (perfectly) mapping to rRNA etc. Then mapping of multi-mapped reads would at least not include reads that were mapped to genes by chance but could also be mapped to rRNA => the reliability of the mapping increases.

                        Comment


                        • #27
                          Note that hanshart's suggestions are all ways to speed up the alignment process a bit. Investing much time into figuring out how to pre-filter your reads before alignment is probably worth your while only if you have really many reads to align.

                          After all, after the alignment, it's trivial to remove the reads that map to rRNA, if this is important to you for some reason. Just look at the alignments and flag all reads that were aligned to rRNA loci.

                          Comment


                          • #28
                            Originally posted by polyatail View Post
                            You can get this pretty easily from the UCSC table browser.
                            Thank you, this is very useful. Great post!

                            Comment


                            • #29
                              Can you get a version for Ensembl as well?

                              Comment


                              • #30
                                The UCSC table does not provide ribosomal genes

                                Below, I show one example of a ribosomal gene that is present in the iGenomes GTF file and absent from the rmsk UCSC table.

                                genes.gtf provided by iGenomes hg19 contains RPS25:

                                Code:
                                grep 118886422 genes.gtf
                                chr11   unknown exon    118886422       118886468       .       -       .       gene_id "RPS25"; gene_name "RPS25"; p_id "P8979"; transcript_id "NM_001028"; tss_id "TSS14859";
                                I got the rmsk.gtf from the UCSC table browser without using any filters. I downloaded the entire 520MB table.

                                It does NOT have this gene:

                                Code:
                                grep chr11 rmsk.gtf | grep -P '118886\d{3}'
                                chr11   hg19_rmsk       exon    118886716       118886989       1723.000000     +       .       gene_id "AluSz"; transcript_id "AluSz_dup3616";
                                Nor does it have any RPS genes:

                                Code:
                                grep RPS rmsk.gtf | wc -l
                                0
                                If I "grep RPS genes.gtf" to find ribosomal genes, then I will also find non-ribosomal genes such as the transcription factor TRPS1, because TRPS1 contains contains the string "RPS".

                                Also, it is possible that some ribosomal genes are not found by "grep RPS". Perhaps I'm wrong about this.

                                In general, this is why suggesting "grep" is poor advice.

                                If you have proper annotations for ribosomal genes and tRNAs, I would appreciate it if you could please share your method for obtaining them. You can also share your file if you want, but I'm only interested in the method and not the file.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Essential Discoveries and Tools in Epitranscriptomics
                                  by seqadmin




                                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                  04-22-2024, 07:01 AM
                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                59 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                57 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                53 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                56 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X