Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • apredeus
    Senior Member
    • Jul 2012
    • 151

    Find all annotated rRNA (rDNA) sequences

    Hello all,

    I think it would be good to post it here for future reference. I could not find a respective topic here, and only found one discussion on Biostars.

    Picard tools has a tool named CollectRNASeqStatistics

    It's a very useful program that requires, among other, more obvious things, a file of ribosomal intervals, in SAM-like format (SAM-type header, and intervals in 5 fields: chr, begin, end, strand (+ or -), and actual gene name.
    Since I mostly deal with mouse (in mm9 assembly) and human (in hg19 assembly) genomes, I wanted to find these files or make them myself.

    I've tried to make sense of the files http://www.arb-silva.de/ and just flat-out failed. If someone can tell me how to convert files they have available for download into genomic intervals that correspond to rRNA, I'd be very grateful.

    At any rate, I've proceeded to the latest version of GENCODE. There are 1587 intervals annotated as "rRNA" transcript type in v17 of GENCODE. However, I've found that many intervals I had in a previous rRNA interval file (origins of which are mysterious), such as LSU-rRNAs, are absent.

    So, here are two main questions:

    - what should be the ultimate source of the information for rDNA annotated intervals?
    - what would be such source for mouse genome, considering that there's no GENCODE data for mouse?

    Thank you for your inputs.
  • apredeus
    Senior Member
    • Jul 2012
    • 151

    #2
    Ok I've figured it out - I guess I did not search thoroughly enough.

    You can find the intervals using the UCSC Table browser. For this, you go to



    and then set group:all tables, table:rmsk, and filter to "repClass (does match) rRNA"

    then output it as a GTF file. Voila! Works for both mouse and human.

    Comment

    • GenoMax
      Senior Member
      • Feb 2008
      • 7142

      #3
      As another option: this information is also in the GTF files available from Ensembl: http://www.ensembl.org/info/data/ftp/index.html You will have to "grep" out the information.

      Comment

      • apredeus
        Senior Member
        • Jul 2012
        • 151

        #4
        Cool. It would be interesting to compare the two.

        Comment

        • GenoMax
          Senior Member
          • Feb 2008
          • 7142

          #5
          There are differences in GTF files for UCSC and Ensembl downloaded from the iGenomes. See for reference: http://seqanswers.com/forums/showthread.php?t=41701

          Can you check to see if the data you got from UCSC matches the Ensembl GTF?

          Comment

          • apredeus
            Senior Member
            • Jul 2012
            • 151

            #6
            Originally posted by GenoMax View Post
            There are differences in GTF files for UCSC and Ensembl downloaded from the iGenomes. See for reference: http://seqanswers.com/forums/showthread.php?t=41701

            Can you check to see if the data you got from UCSC matches the Ensembl GTF?
            I definitely will, and I'll post the results here in a day or two.

            Comment

            • jbchang
              Junior Member
              • Feb 2014
              • 2

              #7
              Hey folks,

              I've been looking into this for a day or two as well, and I also downloaded the .gtf corresponding to repClass=rRNA (also tRNA) from the UCSC Table Browser.

              When I scroll through the .gtf file, though, it seems to only have entries corresponding to the 5S ribosome. Is this true for you guys, too?


              Thanks,
              Jeremy

              Comment

              • apredeus
                Senior Member
                • Jul 2012
                • 151

                #8
                Originally posted by jbchang View Post
                Hey folks,
                I've been looking into this for a day or two as well, and I also downloaded the .gtf corresponding to repClass=rRNA (also tRNA) from the UCSC Table Browser.
                When I scroll through the .gtf file, though, it seems to only have entries corresponding to the 5S ribosome. Is this true for you guys, too?
                Thanks,
                Jeremy
                So I spent some time to learn the situation, and this is actually pretty cool
                So, there are 6 kinds of ribosomal RNA in mammals (well for sure in humans): 3 belonging to large subunit (LSU rRNA: 5S, 5.8S, and 28S), 1 belonging to small subunit (SSU rRNA: 18 S), and 2 mitochondrial rRNAs (12S, 16S).

                Mitochondrial ones are the easiest - they reside, well, in chromosome M Two rRNAs, two genes, very neat.

                However, others are a real mess. Here's what Wiki says about it:

                The 28S, 5.8S, and 18S rRNAs are encoded by a single transcription unit (45S) separated by 2 internally transcribed spacers. The 45S rDNA is organized into 5 clusters (each has 30-40 repeats) on chromosomes 13, 14, 15, 21, and 22. These are transcribed by RNA polymerase I. 5S occurs in tandem arrays (~200-300 true 5S genes and many dispersed pseudogenes), the largest one on the chromosome 1q41-42. 5S rRNA is transcribed by RNA polymerase III.
                So, yes, we do see a ton of 5S genes and pseudogenes all over the annotated GTFs. There are also some 5.8S, but not many (about ten). Where are the the above-mentioned clusters of 45 rDNA though? Well, that's when it gets interesting. They are still not annotated! Gene cards say the following here:

                The sequences coding for ribosomal RNAs are present as rDNA repeating units, designated RNR1 through RNR5, in the p12 region of chromosomes 13, 14, 15, 21 and 22. A 45S rRNA which serves as the precursor for the 18S, 5.8S and 28S rRNA, is transcribed from each rDNA unit by RNA polymerase I. The number of rDNA repeating units varied between individuals and from chromosome to chromosome, although usually 30 to 40 repeats are found on each chromosome. These ribosomal repeating units are not currently annotated on the reference genome. This gene represents the portion of one rDNA repeat which encodes an 18S rRNA.(provided by RefSeq, Mar 2009) .
                Indeed, in Ensembl file I've found the following entries:

                Code:
                GL000220.1      109078  110946  5S_rRNA rRNA
                GL000220.1      109078  110946  RNA18S5 rRNA
                GL000220.1      109078  110946  RNA18S5 rRNA
                GL000220.1      112025  112177  RNA18S5 rRNA
                GL000220.1      112025  112177  RNA5-8S5        rRNA
                GL000220.1      112025  112177  RNA5-8S5        rRNA
                GL000220.1      113348  118417  RNA5-8S5        rRNA
                GL000220.1      113348  118417  RNA28S5 rRNA
                GL000220.1      113348  118417  RNA28S5 rRNA
                GL000220.1      114151  114242  RNA28S5 rRNA
                GL000220.1      114151  114242  RNA28S5 rRNA
                GL000220.1      118197  118253  RNA28S5 rRNA
                GL000220.1      118197  118253  RNA28S5 rRNA
                GL000220.1      155997  156149  RNA28S5 rRNA
                GL000220.1      155997  156149  RNA5-8S5        rRNA
                GL000220.1      155997  156149  RNA5-8S5        rRNA
                GL000228.1      20113   20230   RNA5-8S5        rRNA
                GL000228.1      20113   20230   5S_rRNA rRNA
                GL000228.1      20113   20230   5S_rRNA rRNA
                GL000228.1      22673   22791   5S_rRNA rRNA
                GL000228.1      22673   22791   5S_rRNA rRNA
                GL000228.1      22673   22791   5S_rRNA rRNA
                As of GRCh38, contig GL000220.1 is still unplaced. However, GL000228.1 is obsolete and probably is present in main GRCh38 assembly.

                Well, I think we all learned something today!

                Comment

                • apredeus
                  Senior Member
                  • Jul 2012
                  • 151

                  #9
                  So, to continue, I've looked at the "gene_id" identifiers provided in UCSC "rmsk" table (which, as I understand now, refers to RepeatMasker database, go me! ).

                  For hg19, the output is

                  Code:
                     1275 5S
                      414 LSU-rRNA_Hsa
                       80 SSU-rRNA_Hsa
                  For mm9, it is

                  Code:
                        1035 5S
                      491 LSU-rRNA_Hsa
                       46 SSU-rRNA_Hsa
                  So basically the conclusions is 1) UCSC DOES include all of the rRNA into their "rmsk" table annotation, and 2) to get realistic picture of rRNA presence, you should have genome with all of the "random" chromsomes, etc.

                  Now, I'm seeing quite a bit of differences between intervals provided in Ensembl GTF file and in UCSC tables. I'll have to look at this in more detail to try and understand where do they come from.

                  Comment

                  • GenoMax
                    Senior Member
                    • Feb 2008
                    • 7142

                    #10
                    Originally posted by apredeus View Post
                    Now, I'm seeing quite a bit of differences between intervals provided in Ensembl GTF file and in UCSC tables. I'll have to look at this in more detail to try and understand where do they come from.
                    First thing to check would be the genome builds and make sure they are the same.

                    Comment

                    • apredeus
                      Senior Member
                      • Jul 2012
                      • 151

                      #11
                      Right, yes, I've considered this. From what I've learnt, GRCh37 should be precisely identical in terms of genomic coordinates to what's known as hg19 in UCSC notation.

                      Does the same hold true for GRCm38/mm10?

                      Comment

                      • GenoMax
                        Senior Member
                        • Feb 2008
                        • 7142

                        #12
                        Originally posted by apredeus View Post
                        Right, yes, I've considered this. From what I've learnt, GRCh37 should be precisely identical in terms of genomic coordinates to what's known as hg19 in UCSC notation.

                        Does the same hold true for GRCm38/mm10?
                        That is correct.

                        Comment

                        • jbchang
                          Junior Member
                          • Feb 2014
                          • 2

                          #13
                          Thanks, guys, for this valuable discussion. After looking through the rmsk rRNA entries more closely, they seem to make enough sense that I'm willing to trust RepeatMasker.

                          In terms of the Ensembl vs Hg19 (rmsk) coordinates/annotations for rRNA, although I am new to this, it doesn't seem like they should necessarily correspond. Don't they have different annotation pipelines/procedures, anyway?


                          Best,
                          Jeremy

                          Comment

                          • apredeus
                            Senior Member
                            • Jul 2012
                            • 151

                            #14
                            Originally posted by jbchang View Post
                            In terms of the Ensembl vs Hg19 (rmsk) coordinates/annotations for rRNA, although I am new to this, it doesn't seem like they should necessarily correspond. Don't they have different annotation pipelines/procedures, anyway?
                            yes, they are different, but the question is which one is more comprehensive (and thus will be more complete in i.e. evaluation of rRNA presence in RNA-seq experiment).

                            Actually, as I'm finding out, they are super different. I'm not quite sure what is the reason for that. I'll post more details in the next post.

                            Comment

                            • apredeus
                              Senior Member
                              • Jul 2012
                              • 151

                              #15
                              so, from the comparison I've done, it seems like rmsk intervals have a lot more coverage. They include virtually all of the Gencode intervals, but many more unique intervals as well.

                              So, for hg19:

                              hg19_rRNA_rmsk.gtf has 1769 intervals covering 193760 bp
                              hg19_rRNA_gencode.gtf has 571 intervals covering 70960 bp (v19 of human Gencode)

                              When you intersect the two, you see that they have about 50 kb in common. However all of the gencode intervals but the few that follow are included in rmsk version (I can't evaluate differences in random chromosomes and unplaced scaffolds since they have different names):

                              Code:
                              chr2	133010727	133010878	RNA5-8SP5
                              chr2	162266065	162266181	5S_rRNA
                              chr9	32293556	32293690	RNA5SP281
                              chr9	110681147	110681259	RNA5SP293
                              chr9	111754689	111754849	RNA5-8SP3
                              chr10	49248476	49248591	RNA5SP315
                              chr11	8866810	8866905	RNA5SP330
                              chr11	96207736	96207856	RNA5SP346
                              chr12	66460001	66460118	RNA5SP362
                              chr15	98015356	98015482	RNA5SP401
                              chr16	33965426	33965577	RNA5-8SP2
                              chr19	24187160	24187309	RNA5-8SP4
                              chr20	5326652	5326806	RNA5-8SP7
                              chrY	10037764	10037915	RNA5-8SP6
                              So basically, I'm going to use rmsk gtf file with the addition of these 14 lines. This should be more than enough for my purposes. I'm also definitely including all of the random chromosomes etc. since a lot of rRNA elements are there.
                              Last edited by apredeus; 03-28-2014, 09:08 PM.

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                by SEQadmin2


                                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


                                Here are nine questions we think about, in roughly the order they matter, before...
                                06-18-2026, 07:11 AM
                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, 06-17-2026, 06:09 AM
                              0 responses
                              22 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-09-2026, 11:58 AM
                              0 responses
                              40 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-05-2026, 10:09 AM
                              0 responses
                              47 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-04-2026, 08:59 AM
                              0 responses
                              49 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...