Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • is it a good choice to filter out reads using repeatmask

    dear all,
    For the RNA seq data, I want to filter out reads from house keeping RNA, is it reasonable to use repeatmask as the filter?

  • #2
    If you mean repeatmasker then I doubt this would be quick enough to process the volume of data you get from a high throughput sequencing run in an acceptable time.

    If you're mapping your data to a known genome then you shouldn't need to pre-filter. If you do need to remove known contaminants I'd suggest using a mapping program to specifically filter them out rather than doing a more general repeat masking.

    Comment


    • #3
      Thanks for your reply.

      I use the ucsc repeatmask annotation rather than running repeatmask by myself.

      Also, I only use the RNA portion (including rRNA, tRNA and etc. )of the repeatmask.

      I choose repeatmask because there seems no other specific annotation for various types of house keeping RNAs like rRNA, tRNA, snRNA, snoRNA and etc.

      Do you have better suggestion?

      Comment


      • #4
        You could download a collection of such sequences from RepBase (www.girinst.org) and create a reference out of it for your short read aligner.

        Comment


        • #5
          Is it a better choice than using repeatmask annotation of ucsc?

          Comment


          • #6
            I can't judge which is better but Repeatmasker annotation seems simpler to me. If you've got exclusively the genomic coordinates of the repeats you want, this will be a small file that can easily be used for filtering the BAM file of your aligned reads, e.g. with BEDTools. If you map reads to a repeat reference, you have to find out which reads align there and throw them out. This is how the ABI BioScope whole transcriptome pipeline does repeat filtering. I don't know a stand-alone tool for this way.

            Comment


            • #7
              No. Never mask.

              Comment


              • #8
                why? Will you please tell me how to filter out reads from house-keeping RNA? Thanks.

                Comment


                • #9
                  It seems that lh3, as simonandrews, assumed that you want to run repeatmasker on the reads. Of course you won't do that. Just try using your annotations and BEDTools for filtering the BAM file, that should work.

                  Comment


                  • #10
                    Originally posted by liuxq View Post
                    why? Will you please tell me how to filter out reads from house-keeping RNA? Thanks.
                    I believe masking will cause the reads that should have been aligned to those house-keeping rna, to sub-optimally align elsewhere and give a biased signal!
                    --
                    bioinfosm

                    Comment


                    • #11
                      I agree with others here that you would not want to map your reads against a repeatmasked genome. You should use the complete genome unmodified. You would also not want to use repeatmasker to screen your reads for repeat similarity as this would likely be too slow. However, if you simply want to remove reads corresponding to rRNAs, tRNAs, etc. from your data you could do as others have suggested and download the RepBase annotations for your species (plus ancestral perhaps). Then use a short read aligner such as BWA to identify all the reads that are likely to correspond to a repeat element. You can then remove these reads from the analysis and save time when mapping the remaining reads to the whole genome. Since the RepBase database is way smaller than the genome, you should be able to align millions of reads to it very quickly.

                      One word of caution, if you download the RepBase annotations, by default, simple repeats are only presented as 70mers. The length is arbitrary for these elements as they occur at many different lengths throughout the genome. If your reads are longer than this length, you should extend the length of these repeat elements in your repeat database.

                      A situation where I could imagine using this approach... where you have sequenced a library made with total RNA (~95-98% rRNA sequences) or a riboMinus processed library that still has a lot of rRNAs. We have sequenced transcriptome libraries like this where the majority of all reads map to a handful of rRNA genes. Hidden among these were the reads corresponding to the rest of the transcriptome. Filtering them out seems reasonable. If your library is polyA+ I wouldn't worry about it.

                      Comment


                      • #12
                        Originally posted by bioinfosm View Post
                        I believe masking will cause the reads that should have been aligned to those house-keeping rna, to sub-optimally align elsewhere and give a biased signal!
                        I did not align reads to repeatmasked genome. After I aligned reads to reference genome, I filtered all the mapped reads which overlap with repeatmask annotation from uscs. Do you think the method still have have bias?

                        Comment


                        • #13
                          Originally posted by liuxq View Post
                          I did not align reads to repeatmasked genome. After I aligned reads to reference genome, I filtered all the mapped reads which overlap with repeatmask annotation from uscs. Do you think the method still have have bias?
                          That sounds OK to me! You simply do not care about those regions and are interested in other features, just like someone doing whole genome sequencing but then simply looking for coding regions data...
                          --
                          bioinfosm

                          Comment


                          • #14
                            Originally posted by liuxq View Post
                            I did not align reads to repeatmasked genome. After I aligned reads to reference genome, I filtered all the mapped reads which overlap with repeatmask annotation from uscs. Do you think the method still have have bias?
                            Depend on what you define as "bias". RepeatMask is imprecise. It masks out many regions that are not repetitive and leaves many regions that are not single copy.

                            Comment


                            • #15
                              Rid rRNA in processing?

                              Has anyone been able to just disregard the rRNA reads through program processing?

                              Has anyone dealt with rRNA with RNA-IP-Seq samples without reduction? Did the the rRNA take away from the coverage in which other genes did not have enough coverage?

                              has anyone used the Epiccentre or RiboMinus in C.elegans? Bad or good?

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin




                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                04-22-2024, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 08:06 AM
                              0 responses
                              13 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-30-2024, 12:17 PM
                              0 responses
                              15 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-29-2024, 10:49 AM
                              0 responses
                              21 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-25-2024, 11:49 AM
                              0 responses
                              27 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X