Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • miRNA analysis..

    Hello,

    I am relatively new to NGS analysis, and have recently been put on a project analyzing samples that have been sequenced by Illumina-Solexa's small RNA protocol.

    I have reviewed this forum and found much advice on various tools/pipelines including miRtools, miRanalyzer, miRexpress, miRdeep, etc. I just wanted to get some feedback/advice on the type of results I am getting using these various public domain pipelines.

    1. Using the s_*_sequence.txt output from the sequencer pipeline, I have about 22 million reads per sample, which after removing redundancy, trimming the adapter sequences, and removing reads where the adapter sequence shows up at the beginning or middle, I paired down to ~300,000 unique reads +/- 25,000 depending on the sample.

    Is this typical?

    2. When analyzing the length distributions of my unique reads, I am seeing a clear peak around 20-21 nucleotides,but i am seeing some minor peaks around 29 and 32 nucleotides. When analyzing the length distributions of the total read counts, I am seeing a peak at 29 nucleotides. Please see attached plots.

    What could this possibly mean? Shouldn't I be seeing peaks around 20-21 nucleotide only? Has anyone seen this type of data before? The idea of these sequences being fragments of mRNA, piRNA, snoRNA, tRNA, etc. has been suggested.

    3. I aligned my unique reads to the human genome using BowTie. I am getting approximately 45-60% of the unique reads aligning to the genome, while the rest of the reads do not align. I have used the command:

    ./bowtie -n 1 -l 17 -k 200 --best --chunkmbs 128

    So what does this mean for the remaining sequences that are not aligning? Why would there be so much unaligned sequences in my data?

    -----------------------

    Thank you all for any insight you can provide me with. SEQAnswers has been very helpful for me thus far in getting introduced to NGS data analysis!
    Attached Files
    Last edited by quicksand21; 06-15-2010, 06:22 PM. Reason: Added another figure showing nucleotide distribution of sequence length 29 reads

  • #2
    miRNA ranges from 16-29nt. Your 29 nt peak seems too high to ignore. Would advise to check for contaminating sequences. Make a nucleotide distribution plot of these 29 nt might give u some ideas.

    Try mapping again with less stringent parameters. There will always be some sequences that just wouldn't align and remain a question.

    Comment


    • #3
      when the technicians conducted the small RNA protocol, I believe they ran a gel and cut out two bands corresponding to the smallest sequence lengths...thus pre-selecting for the small RNAs, correct? Could the second band they cut out be related to this big peak at 29? Also, I did what Melissa suggested, and have what seems to be well-distributed nucleotides. I'll add an attachment to my original post.

      Also, what do you think about removing any sequences that only have a copy number of 1 or 2? These obviously are not well-represented in the sample, and could be a result of some sort of machine error?
      Last edited by quicksand21; 06-15-2010, 06:27 PM. Reason: typo

      Comment


      • #4
        Just the other day we discussed the following paper about miRNA NGS in our group:
        Deep sequencing reveals differential expression of microRNAs in favorable versus unfavorable neuroblastoma.
        Schulte JH, Marschall T, Martin M, Rosenstiel P, Mestdagh P, Schlierf S, Thor T, Vandesompele J, Eggert A, Schreiber S, Rahmann S, Schramm A.
        Nucleic Acids Res. 2010 May 13.
        You might find it helpful since they find a large fraction of 35 bp reads, probably constituting tRNAs and pre-miRNAs. Less than 50% of all reads align uniquely to the human genome or known miRNAs, but repeats, mRNAs or contamination are rare. They didn't elaborate on the issue, but we think the low mapping rate results from the short length of miRNAs and RNA editing, leading to a high mismatch rate. I'm not familiar with Bowtie so I can't judge how you defined the mismatch rate.
        As to discarding sequences with low copy number, the authors required the miRNAs to be present at least 5 times to be included in the following analyses.

        Comment


        • #5
          new miRNA software

          Hi,

          I don't know if this would be helpful to you, but CLC just released a new version of our Genomics Workbench and the most significant added functionality is a suite of tools to support miRNA analysis. You would be welcome to use the free trial to see if it clarifies some of the questions that you have with your results.

          Here is a link to download the new release.

          Welcome to QIAGEN Digital Insights LabCorp uses QCI and HGMD to improve identification and interpretation of genetic variants within inhereited diseases.Read...


          Also, let me know if you want to just take a look at the tutorial, and I will send it to you.

          Best of luck with your analysis!
          Naomi

          Comment


          • #6
            @ epigen: Thank you very much for this paper recommendation. I will read through it. It seems from your post that this may answer some of my issues.

            @ Naomi: Thanks for this suggestion. I think I may take you up on your suggestion to try the trial version

            Comment


            • #7
              Originally posted by quicksand21 View Post
              3. I aligned my unique reads to the human genome using BowTie. I am getting approximately 45-60% of the unique reads aligning to the genome, while the rest of the reads do not align. I have used the command:

              ./bowtie -n 1 -l 17 -k 200 --best --chunkmbs 128
              If your fragments are shoryter than your reads, the reads will contain parts of the adapter on the 3' side of the fragments, which will confuse the aligner. You need to trim the reads by matching their ends against the (reverse-complemented) sequence of the ends of your 5' adapter and trimming off any matches.

              Our HTSeq framework contains functionality for this. I can give you a mroe detailed explanation if needed.

              Simon

              Comment


              • #8
                Hi, I'm one of the authors of the paper mentioned by epigen. I just made our software for adapter removal (that we used in the paper) available. See the project homepage at https://code.google.com/p/cutadapt/ .

                Comment


                • #9
                  I know this is an old thread and the original poster is probably long gone, but still one relevant information for people still reading here. Quicksand21 wondered why he was only left with 300,000 reads of initially 22 millions after preprocessing. His preprocessing involved "removing redundancy", which, I suppose, means removing all reads with the same sequence. Now, as he sequenced microRNA, every miRNA that got properly sequenced will appear only once because each miRNA species can give rise to only one read sequence (namely, if all works correctly: the miRNA sequence, followed by the 3' adapter). It is debatable whether duplicate reads should be removed in mRNA-Seq (I'd say: no) but in miRNA reads is removes all signal.

                  Simon

                  Comment


                  • #10
                    On a side question, will exon sequencing capture any miRNA?

                    Comment


                    • #11
                      Dear Simon,

                      I just saw your comment. I am having the same issue as the OP. From a file having 7M reads I ended up with 370K reads after filtering (adapter trimming, singleton eliminating and quality of reads). I don´t understand when you say that each miRNA species should give only one sequenced read. Actually, from these 370K sequences representing miRNAs, there are 90K unique sequences, so I am assuming that this is my family of miRNA, and not the 370K list.
                      Regardless, I am also puzzled with such a reduction in data, since my QA filter and the singleton filter only removed 20% of the reads.

                      Dave

                      Comment


                      • #12
                        Hi Dave

                        A miRNA transcript is typically 22 bp long. You add adapters to both ends and sequence from the P5 adapter onwards to the end, into the P7 adapter. After trimming that off, you are left with a read of 22 bp, and if you sequence many transcript molecules of the same miRNA, you will see many reads with this 22 bp sequence.

                        In ChIP-Seq, most poeple delete redundant reads, i.e., if they see several reads with exactly the same sequence, they would remove all but one. The original post sounds suspiciously like this was done there, too. All I wanted to point out is that doing so for miRNA-Seq data would be a very bad idea because you fully expect to see the same read sequence many times, and this is no artifact but your signal.

                        I guess you meant the opposite with "singleton filter", namely to remove those reads that occur only once.

                        Maybe you have been too aggressive with filtering. How much does each step of your filtering pipeline remove?

                        Simon

                        Comment


                        • #13
                          Thanks for your comments Simon, yes, this time singleton removal is the other way round as you say, removing reads that occur only once.

                          For this sample, after adapter removal nearly 90% of the original reads were filtered out due to them being too short after adapter trimming (38 cycle run, minimum length cutoff 17nt), ending in about 370K reads. On another sample, starting with similar number of reads, only about 20% of the original reads were discarded.
                          I do quality and singleton filtering after adapter trimming, and the percentage of reads filtered out is roughly similar for both samples(15%)
                          So I see two possibilities, a technical problem in preparation of that sample or something biologically meaningful. I rather think it is something interesting, but what can it be?

                          Comment


                          • #14
                            Why don't you try to align the reads with less than 17nt as well? Most of them will be too short to give a unique alignment, but if you tell your aligner to discard these (instead of discarding them yourself before alignment), you can check whether the remaining once actually map to miRNA loci, and this could help you figure out what happened in the library prep.

                            Also, look at a histogram of the sizes after adapter trimming. Are they just below 17 nt, or so short that they could be primer dimers?

                            S

                            Comment


                            • #15
                              Hi again Simon,

                              they shouldn´t be primer dimers because they are not selected from the sample prep. Looking at the bioanalyzer electropherogram, primer dimers can be seen before the main peak. Nevertheless I will have a look.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin


                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                                Yesterday, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              39 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              41 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              35 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              55 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X