Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • miRNA seq analysis - large numbers of non-aligning reads

    Hello,

    as part of my work I have been given the task of analyzing data from miRNA-seq. Such data was already preprocessed by a facility which did the first QC, adaptor and barcode trimming (it's made up by two pools) and given to me as a set of FASTA files.

    As I'm quite inexperienced with sequencing (I come from the world of microarrays, and I started studying NGS just recently) I looked around (including these forums and wiki) to find a way to align the data properly.

    Basing on what I read, I settled for bowtie. As I'm not doing any discovery, as reference I picked the human hairpin sequences from miRBase. Before alignment, I collapsed identical sequences using the fastx toolkit.

    Now, when aligning, I get a lot of non-aligned reads. As an example, allowing one mismatch on one of the samples:

    Code:
    bowtie Hs_miRBase_hairpin -f -n 1 -l 15 --best VB09121_Pool2/BarcodeCTTA.collapsed.fa -S Pool2_CTTA.sam
    # reads processed: 56581
    # reads with at least one reported alignment: 646 (1.14%)
    # reads that failed to align: 55935 (98.86%)
    Reported 646 alignments to 1 output stream(s)
    Going for no mismatches has even lower yields... As mentioned before (also due to inexperience) I'm not sure if it's what I should expect or not.
    Any pointers in what I should try / read would be appreciated. Thanks!

  • #2
    I forgot to add that I generated the reference with bowtie-build directly from the hairpin FASTA file downloaded from miRBase.

    Comment


    • #3
      You might have to replace U in miRBase sequences to T and create the index and try again.
      How has the sequence library been prepared, may it contain sRNAs, snoRNAs or degradation products from rRNA/tRNA. Did you use the mature or stemloop sequences as mapping target?

      Comment


      • #4
        I noticed about the U/T issue. I will convert the sequence and try again: also I realized (yes, I really feel stupid about it) that I also need to remove non-human sequences from there, as the data is for all species.

        I used the hairpin (i.e. immature) sequences as I have even less hits on mature ones (and mature ones are also shorter).

        EDIT: Translating Us and removing non human sequences raised the % to 17%, although it's still low. About the library, the transcripts were selected using a miRNA purification kit followed by fractioning by size and selection of RNAs between 19 and 29bp.
        Last edited by lbeltrame; 02-01-2012, 07:06 AM.

        Comment


        • #5
          The reference might contain ambiguous bases, Like T, Y etc...

          Comment


          • #6
            whats your read length.
            Remember bowtie can not match if reference sequences are smaller than query

            Comment


            • #7
              From experience with small RNA datasets, many many reads will be of exact sequence.
              While your results may indicate most of your reads did not align, remember that you collapsed your sequences to begin with.
              Have a look at the read IDs of each aligned sequence which should tell you how many counts of a particular sequence there are.
              It might be that your 600 odd aligned sequences actually represent millions of reads (or at least a lot more than your unaligned sequences)

              Comment


              • #8
                Thanks for the suggestions. I'll give a go this week and see what comes out.

                Comment


                • #9
                  You probably have a lot of non miRNA in your sequencing. Have you tried mapping just to the genome? We see a ton of non miRNA small RNA in our small RNA sequencing.

                  Comment


                  • #10
                    Indeed, mapping to the genome gets a larger yield (47% for perfect matches, and 64% allowing one mismatch, with minimum length being 15):

                    Code:
                    bowtie hg19 -f -n 1 -l 15 -p 4 VB09121_Pool2/BarcodeCTTA.collapsed.fa -S Pool2_CTTA_human_aligned.sam
                    # reads processed: 56581
                    # reads with at least one reported alignment: 36637 (64.75%)
                    # reads that failed to align: 19944 (35.25%)
                    Reported 36637 alignments to 1 output stream(s)
                    Also, to answer other questions in the thread, I've mapped the length distribution of my reads and the vast majority are between 22 and 24bp in length.

                    EDIT: Not using collapsed sequences yields me a ~96% alignment rate using the same parameters (against the genome). I'd like to thank everyone who gave suggestions.
                    Last edited by lbeltrame; 02-08-2012, 08:09 AM.

                    Comment


                    • #11
                      Hello epi and all,

                      What can be alternative alignment strategy or tool when reference sequences are smaller than query ?

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Essential Discoveries and Tools in Epitranscriptomics
                        by seqadmin




                        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                        04-22-2024, 07:01 AM
                      • seqadmin
                        Current Approaches to Protein Sequencing
                        by seqadmin


                        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                        04-04-2024, 04:25 PM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 04-11-2024, 12:08 PM
                      0 responses
                      59 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 10:19 PM
                      0 responses
                      57 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 09:21 AM
                      0 responses
                      53 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-04-2024, 09:00 AM
                      0 responses
                      56 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X