Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Grouping small RNAs for analysis

    Hello everyone,

    I'm trying to analyze small ncRNAs from Solexa sequencing data.

    I've clipped, collapsed, and then aligned the sequences up to 15 times each to the genome.

    The problem comes about in the analysis. A lot of the sequences share basically the same alignment. I would like to group these sequences in some way if their coordinates are collapsible and also sum the number of counts if they are collapsed this would make it much easier for comparisons between samples.

    Here is an example of the data I have:
    Code:
    Chr2	4628	4644	TGAAAGACGAACAACT
    Chr2	4628	4645	TGAAAGACGAACAACTG
    Chr2	4628	4646	TGAAAGACGAACAACTGC
    Chr2	4628	4647	TGAAAGACGAACAACTGCG
    Chr2	4628	4648	TGAAAGACGAACAACTGCGA
    Chr2	4628	4649	TGAAAGACGAACAACTGCGAA
    Chr2	4628	4650	TGAAAGACGAACAACTGCGAAA
    Chr2	4628	4651	TGAAAGACGAACAACTGCGAAAG
    Chr2	4628	4652	TGAAAGACGAACAACTGCGAAAGC
    Chr2	4628	4653	TGAAAGACGAACAACTGCGAAAGCA
    Chr2	4628	4654	TGAAAGACGAACAACTGCGAAAGCAT
    Chr2	4628	4655	TGAAAGACGAACAACTGCGAAAGCATT
    Chr2	4628	4656	TGAAAGACGAACAACTGCGAAAGCATTT
    Chr2	4628	4659	TGAAAGACGAACAACTGCGAAAGCATTTGCC
    Chr2	4629	4645	GAAAGACGAACAACTG
    Chr2	4629	4646	GAAAGACGAACAACTGC
    Chr2	4629	4647	GAAAGACGAACAACTGCG
    Chr2	4629	4648	GAAAGACGAACAACTGCGA
    Chr2	4629	4649	GAAAGACGAACAACTGCGAA
    Chr2	4629	4650	GAAAGACGAACAACTGCGAAA
    Chr2	4629	4651	GAAAGACGAACAACTGCGAAAG
    Chr2	4629	4652	GAAAGACGAACAACTGCGAAAGC
    Chr2	4629	4653	GAAAGACGAACAACTGCGAAAGCA
    Chr2	4629	4654	GAAAGACGAACAACTGCGAAAGCAT
    Chr2	4629	4655	GAAAGACGAACAACTGCGAAAGCATT
    Chr2	4629	4657	GAAAGACGAACAACTGCGAAAGCATTTG
    Chr2	4629	4659	GAAAGACGAACAACTGCGAAAGCATTTGCC
    Chr2	4629	4660	GAAAGACGAACAACTGCGAAAGCATTTGCCA
    Chr2	4629	4661	GAAAGACGAACAACTGCGAAAGCATTTGCCAA
    Chr2	4630	4645	AAAGACGAACAACTG
    Chr2	4630	4646	AAAGACGAACAACTGC
    Chr2	4630	4647	AAAGACGAACAACTGCG
    Chr2	4630	4648	AAAGACGAACAACTGCGA
    Chr2	4630	4649	AAAGACGAACAACTGCGAA
    Chr2	4630	4650	AAAGACGAACAACTGCGAAA
    Chr2	4630	4651	AAAGACGAACAACTGCGAAAG
    Chr2	4630	4652	AAAGACGAACAACTGCGAAAGC
    Chr2	4630	4653	AAAGACGAACAACTGCGAAAGCA
    Chr2	4630	4654	AAAGACGAACAACTGCGAAAGCAT
    Chr2	4630	4655	AAAGACGAACAACTGCGAAAGCATT
    Chr2	4630	4656	AAAGACGAACAACTGCGAAAGCATTT
    Chr2	4630	4657	AAAGACGAACAACTGCGAAAGCATTTG
    Chr2	4630	4660	AAAGACGAACAACTGCGAAAGCATTTGCCA
    Chr2	4630	4662	AAAGACGAACAACTGCGAAAGCATTTGCCAAG
    Chr2	4631	4646	AAGACGAACAACTGC
    Chr2	4631	4647	AAGACGAACAACTGCG
    Chr2	4631	4648	AAGACGAACAACTGCGA
    Chr2	4631	4649	AAGACGAACAACTGCGAA
    Chr2	4631	4650	AAGACGAACAACTGCGAAA
    Chr2	4631	4651	AAGACGAACAACTGCGAAAG
    Chr2	4631	4652	AAGACGAACAACTGCGAAAGC
    Chr2	4631	4653	AAGACGAACAACTGCGAAAGCA
    Chr2	4631	4654	AAGACGAACAACTGCGAAAGCAT
    Chr2	4631	4655	AAGACGAACAACTGCGAAAGCATT
    Chr2	4631	4656	AAGACGAACAACTGCGAAAGCATTT
    Chr2	4631	4659	AAGACGAACAACTGCGAAAGCATTTGCC
    Chr2	4632	4647	AGACGAACAACTGCG
    Any help with this problem would definitely be appreciated. I'm a little stumped.

  • #2
    Perhaps using Biopieces (www.biopieces.org)

    Code:
    read_tab -i test.tab -k S_ID,S_BEG,S_END,SEQ |
    compute -e 'STRAND="+"'|
    assemble_contigs
    
    CONTIG: 14;29;44;56;57;57;57;57;57;57;57;57;57;57;57;57;57;56;53;49;44;40;36;32;28;24;20;16;12;9;7;7;4;2;1
    S_ID: Chr2
    S_BEG: 4628
    CONTIG_LEN: 36
    CONTIG_MEAN: 37.83
    CONTIG_MIN: 1
    CONTIG_ID: 0
    CONTIG_MAX: 57
    S_END: 4663
    ---
    Output to a table with select columns using write_tab



    Martin

    Comment


    • #3
      Martin,

      Would that allow me to go from this:
      Code:
      Chr2	4628	4644	TGAAAGACGAACAACT
      Chr2	4628	4645	TGAAAGACGAACAACTG
      Chr2	4628	4646	TGAAAGACGAACAACTGC
      Chr2	4628	4647	TGAAAGACGAACAACTGCG
      Chr2	4628	4648	TGAAAGACGAACAACTGCGA
      Chr2	4628	4649	TGAAAGACGAACAACTGCGAA
      Chr2	4628	4650	TGAAAGACGAACAACTGCGAAA
      Chr2	4628	4651	TGAAAGACGAACAACTGCGAAAG
      Chr2	4628	4652	TGAAAGACGAACAACTGCGAAAGC
      Chr2	4628	4653	TGAAAGACGAACAACTGCGAAAGCA
      Chr2	4628	4654	TGAAAGACGAACAACTGCGAAAGCAT
      Chr2	4628	4655	TGAAAGACGAACAACTGCGAAAGCATT
      Chr2	4628	4656	TGAAAGACGAACAACTGCGAAAGCATTT
      Chr2	4628	4659	TGAAAGACGAACAACTGCGAAAGCATTTGCC
      Chr2	4629	4645	GAAAGACGAACAACTG
      Chr2	4629	4646	GAAAGACGAACAACTGC
      Chr2	4629	4647	GAAAGACGAACAACTGCG
      Chr2	4629	4648	GAAAGACGAACAACTGCGA
      Chr2	4629	4649	GAAAGACGAACAACTGCGAA
      Chr2	4629	4650	GAAAGACGAACAACTGCGAAA
      Chr2	4629	4651	GAAAGACGAACAACTGCGAAAG
      Chr2	4629	4652	GAAAGACGAACAACTGCGAAAGC
      Chr2	4629	4653	GAAAGACGAACAACTGCGAAAGCA
      Chr2	4629	4654	GAAAGACGAACAACTGCGAAAGCAT
      Chr2	4629	4655	GAAAGACGAACAACTGCGAAAGCATT
      Chr2	4629	4657	GAAAGACGAACAACTGCGAAAGCATTTG
      Chr2	4629	4659	GAAAGACGAACAACTGCGAAAGCATTTGCC
      Chr2	4629	4660	GAAAGACGAACAACTGCGAAAGCATTTGCCA
      Chr2	4629	4661	GAAAGACGAACAACTGCGAAAGCATTTGCCAA
      To this:
      Code:
      Chr2	4629	4661	GAAAGACGAACAACTGCGAAAGCATTTGCCAA
      Chr2	4628	4659	TGAAAGACGAACAACTGCGAAAGCATTTGCC
      I would also need to have the counts all of the grouped alignments summed, so if there are counts such as:
      Code:
      3   ATC
      5   ATCTC
      8   ATCTCG
      It would show 16 counts for ATCTCG at position XX, XX, XX.

      Comment


      • #4
        You have a number of feature intervals that are overlapping. Using the intervals we can collapse these into a singe feature (or a contig) using assemble_contigs:

        Code:
        read_tab -i test.tab -k S_ID,S_BEG,S_END,SEQ |
        compute -e 'STRAND="+"'|
        assemble_contigs
        
        CONTIG: 14;29;29;29;29;29;29;29;29;29;29;29;29;29;29;29;29;28;26;24;22;20;18;16;14;12;10;8;6;5;4;4;2;1
        S_ID: Chr2
        S_BEG: 4628
        CONTIG_LEN: 35
        CONTIG_MEAN: 20.53
        CONTIG_MIN: 1
        CONTIG_ID: 0
        CONTIG_MAX: 29
        S_END: 4662
        ---
        The resulting record describes this feature. The length (CONTIG_LEN) is 35 the maximum (CONTIG_MAX) is 29 indicating that you have a max of 29 overlapping intervals. The CONTIG key holds the number of overlapping intervals at each position. We can plot these to view this sum of features:


        Code:
        read_tab -i test.tab -k S_ID,S_BEG,S_END,SEQ |
         compute -e 'STRAND="+"'| 
        assemble_contigs | 
        plot_lines -l CONTIG -x
        
                                               Lines
        
            30 ++------------------------------------------------------------------++
               | *********************************                    CONTIG ****** |
               | *                                *                                 |
            25 ++*                                **                               ++
               |*                                   **                              |
               |*                                     **                            |
            20 +*                                       *                          ++
               |*                                        *                          |
               *                                          **                        |
            15 *+                                           **                     ++
               *                                              **                    |
               |                                                *                   |
               |                                                 *                  |
            10 ++                                                 **               ++
               |                                                    **              |
               |                                                      **            |
             5 ++                                                       *****      ++
               |                                                             *      |
               |                                                              ***   |
             0 ++--------+---------+---------+--------+---------+---------+--------++
               0         5         10        15       20        25        30        35
        Now your suggested outcome appears to be grouped per feature start position, but I am not sure that makes sense when the sequences basically are the same:


        Code:
        read_tab -i test.tab -k SEQ_NAME,S_BEG,S_END,SEQ |
        align_seq |
        invert_align |
        write_align -C -x
                         .         .         .   
        Chr2    _GAAAGACGAACAACTGCGAAAGCATTTGCCAA
        Chr2    --------------------------------_
        Chr2    -------------------------------__
        Chr2    -----------------------------____
        Chr2    ---------------------------______
        Chr2    --------------------------_______
        Chr2    -------------------------________
        Chr2    ------------------------_________
        Chr2    -----------------------__________
        Chr2    ----------------------___________
        Chr2    ---------------------____________
        Chr2    --------------------_____________
        Chr2    -------------------______________
        Chr2    ------------------_______________
        Chr2    -----------------________________
        Chr2    T---------------_________________
        Chr2    T----------------________________
        Chr2    T-----------------_______________
        Chr2    T------------------______________
        Chr2    T-------------------_____________
        Chr2    T--------------------____________
        Chr2    T---------------------___________
        Chr2    T----------------------__________
        Chr2    T-----------------------_________
        Chr2    T------------------------________
        Chr2    T-------------------------_______
        Chr2    T--------------------------______
        Chr2    T---------------------------_____
        Chr2    T------------------------------__

        You write that you want to compare samples. What samples are those and what do you want from the comparison?


        Cheers,


        Martin

        Comment


        • #5
          isn't this a suitable representational format in the sam pileup format?
          see http://samtools.sourceforge.net/pileup.shtml

          you might want to look at softgenetic's nextgene's condensation tools for inspiration as well but i have a feeling the above is sufficient?
          Last edited by KevinLam; 08-31-2010, 11:25 PM. Reason: added samtools link
          http://kevin-gattaca.blogspot.com/

          Comment


          • #6
            You can try to group/analyze your small rna dataset using methods described in our deepBase ( http://deepbase.sysu.edu.cn/ ).

            we have identified 1.2 million RNA clusters by grouping all the mapped small RNA reads according to their distance in our deepBase platform ( http://deepbase.sysu.edu.cn/ ), which were developed to discover small and long ncRNAs from deep-sequencing data.

            These 1.2 million RNA clusters that include multiple classes of infrastructural ncRNAs, miRNAs precursor, piRNA precursors, repeat-associated siRNA precursors and evolutionarily conserved phastCons elements.

            we have identified ~2000 microRNA candidates and ~1890 snoRNA candidates using improved miRDeep and our snoSeeker programs from these RNA clusters.
            Last edited by yjhua2110; 09-01-2010, 06:18 AM.

            Comment


            • #7
              Originally posted by maasha View Post
              You have a number of feature intervals that are overlapping. Using the intervals we can collapse these into a singe feature (or a contig) using assemble_contigs:

              Code:
              read_tab -i test.tab -k S_ID,S_BEG,S_END,SEQ |
              compute -e 'STRAND="+"'|
              assemble_contigs
              
              CONTIG: 14;29;29;29;29;29;29;29;29;29;29;29;29;29;29;29;29;28;26;24;22;20;18;16;14;12;10;8;6;5;4;4;2;1
              S_ID: Chr2
              S_BEG: 4628
              CONTIG_LEN: 35
              CONTIG_MEAN: 20.53
              CONTIG_MIN: 1
              CONTIG_ID: 0
              CONTIG_MAX: 29
              S_END: 4662
              ---
              The resulting record describes this feature. The length (CONTIG_LEN) is 35 the maximum (CONTIG_MAX) is 29 indicating that you have a max of 29 overlapping intervals. The CONTIG key holds the number of overlapping intervals at each position. We can plot these to view this sum of features:


              Code:
              read_tab -i test.tab -k S_ID,S_BEG,S_END,SEQ |
               compute -e 'STRAND="+"'| 
              assemble_contigs | 
              plot_lines -l CONTIG -x
              
                                                     Lines
              
                  30 ++------------------------------------------------------------------++
                     | *********************************                    CONTIG ****** |
                     | *                                *                                 |
                  25 ++*                                **                               ++
                     |*                                   **                              |
                     |*                                     **                            |
                  20 +*                                       *                          ++
                     |*                                        *                          |
                     *                                          **                        |
                  15 *+                                           **                     ++
                     *                                              **                    |
                     |                                                *                   |
                     |                                                 *                  |
                  10 ++                                                 **               ++
                     |                                                    **              |
                     |                                                      **            |
                   5 ++                                                       *****      ++
                     |                                                             *      |
                     |                                                              ***   |
                   0 ++--------+---------+---------+--------+---------+---------+--------++
                     0         5         10        15       20        25        30        35
              Now your suggested outcome appears to be grouped per feature start position, but I am not sure that makes sense when the sequences basically are the same:


              Code:
              read_tab -i test.tab -k SEQ_NAME,S_BEG,S_END,SEQ |
              align_seq |
              invert_align |
              write_align -C -x
                               .         .         .   
              Chr2    _GAAAGACGAACAACTGCGAAAGCATTTGCCAA
              Chr2    --------------------------------_
              Chr2    -------------------------------__
              Chr2    -----------------------------____
              Chr2    ---------------------------______
              Chr2    --------------------------_______
              Chr2    -------------------------________
              Chr2    ------------------------_________
              Chr2    -----------------------__________
              Chr2    ----------------------___________
              Chr2    ---------------------____________
              Chr2    --------------------_____________
              Chr2    -------------------______________
              Chr2    ------------------_______________
              Chr2    -----------------________________
              Chr2    T---------------_________________
              Chr2    T----------------________________
              Chr2    T-----------------_______________
              Chr2    T------------------______________
              Chr2    T-------------------_____________
              Chr2    T--------------------____________
              Chr2    T---------------------___________
              Chr2    T----------------------__________
              Chr2    T-----------------------_________
              Chr2    T------------------------________
              Chr2    T-------------------------_______
              Chr2    T--------------------------______
              Chr2    T---------------------------_____
              Chr2    T------------------------------__

              You write that you want to compare samples. What samples are those and what do you want from the comparison?


              Cheers,


              Martin
              Thanks for that Martin. The samples I'm using are just small RNAs from an experiment and these are the results from Solexa sequencing. As far as comparison goes, I think what I want to try it pretty simple, probably just a matter of coding, but I'm unaware of how to go about it. Basically, what I have done so far is take these unique small RNAs that have been aligned to the genome up to 15 times and then used a program to check for overlap between two files. One file being the the small RNAs in BED format and the other file being annotation in BED format. The problem for comparison is that multiple small RNAs are annotated to the same annotation for example:

              (Sequences with the same chromosome and start position)

              Sequence Annotation Counts
              ATCGG smallRNA1 5
              ATCGGT smallRNA1 4
              ATCGGTC smallRNA1 7


              What I would like to do is group these sequences that match the same annotation, but also be able to sum of counts for that particular annotation. So, in the above case you would end up with:

              Annotation Counts
              smallRNA1 16

              I'm just not sure how to go about it, I think it's a relatively simple problem to solve, I just lack the knowledge to perform it and I've been unable to figure it out with my non-programming background.

              Comment


              • #8
                That problem sounds familiar.

                Have a look here:

                The contents of this site briefly explains my research interests and a few things about myself


                There is a link to the paper at the bottom of that post.

                There is a Biopiece for assembling tag contigs:



                Perhaps that is useful.



                Martin

                Comment


                • #9
                  Martin,

                  I think that may be exactly what I'm looking for. I'll go over it all and get back with you.


                  Thanks again,
                  Brandon

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM
                  • seqadmin
                    Techniques and Challenges in Conservation Genomics
                    by seqadmin



                    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                    Avian Conservation
                    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                    03-08-2024, 10:41 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, Yesterday, 06:37 PM
                  0 responses
                  12 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, Yesterday, 06:07 PM
                  0 responses
                  10 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-22-2024, 10:03 AM
                  0 responses
                  52 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-21-2024, 07:32 AM
                  0 responses
                  68 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X