Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • RNA-seq Mapping to Many Contigs

    What advice do researchers who have previously done RNA-seq on a non-model organism have ? I have RNA-seq data on sea urchin. The current version of the genome has 174772 contigs. I have so far tried generating a genome index with STAR. It used up all of the RAM, and the author said the mapping performance wasn't good on any genomes with more than 50000 contigs. I have also tried de-novo assembly with Trinity, and the number of genes and isoforms found was unrealistically large. Does anyone have a success story to share ?

  • #2
    try filtering out contigs that have a FPKM of less than 1, or .5. This should get rid of a large number of, likely junk, contigs. There are tools in Trinity (RSEM or eXpress) to to this.

    Also, you could try clustering with cd-hit-est to get rid of redundancy.

    Comment


    • #3
      Dear Dario1984,

      You may try the Subread aligner which can deal with large number of contigs.



      Best wishes,

      Wei

      Comment


      • #4
        Thanks for alerting me to the CD-HIT program. I wasn't aware of it. Have you published a journal article using those two steps already ?

        Comment


        • #5
          Originally posted by Dario1984 View Post
          What advice do researchers who have previously done RNA-seq on a non-model organism have ? I have RNA-seq data on sea urchin. The current version of the genome has 174772 contigs. I have so far tried generating a genome index with STAR. It used up all of the RAM, and the author said the mapping performance wasn't good on any genomes with more than 50000 contigs. I have also tried de-novo assembly with Trinity, and the number of genes and isoforms found was unrealistically large. Does anyone have a success story to share ?
          To avoid RAM problems for the large number of contigs with STAR, try reducing --genomeChrBinNbits (=18 by default) to a smaller number, ~14 or less. The mapping speed will be slow by STAR's standards, but it may still adequate.

          Comment


          • #6
            Originally posted by Dario1984 View Post
            Thanks for alerting me to the CD-HIT program. I wasn't aware of it. Have you published a journal article using those two steps already ?
            This paper should be of good reference:
            https://www.biomedcentral.com/1471-2164/13/392

            Comment


            • #7
              I used Subread on the data. Because the seed has to be matched exactly, it isn't suitable for mapping to a related organism's genome. 11 % of my reads mapped. I can see it would be great for mapping to a high quality reference genome, such as the human genome sequence.

              Comment


              • #8
                Most of the reads in the Trinity assembly will be background RNA (something like 80% of the genome is transcribed remember) and assembly junk. As mentioned already mapping the reads to the Trinity assembly and excluding low count sequences will remove this junk. I prefer to use raw read count, then you can easily see what portion of reads map to the 20-40K Trinity sequences you are left with. I have done something like that and from 370,000 trinity sequences, 96% of the reads mapped to about 38,000 trinity sequences and the rest were discarded.

                Comment


                • #9
                  Originally posted by Dario1984 View Post
                  I used Subread on the data. Because the seed has to be matched exactly, it isn't suitable for mapping to a related organism's genome. 11 % of my reads mapped. I can see it would be great for mapping to a high quality reference genome, such as the human genome sequence.
                  Hi Dario,

                  Could you please provide a bit more info about your data such as read length, single-end or paired-end etc? There could be many reasons contributing to a low mappability. Although Subread does not allow mismatches in the seeds, these seeds are quite short (16bp), so I do not really think this was the reason you got a low mapping percentage when mapping your reads to a related species.

                  One thing which may be worthwhile to try is to set -m=1 to test how many reads have a 16bp substring perfectly matched with the reference. If you still got a low percentage, this may simply tell you that your reads are very different from the reference.

                  Best regards,

                  Wei

                  Comment


                  • #10
                    What happens if you only take the 50000 biggest contigs from your reference? A lot of times these draft assemblies have many small contigs that aren't going to contain useful information for gene expression analysis anyway. Meaning they will mostly not contain coding regions, or if they do its only one, maybe two exons, and you can't assign orthology anyway.

                    Comment


                    • #11
                      I think the related genome is too distant. I took 100 random reads and used BLAST to get an impression of what the mapping would be like. Two representative examples of one of the 50 base read pairs are

                      Code:
                      >Scaffold915 
                                Length = 323013
                      
                       Score = 42.1 bits (21), Expect = 0.006
                       Identities = 39/45 (86%)
                       Strand = Plus / Plus
                      
                                                                                 
                      Query: 6      ttccagacaaaacagacaacaaatcataatcataaatatcatttg 50
                                    |||| ||||||| ||||||||  || |||| ||||||||||||||
                      Sbjct: 261960 ttcctgacaaaatagacaacatttcttaattataaatatcatttg 262004
                      and

                      Code:
                      >Scaffold476 
                                Length = 632255
                      
                       Score = 40.1 bits (20), Expect = 0.025
                       Identities = 20/20 (100%)
                       Strand = Plus / Minus
                      
                                                        
                      Query: 8      caagaatttttttgatgaaa 27
                                    ||||||||||||||||||||
                      Sbjct: 568677 caagaatttttttgatgaaa 568658
                      I will proceed by implementing the filtering strategies for de-novo assembly.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Current Approaches to Protein Sequencing
                        by seqadmin


                        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                        04-04-2024, 04:25 PM
                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 04-11-2024, 12:08 PM
                      0 responses
                      22 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 10:19 PM
                      0 responses
                      24 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 09:21 AM
                      0 responses
                      20 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-04-2024, 09:00 AM
                      0 responses
                      52 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X