Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Map 454 RNA-seq single-end reads on a genome: UPDATE

    Hi everybody !

    I want to annotate a non-model organism genome.

    I am using (among other) 454 RNA-seq single-end reads.

    I read several topics about the mapping of 454 RNA seq but they are quite old...
    It seemed that using TopHat was not OK because TopHat can deal only with short reads (like Illumina)

    However, now there is TopHat2 using Bowtie2.
    Apparently, they can deal with longer reads : In the manual (http://tophat.cbcb.umd.edu/manual.shtml), it is written : TopHat can align reads that are up to 1024 bp long.

    What do you think ?
    Is it now OK to align 454 reads with TopHat ?
    If yes, do you have any recommendation concerning the parameters to use ?

    Thanks a lot !

    Muriel

  • #2
    How long are your reads?

    Comment


    • #3
      As 454 data is written in binary .sff format, how does one tell how long the reads are? Standard tools for ascii text such as grep, awk, wc, etc can't be utilized...

      Comment


      • #4
        My data are in fast format.
        It seems that length are from 60 to 900 bp...

        I haven't trim on quality and length, maybe I should ? Which tool to use ? I used Sickle for another project with Illumina data but I heard it's not the best for 454, is it true ?

        Comment


        • #5
          Originally posted by MurielGB View Post
          My data are in fast format.
          It seems that length are from 60 to 900 bp...
          You could run this with BBMap, which is more error-tolerant than Tophat. For reads over 500bp, though, the command line would be a little different, as it needs to run in PacBio mode:

          (index)
          mapPacBio8k.sh ref=reference.fasta -Xmx23g

          (map)
          mapPacBio8k.sh in=reads.fq out=mapped.sam xstag=unstranded maxindel=200000 qin=33 intronlen=10 -Xmx23g

          The "-Xmx23g" sets the amount of memory to use; it should be set to around 85% of physical memory.

          I haven't trim on quality and length, maybe I should ? Which tool to use ? I used Sickle for another project with Illumina data but I heard it's not the best for 454, is it true ?
          Don't trim unless you have problems with the mapping rates. But I'm not sure how a trimming tool designed for Illumina data would perform on 454 data, anyway.
          Last edited by Brian Bushnell; 04-30-2014, 09:20 AM.

          Comment


          • #6
            Give BBMap a try: http://seqanswers.com/forums/showthread.php?t=41057

            What state is your reference sequence in (multiple contigs)?

            Edit: Brian beat me to it

            Brian: The reads here are in fasta format. I assume BBMap will accept them.

            Comment


            • #7
              OK thank you, I didn't know this software, I'll have a look on it !

              I wanted to use Tophat since I also have Illumina RNA-seq and for these ones I used Tophat and I thought, it would be cool to use the same...

              Comment


              • #8
                Originally posted by GenoMax View Post
                Edit: Brian beat me to it


                Brian: The reads here are in fasta format. I assume BBMap will accept them.
                Yes, it will.

                Comment


                • #9
                  Sorry, I meant FastQ format...

                  Comment


                  • #10
                    Originally posted by MurielGB View Post
                    Sorry, I meant FastQ format...
                    No problem, it supports input of fasta, fastq, scarf, and even sam.

                    Comment


                    • #11
                      Originally posted by MurielGB View Post
                      Sorry, I meant FastQ format...
                      In that case Brian's command line is valid for use.

                      Comment


                      • #12
                        What is the output of BBMap ?
                        I want then to use Cufflink, is it ok ?

                        Comment


                        • #13
                          Originally posted by MurielGB View Post
                          What is the output of BBMap ?
                          I want then to use Cufflink, is it ok ?
                          In the example command line above a SAM file.

                          Comment


                          • #14
                            If I understand well, I need to split my fastq files because for the short reads I will use bbmap.sh while for the longest ones I need to use mapPacBio8k.sh ?
                            I checked and I have reads from 45 to 1201 bp.

                            If this is the case, what length should be the max for the short read file used for bbmap.sh ?
                            And which length is the minimum for mapPacBio8k.sh ?

                            Do you know a tool to do that ?

                            Since this is RNA seq data, I guess I should index the reference that way :
                            bbmap.sh ref=ref.fasta k=14

                            But how to choose midpad ? My genome is 460 Mb long and there are 8,000 scaffolds.
                            I guess I could calculate the length of the longest gap in my reference and use a higher value ?

                            My longest expected read is 1201 bp so should I leave the default value (8,000) for startpad and stoppad ? Or maybe I can put 1,300 ???
                            Can you explain how this is gonna affect the rest of the analysis ?

                            Finally, if I use bbmap.sh for short reads and mapPacBio8k.sh for long reads, should I build two different index ? One using bbmap.sh and the other using mapPacBio8k.sh ?

                            Thank you,

                            Muriel
                            Last edited by MurielGB; 05-01-2014, 06:08 AM.

                            Comment


                            • #15
                              Originally posted by MurielGB View Post
                              If I understand well, I need to split my fastq files because for the short reads I will use bbmap.sh while for the longest ones I need to use mapPacBio8k.sh ?
                              I checked and I have reads from 45 to 1201 bp.

                              If this is the case, what length should be the max for the short read file used for bbmap.sh ?
                              And which length is the minimum for mapPacBio8k.sh ?

                              Do you know a tool to do that ?

                              Since this is RNA seq data, I guess I should index the reference that way :
                              bbmap.sh ref=ref.fasta k=14

                              But how to choose midpad ? My genome is 460 Mb long and there are 8,000 scaffolds.
                              I guess I could calculate the length of the longest gap in my reference and use a higher value ?

                              My longest expected read is 1201 bp so should I leave the default value (8,000) for startpad and stoppad ? Or maybe I can put 1,300 ???
                              Can you explain how this is gonna affect the rest of the analysis ?

                              Finally, if I use bbmap.sh for short reads and mapPacBio8k.sh for long reads, should I build two different index ? One using bbmap.sh and the other using mapPacBio8k.sh ?

                              Thank you,

                              Muriel
                              Muriel,

                              Just use mapPacBio8k.sh for all reads, it will work fine. It has a default midpad of 6000 which will also be fine for those reads. The value of midpad is not very important, as long as it is at least as long as your longest read; it's just there to prevent reads from mapping spanning two adjacent scaffolds.

                              Since you brought it up, it is worth noting that bbmap has default k=13 while the pacbio version has default k=12 because of pacbio's high error rate. And yes, a longer k can help with rnaseq data that has long introns, so it's probably worth using k=13 or k=14. To set k=14, you need to include that flag both while indexing and while mapping.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin




                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                04-22-2024, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 08:47 AM
                              0 responses
                              16 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              60 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              60 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              54 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X