Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Questions about overlapping paired-end reads...

    Dear RNA-Seq analysis experts,
    This is my first time analysing RNA-Seq data; so I would appreciate your help with a couple of issues.
    My data consist of 8 samples sequenced using Illumina’s standard paired-end RNA-seq protocol (which unfortunately at the time was NOT carried out in a strand-specific manner). The fragment size was 220bp which left around 100 bp after subtraction of adapters (2X58-60bp). Each pair (60 bp) in the dataset therefore has an approximately 20bp overlap.
    I would like to use the bowtie/tophat/cufflinks pipeline and have a couple of questions regarding the analysis:
    1. Although this post was very informative, I cannot decide which of these is a better strategy for analysing reads in my case:
    a. Assembling the paired-end reads into 100bp single reads before…
    b. Directly using the paired-end reads in tophat [This generates a negative (-20) inner distance between pairs but version 1.0.13 onwards seems to be able to handle this scenario (the –r option, I’m right???)].
    2. Since the Illumina protocol was not strand-specific, is it a good idea to convert (correct word?) all the resulting mapping data or the sequence reads (after an initial round of mapping) so that it matches a single strand of the genome? I wonder if this strategy will help cufflinks better assemble and quantify the transcripts…
    Thank you very much in advance for your help/suggestions/feedback…
    Fred

  • #2
    If you decide to assemble your reads take a look at this recent paper presenting an open source pipeline:

    Rodrigue S, Materna AC, Timberlake SC, Blackburn MC, Malmstrom RR, et al. (2010) Unlocking Short Read Sequencing for Metagenomics. PLoS ONE 5: e11840. Available: http://dx.plos.org/10.1371/journal.pone.0011840.

    Abstract
    Background Different high-throughput nucleic acid sequencing platforms are currently available but a trade-off currently exists between the cost and number of reads that can be generated versus the read length that can be achieved.
    Methodology/PrincipalFindings:We describe an experimental and computational pipeline yielding millions of reads that can exceed 200 bp with quality scores approaching that of traditional Sanger sequencing. The method combines an automatable gel-less library construction step with paired-end sequencing on a short-read instrument. With appropriately sized library inserts, mate-pair sequences can overlap, and we describe the SHERA software package that joins them to form a longer composite read.
    Conclusions/Significance:This strategy is broadly applicable to sequencing applications that benefit from low-cost high- throughput sequencing, but require longer read lengths. We demonstrate that our approach enables metagenomic analyses using the Illumina Genome Analyzer, with low error rates, and at a fraction of the cost of pyrosequencing.

    Comment


    • #3
      Originally posted by greigite View Post
      If you decide to assemble your reads take a look at this recent paper presenting an open source pipeline:

      Rodrigue S, Materna AC, Timberlake SC, Blackburn MC, Malmstrom RR, et al. (2010) Unlocking Short Read Sequencing for Metagenomics. PLoS ONE 5: e11840. Available: http://dx.plos.org/10.1371/journal.pone.0011840.
      Thank you very much for your answser...
      Today, I tried the program stitch to assemble my paired-end reads... and it worked fine. Nevertheless, I will test SHERA tomorrow on my data and compare the performance.
      Fred

      Comment


      • #4
        FredOnSeq, would you mind pointing me to where you found the "stitch" software? I can't seem to find it by Google'ing. Thanks in advance.

        Comment


        • #5
          Stitch is available here:


          On the other hand, I have a working draft C program to do adapter stripping/ paired-end merging similar to stitch and SHERA for large potentially gzipped datasets. It looks like it processes somewhere around 20M 100x2 pairs per hour in my testing. Its available here if anyone is interested:

          Tool for stripping adaptors and/or merging paired reads with overlap into single reads. - jstjohn/SeqPrep


          I don't have correctness statistics available, but the program can copy a subset of the merged reads into a human-readable aligned format so you can sanity check the settings. The defaults seem to work well with my data.

          Comment


          • #6
            I have had some trouble installing Stitch- if anyone has successfully run it, could you point me in the right direction? The error is as follows:
            > python setup.py install
            Traceback (most recent call last):
            File "setup.py", line 9, in <module>
            setup(
            NameError: name 'setup' is not defined

            I'm also having some trouble with SeqPrep. Is there any reason why the merged file produced with option 's' should look like a binary file (it dumps a bunch of garbage onto the screen when I open it on the command line)?
            Last edited by greigite; 03-23-2011, 01:42 PM. Reason: addition

            Comment


            • #7
              For now SeqPrep outputs gziped files regardless of the name you give the output. If its just the phred scores that look weird that could be because you have ascii Phred+64 and didn't supply -6 as a command line argument.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Techniques and Challenges in Conservation Genomics
                by seqadmin



                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                Avian Conservation
                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                03-08-2024, 10:41 AM
              • seqadmin
                The Impact of AI in Genomic Medicine
                by seqadmin



                Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                02-26-2024, 02:07 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 03-14-2024, 06:13 AM
              0 responses
              33 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-08-2024, 08:03 AM
              0 responses
              72 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-07-2024, 08:13 AM
              0 responses
              81 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-06-2024, 09:51 AM
              0 responses
              68 views
              0 likes
              Last Post seqadmin  
              Working...
              X