Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Additional question regarding BBtools

    Hello,
    This thread helped me with an issue I'm dealing with only I've become stuck and seek some advice.

    I am trying to use BBMap or BBTools to fix what might be either an incorrectly interleaved file or a file that has reads formatted as single-end but really contains a collection of paired and single end reads.

    I have interleaved some paired-end RNAseq Illumina reads in order to run them through a program called 'sortmerna' to remove rRNA from my reads. Here is a testformat.sh on the interleaved input file.


    Code:
    /mnt/home/steepale/Apps/bbmap/testformat.sh ./data/017798-1_1_AGTGAG_L001_interleaved_001_100K.fastq
    sanger	fastq	raw	interleaved	125bp

    If BBTools can provide this ability (filter out rRNA), I would appreciate any advice. The output file seems to either be incorrectly interleaved or are just in single end format. Here is a testformat.sh of the output file I am interested in.


    Code:
    /mnt/home/steepale/Apps/bbmap/testformat.sh ./data/017798-1_1_AGTGAG_L001_norRNA_001_100K.fastq
    -1	fastq	raw	single-ended

    I would ultimately like to the reads from this output file and separate them by their forward and reverse reads into two files; essentially I want to map them with tophat.

    Is anyone familiar with BBTools and how to fix such an issue? I've hit a roadblock.
    Last edited by steepale; 10-06-2016, 09:05 AM.

  • #2
    I have moved your question to a new thread to make it visible.

    Solution for your problem can come from BBTools themselves. There is a program called bbsplit.sh that you can use with your original data (R1/R2 files). Provide this program with rDNA repeat sequence (provided one is available for your genome). BBsplit can then separate any reads that align to this in one file where as the rest will go to a different file. BBMap is also splice aware so you could use it to align your RNAseq data (should perform better than tophat).

    You can use reformat.sh from BBMap to see if the reads are correctly interleaved

    Code:
    $ reformat.sh in=reads.fastq verifypairing
    then optionally de-interleave them

    Code:
    $ reformat.sh in=reads.fastq out1=r1.fastq out2=r2.fastq
    Last edited by GenoMax; 10-06-2016, 09:22 AM.

    Comment


    • #3
      Here's some additional advice which might add to the convo.

      Also, thanks GenoMax, I've located Chicken-specific rDNA clusters and am lifting them over to the correct genome build.

      Assuming the reads stayed in the same order but sortmerna just removed some of them, you can use repair.sh like this:

      repair.sh in=017798-1_1_AGTGAG_L001_norRNA_001_100K.fastq out1=r1.fq out2=r2.fq outs=single.fq fint
      "
      If the reads were reordered you'd need the "repair" flag instead of "fint" but they probably were not. The "repair" flag will always work, it just uses more memory than "fint".

      However, you can avoid this problem in the first place if you use BBDuk for kmer-matching to remove rRNAs, if you have the ribosomal sequence, since BBDuk will keep pairs together:

      bbduk.sh in=interleaved.fq out1=filtered1.fq out2=filtered2.fq outm1=rrna1.fq out2=rrna2.fq ref=ribosomes.fa k=31

      You can also use a bulk set of ribosomal sequences like Silva, but using the species' specific ribosomal sequences is much more precise."

      Comment


      • #4
        I would not bother doing any liftover since the sequence of rDNA is unlikely to change between builds.

        Use one copy of the full rDNA repeat (don't bother with multiple copies since those are just tandem repeats in most organisms) with whichever tool (bbduk or bbsplit) that you choose to use.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM
        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        30 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        32 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        28 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        53 views
        0 likes
        Last Post seqadmin  
        Working...
        X