Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Demultiplexing HiSeq 2000 reads containing an N at the 5' end

    Hello all:

    I have an issue that I don't think has been covered yet in the SEQanswers community.

    We recently performed CAGE (http://en.wikipedia.org/wiki/Cap_ana...ene_expression), which is a method for large-scale profiling of 5' mRNA ends, and obtained a large number of high quality Illumina HiSeq SE reads of 50bp.

    This experiment contains data from 8 separate experiments, and so the 5' ends of the reads were barcoded with 8 trinucleotides, as follows:

    sample_1 ACC
    sample_2 CAC
    sample_3 AGT
    sample_4 GCG
    sample_5 ATG
    sample_6 TAC
    sample_7 ACG
    sample_8 GCT

    Of course, these samples need to be demultiplexed so they can be analyzed separately. I did so using the FASTX-Toolkit's FASTX Barcode Splitter (http://seqanswers.com/forums/newthre...ostthread&f=18) as follows:

    cat myCompleteCAGEfile.fastq | fastx_barcode_splitter.pl --bcfile mybarcodes.txt --bol --exact --suffix ".txt" --prefix /my_directory/demulti-

    I chose the --exact flag because the barcodes are only three bases in length, so I reasoned it was best to demand a precise match and then rescue the unmatched reads after the fact.

    The above demultiplexed job worked well, and I was left with a small (<5%) but not insignificant number of unmatched reads. The largest class of these unmatched reads have an N at the first base.

    For example, one of the first reads begins with:
    NCTGAGAGCGG...
    For which the barcode (N)CT would correspond to Sample 8: GCT.

    I ran the fastx_barcode_splitter.pl command again but with a tolerance for a single mismatch, but this causes a conflict between possible barcodes and as far as I know this command does not allow for specifying mismatch tolerance at a specific base, which would be ideal in this case. Also, creating a degenerate barcode file including the N is not tolerated by the program either.

    I've considered using a set of piped linux commands, including cut and sed, but this would be trickier than it needs to be, and I expect there is another way to rescue these 'single leading N' unmatched reads. Can anyone point me in another direction? It may be possible to do this using CASAVA, but I very limited experience with that software package.

    Thanks in advance,

    Taylor

  • #2
    If you recovered 95% of the reads that you are interested in then do you really need the remaining 5%? Generally an N is indicative of inability of the basecaller to decide on what base it thinks it is. In your case the last two bases are unique so your hypothesis as stated above may hold true i.e. (N)CT must really be a GCT. You could recover the remaining reads following that logic/some code but if you are happy with the 95% then I would say ignore the rest.

    Comment


    • #3
      That's a very good point- this is an edge case, and I don't necessarily need to hold the rest of the analysis up on account of <5% of the reads.

      That said, I'm still interested in finding a solution to this problem so I can incorporate it in a pipeline that I'm building. If I find one I'll post it to this thread.

      Comment


      • #4
        I agree with GenoMax; just throw those away. 3bp tags are really short; with an N, you have 2bp, and an indication that the other 2 bases are probably low quality, or else why would the other be an N? Remember that there are miscalled bases in barcodes, too. If you accept barcodes with an N, a single miscalled base will cause cross-contamination.

        Of course, you already have some (like ACC and ACG) that are only a single base apart, so I hope the study is not sensitive to cross-contamination. But keeping the ones with N calls will just make the noise greater, because a 2bp code can be 1 substitution away from 3 or 4 other codes, thus increasing the chances of generating a valid code from a random sub.

        Comment


        • #5
          Hi Brian:

          Good points. I'll likely just keep these reads separate and go ahead with the analysis without them; not having them will not change the results, and we certainly have a tremendous number of reads. We are setting up to do similar 5' end profiling experiments in our lab, and when we do so we'll use much longer barcodes so we don't run into these ambiguity problems.

          Best regards,

          Taylor

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM
          • seqadmin
            Techniques and Challenges in Conservation Genomics
            by seqadmin



            The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

            Avian Conservation
            Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
            03-08-2024, 10:41 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Yesterday, 06:37 PM
          0 responses
          8 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, Yesterday, 06:07 PM
          0 responses
          8 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-22-2024, 10:03 AM
          0 responses
          49 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-21-2024, 07:32 AM
          0 responses
          66 views
          0 likes
          Last Post seqadmin  
          Working...
          X