Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Pulling out paired reads containing a specific sequence in one pair

    I am wondering if anyone knows of an easy way (or tool) to take paired Illumina fastq data and isolate only those read pairs that contain a specific sequence (say a 20 or 30mer) as the first 20 (or 30) nucleotides of one of the reads.

    I will have a 2x150 MiSeq library where some percentage of the total reads (the percentage that I want) will have one read in the pair that starts with a specific nucleotide sequence. Ideally I would like to collect all read pairs where one read in the pair starts with the specific sequence as well as have that sequence trimmed from the read containing it before I map the pairs to the genome.

    Thanks.

  • #2
    BBDuk might work for that. It can bin (or trim) reads by the presence and absence of specific kmers, like this:

    bbduk.sh in=reads.fq outm=matching.fq out=unmatching.fq literal=ATGTTACGTCT k=11

    However, it looks at all of the kmers in a read, not just the first one. Does this sequence ever occur in the middle of the reads, and if so, what would you want to do with those reads?

    Comment


    • #3
      The sequence most likely always occurs at the beginning of the read, but I suppose if something were to be slightly off with the prep, it could occur later in the read. In that case I would also like to pull out those reads. So if I understand correctly, this might work for that purpose?

      Comment


      • #4
        However, does BBDuk allow for paired data? Ie, if the kmer is in read 1, will it allow for isolation of read 1 reads containing the kmer but also of read 2 pairs for those reads it identifies?)

        Comment


        • #5
          1) Yes, it will work perfectly, in this case.
          2) BBDuk always keeps pairs together, as long as it knows the input is paired. For twin files, the command would be:

          bbduk.sh in1=reads1.fq in2=reads2.fq outm1=matching1.fq outm2=matching2.fq out1=unmatching1.fq out2=unmatching2.fq literal=ATGTTACGTCT k=11

          You can later trim the reads with the "ktrim=l" flag.

          Comment


          • #6
            So I tried this on a very small test data set where I artificially inserted a specific 12mer (GACCAGCTAGTG) and it found all of the ones that I artificially inserted (as well as one that I didn't realize was there to begin with), but it also output a few read pairs as matches that look like they shouldn't belong. For example the read pair below:

            @IRIS:7:32:32:1772#0/1
            AAGGCTTTAGTCATGTGTTCAAGATCGAAAAAGGAA
            +
            aaaaaaaaaa`abab`a^aabaaa`ab`a`aaa`]a

            @IRIS:7:32:32:1772#0/2
            GAAGAAACCTCACAAGACTTTCACTAGATGGTCAGA
            +
            abbbaab^aaa``_aaa]`^_Z\X`W]^_a_TQ[]Z


            Any ideas why it would be making some improper calls?

            Comment


            • #7
              Basically it found all 11 sequences that I know match the 12mer, but it also pulled out an additional 9 sequences that I have no idea why they are being called matching.

              Comment


              • #8
                Oh - by default, it looks for both a kmer AND its reverse compliment, and ignores the middle letter of the kmer to increase sensitivity. To disable these, add these flags:

                rcomp=f mm=f

                (where rcomp means 'look for reverse-compliments of kmers' and mm means 'mask middle').

                In this case, reverse-compliment of GACCAGCTAGTG = CACTAGCTGGTC, and:
                Code:
                                     [B]CACTAG[COLOR="Red"]C[/COLOR]TGGTC[/B]
                GAAGAAACCTCACAAGACTTT[B]CACTAGATGGTC[/B]AGA
                ...the middle base is masked. So it matches read 2.

                Comment


                • #9
                  Ah, brilliant. Now it works like a charm. Thanks for the help. I'll give it a try on my actual dataset whenever I get it back. Now if only you could make that happen faster

                  Comment


                  • #10
                    Now I may be getting greedy, but is there an option that would allow me to set a threshold for mismatches between my sequence and kmer. For example I know my kmer is exact, but it's possible that I would want to allow 1 or 2 mismatches from my kmer in sequences and still have them be called "matched". Is there an option for this? It wasn't immediately obvious to me looking at the usage.

                    Thanks

                    Comment


                    • #11
                      Yes! It is possible. You can set "hdist=1" for one mismatch or "hdist=2" for 2 (that stands for Hamming distance). You can also allow indels but that shouldn't be necessary with Illumina data.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Essential Discoveries and Tools in Epitranscriptomics
                        by seqadmin




                        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                        04-22-2024, 07:01 AM
                      • seqadmin
                        Current Approaches to Protein Sequencing
                        by seqadmin


                        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                        04-04-2024, 04:25 PM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 04-11-2024, 12:08 PM
                      0 responses
                      59 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 10:19 PM
                      0 responses
                      57 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 09:21 AM
                      0 responses
                      51 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-04-2024, 09:00 AM
                      0 responses
                      55 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X