Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sorting fastq by primers, then searching by sequence (with mismatches)

    Hi all, after my near-immediate success solving my last problem on here I thought I'd ask about my other major struggles!

    I'm currently analysing a number of paired end fastq files, each pair of which contains Illumina sequencing of the results of a duplex PCR for two variable regions.

    As there were two different primer pair reactions in the mix (and that the amplicon length is greater than that of the read), I know that each read should start with one of the four primers (and that the given mate of one primer should start with its corresponding pair). What comes after is variable, but dependant on the primer; the majority of that sequence should match another sequence in a separate fasta file I have.

    The two main problems I have at the moment are first gathering all the sequences from each file that start with each primer, and then searching the files that do start with a given primer to see which of the sequences in the fasta files their variable sequence corresponds to.

    I've been using the shortread package in bioconductor so far for the first part, adapting a little line of script that I found on Darren Wilkinson's wonderful blog (http://darrenjw.wordpress.com/2010/1...s-of-ngs-data/), where I search the first X number of bases for an exact match to my primer a little something like:

    primerreads = inputfastq[substr(sequences,1,X)=="XXXXXXXX"]

    I tried to use the filters that allow mismatches as described on that blogpost, but they seems to trick me - I'd get out a decent number of sequences, but when I check them they all seem to be exactly the same. So, bad to the substr method.

    Unfortunately when I do this for all four primer in the mix, this only accounts for around 40-60% of all the reads that were in the original file. Now, certainly I'm sure that some of the reads will be erroneous, I know I'm missing some data (for instance I can manually find reads where the primer seems to have shifted forward or backward a position, thus throwing off my match).

    Is there a way to do this better? Or am I worrying too much, and that's a reasonable number of reads to expect?

    Any opinion at all on the bigger, second problem (matching the rest of the read, which is variable, to one of a list of sequences) would be very warmly appreciated - I'm really not a programmer, and there doesn't seem to be any bespoke program to do such a task on such large files, and I really don't have a handle on even where to begin.

    Thanks for even reading this far, and double thanks for any help you're able to share!

Latest Articles

Collapse

  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM
  • seqadmin
    Strategies for Sequencing Challenging Samples
    by seqadmin


    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
    03-22-2024, 06:39 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 04-11-2024, 12:08 PM
0 responses
22 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 10:19 PM
0 responses
24 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 09:21 AM
0 responses
19 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-04-2024, 09:00 AM
0 responses
50 views
0 likes
Last Post seqadmin  
Working...
X