Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sorting fastq by primers, then searching by sequence (with mismatches)

    Hi all, after my near-immediate success solving my last problem on here I thought I'd ask about my other major struggles!

    I'm currently analysing a number of paired end fastq files, each pair of which contains Illumina sequencing of the results of a duplex PCR for two variable regions.

    As there were two different primer pair reactions in the mix (and that the amplicon length is greater than that of the read), I know that each read should start with one of the four primers (and that the given mate of one primer should start with its corresponding pair). What comes after is variable, but dependant on the primer; the majority of that sequence should match another sequence in a separate fasta file I have.

    The two main problems I have at the moment are first gathering all the sequences from each file that start with each primer, and then searching the files that do start with a given primer to see which of the sequences in the fasta files their variable sequence corresponds to.

    I've been using the shortread package in bioconductor so far for the first part, adapting a little line of script that I found on Darren Wilkinson's wonderful blog (http://darrenjw.wordpress.com/2010/1...s-of-ngs-data/), where I search the first X number of bases for an exact match to my primer a little something like:

    primerreads = inputfastq[substr(sequences,1,X)=="XXXXXXXX"]

    I tried to use the filters that allow mismatches as described on that blogpost, but they seems to trick me - I'd get out a decent number of sequences, but when I check them they all seem to be exactly the same. So, bad to the substr method.

    Unfortunately when I do this for all four primer in the mix, this only accounts for around 40-60% of all the reads that were in the original file. Now, certainly I'm sure that some of the reads will be erroneous, I know I'm missing some data (for instance I can manually find reads where the primer seems to have shifted forward or backward a position, thus throwing off my match).

    Is there a way to do this better? Or am I worrying too much, and that's a reasonable number of reads to expect?

    Any opinion at all on the bigger, second problem (matching the rest of the read, which is variable, to one of a list of sequences) would be very warmly appreciated - I'm really not a programmer, and there doesn't seem to be any bespoke program to do such a task on such large files, and I really don't have a handle on even where to begin.

    Thanks for even reading this far, and double thanks for any help you're able to share!

Latest Articles

Collapse

  • seqadmin
    Recent Advances in Sequencing Analysis Tools
    by seqadmin


    The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
    05-06-2024, 07:48 AM
  • seqadmin
    Essential Discoveries and Tools in Epitranscriptomics
    by seqadmin




    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
    04-22-2024, 07:01 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 05-07-2024, 06:57 AM
0 responses
12 views
0 likes
Last Post seqadmin  
Started by seqadmin, 05-06-2024, 07:17 AM
0 responses
16 views
0 likes
Last Post seqadmin  
Started by seqadmin, 05-02-2024, 08:06 AM
0 responses
21 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-30-2024, 12:17 PM
0 responses
24 views
0 likes
Last Post seqadmin  
Working...
X