SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
fastx_quality_stats error : input file (-) has unknown file format mbk0asis Bioinformatics 5 01-20-2014 05:03 PM
Use SAM file to pull reads from FASTQ pbm13 Bioinformatics 5 06-29-2013 08:46 PM
Unknown SNP file format atruglio Bioinformatics 0 05-29-2013 03:59 PM
Sorting fastq by primers, then searching by sequence (with mismatches) jme Bioinformatics 0 01-18-2012 10:25 AM
Reduce file size after Illumina FASTQ to Sanger FASTQ conversion? jjw14 Illumina/Solexa 2 06-01-2010 05:35 PM

Reply
 
Thread Tools
Old 12-05-2014, 07:51 AM   #1
JenBarb
Member
 
Location: Bethesda, MD

Join Date: Oct 2010
Posts: 47
Default Pull out unknown primers from fastq file?

Hello,
I have fastq files from 16S sequencing data. The reads in these files have 6 different primers in them and I am wondering if anyone knows of a method where I can pull out the beginning of the reads, maybe 20-30 bps and then look for consensus sequences within these? This will then give me the primer sequence information for the 6 primers which is proprietary information from the company where the kits are made.

Can anyone think of a way to do this?
Thanks!
JenBarb is offline   Reply With Quote
Old 12-05-2014, 10:51 AM   #2
cmbetts
Senior Member
 
Location: Bay Area

Join Date: Jun 2012
Posts: 112
Default

Something like that should be pretty easy to do with any scripting language with fastq parsing libraries (or heck maybe manually inspecting the fastq since there's only 6 primers if you're not keen on programming).

Example R psuedocode (only because that's what I'm comfortable with)
library("ShortRead"); #load the library for fastq manipulation
fq_data <- readFastq("reads.fastq.gz"); #read in fastq data
base_info <- sread(fq_data); #get just the base calls
first20 <- substring(base_info, 1, 20); #get the first 20bp of each read

then you could do something like
table(first20) to see the frequency of different 20bp sequences
alphabetFrequency(first20) to get consensus sequences
cmbetts is offline   Reply With Quote
Old 12-11-2014, 05:00 AM   #3
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

IIRC, some FASTQ quality control pipelines will spot and report possible primer sequences.
maubp is offline   Reply With Quote
Old 12-11-2014, 05:20 AM   #4
JenBarb
Member
 
Location: Bethesda, MD

Join Date: Oct 2010
Posts: 47
Default

maubp,
I would love to find a tool that will spot and report possible primers. Can you be more specific?

I tried to sort and tally up the sequences and i am not finding them this way. Which tool are you referring to?

Thanks a bunch.
Jen
JenBarb is offline   Reply With Quote
Old 12-11-2014, 05:25 AM   #5
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

e.g. FASTQC reports overrepresented sequences which ought to spot your primers:
http://www.bioinformatics.babraham.a...Sequences.html
maubp is offline   Reply With Quote
Old 04-09-2015, 07:42 PM   #6
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

If you have not solved this problem yet, there's another option, using BBTools:

reformat.sh in=reads.fq out=trimmed.fq ftr=19

This will trim all but the first 20 bases (all bases after position 19, zero-based).

kmercountexact.sh in=trimmed.fq out=counts.txt fastadump=f mincount=10 k=20 rcomp=f
This will generate a file containing the counts of all 20-mers that occurred at least 10 times, in a 2-column format that is easy to sort in Excel. For example:

Code:
ACCGTTACCGTTACCGTTAC	100
AAATTTTTTTCCCCCCCCCC	85
...etc. If the primers are 20bp long, they should be pretty obvious.
Brian Bushnell is offline   Reply With Quote
Old 05-27-2015, 06:54 AM   #7
JenBarb
Member
 
Location: Bethesda, MD

Join Date: Oct 2010
Posts: 47
Default

Hi Brian,
How should I cite your tool in a manuscript in prep that I am doing? Do you have a reference or should I use your website?
Thanks,
Jen
JenBarb is offline   Reply With Quote
Old 05-27-2015, 09:47 AM   #8
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Hi Jen,

My tools are all still unpublished, so please just cite my name and website. Thanks!

-Brian
Brian Bushnell is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:19 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO