SEQanswers (
-   Metagenomics (
-   -   How to extract subregions from Miseq 16srRNA data? (

LijunZ 03-15-2017 09:18 AM

How to extract subregions from Miseq 16srRNA data?
We did 2X300 pair end sequence with Miseq to cover V1-V3 regions of 16srRNA. We use primer 27F (AGAGTTTGATCATGGCTCAG) for V1, and position 534-518 R (ATTACCGCGGCTGCTGG) for V3 regions.

Now, if I want to check V1 region only, I extract the reads have 27F primer:


Then keep the first 92bp for both read1/read2 as V1 region.

My question, do I need to consider reverse complimentary seq for 27F primer for both pair end reads? Am I right to extract from both paired read with same primer? Why my left reads (with V1 primer) from above have lots of V3 primer (over 30%)?

Is there recommend way to extract subregion for this pair end sequence, i.e V1 or V3 only with known design primer?


Brian Bushnell 03-15-2017 02:56 PM

I suggest that you first merge the reads. Then, you can find and cut the primer-flanked regions using the BBMap package, like this:

Code: in=merged.fq out=locations1.sam query=AGAGTTTGATCATGGCTCAG in=merged.fq out=locations2.sam query=??????????? in=merged.fq out=V1 sam1=locations1.sam sam2=locations2.sam

For the second query, rather than ????????, you need to use the sequence flanking the opposite side of the V1 region, whatever that is.

thermophile 03-16-2017 06:56 AM

The end of v3 is ~515. V1-V3 should be using pairs like 27f & 519R. 27F, 173F are both forward primers and won't give you any product? I'm guessing you were given these sequences from someone else? I'd go back and clarify what primers they actually used.

LijunZ 03-16-2017 06:58 AM

Thanks Brian:
I have tried to merge pair reads, actually, we did two experiences to cover V1-3 and V3-5 regions for the same samples with over 1million reads pairs. However we didn't get enough joined reads, especially for V3-5 regions. Bioanalysor show the peak for V35 around 600-800bp, which have lots of off targets. we can not merge these pair end reads.

I want to extract subregions from these pair reads to compare with the merged result. also want to extract V3 region from both experiences V1-3 and V3-5 to check whether the results consistent for the same sample.

Do you think it is doable?

Thanks again.

LijunZ 03-16-2017 07:18 AM

Thanks thermophile:

You are right, we are using 27F for V1, and position 534-518 (ATTACCGCGGCTGCTGG) for V3. I put typo as 173F.

if I check V1 primer, I can get around 47% reads:

grep "AGAGTTTGATCATGGCTCAG" Read1 > V1_R1_read --> 47% reads
grep "AGAGTTTGATCATGGCTCAG" Read2 > V1_R2_read --> 43% reads

if I check V3: I got 25% reads:
grep "ATTACCGCGGCTGCTGG" Read1 >V3_R1_Read --> 26% reads
grep "ATTACCGCGGCTGCTGG" Read2 >V3_R2_Read --> 28% reads

However, in V1 output reads (V1_R1/2_read), ~25% reads have V3 primer (or V3 primer reverse complementary)

and in V3 output reads (V3_R1/2_read), ~40% reads have V1 primer (or V1 primer reverse complementary)

For over 1million pair reads originally designed, we can only have 700k reads for V1 if we cat V1_R1_read and V1_R2_read and remove V3 primer reads; and 300k reads for v3 if we cat V3_R1_Read and V3_R2_Read and remove V1 primer reads.

Is it reasonable?


thermophile 03-16-2017 07:58 AM

Something bad may have happened to the run if only 25% have the reverse primer (that should be near the beginning of R2). It's the middle of the region that you are likely going to miss when sequencing such long amplicons. So if you can't find the primer, something is wrong. If you are losing lots of sequences in the merging-that's expected because your amplicons are too long for miseq.

If you want to just look at V3, don't merge the reads because you will have essentially no overlap. Instead pull R2 from the v1-3 set and R1 from v3-5, do in silico pcr to trim out just v3.

ETA: AAAHHHH actually, your issues may be that you are using grep for degenerate primers. There are many positions in those primers that have variable bases. Look for tools for "in silico pcr" because they are build to deal with degeneracies

Brian Bushnell 03-16-2017 10:03 AM

BBDuk, by the way, deals with degenerate bases and automatically reverse-complements as well. You can do something like this with a degenerate sequence to filter / see how many match (they will be called "contaminants"), optionally including some number of substitutions with the "hdist" flag: in=reads.fq outm=matched.fq k=16 mm=f literal=ACGTNNAKCCWTTACA copyundefined

LijunZ 03-16-2017 10:47 AM

Thanks Brian and thermophile:

Actually I use cutadapt to grep designed primer, and allowed 2 mismatch over primer for degenerated bases. then remove reverse complementary primer in the same reads.

For the low outcome of V3 read, could it come from PCR design? We used M13 (CAGGGTTTTCCCAGTCACGAC) + designed primer(ATTACCGCGGCTGCTGG in V13; CCTACGGGAGGCAGCAG in V35) in PCR to guide V3 region. Both experience in V13 and V35 have lower reads outcome of V3 only region compared with V1 and V5 only.

BTW, is it normal if we extract only V3 from V13 and V35 from same sample, but the otu table quite different from the same sample? Here, we use different primer for V3 region in both experience. (in V35 we use 341F to guide V3, and in V13, we use 534R to guide V3 region)

Thanks for your reply. Much appreciate it.

thermophile 03-16-2017 11:30 AM

Why were you using M13? you were cloning these first?

are you giving cutadapt the degenerate primers and allowing for 2 mismatches? or giving it one of the possible non-degenerate primers and allowing for 2 mismatches? The first should work, the second should result in losing many sequences because there are degeneracies and mismatch potential which equals more than 2 differences

All times are GMT -8. The time now is 01:38 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.