Seqanswers Leaderboard Ad

**pparg** · 02-23-2010, 09:42 AM

Hello,
Has anyone seen this before? Are there any thoughts? Thanks!

**dcjamison** · 02-23-2010, 12:16 PM

I've done PE vs RefSeq also, and my impression (not backed by numbers) is that the majority of anomalies are due to positional issues. I did not pursue it, since I was looking for something completely different in the data set, and because it was more or less what I was expecting.

There are two parts to your question: why are the anomalous pairs restricted to >36 bp inserts, and what is causing the small inserts.

The first is caused by the algorithm. The pairing estimates the average size of the insert, then places cut-offs at a couple standard deviations out. In your case, the low-end cut-off seems to be 36, which by happenstance concurs with the read-length.

The "short" inserts are probably caused by a difference between RefSeq and the biological realities. For example, RefSeq does not contain all the possible isoforms generated by alternative splicing. I have also noticed that the amount of UTR reported for various isoforms varies, which can lead to odd insert sizes.

I am pretty sure the sequencing hypothesis you're describing isn't very likely, since it would A) require the fragments to hang around through the cluster regen steps, which involves several washes; B) would also be present in genomic PE sequencing; and C) would result in read 2 being the read 1 adaptor.

My suggestion is to put the reads into two bed files (translating the coordinates from RefSeq to genomic) and load them as separate tracks into the UCSC browser: one colored blue with only consistent pairs, and one colored red with only the anomalous pairs. I predict the anomalous pairs will cluster to a limited number of places, and that by examining them in relation to the gene model you will be able to figure out what is going on.

**pparg** · 02-24-2010, 05:34 PM

Thanks a lot, dcjamison, for all the thoughts and suggestions.
I think it is not likely due to cut-off setting. Even though over 99% of the abnormal read pairs have insert length <=36, there are still very small number of read pairs have insert length >36.
It is also not likely due to incomplete Refseq data, because the insert length here covers the whole segment between the outer bounderies of the read pair. When read length =36, insert length <=36 suggest that only the first mate have been re-sequenced from the other direction in the second round. I know my hypothesis seems not likely be the true cause, but I can’t think of any reason that explain the data.
You last suggestion seems to be promising. I will try it out later.

**Xi Wang** · 03-02-2010, 06:32 AM

is the insert length you meant from the most right site of mate 1 to the most left site of mate 2?

**pparg** · 03-27-2010, 09:06 AM

Yes, any thoughts?

**The_Roads** · 03-27-2010, 04:04 PM

Hi,

If i understand your description we have seen the same thing in paired end dna sequencing.

We see a small percentage of pairs where the two reads are either on top of each other or have very short inserts/paired end distances. it depends on the assembler you use as to whether you can see or extract these pairs. in our case suspect they are simply an artifact of the gel extraction steps where a small percentage of short fragments end up in the extraction. i dont know if the same step is done in rna seq so i may be totally wrong. for snp/indel detection we now remove all pairs like this as they the overlapping pairs double the counts for any variants they carry.

we also see low frequency reads where segments of the read align inverted to each other. as with the above these show random and even/coverage dependent distribution across our ref seqs and we put these down to seq errors (internal priming?). i have seen these described in an rna-seq paper as evidence for unique transcripts but as they are inverted and seen all over the place i am personally doubtful about this, again i may be way off.

i guess while we are on the subject or pe errors, when we map re-arrangements by looking for extended inserts/read distances or backward forward aligned pairs we have found some libraries where we see more than usual but again they are distributed evenly. Because in these libraries we see equal proportions of B-F and F-B pairs for each rearrangement locus we put these down to ligation errors during library prep. ie some fragments are ligating to each other creating false pairs. anyone have any idea whether this is a possibility? i'm pretty certain they are non-biological what ever the cause.

EDIT: we did some blast alignments of single reads and found the same B-F, F-B "mirrored" breakpoints so i think we can rule out ligation errors. i have no idea what could be causing the problem, whether biological or experimental. Any ideas welcome.

cheers,
The_Roads

**pparg** · 03-29-2010, 10:31 AM

Hello, The_Roads,
Your idea of internal priming is very interesting. This seems to explain read pairs with insert length <36 very well. If this is the case, such internal priming should occur at the PCR step in the library preparation stage, rather than the sequencing stage. One thing I am not sure is that is such internal priming feasible or common in reality?
Ligation errors, if occur, should lead to false ‘re-arrangement’ calls. But if I understand correctly, the B-F and F-B pairs yo mentioned were not likely due to ligation error. I don’t think ligation error can generate a combined sequence where the two ends are B-F or F-B complementary to each other.
Anyway, thanks a lot for your informative input!!

**thinkRNA** · 03-29-2010, 10:39 AM

Originally posted by pparg View Post

Hello, The_Roads,
Your idea of internal priming is very interesting. This seems to explain read pairs with insert length <36 very well. If this is the case, such internal priming should occur at the PCR step in the library preparation stage, rather than the sequencing stage. One thing I am not sure is that is such internal priming feasible or common in reality?
Ligation errors, if occur, should lead to false ‘re-arrangement’ calls. But if I understand correctly, the B-F and F-B pairs yo mentioned were not likely due to ligation error. I don’t think ligation error can generate a combined sequence where the two ends are B-F or F-B complementary to each other.
Anyway, thanks a lot for your informative input!!

EST sequences which are created by sequencing ends of cDNA that are in turn generated using oligo-dT primers have evidence of as much as 15-20% internal priming (http://bioinformatics.oxfordjournals...ull/21/18/3691). If your sequencing was done using cDNA generated by oligo-dT primers, I don't see why it couldn't be as prevalent.

One way to check for it is, you can see the stretches of A's in the genome where the primer attached itself (instead of the polyA tail). I wonder if you can use this to check for errors in your paired end data.

**The_Roads** · 03-29-2010, 11:15 AM

Hi PPARG,

Thanks for the reply. The internal priming i mentioned was meant as internal priming during sequencing, it was something that was suggested by an illumina tech when we were discussing the problem reads. These errors turn up in our data at very low frequency <1000 per 1.5M aligned reads and are evenly distributed in all samples and across all ref seqs we've looked at while doing dna sequencing. hence i think they are seq errors.

The rearranged reads occur at similar frequencies but they are not overlapping pairs. what i meant was for the sequence ABCDEFG we see similar frequencies of pairs that align, for instance, C-E as E-C. in our real biological rearrangements we only ever see one form ie only C-E. these suspected ligation errors only seem to have happened in only a few dna sequencing library preps and are always present at low frequency ~3/10,000 reads. i hope this makes things clearer.

see edit above

Topics	Statistics	Last Post
New Software Simplifies 3D Gene Expression Mapping by seqadmin Started by seqadmin, Today, 10:17 AM	0 responses 7 views 0 reactions	Last Post by seqadmin Today, 10:17 AM
AI Tool Creates High-Resolution 3D Maps of the Mouse Brain by seqadmin Started by seqadmin, 03-20-2025, 05:03 AM	0 responses 49 views 0 reactions	Last Post by seqadmin 03-20-2025, 05:03 AM
Studying Microbial Gene Transfer with RNA Barcoding by seqadmin Started by seqadmin, 03-19-2025, 07:27 AM	0 responses 59 views 0 reactions	Last Post by seqadmin 03-19-2025, 07:27 AM
Mapping the snoRNAome in Zebrafish to Advance Disease Research by seqadmin Started by seqadmin, 03-18-2025, 12:50 PM	0 responses 50 views 0 reactions	Last Post by seqadmin 03-18-2025, 12:50 PM

Seqanswers Leaderboard Ad

pair-end sequencing produces single-end read artifact

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News