Seqanswers Leaderboard Ad

**GenoMax** · 09-24-2014, 12:00 PM

If you recovered 95% of the reads that you are interested in then do you really need the remaining 5%? Generally an N is indicative of inability of the basecaller to decide on what base it thinks it is. In your case the last two bases are unique so your hypothesis as stated above may hold true i.e. (N)CT must really be a GCT. You could recover the remaining reads following that logic/some code but if you are happy with the 95% then I would say ignore the rest.

**rtraborn** · 09-24-2014, 02:29 PM

That's a very good point- this is an edge case, and I don't necessarily need to hold the rest of the analysis up on account of <5% of the reads.

That said, I'm still interested in finding a solution to this problem so I can incorporate it in a pipeline that I'm building. If I find one I'll post it to this thread.

**Brian Bushnell** · 09-24-2014, 02:49 PM

I agree with GenoMax; just throw those away. 3bp tags are really short; with an N, you have 2bp, and an indication that the other 2 bases are probably low quality, or else why would the other be an N? Remember that there are miscalled bases in barcodes, too. If you accept barcodes with an N, a single miscalled base will cause cross-contamination.

Of course, you already have some (like ACC and ACG) that are only a single base apart, so I hope the study is not sensitive to cross-contamination. But keeping the ones with N calls will just make the noise greater, because a 2bp code can be 1 substitution away from 3 or 4 other codes, thus increasing the chances of generating a valid code from a random sub.

**rtraborn** · 09-25-2014, 09:47 AM

Hi Brian:

Good points. I'll likely just keep these reads separate and go ahead with the analysis without them; not having them will not change the results, and we certainly have a tremendous number of reads. We are setting up to do similar 5' end profiling experiments in our lab, and when we do so we'll use much longer barcodes so we don't run into these ambiguity problems.

Best regards,

Taylor

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Demultiplexing HiSeq 2000 reads containing an N at the 5' end

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News