Seqanswers Leaderboard Ad

**SNPsaurus** · 05-19-2013, 05:28 PM

Have you tried running it against the reference without extracting potential RAD sites? I've done large populations using novoalign against a reference (RAD or nextRAD) and the time is trivial. I also routinely take a population, identify the tags in the population (you can decide to include only predominant tags or all tags), and then align those tags against each other to determine alleles, and then count those alleles in each sample.

Sorry to not address your question, but these are paths that work for me for RAD and nextRAD. I assume 2b-RAD would behave in similar ways.

**rcapper** · 05-19-2013, 06:38 PM

Interesting. I had previously resisted mapping to the whole genome (12.5k contigs) because I felt like there may be increased mapping error around the restriction site. However, I don't have any data about this yet, just a gut feeling that feeding the mapper known sites selected for their restriction sites would improve this. But... on the other hand, I filter and trim my raw reads and insist that each one contains the RAD site, so maybe I'm just being cautious for no reason.

Anyway -- I did what you suggested and mapped an individual to the whole genome, then ran the UnifiedGenotyper on that guy. We're down to 51 minutes! Looks like the reference database size does indeed make a huge time difference (12.5 contigs/1 hour vs 95k contigs/4.9 weeks...)

Another idea that was suggested to me was to extract the RAD sites from the genome, then concatenate them into artificial chromosomes, potentially using strings of 1000 N's to separate each tag. I think this is a good idea, but more complicated than just mapping to the genome.

Can you think of any reason why mapping to the genome instead of to the sites themselves would be better or worse? I can think of pros and cons for each side, but I am not convinced of either way yet. Obviously, though, the computational time alone will make up my mind

**SNPsaurus** · 05-19-2013, 08:08 PM

To me, the biggest source of error when mapping to a whole genome is from alignment to duplicate genes/genomic regions. Sometimes a tag will align to multiple locations. One is true, and the others spurious. But a sequencing error may be enough to shift the tag from one location to the other. Or, because few genomes are "complete", your tag may map to the duplicate region rather than its real locus.

But these are not serious issues. A small loss of tags from tossing out ones that map to multiple locations is just not an issue when you have 10k (or 100k) markers to play with. And in the second scenario, the usual information desired is comparing your samples to each other. So if all the samples map to a spurious locus because of missing sequence in the reference, they will be true in comparison to each other. Mapping issues might be more of a problem with 2b-RAD with its short sequence length, but I bet it will be OK.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

2b-RAD and GATK

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News