Seqanswers Leaderboard Ad

**SNPsaurus** · 11-06-2013, 11:54 AM

This is a pretty bad hack, but will give you a ballpark:

grep -A6 CCTGCAGG neurospora_crassa.fasta | grep -c -E "CCGG|GGCC"

This finds SbfI in the forward direction, then looks in the next 360 nt (if the fasta file is 60 nt per line) for MspI in either direction. I'd double that number that results to get SbfI on the reverse strand, and then reduce it by how tight a fragment size range you are selecting. Are you using a Pippin Prep to make the size selection?

**odoyle81** · 11-06-2013, 01:12 PM

Cool, this seems to work!
How do I adapt this for a single enzyme?

I think they do a gel for fragment selection but I'm not sure.

for example, it doesn't seem to work for ApeKI.. I get the same number whether I search for ApeKI once or twice within 8 lines. Also, it doesn't agree with fuzznuc results:

Code:

grep -A8 -c -E 'GCAGC|GCTGC' MSU-7-nb-all.chrs.fasta                                                                     
662343

grep -A8 -E 'GCAGC|GCTGC' MSU-7-nb-all.chrs.fasta | grep -c -E 'GCAGC|GCTGC'                                            
662343

fuzznuc results (not searching complement):

Code:

ApeKI
GCWGC

#---------------------------------------
# Total_sequences: 12
# Total_length: 373245519
# Reported_sequences: 12
# Reported_hitcount: 857873
#---------------------------------------

**SNPsaurus** · 11-06-2013, 09:31 PM

Does fuzznuc just report on the number of sites total in the genome? If so, it would always be much higher than looking for a combination of sites that occur within a specific distance.

The single enzyme count is even more hack-y. In the original command, grep prints the line with the first site as well as 6 or 8 following lines. Those lines get sent to grep to look for the second site. If the second site is the same as the first, it will always find that site in the sequence sent to it, since it would be in the first line!

This is a better solution anyway:
cat neurospora_crassa.fasta | tr -d "\n" | grep -o -E "CCGG.{200,400}CCGG" | wc -l

It removes all the newlines from the file, then searches for the pattern, allows 200 to 400 any letters, then the pattern again, and counts those. I had made a mistake first time around searching for CCGG and GGCC... the reverse complement of CCGG is CCGG of course! To do the two-enzyme search just replace the first site with the SbfI site or whatever:
cat neurospora_crassa.fasta | tr -d "\n" | grep -o -E "CCTGCAGG.{200,400}CCGG" | wc -l
and then swap the sites to find fragments cutting with the frequent-cutter on the left.

I'm not sure how memory will hold up if you have a large genome but it worked with zebrafish so should be ok.

I hope they aren't doing multiple libraries and a tight size range using gel size selection. The ddRAD fixed lengths of markers means that if you select 10,000 loci from 250-350 bp, and the next library is 260-360, you will have ~20% missing data between the two libraries. If all the samples are in one library it won't be a problem.

**odoyle81** · 11-09-2013, 02:46 PM

Thanks for your replies!

Does fuzznuc just report on the number of sites total in the genome? If so, it would always be much higher than looking for a combination of sites that occur within a specific distance.

It reports the number of sites.. but my point is that this grep line should simply find all ApeKI sites.. and should match what fuzznuc is giving me.. correct?

Code:

grep -c -E 'GCAGC|GCTGC' MSU-7-nb-all.chrs.fasta                                                                   
662343

Code:

>fuzznuc MSU-7-nb-all.chrs.fasta
Search for patterns in nucleotide sequences
Search pattern: GCWGC
Output report [chr1.fuzznuc]: output
>tail output 

#---------------------------------------
#---------------------------------------

#---------------------------------------
# Total_sequences: 12
# Total_length: 373245519
# Reported_sequences: 12
# Reported_hitcount: 857873
#---------------------------------------

But fuzznuc is reporting 857873 while the grep code is reporting 662343

Trying the new code you proposed gives a different number:

Code:

cat MSU-7-nb-all.chrs.fasta | tr -d "\n" | grep -o -E "GC[AT]GC" | wc -l                                           
788020

I think I need to understand why these are giving different results before I move forward with this. Any ideas?

Thanks!

**SNPsaurus** · 11-09-2013, 05:09 PM

There is probably a trivial explanation. grep works on a line-by-line basis, so it doesn't find sites that are partly on one line and part on the next.

So with test.txt:
ATATAATATATGC
GCATATA

$ cat test.txt | tr -d "\n" | grep -o -E "GCGC" | wc -l
1
$ grep -c GCGC test.txt
0

So that is why the simple grep underestimates compared to the one that puts all the lines into a single line. The other aspect of the complex grep is that counts multiple sites per line:
ATATAATATAT
GCGCATATAGCGC

$ grep -c GCGC test.txt
1
$ cat test.txt | tr -d "\n" | grep -o -E "GCGC" | wc -l
2

But, the complex grep fails in specific situation, where sites overlap:
ATATAATATAT
GCGCGCATATA

$ cat test.txt | tr -d "\n" | grep -o -E "GCGC" | wc -l
1

I assume fuzznuc would report '2' for that last test. But in practice, overlapping sites would not produce 2 separate GBS fragments (probably one is preferred as a site and would get cut first, destroying the second). So fuzznuc may overestimate in this case. Hopefully your project can tolerate a 10% difference in site # anyway.

GBS is particularly sensitive to polymorphism in the cut site, since it uses a frequent cutter. A 300 bp fragment will have 10-30 sites that are one polymorphism from creating the frequent cutter, which if made will make a shorter fragment that has a truncated read or is lost from the library prep. Are you working in a system with a low or defined amount of polymorphism (like a RIL, which is what GBS was designed for)?

**luc** · 11-09-2013, 05:38 PM

Hi SNPsaurus,

I found your explanations very helpful. Just wanted to say thanks!

**odoyle81** · 11-10-2013, 09:40 PM

Good points about overlapping sites, and sites on different lines in the fasta files..
Fuzznuc does in fact report 2 cut sites for your second example.
I suppose I could write a little python script to deal with all this.. but also good point about 10% variation not such a big deal... so maybe these estimates are close enough!
Yes, it is a rice RIL population.. and the parents are both Indica (a clade within rice), so probably pretty low polymorphism. In regards to your earlier question about variation between libraries.. I suppose I am a little ignorant about library construction since I'm not the one doing it, but they are barcoding each line (~300 RIL lines) and plan to run them on Illumina* but I'm not sure if that is called one library or one library for each barcode?
Thanks for your help and detailed explanation and examples! I think I can work with this

* Actually, the full story is that we already ran one lane, but I don't think we have enough coverage because we didn't recover alot of SNPs.. so I'm thinking we need more coverage... hence the reason for trying to estimate number of fragments etc..

**SNPsaurus** · 11-11-2013, 10:23 AM

The convention, for the sequencers around me at least, is to call whatever gets amplified at the final step a library. So if they multiplexed the barcoded samples into a pool, and then amplified the pool, I'd call that one library. But it does become a gray area when maybe 90% of the steps are acting upon the individual sample.

Typically with GBS the libraries are way undersequenced, to pick up single reads for a fraction of the total tags possible. If the starting genomes and polymorphisms are known, then the missing data are inferred. This is why it is efficient for RIL genotyping. To get to a higher read depth takes lots of sequencing since there are so many tags, in fact many more than needed for RILs, since a RIL block is large (defined by meiotic recombination). And if you were to sequence to higher read depths then the problems with reproducibility between libraries and missing data from polymorphisms in the almost-cut sites become a problem.

If you didn't recover enough SNPs, is it because the SNP callers are expecting high read depth? With GBS you have to tell the SNP callers that a single read is enough, and to expect sparse data across the population. The results confound the typical pipeline.

**odoyle81** · 11-11-2013, 11:16 AM

We tried with TASSEL and stacks and they both default to a min read depth of 1... but that was a good thought

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 18 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 17 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 49 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

in silico RAD seq double digest fragment calculation

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News