SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
in silico restriction digest of Illumina reads (fastq file) berthold Bioinformatics 2 05-13-2018 11:54 PM
RNA-seq: rRNA Depletion, Double riboZero or Double RiboMinus cvhove Sample Prep / Library Generation 5 08-22-2014 04:18 AM
How to calculate RAD-Seq digestion sites? fanwei Bioinformatics 12 08-27-2013 01:12 AM
How to calculate RAD-Seq digestion sites? fanwei Illumina/Solexa 0 08-19-2013 06:30 PM

Reply
 
Thread Tools
Old 11-05-2013, 04:01 PM   #1
odoyle81
Member
 
Location: United States

Join Date: Aug 2011
Posts: 31
Default in silico RAD seq double digest fragment calculation

Hi,
How do I determine # of fragments created from a single (or more importantly) double digest of a reference genome? Is there a tool for this? The fuzznuc tool from EMBOSS is close, but only tells me the cut sites, not fragments created.

Thanks!

Last edited by odoyle81; 11-05-2013 at 04:05 PM. Reason: typo
odoyle81 is offline   Reply With Quote
Old 11-06-2013, 10:54 AM   #2
SNPsaurus
Registered Vendor
 
Location: Eugene, OR

Join Date: May 2013
Posts: 523
Default

This is a pretty bad hack, but will give you a ballpark:

grep -A6 CCTGCAGG neurospora_crassa.fasta | grep -c -E "CCGG|GGCC"

This finds SbfI in the forward direction, then looks in the next 360 nt (if the fasta file is 60 nt per line) for MspI in either direction. I'd double that number that results to get SbfI on the reverse strand, and then reduce it by how tight a fragment size range you are selecting. Are you using a Pippin Prep to make the size selection?
__________________
Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com
SNPsaurus is offline   Reply With Quote
Old 11-06-2013, 12:12 PM   #3
odoyle81
Member
 
Location: United States

Join Date: Aug 2011
Posts: 31
Default

Cool, this seems to work!
How do I adapt this for a single enzyme?

I think they do a gel for fragment selection but I'm not sure.

for example, it doesn't seem to work for ApeKI.. I get the same number whether I search for ApeKI once or twice within 8 lines. Also, it doesn't agree with fuzznuc results:

Code:
grep -A8 -c -E 'GCAGC|GCTGC' MSU-7-nb-all.chrs.fasta                                                                     
662343

grep -A8 -E 'GCAGC|GCTGC' MSU-7-nb-all.chrs.fasta | grep -c -E 'GCAGC|GCTGC'                                            
662343
fuzznuc results (not searching complement):
Code:
ApeKI
GCWGC

#---------------------------------------
# Total_sequences: 12
# Total_length: 373245519
# Reported_sequences: 12
# Reported_hitcount: 857873
#---------------------------------------

Last edited by odoyle81; 11-06-2013 at 12:28 PM.
odoyle81 is offline   Reply With Quote
Old 11-06-2013, 08:31 PM   #4
SNPsaurus
Registered Vendor
 
Location: Eugene, OR

Join Date: May 2013
Posts: 523
Default

Does fuzznuc just report on the number of sites total in the genome? If so, it would always be much higher than looking for a combination of sites that occur within a specific distance.

The single enzyme count is even more hack-y. In the original command, grep prints the line with the first site as well as 6 or 8 following lines. Those lines get sent to grep to look for the second site. If the second site is the same as the first, it will always find that site in the sequence sent to it, since it would be in the first line!

This is a better solution anyway:
cat neurospora_crassa.fasta | tr -d "\n" | grep -o -E "CCGG.{200,400}CCGG" | wc -l

It removes all the newlines from the file, then searches for the pattern, allows 200 to 400 any letters, then the pattern again, and counts those. I had made a mistake first time around searching for CCGG and GGCC... the reverse complement of CCGG is CCGG of course! To do the two-enzyme search just replace the first site with the SbfI site or whatever:
cat neurospora_crassa.fasta | tr -d "\n" | grep -o -E "CCTGCAGG.{200,400}CCGG" | wc -l
and then swap the sites to find fragments cutting with the frequent-cutter on the left.

I'm not sure how memory will hold up if you have a large genome but it worked with zebrafish so should be ok.

I hope they aren't doing multiple libraries and a tight size range using gel size selection. The ddRAD fixed lengths of markers means that if you select 10,000 loci from 250-350 bp, and the next library is 260-360, you will have ~20% missing data between the two libraries. If all the samples are in one library it won't be a problem.
__________________
Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com
SNPsaurus is offline   Reply With Quote
Old 11-09-2013, 01:46 PM   #5
odoyle81
Member
 
Location: United States

Join Date: Aug 2011
Posts: 31
Default

Thanks for your replies!

Quote:
Does fuzznuc just report on the number of sites total in the genome? If so, it would always be much higher than looking for a combination of sites that occur within a specific distance.
It reports the number of sites.. but my point is that this grep line should simply find all ApeKI sites.. and should match what fuzznuc is giving me.. correct?

Code:
grep -c -E 'GCAGC|GCTGC' MSU-7-nb-all.chrs.fasta                                                                   
662343
Code:
>fuzznuc MSU-7-nb-all.chrs.fasta
Search for patterns in nucleotide sequences
Search pattern: GCWGC
Output report [chr1.fuzznuc]: output
>tail output 

#---------------------------------------
#---------------------------------------

#---------------------------------------
# Total_sequences: 12
# Total_length: 373245519
# Reported_sequences: 12
# Reported_hitcount: 857873
#---------------------------------------
But fuzznuc is reporting 857873 while the grep code is reporting 662343

Trying the new code you proposed gives a different number:

Code:
cat MSU-7-nb-all.chrs.fasta | tr -d "\n" | grep -o -E "GC[AT]GC" | wc -l                                           
788020
I think I need to understand why these are giving different results before I move forward with this. Any ideas?

Thanks!
odoyle81 is offline   Reply With Quote
Old 11-09-2013, 04:09 PM   #6
SNPsaurus
Registered Vendor
 
Location: Eugene, OR

Join Date: May 2013
Posts: 523
Default

There is probably a trivial explanation. grep works on a line-by-line basis, so it doesn't find sites that are partly on one line and part on the next.

So with test.txt:
ATATAATATATGC
GCATATA

$ cat test.txt | tr -d "\n" | grep -o -E "GCGC" | wc -l
1
$ grep -c GCGC test.txt
0

So that is why the simple grep underestimates compared to the one that puts all the lines into a single line. The other aspect of the complex grep is that counts multiple sites per line:
ATATAATATAT
GCGCATATAGCGC

$ grep -c GCGC test.txt
1
$ cat test.txt | tr -d "\n" | grep -o -E "GCGC" | wc -l
2

But, the complex grep fails in specific situation, where sites overlap:
ATATAATATAT
GCGCGCATATA

$ cat test.txt | tr -d "\n" | grep -o -E "GCGC" | wc -l
1

I assume fuzznuc would report '2' for that last test. But in practice, overlapping sites would not produce 2 separate GBS fragments (probably one is preferred as a site and would get cut first, destroying the second). So fuzznuc may overestimate in this case. Hopefully your project can tolerate a 10% difference in site # anyway.

GBS is particularly sensitive to polymorphism in the cut site, since it uses a frequent cutter. A 300 bp fragment will have 10-30 sites that are one polymorphism from creating the frequent cutter, which if made will make a shorter fragment that has a truncated read or is lost from the library prep. Are you working in a system with a low or defined amount of polymorphism (like a RIL, which is what GBS was designed for)?
__________________
Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com
SNPsaurus is offline   Reply With Quote
Old 11-09-2013, 04:38 PM   #7
luc
Senior Member
 
Location: US

Join Date: Dec 2010
Posts: 466
Default

Hi SNPsaurus,

I found your explanations very helpful. Just wanted to say thanks!
luc is offline   Reply With Quote
Old 11-10-2013, 08:40 PM   #8
odoyle81
Member
 
Location: United States

Join Date: Aug 2011
Posts: 31
Default

Good points about overlapping sites, and sites on different lines in the fasta files..
Fuzznuc does in fact report 2 cut sites for your second example.
I suppose I could write a little python script to deal with all this.. but also good point about 10% variation not such a big deal... so maybe these estimates are close enough!
Yes, it is a rice RIL population.. and the parents are both Indica (a clade within rice), so probably pretty low polymorphism. In regards to your earlier question about variation between libraries.. I suppose I am a little ignorant about library construction since I'm not the one doing it, but they are barcoding each line (~300 RIL lines) and plan to run them on Illumina* but I'm not sure if that is called one library or one library for each barcode?
Thanks for your help and detailed explanation and examples! I think I can work with this

* Actually, the full story is that we already ran one lane, but I don't think we have enough coverage because we didn't recover alot of SNPs.. so I'm thinking we need more coverage... hence the reason for trying to estimate number of fragments etc..

Last edited by odoyle81; 11-10-2013 at 08:47 PM.
odoyle81 is offline   Reply With Quote
Old 11-11-2013, 09:23 AM   #9
SNPsaurus
Registered Vendor
 
Location: Eugene, OR

Join Date: May 2013
Posts: 523
Default

The convention, for the sequencers around me at least, is to call whatever gets amplified at the final step a library. So if they multiplexed the barcoded samples into a pool, and then amplified the pool, I'd call that one library. But it does become a gray area when maybe 90% of the steps are acting upon the individual sample.

Typically with GBS the libraries are way undersequenced, to pick up single reads for a fraction of the total tags possible. If the starting genomes and polymorphisms are known, then the missing data are inferred. This is why it is efficient for RIL genotyping. To get to a higher read depth takes lots of sequencing since there are so many tags, in fact many more than needed for RILs, since a RIL block is large (defined by meiotic recombination). And if you were to sequence to higher read depths then the problems with reproducibility between libraries and missing data from polymorphisms in the almost-cut sites become a problem.

If you didn't recover enough SNPs, is it because the SNP callers are expecting high read depth? With GBS you have to tell the SNP callers that a single read is enough, and to expect sparse data across the population. The results confound the typical pipeline.
__________________
Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com
SNPsaurus is offline   Reply With Quote
Old 11-11-2013, 10:16 AM   #10
odoyle81
Member
 
Location: United States

Join Date: Aug 2011
Posts: 31
Default

We tried with TASSEL and stacks and they both default to a min read depth of 1... but that was a good thought
odoyle81 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:32 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO