Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • in silico RAD seq double digest fragment calculation

    Hi,
    How do I determine # of fragments created from a single (or more importantly) double digest of a reference genome? Is there a tool for this? The fuzznuc tool from EMBOSS is close, but only tells me the cut sites, not fragments created.

    Thanks!
    Last edited by odoyle81; 11-05-2013, 05:05 PM. Reason: typo

  • #2
    This is a pretty bad hack, but will give you a ballpark:

    grep -A6 CCTGCAGG neurospora_crassa.fasta | grep -c -E "CCGG|GGCC"

    This finds SbfI in the forward direction, then looks in the next 360 nt (if the fasta file is 60 nt per line) for MspI in either direction. I'd double that number that results to get SbfI on the reverse strand, and then reduce it by how tight a fragment size range you are selecting. Are you using a Pippin Prep to make the size selection?
    Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

    Comment


    • #3
      Cool, this seems to work!
      How do I adapt this for a single enzyme?

      I think they do a gel for fragment selection but I'm not sure.

      for example, it doesn't seem to work for ApeKI.. I get the same number whether I search for ApeKI once or twice within 8 lines. Also, it doesn't agree with fuzznuc results:

      Code:
      grep -A8 -c -E 'GCAGC|GCTGC' MSU-7-nb-all.chrs.fasta                                                                     
      662343
      
      grep -A8 -E 'GCAGC|GCTGC' MSU-7-nb-all.chrs.fasta | grep -c -E 'GCAGC|GCTGC'                                            
      662343
      fuzznuc results (not searching complement):
      Code:
      ApeKI
      GCWGC
      
      #---------------------------------------
      # Total_sequences: 12
      # Total_length: 373245519
      # Reported_sequences: 12
      # Reported_hitcount: 857873
      #---------------------------------------
      Last edited by odoyle81; 11-06-2013, 01:28 PM.

      Comment


      • #4
        Does fuzznuc just report on the number of sites total in the genome? If so, it would always be much higher than looking for a combination of sites that occur within a specific distance.

        The single enzyme count is even more hack-y. In the original command, grep prints the line with the first site as well as 6 or 8 following lines. Those lines get sent to grep to look for the second site. If the second site is the same as the first, it will always find that site in the sequence sent to it, since it would be in the first line!

        This is a better solution anyway:
        cat neurospora_crassa.fasta | tr -d "\n" | grep -o -E "CCGG.{200,400}CCGG" | wc -l

        It removes all the newlines from the file, then searches for the pattern, allows 200 to 400 any letters, then the pattern again, and counts those. I had made a mistake first time around searching for CCGG and GGCC... the reverse complement of CCGG is CCGG of course! To do the two-enzyme search just replace the first site with the SbfI site or whatever:
        cat neurospora_crassa.fasta | tr -d "\n" | grep -o -E "CCTGCAGG.{200,400}CCGG" | wc -l
        and then swap the sites to find fragments cutting with the frequent-cutter on the left.

        I'm not sure how memory will hold up if you have a large genome but it worked with zebrafish so should be ok.

        I hope they aren't doing multiple libraries and a tight size range using gel size selection. The ddRAD fixed lengths of markers means that if you select 10,000 loci from 250-350 bp, and the next library is 260-360, you will have ~20% missing data between the two libraries. If all the samples are in one library it won't be a problem.
        Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

        Comment


        • #5
          Thanks for your replies!

          Does fuzznuc just report on the number of sites total in the genome? If so, it would always be much higher than looking for a combination of sites that occur within a specific distance.
          It reports the number of sites.. but my point is that this grep line should simply find all ApeKI sites.. and should match what fuzznuc is giving me.. correct?

          Code:
          grep -c -E 'GCAGC|GCTGC' MSU-7-nb-all.chrs.fasta                                                                   
          662343
          Code:
          >fuzznuc MSU-7-nb-all.chrs.fasta
          Search for patterns in nucleotide sequences
          Search pattern: GCWGC
          Output report [chr1.fuzznuc]: output
          >tail output 
          
          #---------------------------------------
          #---------------------------------------
          
          #---------------------------------------
          # Total_sequences: 12
          # Total_length: 373245519
          # Reported_sequences: 12
          # Reported_hitcount: 857873
          #---------------------------------------
          But fuzznuc is reporting 857873 while the grep code is reporting 662343

          Trying the new code you proposed gives a different number:

          Code:
          cat MSU-7-nb-all.chrs.fasta | tr -d "\n" | grep -o -E "GC[AT]GC" | wc -l                                           
          788020
          I think I need to understand why these are giving different results before I move forward with this. Any ideas?

          Thanks!

          Comment


          • #6
            There is probably a trivial explanation. grep works on a line-by-line basis, so it doesn't find sites that are partly on one line and part on the next.

            So with test.txt:
            ATATAATATATGC
            GCATATA

            $ cat test.txt | tr -d "\n" | grep -o -E "GCGC" | wc -l
            1
            $ grep -c GCGC test.txt
            0

            So that is why the simple grep underestimates compared to the one that puts all the lines into a single line. The other aspect of the complex grep is that counts multiple sites per line:
            ATATAATATAT
            GCGCATATAGCGC

            $ grep -c GCGC test.txt
            1
            $ cat test.txt | tr -d "\n" | grep -o -E "GCGC" | wc -l
            2

            But, the complex grep fails in specific situation, where sites overlap:
            ATATAATATAT
            GCGCGCATATA

            $ cat test.txt | tr -d "\n" | grep -o -E "GCGC" | wc -l
            1

            I assume fuzznuc would report '2' for that last test. But in practice, overlapping sites would not produce 2 separate GBS fragments (probably one is preferred as a site and would get cut first, destroying the second). So fuzznuc may overestimate in this case. Hopefully your project can tolerate a 10% difference in site # anyway.

            GBS is particularly sensitive to polymorphism in the cut site, since it uses a frequent cutter. A 300 bp fragment will have 10-30 sites that are one polymorphism from creating the frequent cutter, which if made will make a shorter fragment that has a truncated read or is lost from the library prep. Are you working in a system with a low or defined amount of polymorphism (like a RIL, which is what GBS was designed for)?
            Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

            Comment


            • #7
              Hi SNPsaurus,

              I found your explanations very helpful. Just wanted to say thanks!

              Comment


              • #8
                Good points about overlapping sites, and sites on different lines in the fasta files..
                Fuzznuc does in fact report 2 cut sites for your second example.
                I suppose I could write a little python script to deal with all this.. but also good point about 10% variation not such a big deal... so maybe these estimates are close enough!
                Yes, it is a rice RIL population.. and the parents are both Indica (a clade within rice), so probably pretty low polymorphism. In regards to your earlier question about variation between libraries.. I suppose I am a little ignorant about library construction since I'm not the one doing it, but they are barcoding each line (~300 RIL lines) and plan to run them on Illumina* but I'm not sure if that is called one library or one library for each barcode?
                Thanks for your help and detailed explanation and examples! I think I can work with this

                * Actually, the full story is that we already ran one lane, but I don't think we have enough coverage because we didn't recover alot of SNPs.. so I'm thinking we need more coverage... hence the reason for trying to estimate number of fragments etc..
                Last edited by odoyle81; 11-10-2013, 09:47 PM.

                Comment


                • #9
                  The convention, for the sequencers around me at least, is to call whatever gets amplified at the final step a library. So if they multiplexed the barcoded samples into a pool, and then amplified the pool, I'd call that one library. But it does become a gray area when maybe 90% of the steps are acting upon the individual sample.

                  Typically with GBS the libraries are way undersequenced, to pick up single reads for a fraction of the total tags possible. If the starting genomes and polymorphisms are known, then the missing data are inferred. This is why it is efficient for RIL genotyping. To get to a higher read depth takes lots of sequencing since there are so many tags, in fact many more than needed for RILs, since a RIL block is large (defined by meiotic recombination). And if you were to sequence to higher read depths then the problems with reproducibility between libraries and missing data from polymorphisms in the almost-cut sites become a problem.

                  If you didn't recover enough SNPs, is it because the SNP callers are expecting high read depth? With GBS you have to tell the SNP callers that a single read is enough, and to expect sparse data across the population. The results confound the typical pipeline.
                  Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

                  Comment


                  • #10
                    We tried with TASSEL and stacks and they both default to a min read depth of 1... but that was a good thought

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM
                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    18 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    22 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    17 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-04-2024, 09:00 AM
                    0 responses
                    49 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X