Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Questions about soapaligner and soapsnp

    I am using soapaligner+soapsnp to find SNVs in our pair-end DNA data. I read the online manual, and the tools have been running smoothly.

    However, there are three things I am not sure about.

    First, how should I request soapaligner to report repetitive results for SNV discovery? Should it report none, random or all best-hit of the repetitive alignments of a certain read?

    Second, I understand the input for soapsnp should be sorted according to the chromosomes and coordinates. The results from soapaligner is organized in the way that each line corresponds to one end of a pair of reads. After the soapaligner output is sorted, the lines of the two ends of the same read would appear far away from each other, with the alignments of many other reads between them. Is this the correct method to sort the soapaligner result?

    Third, the minimal insert size of soapaligner is 400 while the insert size of our data is only 35bp, and so the two ends of a pair of reads could overlap each other. If there is a SNV within the overlapping part of the two ends
    of a pair-end read, and the two ends may or may not have the same base, quality score on the SNV, how would the program handle this situation? I read the paper of soapsnp, but I couldn't get a clue.

    Thank you for your help.

    Update,

    The read length is 35bp in our data.
    Last edited by superligang; 01-07-2011, 11:56 AM.

  • #2
    Hi,

    I can't help you regarding your first and third question, however this is how we handle question 2...

    Split the soap output file into one file for each scaffold like so:

    awk '{ print $0 >> $8".soap" }' alignment

    (in this example the soap alignment file is titled 'alignment' and you'll wind up with a file for each scaffold such as scaffold_1.soap).

    Then you could run this shell script (requires tcsh I believe)

    foreach file (`ls -1 *.soap`)
    sort -k 9 -n $file > $file.sorted
    end

    Or this perl script...
    (no promises that this is the simplest most concisely written piece of code in the world but I know it works)

    ----------------------------
    #!/usr/bin/perl
    use strict;

    # Sort SOAP alignment output files by position - 9th column
    # lines like this:
    # 4_100_10086_4044\1 TAACGGTAATTTTTTTTAAAAAACAAAATCATATTTTCCT ffddcffff`dfffffefffffffffffafdeffcdffff 1 a 40 - scaffold_9 1593868 0 40M 40

    print STDERR "start at: ".localtime()."\n";

    my $h = {};
    my $f = $ARGV[0];
    open(F,$f) or die "can't open input file: $f $!\n";
    while (<F>){
    my @l = split(/\t/,$_);
    $h->{$l[0]}->{'pos'} = $l[8];
    $h->{$l[0]}->{'line'} = $_;;
    }
    close F;
    print STDERR scalar(keys %{$h})." lines read from $f\n";

    foreach my $id (sort { $h->{$a}->{'pos'} <=> $h->{$b}->{'pos'} } keys %{$h}){
    print $h->{$id}->{'line'};
    }

    print STDERR "sort_soap.pl completed sorting $f successfully at ".localtime()."\n";

    ---------------------------

    Larry Wilhelm
    Bioinformatics
    Oregon State University

    Comment


    • #3
      Interesting, but I don't quite understand what scaffold means in this context.
      I hope that soapsnp handles the overlapping part of the two ends of a pair of read reasonably.
      Thank you.
      Last edited by superligang; 01-07-2011, 11:36 AM.

      Comment


      • #4
        'Scaffold' doesn't mean too much in this context actually. It just happens to be how the sequences to which I aligned my reads were named.

        After reading your original post more carefully I realize I'm not really addressing your issue as this was all done with single end reads. I suspect it should still work though.

        Your situation with a 35bp insert and read lengths of apparently > 35bp is not something I've ever confronted before. Is this really the case? I don't quite understand why a library would be constructed like this.

        Larry W.

        Comment


        • #5
          Originally posted by wilhelml View Post
          Your situation with a 35bp insert and read lengths of apparently > 35bp is not something I've ever confronted before. Is this really the case? I don't quite understand why a library would be constructed like this.
          Larry W.
          I don't know if I understand the insert size correctly, but I use insert size as the distance between the two farthest bases of the two ends of a pair of reads.
          For read of 35bp long, the minimal insert size could be 35 when the two reads of a pair overlap completely. The insert size is larger than 35bp if there is a gap between the two reads.
          Thanks.

          Comment


          • #6
            I don't think that is correct actually, regarding insert size.

            The DNA is fragmented and then adapters are ligated. The size of the fragment to which the adapters are ligated is your total size. So if you cut 400Kb fragments from a gel and attach adapters to that, then sequence 80mers, you'll have an actual insert size of 400-(80*2)=240. Though, Soap wants the insert size as the total size so you'd use the 400 number.

            So the question becomes: To what size of DNA fragment were the adapters ligated? And.. What is the length that was sequenced?

            Also, what technology was used for sequencing?

            This article may shed light...


            Disclaimer: I've never actually made libraries, I work entirely on the software side of things, just trying to help a little.

            Comment


            • #7
              oops... I didn't mean 400Kb fragments, I meant just 400nt.

              And I guess I can assume your read length is 35.

              Comment


              • #8
                I don't run experiments in biology either
                I downloaded this DNA data, and I guess probably RNA-seq is different from DNA-seq. When I aligned some RNA-seq data, the two ends of a pair of read frequently overlap with each other.
                soapsnp uses 400 as the default minimal insert size, and so the overlapping part is not a problem in their DNA data.
                Thank you for your clarification.
                Last edited by superligang; 01-07-2011, 01:27 PM.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM
                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                30 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                32 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                28 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                53 views
                0 likes
                Last Post seqadmin  
                Working...
                X