Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Scaffolding problem

    Hi,

    I have these scaffolds with different sizes(1 to 200 kb) from a previous assembly.

    Now there is some new data with 3 and 5kb insert paired end. I want to add these paired end reads onto the scaffolds.

    I tried SOAP denovo, but it only takes paired end reads or single reads. Same thing goes for velvet and abyss.

    The data is too big for CAP3.

    What other programs are able to handle this kind of data?

    Thanks

  • #2
    Velvet will take long reads and short paired reads in the same assembly. It's described in the current manual pg. 8, "Adding long reads".

    Comment


    • #3
      Hi,

      do you want to scaffold the previous scaffold, or do you want to extend the previous scaffolds?

      Anyway, maybe you can try out SSPACE for this purpose, see this thread;

      Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc


      Kind regards,
      Boetsie

      Comment


      • #4
        Thanks.

        Ya i guess i will be extending the previous scaffolds.

        The problem with using SSPACE is that it does not allow N's in the input contig file.

        The scaffolds which i have are having varying insert sizes. Should i break each of them into paired end reads and use as separate libraries to use it in SSPACE?

        Velvet is not able to handle long reads which are more than 20KB?

        Comment


        • #5
          Hi,

          You say;
          The problem with using SSPACE is that it does not allow N's in the input contig file.
          while the SSPACE manual says;
          Contigs having a non-ACGT character like “.” or “N” are not discarded. They are used for extension, mapping and building scaffolds. However, contigs having such character at either end of the sequence, could fail for proper contig extension.
          So, they can be used for extending, only if the N's are at the end of a sequence it is unable to map reads.

          I don't know about Velvet... I know SSAKE (which has basically the same procedure as SSPACE) also can use contigs as 'seeds' and extends them with additional reads. Difference is that SSPACE first maps the reads to the pre-assembled contigs and only uses the unmapped reads for contig/scaffold extension. SSAKE does not include mapping.

          Kind regards,
          Boetsie

          Comment


          • #6
            SSPACE bo improvement in N50 or contig size

            HI Boetsie,
            I can't find any improvement before and after scaffolding ... Am I doing something wrong ??? Thanks

            -x = 0
            -k = 5
            -a = 0.7
            -n = 15
            -p = 0

            ==================================

            Number of single reads found on contigs = 84724494
            Number of pairs found with pairing contigs / total pairs = 47882393 / 48019708
            ------------------------------------------------------------

            READ PAIRS STATS:
            ------------------------------------------------------------
            At least one sequence/pair missing from contigs: 137314
            Assembled pairs: 47882393 (95764786 sequences)
            Satisfied in distance/logic within contigs (i.e. -> <-, distance on target: 2500 +/-1750): 22
            Unsatisfied in distance within contigs (i.e. distance out-of-bounds): 11
            Unsatisfied pairing logic within contigs (i.e. illogical pairing ->->, <-<- or <-->): 81
            ---
            Satisfied in distance/logic within a given contig pair (pre-scaffold): 26534237
            Unsatisfied in distance within a given contig pair (i.e. calculated distances out-of-bounds): 21348042
            ---
            Total satisfied: 26534259 unsatisfied: 21348134

            ------------------------------------------------------------

            ################################################################################

            SUMMARY:
            ------------------------------------------------------------
            Inserted contig file;
            Total number of contigs = 1060008
            Sum (bp) = 2114313317
            Max contig size = 56175
            Min contig size = 200
            Average contig size = 1988
            N50 = 3918

            After scaffolding MP1:
            Total number of scaffolds = 1060008
            Sum (bp) = 2114313317
            Max scaffold size = 56175
            Min scaffold size = 200
            Average scaffold size = 1988
            N50 = 3918
            Regards

            Comment


            • #7
              longer reads

              Thanks for the clarification Boetsie,

              Bowtie can handle only reads that are a maximum of 1024 BP long. What does SSPACE do for reads that are longer than that?

              I am interested in merging scaffolds, that is merging 2 sequences that look like below(SSPACE does not use reads with N's in the paired end files, am i correct?):

              AGCTAGCTAGCTNNNNNNNNNCGATCGATGCNNNNNNNCGATCGATCGATCGNNNNCAGCTAGT


              ANNNNNTAGCTACGATCGATCGNNNNNNNNNGATGCACGTACGATNNCGATNNNNNNNNNNNCAGCTAGT

              Comment


              • #8
                Originally posted by Ashu View Post
                HI Boetsie,
                I can't find any improvement before and after scaffolding ... Am I doing something wrong ??? Thanks
                Hi Ashu,

                i'm pretty sure you turned around the library file. Are you using paired-end (--> <-- direction) or mate pair (<-- --> direction) reads? If you use paired-end, your library should look something like this;

                libname file1.fasta file2.fasta 700 0.25 0

                With the last column containing a 0. For mate pairs, the last column should contain a 1;

                libname file1.fasta file2.fasta 700 0.25 1

                I think this should do it.

                Boetsie

                Comment


                • #9
                  Originally posted by Autotroph View Post
                  Thanks for the clarification Boetsie,

                  Bowtie can handle only reads that are a maximum of 1024 BP long. What does SSPACE do for reads that are longer than that?
                  SSPACE can unfortunately not handle sequences longer than 1024 bp long. They simply are not used for mapping.

                  I am interested in merging scaffolds, that is merging 2 sequences that look like below(SSPACE does not use reads with N's in the paired end files, am i correct?)
                  Indeed SSPACE does not allow reads with N's in the paired-end files.

                  I think you should consider another program for this, since you mention that you want to merge scaffolds, instead of extend them. You could try something like an alignment program if you want to merge 2 scaffolds. Maybe you can do something like Ken Kraaijeveld (http://www.kenkraaijeveld.nl/genomics/bioinformatics/). See the "combining contigs" section.

                  Boetsie

                  Comment


                  • #10
                    unfortunately Minimus can be used to merge contigs only, not scaffolds.Bambus is able to merge scaffolds but does not allow N's in the input.

                    It might be possible for me to use Minimus and SSPACE in some combination to merge the scaffolds.

                    Could you please look at below example and let me know why SSPACE does not merge the 2 "contigs"?

                    --------------------_________________--------------------------
                    read1 read2(rev-comped) (common anchor sequence)

                    Contigs.fa:

                    >contig1
                    AGCTACTAGCTGCTACTAGCTCAGATGCATCGATCGACGATCTGATCGGCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCATCGTACTACGTATCTGATAGCTAGCTAGCTACGATCGATCGTCATCG
                    >contig2
                    TGTGTCAGCTAGCTACGAGCTAGCTAGCTACTACTAGCTACTAGCTAGCGCATCGTACTACGTATCTGATAGCTAGCTAGCTACGATCGATCGTCATCG

                    read1.fa

                    >read1
                    AGCTACTAGCTGCTACTAGCTCAGATGCATCGATCGACGATCTGATCGGC

                    read2 .fa(first 50 bases of contig2 are reverse complemented)

                    >read2
                    CGCTAGCTAGTAGCTAGTAGTAGCTAGCTAGCTCGTAGCTAGCTGACACA

                    lib file:

                    lib1 read1.fa read2.fa 100 0.7 0

                    command:

                    perl SSPACE_v1-1.pl -l lib -s contigs.fa -k 1 -a 0.7 -x 1 -o 1 -b merger

                    This gives me 2 scaffolds instead of the 1 scaffold that i am expecting. When the length of the anchor sequence is reduced, it gives a single scaffold with a "n" placed between the 2 scaffolds.

                    Surprisingly if the same information is given in the form a set of 2 mate pairs, the 2 scaffolds are merged. My guess would be that SSPACE does not treat the initial set of N's in the same way as the N's added by it in the intermediate steps.
                    Last edited by Autotroph; 03-08-2011, 03:03 AM. Reason: additional information

                    Comment


                    • #11
                      Hi Boetsie,
                      Thank you for the information,
                      I have a mate pair, with a distance, estimated by bioanalyzer,
                      My library looks as follows

                      MP1 /G1/2_5kb/s_a_sequence_1.fastq /G1/2_5kb/s_a_sequence_2.fastq 2500 0.7 1
                      MP1 /G1/2_5kb/s_b_sequence_1.fastq /G1/2_5kb/s_b_sequence_2.fastq 2500 0.7 1
                      MP1 /G2/2_5kb/s_a_sequence_1.fastq /G2/2_5kb/s_a_sequence_2.fastq 2500 0.7 1
                      MP1 /G2/2_5kb/s_b_sequence_1.fastq /G2/2_5kb/s_b_sequence_2.fastq 2500 0.7 1
                      MP1 /G2/2_5kb/s_c_sequence_1.fastq /G2/2_5kb/s_c_sequence_2.fastq 2500 0.7 1
                      MP1 /G2/2_5kb/s_d_sequence_1.fastq /G2/2_5kb/s_d_sequence_2.fastq 2500 0.7 1

                      I will try it with paired end form (0), but i cant imagine why it turns out to be paired end not matepair. In the pairing issue file, I also see that there is a lot of distance problem, is there a way to put this in graph.
                      Thank you again for your kind reaction,
                      regards,
                      Ashu

                      Comment


                      • #12
                        Originally posted by Autotroph View Post
                        Could you please look at below example and let me know why SSPACE does not merge the 2 "contigs"?
                        Hi Autotroph,

                        I've had a look at it, and i think i know why it did not merge. You should increase the insert size in your library file. SSPACE includes the read lengths within the determination of the gap/overlap. With 100bp insert size, it did not satisfy the minimum allowed distance.

                        The read lengths of your 2 reads are both 50bp. So increasing the insert size in your library with 100 (2*50bp of your reads) should do it, thus;

                        lib1 read1.fa read2.fa 200 0.7 0

                        If you need a more detailed description, please let me know

                        Kind regards,
                        Boetsie

                        Comment


                        • #13
                          The point of giving an insert size of 100(50+50) is to not have any gaps in the final scaffold. I understood that the two reads could even overlap if an insert size less than 100 is given for 2*50 bp reads.

                          Actual sequence (without any gaps)expected would be:

                          "AGCTACTAGCTGCTACTAGCTCAGATGCATCGATCGACGATCTGATCGGCTGTGTCAGCTAGCTACGAGCTAGCTAGCTACTACTAGCTACTAGCTAGCGCATCGTACTACGTATCTGATAGCTAGCTAGCTACGATCGATCGTCATCG"

                          I even tried with 200 as insert size, but it fails to merge the contigs "correctly".

                          output given below :

                          >scaffold1.1|size269
                          AGCTACTAGCTGCTACTAGCTCAGATGCATCGATCGACGATCTGATCGGCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCATCGTACTACGTATCTGATAGCTAGCTAGCTACGATCGATCGTCATCGnCGATCGACGATCTGATCGGCTGTGTCAGCTAGCTACGAGCTAGCTAGCTACTACTAGCTACTAGCTAGCGCATCGTACTACGTATCTGATAGCTAGCTAGCTACGATCGATCGTCATCG

                          Does it mean that the two reads of PE must have a gap between them?

                          Why "TGTGTCAGCTAGCTACGAGCTAGCTAGCTACTACTAGCTACTAGCTAGCG" is not replacing the N's while it has overlap and also has PE read connecting the two 'contigs'?

                          Comment


                          • #14
                            Hi Autotroph,

                            sorry but i think it's simply not possible to merge them with SSPACE with the method you try to do. SSPACE will only look at the end of the contigs if there is any overlap, while you try to change the "N" characters into DNA characters by merging.

                            SSPACE does this;
                            CATCGTACTACGTATCTGATAGCTAGCTAGCTACGATCGATC
                            .............................
                            GCTACGATCGATCAGTAGTAGATAGATAGATGATAG

                            While you try to find an certain overlap, and determine the rest of the sequence;

                            NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCATCGTACTACGTATCTGATAGCTAGCTAGCTACGATCGATCGTCATCG


                            TGTGTCAGCTAGCTACGAGCTAGCTAGCTACTACTAGCTACTAGCTAGCGCATCGTACTACGTATCTGATAGCTAGCTAGCTACGATCGATCGTCATCG
                            .......

                            As said, i think what you want to do is not possible with SSPACE. Maybe you can first do a gapclosure on the scaffolds (e.g. with SOAP's gapclosure method) so the N's will be removed out of your data.

                            Boetsie

                            Comment


                            • #15
                              Hi boetsie,

                              Thanks a lot for the patient explanation.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              25 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              28 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              24 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X