Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • bwa color-space index

    I like to build color-space indexing by bwa. The input fast should be in nucleotide space, so I use following command to index whole human genome:

    >bwa index -c human.fasta

    But segmentation fault occurred everytime like this,

    [bwa_index] Pack nucleotide FASTA... 60.48 sec
    [bwa_index] Convert nucleotide PAC to color PAC... 31.13 sec
    [bwa_index] Reverse the packed sequence... 16.62 sec
    [bwa_index] Construct BWT for the packed sequence...
    Segmentation fault

    Can anyone tell me why that happen?

    thanks

  • #2
    Hi,

    probably it is because you did not use the -a bwtsw option. According to the manual (bwaw.1) it is needed for human:

    bwtsw Algorithm implemented in BWT-SW. This is the only method that works with the whole human genome. However, this module does not work with database smaller than 10MB and it is much slower than the other two. Bwtsw algorithm trades speed for memory.

    Comment


    • #3
      Hello, Chipper

      You are right, I am only able to implement whole human genome by bwtsw. Guess bwa might not be competitive for sequencing SOLiD color space data.

      Thanks

      Comment


      • #4
        Sorry, I don't follow, why would it not be competetive for SOLiD data? It takes some time to build the index, but once you have the index it is really fast.

        Comment


        • #5
          Sorry, I think I mislead in my reply, what I mean was that bwa couldn't index whole human genome in color space because bwtsw is the only way to do so. I can use -c for smaller genome, like chr1, ...etc.

          Since some pipeline are using SOLiD data, I was thinking to generate the human genome index in color space and as you mentioned it is going to be fast once I have all those index files. So if now I want to align the color space data to human.fasta, I would have to pick another aligner?

          thanks for your reply

          Comment


          • #6
            Originally posted by totalnew View Post
            Sorry, I think I mislead in my reply, what I mean was that bwa couldn't index whole human genome in color space because bwtsw is the only way to do so. I can use -c for smaller genome, like chr1, ...etc.

            Since some pipeline are using SOLiD data, I was thinking to generate the human genome index in color space and as you mentioned it is going to be fast once I have all those index files. So if now I want to align the color space data to human.fasta, I would have to pick another aligner?

            thanks for your reply
            I was able to index the entire human genome with BWA (bwtsw) so it is possible. I would like to certainly like to hear your experiences with longer read lengths with SOLiD data (50 and 75bp) and BWA. I have not gotten it to run as fast as other methods, especially when I try to have higher error tolerances (>10% color errors, and long indels).

            Comment


            • #7
              Nils,

              I am testing it with a 50 bp dataset (23.6 M reads). As expected, aln without indels and few mismatches is very fast. 2 MM was done after ~ 15 minutes with 4 threads. 4 MM probably ~ 10 x slower but I like the option to allow more mismatches at the end (with a good seed) which should make it much faster. Would be nice to compare it to BFAST if I ever manage to build that index...

              Any ideas on how to set up an ideal aligner comparison test for SOLiD data?

              Comment


              • #8
                Originally posted by Chipper View Post
                Nils,

                I am testing it with a 50 bp dataset (23.6 M reads). As expected, aln without indels and few mismatches is very fast. 2 MM was done after ~ 15 minutes with 4 threads. 4 MM probably ~ 10 x slower but I like the option to allow more mismatches at the end (with a good seed) which should make it much faster. Would be nice to compare it to BFAST if I ever manage to build that index...

                Any ideas on how to set up an ideal aligner comparison test for SOLiD data?
                Create some simulations is the best bet. I would create a dataset composing of sets of 10K reads, each with X SNPs, Y color errors, and a Z base long indel. You can then vary X, Y, and Z to see what power you really have to detect variants and to be robust to errors (10% color error rate is not unheard of). This is what I did with BFAST, which has its own synthetic read generator, when I compared it to other aligners.

                I have found it takes about 6 hours to build one BFAST index on a 32GB quad-core machine. Like BWA, this needs to done only once per reference (save those indexes!). The BWA index I builit did not take too long to build either.

                Comment


                • #9
                  Mapping with 6 MM with 2 in the seed (25 bp) is ~ 3x faster than with 4 MM in the full sequence. Will try with some recent datasets tomorrow.

                  Comment


                  • #10
                    BWA: getting sequences like "GNNNNNNNNNNNNNNNNNNNNNNNNN" from SOLiD color space reads

                    Hello there,

                    I am new to this forum, but glad to see so many great discussions going on.

                    In the past, I have been mainly using MAQ to analyze the Solexa data. A few days ago, I started trying to use BWA to analyze the SOLiD data, partly because of its claimed fast speed, partly because of some of the problems I ran into when using MAQ for SOLiD data. I am running to a problem as described below. Just wonder if I can get some help from you experts. Thanks very much in advance! (sorry for having a long message as my first post, but I think it is necessary for you to understand the problem)

                    #Problem:
                    #I used the following commands trying to map pair-end SOLiD data in fastq format directly downloaded from the 1000 genome project site:

                    bwa index -a bwtsw -c hg18.fa &

                    [bwt_gen] Finished constructing BWT in 314 iterations.
                    [bwa_index] 2054.17 seconds elapse.
                    #This seem to work fine

                    bwa aln -c hg18.fa SRR003188_1.fastq >SRR003188_1.sai
                    bwa aln -c hg18.fa SRR003188_2.fastq >SRR003188_2.sai

                    #these are the files generated including the original read files:
                    -rw-r--r-- 1 pliang pliang 355680085 Aug 24 12:47 SRR003188_1.fastq
                    -rw-r--r-- 1 pliang pliang 11958400 Aug 26 21:19 SRR003188_1.sai
                    -rw-r--r-- 1 pliang pliang 355680085 Aug 24 12:49 SRR003188_2.fastq
                    -rw-r--r-- 1 pliang pliang 11958400 Aug 26 21:20 SRR003188_2.sai

                    bwa sampe -a 2400 hg18.fa SRR003188_1.sai SRR003188_2.sai SRR003188_1.fastq SRR003188_2.fastq >SRR003188.sam
                    #message
                    [bwa_sai2sam_pe_core] convert to sequence coordinate...
                    [infer_isize] fail to infer insert size: too few good pairs
                    [bwa_sai2sam_pe_core] time elapses: 3.11 sec
                    [bwa_sai2sam_pe_core] change of coordinates in 0 alignments.
                    [bwa_sai2sam_pe_core] align unmapped mate...
                    [bwa_sai2sam_pe_core] time elapses: 0.71 sec
                    [bwa_sai2sam_pe_core] refine gapped alignments... 1.58 sec
                    [bwa_sai2sam_pe_core] print alignments... 0.43 sec
                    [bwa_sai2sam_pe_core] 262144 sequences have been processed.
                    [bwa_sai2sam_pe_core] convert to sequence coordinate...
                    [infer_isize] fail to infer insert size: too few good pairs

                    #when open the .sam file, it looks like this:
                    VAB_Solid0044_20080423_1_Pilot2_YRI_1_8_3KB_MP_11137_718_114 77 * 0 0 * * 0 0 GNNNNNNNNNNNNNNNNNNNNNNNNN !611%%(-+%*.&*.,&2,,'%(
                    )31
                    VAB_Solid0044_20080423_1_Pilot2_YRI_1_8_3KB_MP_11137_718_114 141 * 0 0 * * 0 0 TNNNNNNNNNNNNNNNNNNNNNNNNN !1:7%6);%.1/<%&717'/'7:
                    .....

                    #this was the same when run samse with single input. Looks like to me that the color space didn't get converted to properly, therefore not finding any match. Also, the time used for aln and sampe/samse seems to be too little to me.

                    Comment


                    • #11
                      Hi,

                      bwa uses fastQ files with colors represented as ACGT, perhaps the 1000 genomes fastq files represents it as 0123? Also, bwa does not use the first color so you may have to strip it or use the solid2fastq script.

                      Comment


                      • #12
                        Hi Chipper, thanks for your response. Yes, you are right about the fastq files from the 1000 genome project. I didn't know the bwa uses only nucleotide sequence. So I will what you suggested and see how it goes. Thanks again.

                        Comment


                        • #13
                          Originally posted by totalnew View Post
                          I like to build color-space indexing by bwa. The input fast should be in nucleotide space, so I use following command to index whole human genome:

                          >bwa index -c human.fasta

                          But segmentation fault occurred everytime like this,

                          [bwa_index] Pack nucleotide FASTA... 60.48 sec
                          [bwa_index] Convert nucleotide PAC to color PAC... 31.13 sec
                          [bwa_index] Reverse the packed sequence... 16.62 sec
                          [bwa_index] Construct BWT for the packed sequence...
                          Segmentation fault

                          Can anyone tell me why that happen?

                          thanks
                          Are there pre built indexes for BWA as there are for bowtie?
                          ftp://ftp.cbcb.umd.edu/pub/data/bowtie_indexes/
                          Last edited by KevinLam; 12-21-2009, 06:47 PM.
                          http://kevin-gattaca.blogspot.com/

                          Comment


                          • #14
                            though this is an old thread, it might be important to clarify ... are you referring to another tool called 'bwtsw', separate from bwa? Chipper was referring to the bwtsw indexing option to the 'bwa index' command ...

                            Comment


                            • #15
                              Converting NCBI colorspace fastq to BWA Colorspace Fastq.

                              I have a project that involves aligning SoLID data to Hg18. The short reads (both pair and single ended) are provided in a fastq file that looks like this

                              Code:
                              @SRR035457.1557068 VAB_solid0148_20090522_1_AZZ_ABT_LMP_pA_0000001003227942_AZZ_ABT_LMP_pA_000000100322794288_85_1730 length=50
                              T003.......0230..0.0.....220..2.010.301...321..111.
                              +SRR035457.1557068 VAB_solid0148_20090522_1_AZZ_ABT_LMP_pA_0000001003227942_AZZ_ABT_LMP_pA_000000100322794288_85_1730 length=50
                              !%9#!!!!!!!#-$1!!2!%!!!!!%)&!!(!*,#!$2'!!!)/+!!%2,!
                              Clearly this is colorspace data, and I'd like to use BWA to align it (I already have a suite of tools compatible with BWA, and this is near the end of the project, so I don't really want to switch).

                              The solid2fastq.pl script most often refrenced in BWA literature seems to require color space data in some kind of another format (multiple files, seperate quality and color data, perhaps different quality score scaling, etc...)

                              Can anyone provide some pointers as to how I can convert this colorspace FASTQ file to a colorspace FASTQ file that is compatible with BWA's colorspace aligner (presumably BWA's colorspace format represents colors using nucleotide letters... as opposed to converting the colorspace reads to actual nucleotides).

                              I want to make sure that I have
                              • the correct colorspace name (0->A, 1->C, 2->G 3->T, *->N)
                              • the correct quality score mapping and representation
                              • allows paired reads to be correctly treated by BWA


                              Many thanks,
                              --Brad

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              31 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              32 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              28 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              53 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X