Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • BWA unmapped reads for color space data

    Hi everyone! This is my first post and I would first like to thank everyone on this forum for their expertise, which I've consulted many many times over the past few weeks. I have not, however, been able to find an answer to my latest problem, so I figured I'd see if anyone else has some ideas.

    I am trying to map SOLiD exome sequencing data to the human genome. I am new to the project and bioinformatics in general, and from what I understand, a portion of the data has already been mapped using the SOLiD software (which doesn't map indels), and my task is to use BWA to map indels for the remaining reads.

    The files were provided to me in bam format with the sequence in color space and the qualities in what I believe to be the standard (not SOLiD) format.

    bam file:

    110_857_1962 4 * 0 0 * * 0 0 * * RG:Z:20100728125836730 CS:Z:T00220012130000301222223222122120232021122012121222 CQ:Z:%)''%')%%''&%'&&(((&&%%'&%&%)&)%%%(&&(&%(%%'%''()&

    110_857_2034 4 * 0 0 * * 0 0 * * RG:Z:20100728125836730 CS:Z:T030.12130..311.0330.022330000003322010022213030022 CQ:Z:<@?!86:><!!7.2!<<0,!30>0719?10-8:>9%8')%824/+%,&<0

    110_858_72 4 * 0 0 * * 0 0 * * RG:Z:20100728125836730 CS:Z:T32022000230003012211032210202202201203100122102222 CQ:Z?>36@&<;.%<)7'(4)1(&2+81,:&%+&1.1;.9<&.+)%=%4'93;

    Because the sequences and qualities were in the comments section, the file was incompatible with BWA so I converted it to fastq format. I tried to mimic the process used in the solid2fastq.pl file that comes with BWA, in which I converted each color space sequence from 0123. to ACGTN. I clipped off the first two bases of each read (adapter+first base), and I also clipped off the first quality score so that the read length would match the quality length.

    fastq file:

    @110_857_1962
    AGGAACGCTAAAATACGGGGGTGGGCGGCGAGTGAGCCGGACGCGCGGG
    +
    )''%')%%''&%'&&(((&&%%'&%&%)&)%%%(&&(&%(%%'%''()&
    @110_857_2034
    TANCGCTANNTCCNATTANAGGTTAAAAAATTGGACAAGGGCTATAAGG
    +
    @?!86:><!!7.2!<<0,!30>0719?10-8:>9%8')%824/+%,&<0
    @110_858_72
    GAGGAAAGTAAATACGGCCATGGCAGAGGAGGACGATCAACGGCAGGGG
    +
    ?>36@&<;.%<)7'(4)1(&2+81,:&%+&1.1;.9<&.+)%=%4'93;

    However, when I run BWA v0.5.9 with default values and the color space option, out of 1000000 sequences, I only had 30 mapped reads. I'm stumped -- is there something I'm blatantly doing wrong, or does anyone have any other idea where the problem might be stemming from?

    My command line for BWA:

    bwa index -a bwtsw -c hg19.fasta
    bwa aln -c hg19.fasta sample.fastq > output.sai
    bwa samse hg19.fasta output.sai sample.fastq > output.sam

    Thanks,
    Jason

  • #2
    I would demand the original colour space files, i.e. csfasta and qual. Then align with Shrimp2, which does a better job than BWA for colour space in my experience. BWA has dropped colour space support from version 0.6.0.

    Comment


    • #3
      As it turns out, I think I may actually be getting the correct results, since most of the mapped sequences had been removed from my data anyway. When I try the same thing with reads that have been previously mapped, I have many more matches. Thanks for your suggestion colindaven, I will look into Shrimp2 (and probably BFAST as well) when it comes time to compare my results between several different programs.

      Comment


      • #4
        SHRiMP2 slow for higher organisms

        I tried to run SHRiMP2 once on SOLiD paired end ChipSeq data with mouse genome as reference and it never finished running despite giving a threading option. Now with gapped alignment (to obtain In-Dels) of SOLiD paired end whole exome data with human genome I am stuck since BWA now does not handle colorspace. Did you give it a try to run using SHRiMP2?

        Comment


        • #5
          One can always use BWA 0.5.9 for colorspace since, as pointed out, BWA dropped color space support in 0.6.x.

          Isn't cutting off any starting sequence of color space data a bad idea? The calling is based on transition so odds are you're going to start off lost if you cut off one base and even higher probability of getting lost by cutting off two.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM
          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          25 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          29 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          24 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          52 views
          0 likes
          Last Post seqadmin  
          Working...
          X