Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problems with BWA XA output on

    I believe the XA information is wrong in many of our reads but I can't understand why. Any help will be greatly appreciated. Essentially I want to be able to rely on the XA information (I'm not only concerned about uniq reads).

    History:
    - these are single-end reads from SOLiD
    - bwa flags are as follows:
    bwa aln -c -n 0.04 -N -q 30 -t 60 -f ${OUTFILE}.sai $INDEX ${FASTQ}
    bwa samse -n 1000000 -f ${OUTFILE}.txt $INDEX ${OUTFILE}.sai ${FASTQ}
    - input file that is indexed has 1314 sequences and is a total of 2.3G
    - i'm getting lots of problematic reads (> 74,000) over 16 samples. Could be much more than 74,000 but those were easy to verify
    - one of the problem reads are below with questions/concerns I'm hoping to be able to get assistance with

    A sample problem read:
    208_44_1461 0 gi|307149945|ref|NC_014501.1|_Cyanothece_sp._PCC_7822_chromosome__complete_genome 4159110 25 20M * 0 0 AAATACCGTTCAAATAATGG A!!57A?:@CHQH%=O79:< XT:A:U CM:i:0 X0:i:1 X1:i:1 XM:i:2 XO:i:0 XG:i:0 MD:Z:20 XA:Z:gi|258513327|ref|NC_013213.1|_Acetobacter_pasteurianus_IFO_3283-01_plasmid_pAPA01-040__complete_sequence,+3199,7M1I1M2I4M1D5M,2;

    Some concerns with the sample read above:
    1. Cyanothece seems to map, but Acetobacter does not. When looking at the Acetobacter sequence, I don't see anything that even closely resembles AAATACCGTTCAAATAATGG. Not even when taking into account the 3 insertions and 1 deletion mentioned in the CIGAR score. As you will see from the X1 score, Acetobacter is a reported suboptimal hit.
    2. The Acetobacter has a start coordinate of +3199 (which makes the coordinates 3199 - 3218). However, the actual sequence of Acetobacter is only 3205 in length, so it can't go to 3218. I don't believe it's a problem with boundries because even when looking at 3199-3205 (that are in the sequence), nothing resembles AAATACCGTTCAAATAATGG. I also looked at the sequences surrounding that sequence and didn't see anything it could map to (even taking the reverse complement). My bigger fear is there are also reported reads that don't exceed the sequence boundry that don't map at all.
    3. Acetobacter has a CIGAR score of 7M1I1M2I4M1D5M which I believe should have been 'filtered out' given the bwa flags shown above. While I still don't see the alignment, the CIGAR score likely should have been filtered despite it being a suboptimal hit (unless I'm not understanding the suboptimal hit thresholds correctly).

    Any help will be greatly appreciated. I'm mainly trying to understand if we should continue to use BWA for reading 'non-uniq' reads, use some other method, or apply some work-arounds. Could the problems be only with sub-optimal hits? Or could it be more wide-spread and effect all repeat-hits or even uniq hits?

    Thank you.

  • #2
    I have a similar problem with bwa 0.6.1 and XA.

    I have been checking SRA reads against a portion of the mouse genome.
    One of hits is:

    SRR067605.10526006_0 0 R16_3260025_3260114;8547975 37 23 50M * 0 0 ACTTATTATATAATATTATTTCATTTACAGAGGAACATATAATACCTTAT GGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGEFDFFFGGGFGFGGGG XT:A:U NM:i:0 X0:i:1 X1:i:1 XM:i:0 XO:i:0 XG:i:0 MD:Z:50 XA:Z:U16_3259971_3260068;8547974,+91,6M2I42M,2;

    which means this read is mapped to R16_3260025_3260114;8547975 and U16_3259971_3260068;8547974. But U16_* was too short for an alignment
    that started at base 91.

    So, I isolated R16_* and U16_* (and another one) from my 1 Gb subject fasta
    into a file of three records:

    >U16_3259971_3260068;8547974
    CTTCTTCTATTTTTTGATTATGGTTTTTGGTATGTATATTATATCCAGGGAATGTGTTTA
    TTACAGAAGAAATAATCATCATTTATTAGGACTTATTA
    >R16_3260025_3260114;8547975
    TGTTTATTACAGAAGAAATAATCATCATTTATTAGGACTTATTATATAATATTATTTCAT
    TTACAGAGGAACATATAATACCTTATAAAT
    >U16_3260071_3260149;8547976
    TAATATTATTTCATTTACAGAGGAACATATAATACCTTATAAATGTATATGAAATTGATA
    AAAGTTTTTACAGTGTAAG

    Isolated the read into a fastq file:

    @SRR067605.10526006_0
    ACTTATTATATAATATTATTTCATTTACAGAGGAACATATAATACCTTAT
    +SRR067605.10526006_0
    GGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGEFDFFFGGGFGFGGGG

    And reran the alignment just between the two, the hit from bwa now becomes:

    SRR067605.10526006_0 0 R16_3260025_3260114;8547975 37 37 50M * 0 0
    ACTTATTATATAATATTATTTCATTTACAGAGGAACATATAATACCTTAT GGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGEFDFFFGGGFGFGGGG XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50

    No U16 hit...very perplexing...
    I can't post the 1 Gb subject fasta file. Don't quite know why it incorrectly
    picked U16_* when aligning against the larger subject file, but not when the
    subject file is only 3 records.

    Any recommendations ?

    Thanks

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Recent Advances in Sequencing Analysis Tools
      by seqadmin


      The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
      05-06-2024, 07:48 AM
    • seqadmin
      Essential Discoveries and Tools in Epitranscriptomics
      by seqadmin




      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
      04-22-2024, 07:01 AM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, 05-07-2024, 06:57 AM
    0 responses
    12 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 05-06-2024, 07:17 AM
    0 responses
    16 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 05-02-2024, 08:06 AM
    0 responses
    22 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-30-2024, 12:17 PM
    0 responses
    24 views
    0 likes
    Last Post seqadmin  
    Working...
    X