Problems with BWA XA output on

kbhit

Junior Member

Join Date: Sep 2011

Posts: 9
- Share
- Tweet
#1

Problems with BWA XA output on

12-15-2011, 08:34 AM

I believe the XA information is wrong in many of our reads but I can't understand why. Any help will be greatly appreciated. Essentially I want to be able to rely on the XA information (I'm not only concerned about uniq reads).

History:
- these are single-end reads from SOLiD
- bwa flags are as follows:
bwa aln -c -n 0.04 -N -q 30 -t 60 -f ${OUTFILE}.sai $INDEX ${FASTQ}
bwa samse -n 1000000 -f ${OUTFILE}.txt $INDEX ${OUTFILE}.sai ${FASTQ}
- input file that is indexed has 1314 sequences and is a total of 2.3G
- i'm getting lots of problematic reads (> 74,000) over 16 samples. Could be much more than 74,000 but those were easy to verify
- one of the problem reads are below with questions/concerns I'm hoping to be able to get assistance with

A sample problem read:
208_44_1461 0 gi|307149945|ref|NC_014501.1|_Cyanothece_sp._PCC_7822_chromosome__complete_genome 4159110 25 20M * 0 0 AAATACCGTTCAAATAATGG A!!57A?:@CHQH%=O79:< XT:A:U CM:i:0 X0:i:1 X1:i:1 XM:i:2 XO:i:0 XG:i:0 MD:Z:20 XA:Z:gi|258513327|ref|NC_013213.1|_Acetobacter_pasteurianus_IFO_3283-01_plasmid_pAPA01-040__complete_sequence,+3199,7M1I1M2I4M1D5M,2;

Some concerns with the sample read above:
1. Cyanothece seems to map, but Acetobacter does not. When looking at the Acetobacter sequence, I don't see anything that even closely resembles AAATACCGTTCAAATAATGG. Not even when taking into account the 3 insertions and 1 deletion mentioned in the CIGAR score. As you will see from the X1 score, Acetobacter is a reported suboptimal hit.
2. The Acetobacter has a start coordinate of +3199 (which makes the coordinates 3199 - 3218). However, the actual sequence of Acetobacter is only 3205 in length, so it can't go to 3218. I don't believe it's a problem with boundries because even when looking at 3199-3205 (that are in the sequence), nothing resembles AAATACCGTTCAAATAATGG. I also looked at the sequences surrounding that sequence and didn't see anything it could map to (even taking the reverse complement). My bigger fear is there are also reported reads that don't exceed the sequence boundry that don't map at all.
3. Acetobacter has a CIGAR score of 7M1I1M2I4M1D5M which I believe should have been 'filtered out' given the bwa flags shown above. While I still don't see the alignment, the CIGAR score likely should have been filtered despite it being a suboptimal hit (unless I'm not understanding the suboptimal hit thresholds correctly).

Any help will be greatly appreciated. I'm mainly trying to understand if we should continue to use BWA for reading 'non-uniq' reads, use some other method, or apply some work-arounds. Could the problems be only with sub-optimal hits? Or could it be more wide-spread and effect all repeat-hits or even uniq hits?

Thank you.
Tags: None
jenli

Junior Member

Join Date: Feb 2012

Posts: 1
- Share
- Tweet
#2

02-29-2012, 02:16 PM

I have a similar problem with bwa 0.6.1 and XA.

I have been checking SRA reads against a portion of the mouse genome.
One of hits is:

SRR067605.10526006_0 0 R16_3260025_3260114;8547975 37 23 50M * 0 0 ACTTATTATATAATATTATTTCATTTACAGAGGAACATATAATACCTTAT GGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGEFDFFFGGGFGFGGGG XT:A:U NM:i:0 X0:i:1 X1:i:1 XM:i:0 XO:i:0 XG:i:0 MD:Z:50 XA:Z:U16_3259971_3260068;8547974,+91,6M2I42M,2;

which means this read is mapped to R16_3260025_3260114;8547975 and U16_3259971_3260068;8547974. But U16_* was too short for an alignment
that started at base 91.

So, I isolated R16_* and U16_* (and another one) from my 1 Gb subject fasta
into a file of three records:

>U16_3259971_3260068;8547974
CTTCTTCTATTTTTTGATTATGGTTTTTGGTATGTATATTATATCCAGGGAATGTGTTTA
TTACAGAAGAAATAATCATCATTTATTAGGACTTATTA
>R16_3260025_3260114;8547975
TGTTTATTACAGAAGAAATAATCATCATTTATTAGGACTTATTATATAATATTATTTCAT
TTACAGAGGAACATATAATACCTTATAAAT
>U16_3260071_3260149;8547976
TAATATTATTTCATTTACAGAGGAACATATAATACCTTATAAATGTATATGAAATTGATA
AAAGTTTTTACAGTGTAAG

Isolated the read into a fastq file:

@SRR067605.10526006_0
ACTTATTATATAATATTATTTCATTTACAGAGGAACATATAATACCTTAT
+SRR067605.10526006_0
GGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGEFDFFFGGGFGFGGGG

And reran the alignment just between the two, the hit from bwa now becomes:

SRR067605.10526006_0 0 R16_3260025_3260114;8547975 37 37 50M * 0 0
ACTTATTATATAATATTATTTCATTTACAGAGGAACATATAATACCTTAT GGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGEFDFFFGGGFGFGGGG XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50

No U16 hit...very perplexing...
I can't post the 1 Gb subject fasta file. Don't quite know why it incorrectly
picked U16_* when aligning against the larger subject file, but not when the
subject file is only 3 records.

Any recommendations ?

Thanks
Comment

Previous template Next

Recent Advances in Sequencing Analysis Tools

by seqadmin

The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
- Channel: Articles
05-06-2024, 07:48 AM
Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM

Topics	Statistics	Last Post
Telomere Maintenance by PARP1: A New Perspective in Cancer Research by seqadmin Started by seqadmin, 05-07-2024, 06:57 AM	0 responses 12 views 0 likes	Last Post by seqadmin 05-07-2024, 06:57 AM
Enhanced Neoantigen Detection: Introducing NeoHunter by seqadmin Started by seqadmin, 05-06-2024, 07:17 AM	0 responses 16 views 0 likes	Last Post by seqadmin 05-06-2024, 07:17 AM
A Close Examination at Probiotic-Related Bacteremia by seqadmin Started by seqadmin, 05-02-2024, 08:06 AM	0 responses 22 views 0 likes	Last Post by seqadmin 05-02-2024, 08:06 AM
Expanded Genetic Insights into Blood Pressure Regulation by seqadmin Started by seqadmin, 04-30-2024, 12:17 PM	0 responses 24 views 0 likes	Last Post by seqadmin 04-30-2024, 12:17 PM

Seqanswers Leaderboard Ad

Announcement

Problems with BWA XA output on

Comment

Latest Articles

ad_right_rmr

News