@Pepe I am sure you can make a script to perform the same task, though it is conceptually easier to avoid dropping the unmapped reads while you are processing them with 'bwa bwasw' one by one rather than figuring out later what you dropped. I also wanted to mention that occasionally 'bwa bwasw' reports multiple alignments in the sam file for the same read. The only way to modify that behavior at the moment is to get into the source code.
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
Unmapped reads bwasw + patch
Originally posted by freeseek View PostThis simple patch:
Code:--- bwtsw2_aux.c +++ bwtsw2_aux.c @@ -520,8 +520,10 @@ bsw2_resolve_query_overlaps(b[0], opt.mask_level); } else b[1] = 0; // generate CIGAR and print SAM - gen_cigar(&opt, l, seq, pac, b[0]); - print_hits(bns, &opt, p, b[0]); + if(b[0]->n > 0) { + gen_cigar(&opt, l, seq, pac, b[0]); + print_hits(bns, &opt, p, b[0]); + } // free free(seq[0]); bsw2_destroy(b[0]); @@ -583,6 +585,7 @@ for (i = 0; i < _seq->n; ++i) { bsw2seq1_t *p = _seq->seq + i; if (p->sam) printf("%s", p->sam); + else printf("%s\t4\t*\t0\t0\t*\t*\t0\t0\t%s\t%s\n", p->name, p->seq, p->qual); free(p->name); free(p->seq); free(p->qual); free(p->sam); p->tid = -1; p->l = 0; p->name = p->seq = p->qual = p->sam = 0;
I'm trying to align my sample illumina bacterial genome to an already published reference bacterial genome. The aim is to find out what genes are present in my sample genome, but missing in the reference. The plan of action was to run bwasw, and then pull out the sequence in the sample genome that do not align to the reference, then look at what these sequences are.
I aligned my query genome to the ref using bwasw.
The output was in sam format. I converted that to bam format using
samtools view -bS aln.sam > aln.bam
Then I ran the command lines quoted above.
When I headed my mapped.sam file, it appears to have worked, however, my unmapped.sam file is empty,the size is zero.
This is consistent for two of my sample genomes. Reading this thread, it seems that bwasw does not give out unmapped reads, thus I was trying your patch.
This is the message I'm getting:
bgansw:~/bwa$ patch < bwtsw2_aux.diff
can't find file to patch at input line 3
Perhaps you should have used the -p or --strip option?
The text leading up to this was:
--------------------------
|--- bwtsw2_aux.c
|+++ bwtsw2_aux.c
--------------------------
File to patch:
I'm probably going to ask a very silly question. I'm a newbie, so am not good with scripts. What should I enter in file to patch? Do I have to make changes or input a file name in the script? I've got paired end illumina hiseq data.
Any help would be greatly appreciated.
bgansw
Comment
-
Hi Freeseek,
I inputted my faq file in the file to patch. That gave me .fastq.orig file and .fastq.rej
Copy pasting the .orig file
bgansw:~/bwa$ head 6_combined_paired.faq.orig
@GAPC_0027_FC:7:1:4477:1032#TGACCA/1
AGTATGGGGGAAATGCAAGCAAGCATTNATGCTAGTGGTTGTCAAATAGTTACCGTAGCTGTGCGCA
+
hhhhhhhhhhdhhghhhhhhhheddeeBededaad^dfdfffhhhhhhfhhfhhfceffdfbhhhad
@GAPC_0027_FC:7:1:4477:1032#TGACCA/2
TGCTTCTAAGGTACCAATGGNNNCAGGTAACAGGTATTTAGGATCGGGTATGACCTNNNNTTTAACA
+
hhhhhhhhhhhefhfhf_]_BBB`]aa^acdcff`fffffffhghhgfafccY][ZBBBBW\\[[[[
@GAPC_0027_FC:7:1:4986:1032#TGACCA/1
GAAAATCCGAAAGAACCGGCAATGATTNTTTCTACGTTCCGTAAAGAAAGCAATGGTGATAAGCCTG
Am copy pasting the .fastq.rej file:
bgansw:~/bwa$ cat 6_combined_paired.faq.rej
***************
*** 520,527 ****
bsw2_resolve_query_overlaps(b[0], opt.mask_level);
} else b[1] = 0;
// generate CIGAR and print SAM
- gen_cigar(&opt, l, seq, pac, b[0]);
- print_hits(bns, &opt, p, b[0]);
// free
free(seq[0]);
bsw2_destroy(b[0]);
--- 520,529 ----
bsw2_resolve_query_overlaps(b[0], opt.mask_level);
} else b[1] = 0;
// generate CIGAR and print SAM
+ if(b[0]->n > 0) {
+ gen_cigar(&opt, l, seq, pac, b[0]);
+ print_hits(bns, &opt, p, b[0]);
+ }
// free
free(seq[0]);
bsw2_destroy(b[0]);
***************
*** 583,588 ****
for (i = 0; i < _seq->n; ++i) {
bsw2seq1_t *p = _seq->seq + i;
if (p->sam) printf("%s", p->sam);
free(p->name); free(p->seq); free(p->qual); free(p->sam);
p->tid = -1; p->l = 0;
p->name = p->seq = p->qual = p->sam = 0;
--- 585,591 ----
for (i = 0; i < _seq->n; ++i) {
bsw2seq1_t *p = _seq->seq + i;
if (p->sam) printf("%s", p->sam);
+ else printf("%s\t4\t*\t0\t0\t*\t*\t0\t0\t%s\t%s\n", p->name, p->seq, p->qual);
free(p->name); free(p->seq); free(p->qual); free(p->sam);
p->tid = -1; p->l = 0;
p->name = p->seq = p->qual = p->sam = 0;
My apologies, since I dont know if I'm on the right track or not. Any help would be greatly appreciated.
Regards
bgansw
Comment
-
If you have Illumina reads, just use bwa aln. It doesn't much matter how long your reads are, you are only aligning to a bacterial genome, how long can it take?
The post you are citing is a year old, do you know that it's suppsoed to work for whatever version of bwa you have?
Comment
-
Hi swbarnes2,
I ran bwa using the aln parameter, instead of the bwasw, but am not sure that its worked. The output .sam file is only 4K, while the bwasw output file was 796K. Additionally, the aln.sam file looks nothing like the bwasw.sam file. It doesnt contain reads, looks more like a .bam file. The aln.sam file also doesnt contain @SQ headers, while the bwasw file contained them, so I'm unable to convert the sam file to .bam.
I tried
samtools faidx ref.fa (but I get an error - [fai_build_core] different line length in sequence 'NZ_A01000034'.
Segmentation fault
So am stuck...
I would really appreciate any help.
Comment
-
You realize that aln doesn't output a sam file, right? You put the outputs of aln through samse or sampe, and they make the .sam file. I align to bacteria all the time with 60-mers and use bwa aln and sampe.
And your faidx error means exactly what it looks like it means. bwa doesn't mind that particular problem, but you can't index the fasta if the line lengths vary within a sequence. I don't see why you need to index the fasta at this point anyway. Samtools view does not require an indexed fasta to make a .bam file. You can pipe samtools samse or sampe straight into samtools view, and not have to deal with a bulky .sam file. I do it all the time.
4 K may be too small, but maybe not.
For bacteria, these are the commands you want:
bwa index bacteria_ref.fasta; bwa aln bacteria_ref.fasta read1.fq > R1.sai; bwa aln bacteria_ref.fasta read2.fq > R2.sai; bwa sampe bacteria_ref.fasta R1.sai R2.sai read1.fq read2.fq | samtools view -bSho out.bam -; samtools sort out.bam sort; samtools view -bf 4 > unmapped.bam
Comment
-
extract unmapped reads from bwa - convert to fastq
Hi Swarnes2,
Thank you very very much for your detailed post. Since I'm a newbie, the post was direct, easy to understand and very helpful. Thank you very very much. I'm very grateful!
I ran bwa aln , then sampe, got an output_unmapped.bam file
I also converted the output_unmapped.bam file to fasta format.
This is what the unmapped reads look like:
>GAPC_0027_FC:7:1:4477:1032#TGACCA
TGTTAAANNNNAGGTCATACCCGATCCTAAATACCTGTTACCTGNNNCCATTGGTACCTTAGAAGCA
>GAPC_0027_FC:7:1:4986:1032#TGACCA
GAAAATCCGAAAGAACCGGCAATGATTNTTTCTACGTTCCGTAAAGAAAGCAATGGTGATAAGCCTG
>GAPC_0027_FC:7:1:4986:1032#TGACCA
GATAGGGGCTACTTACTCAGNNNCAGCAGCTTCACCTTTCTTCTCTGCATTGTGTANNNNAGCAGAA
>GAPC_0027_FC:7:1:6525:1034#TGACCA
There are over 100 of these little sequences in the unmapped file.
I tried to put the .bam file into velvet
velveth UNMAPPED_06_aln_sampe 55 -fastq -shortPaired out_06_aln_sampe.bam
[0.000000] Reading FastQ file out_06_aln_sampe.bam
velveth: out_06_aln_sampe.bam does not seem to be in FastQ format
Am I doing something wrong?
I've been trying to use picard - bam to fastq and other fasta to fastq convertors available online, but havent got them working yet. Once I generate a fastq file, would aligning these reads in velvet or soapdenovo be the next step, to finding reads present in my sample genome but not in the reference genome.
Thank you for your advice and help. I really appreciate it.
Regards
bgansw
Comment
-
Sure, but new versions of velvet will take .bams. Of course, you have to specify that in the velvet command line.
Bams have the virtue of being very nicely compressed. Of course, fastq.gz is also pretty compressed, and both bwa and velvet will take those too.
My guess is that it's not going to show you much. Looks like those reads failed to align because something happened to the instrument at due to quality issues, and not because there was no suitable reference for them to align to. The pattern of the N's suggest that the instrument had bubbles or something during certain cycles, and those letters just couldn't be called.
Comment
Latest Articles
Collapse
-
by seqadmin
Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...-
Channel: Articles
04-04-2024, 04:25 PM -
-
by seqadmin
Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...-
Channel: Articles
03-22-2024, 06:39 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 04-11-2024, 12:08 PM
|
0 responses
28 views
0 likes
|
Last Post
by seqadmin
04-11-2024, 12:08 PM
|
||
Started by seqadmin, 04-10-2024, 10:19 PM
|
0 responses
31 views
0 likes
|
Last Post
by seqadmin
04-10-2024, 10:19 PM
|
||
Started by seqadmin, 04-10-2024, 09:21 AM
|
0 responses
27 views
0 likes
|
Last Post
by seqadmin
04-10-2024, 09:21 AM
|
||
Started by seqadmin, 04-04-2024, 09:00 AM
|
0 responses
52 views
0 likes
|
Last Post
by seqadmin
04-04-2024, 09:00 AM
|
Comment