Seqanswers Leaderboard Ad

**freeseek** · 10-08-2010, 06:46 PM

@Pepe I am sure you can make a script to perform the same task, though it is conceptually easier to avoid dropping the unmapped reads while you are processing them with 'bwa bwasw' one by one rather than figuring out later what you dropped. I also wanted to mention that occasionally 'bwa bwasw' reports multiple alignments in the sam file for the same read. The only way to modify that behavior at the moment is to get into the source code.

**Bgansw** · 02-21-2012, 04:12 PM

Unmapped reads bwasw + patch

Originally posted by freeseek View Post

This simple patch:

Code:

--- bwtsw2_aux.c
+++ bwtsw2_aux.c
@@ -520,8 +520,10 @@
 			bsw2_resolve_query_overlaps(b[0], opt.mask_level);
 		} else b[1] = 0;
 		// generate CIGAR and print SAM
-		gen_cigar(&opt, l, seq, pac, b[0]);
-		print_hits(bns, &opt, p, b[0]);
+		if(b[0]->n > 0) {
+		    gen_cigar(&opt, l, seq, pac, b[0]);
+		    print_hits(bns, &opt, p, b[0]);
+		}
 		// free
 		free(seq[0]);
 		bsw2_destroy(b[0]);
@@ -583,6 +585,7 @@
 	for (i = 0; i < _seq->n; ++i) {
 		bsw2seq1_t *p = _seq->seq + i;
 		if (p->sam) printf("%s", p->sam);
+		else printf("%s\t4\t*\t0\t0\t*\t*\t0\t0\t%s\t%s\n", p->name, p->seq, p->qual);
 		free(p->name); free(p->seq); free(p->qual); free(p->sam);
 		p->tid = -1; p->l = 0;
 		p->name = p->seq = p->qual = p->sam = 0;

will make bwasw spit out also the unmapped reads, though it only works with .fastq files, so use it carefully. To use it, create a file contained the above code, like "bwtsw2_aux.diff", enter the bwa source directory, and issue the command "patch < bwtsw2_aux.diff". (PS You need to be careful to not make the tabs become spaces, or the patch will not work)

Hi Freeseek,

I'm trying to align my sample illumina bacterial genome to an already published reference bacterial genome. The aim is to find out what genes are present in my sample genome, but missing in the reference. The plan of action was to run bwasw, and then pull out the sequence in the sample genome that do not align to the reference, then look at what these sequences are.

I aligned my query genome to the ref using bwasw.

The output was in sam format. I converted that to bam format using

samtools view -bS aln.sam > aln.bam

Then I ran the command lines quoted above.

When I headed my mapped.sam file, it appears to have worked, however, my unmapped.sam file is empty,the size is zero.
This is consistent for two of my sample genomes. Reading this thread, it seems that bwasw does not give out unmapped reads, thus I was trying your patch.

This is the message I'm getting:
bgansw:~/bwa$ patch < bwtsw2_aux.diff
can't find file to patch at input line 3
Perhaps you should have used the -p or --strip option?
The text leading up to this was:
--------------------------
|--- bwtsw2_aux.c
|+++ bwtsw2_aux.c
--------------------------
File to patch:

I'm probably going to ask a very silly question. I'm a newbie, so am not good with scripts. What should I enter in file to patch? Do I have to make changes or input a file name in the script? I've got paired end illumina hiseq data.

Any help would be greatly appreciated.

bgansw

**Bgansw** · 02-21-2012, 04:21 PM

Hi Freeseek,

I inputted my faq file in the file to patch. That gave me .fastq.orig file and .fastq.rej

Copy pasting the .orig file

bgansw:~/bwa$ head 6_combined_paired.faq.orig
@GAPC_0027_FC:7:1:4477:1032#TGACCA/1
AGTATGGGGGAAATGCAAGCAAGCATTNATGCTAGTGGTTGTCAAATAGTTACCGTAGCTGTGCGCA
+
hhhhhhhhhhdhhghhhhhhhheddeeBededaad^dfdfffhhhhhhfhhfhhfceffdfbhhhad
@GAPC_0027_FC:7:1:4477:1032#TGACCA/2
TGCTTCTAAGGTACCAATGGNNNCAGGTAACAGGTATTTAGGATCGGGTATGACCTNNNNTTTAACA
+
hhhhhhhhhhhefhfhf_]_BBB`]aa^acdcff`fffffffhghhgfafccY][ZBBBBW\\[[[[
@GAPC_0027_FC:7:1:4986:1032#TGACCA/1
GAAAATCCGAAAGAACCGGCAATGATTNTTTCTACGTTCCGTAAAGAAAGCAATGGTGATAAGCCTG

Am copy pasting the .fastq.rej file:

bgansw:~/bwa$ cat 6_combined_paired.faq.rej
***************
*** 520,527 ****
bsw2_resolve_query_overlaps(b[0], opt.mask_level);
} else b[1] = 0;
// generate CIGAR and print SAM
- gen_cigar(&opt, l, seq, pac, b[0]);
- print_hits(bns, &opt, p, b[0]);
// free
free(seq[0]);
bsw2_destroy(b[0]);
--- 520,529 ----
bsw2_resolve_query_overlaps(b[0], opt.mask_level);
} else b[1] = 0;
// generate CIGAR and print SAM
+ if(b[0]->n > 0) {
+ gen_cigar(&opt, l, seq, pac, b[0]);
+ print_hits(bns, &opt, p, b[0]);
+ }
// free
free(seq[0]);
bsw2_destroy(b[0]);
***************
*** 583,588 ****
for (i = 0; i < _seq->n; ++i) {
bsw2seq1_t *p = _seq->seq + i;
if (p->sam) printf("%s", p->sam);
free(p->name); free(p->seq); free(p->qual); free(p->sam);
p->tid = -1; p->l = 0;
p->name = p->seq = p->qual = p->sam = 0;
--- 585,591 ----
for (i = 0; i < _seq->n; ++i) {
bsw2seq1_t *p = _seq->seq + i;
if (p->sam) printf("%s", p->sam);
+ else printf("%s\t4\t*\t0\t0\t*\t*\t0\t0\t%s\t%s\n", p->name, p->seq, p->qual);
free(p->name); free(p->seq); free(p->qual); free(p->sam);
p->tid = -1; p->l = 0;
p->name = p->seq = p->qual = p->sam = 0;

My apologies, since I dont know if I'm on the right track or not. Any help would be greatly appreciated.

Regards

bgansw

**swbarnes2** · 02-21-2012, 04:37 PM

If you have Illumina reads, just use bwa aln. It doesn't much matter how long your reads are, you are only aligning to a bacterial genome, how long can it take?

The post you are citing is a year old, do you know that it's suppsoed to work for whatever version of bwa you have?

**Bgansw** · 02-21-2012, 05:43 PM

Hi swbarnes2,

I ran bwa using the aln parameter, instead of the bwasw, but am not sure that its worked. The output .sam file is only 4K, while the bwasw output file was 796K. Additionally, the aln.sam file looks nothing like the bwasw.sam file. It doesnt contain reads, looks more like a .bam file. The aln.sam file also doesnt contain @SQ headers, while the bwasw file contained them, so I'm unable to convert the sam file to .bam.

I tried

samtools faidx ref.fa (but I get an error - [fai_build_core] different line length in sequence 'NZ_A01000034'.
Segmentation fault

So am stuck...

I would really appreciate any help.

**swbarnes2** · 02-22-2012, 09:52 AM

You realize that aln doesn't output a sam file, right? You put the outputs of aln through samse or sampe, and they make the .sam file. I align to bacteria all the time with 60-mers and use bwa aln and sampe.

And your faidx error means exactly what it looks like it means. bwa doesn't mind that particular problem, but you can't index the fasta if the line lengths vary within a sequence. I don't see why you need to index the fasta at this point anyway. Samtools view does not require an indexed fasta to make a .bam file. You can pipe samtools samse or sampe straight into samtools view, and not have to deal with a bulky .sam file. I do it all the time.

4 K may be too small, but maybe not.

For bacteria, these are the commands you want:

bwa index bacteria_ref.fasta; bwa aln bacteria_ref.fasta read1.fq > R1.sai; bwa aln bacteria_ref.fasta read2.fq > R2.sai; bwa sampe bacteria_ref.fasta R1.sai R2.sai read1.fq read2.fq | samtools view -bSho out.bam -; samtools sort out.bam sort; samtools view -bf 4 > unmapped.bam

You could throw in a -r `@RG\tID:foo\tSM:bar' in to the sampe step, to put read groups in. You could add a -u to the first samtools view step, as it makes that step run a little faster by not compressiong the .bam. The sort command will complain that the .bam lacks the right EOF, but it will still run correctly. Once you have the sorted file, you don't need the unsorted one. There are a couple of ways to convert a .bam to a fastq. Picard will do it. Velvet will take the .bam as is, and since de novo assembly is what you want to do anyway, that works out fine.

**Bgansw** · 02-22-2012, 07:37 PM

extract unmapped reads from bwa - convert to fastq

Hi Swarnes2,

Thank you very very much for your detailed post. Since I'm a newbie, the post was direct, easy to understand and very helpful. Thank you very very much. I'm very grateful!

I ran bwa aln , then sampe, got an output_unmapped.bam file

I also converted the output_unmapped.bam file to fasta format.

This is what the unmapped reads look like:

>GAPC_0027_FC:7:1:4477:1032#TGACCA
TGTTAAANNNNAGGTCATACCCGATCCTAAATACCTGTTACCTGNNNCCATTGGTACCTTAGAAGCA
>GAPC_0027_FC:7:1:4986:1032#TGACCA
GAAAATCCGAAAGAACCGGCAATGATTNTTTCTACGTTCCGTAAAGAAAGCAATGGTGATAAGCCTG
>GAPC_0027_FC:7:1:4986:1032#TGACCA
GATAGGGGCTACTTACTCAGNNNCAGCAGCTTCACCTTTCTTCTCTGCATTGTGTANNNNAGCAGAA
>GAPC_0027_FC:7:1:6525:1034#TGACCA

There are over 100 of these little sequences in the unmapped file.

I tried to put the .bam file into velvet

velveth UNMAPPED_06_aln_sampe 55 -fastq -shortPaired out_06_aln_sampe.bam
[0.000000] Reading FastQ file out_06_aln_sampe.bam
velveth: out_06_aln_sampe.bam does not seem to be in FastQ format

Am I doing something wrong?

I've been trying to use picard - bam to fastq and other fasta to fastq convertors available online, but havent got them working yet. Once I generate a fastq file, would aligning these reads in velvet or soapdenovo be the next step, to finding reads present in my sample genome but not in the reference genome.

Thank you for your advice and help. I really appreciate it.

Regards

bgansw

**Bgansw** · 02-22-2012, 09:21 PM

Hi Swarnes2,

I was able to convert the unmapped.bam files to fastq and then assemble them in velvet.

I'm annotating them now.

Thank you. I guess I'm sorted for now.

Again, thank you very very much for your help.

Regards

bgansw

**swbarnes2** · 02-22-2012, 10:12 PM

Sure, but new versions of velvet will take .bams. Of course, you have to specify that in the velvet command line.

Bams have the virtue of being very nicely compressed. Of course, fastq.gz is also pretty compressed, and both bwa and velvet will take those too.

My guess is that it's not going to show you much. Looks like those reads failed to align because something happened to the instrument at due to quality issues, and not because there was no suitable reference for them to align to. The pattern of the N's suggest that the instrument had bubbles or something during certain cycles, and those letters just couldn't be called.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 28 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 31 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 27 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News