Seqanswers Leaderboard Ad

**Michael.James.Clark** · 11-24-2009, 12:33 PM

It looks like your BOWTIE reference genome is 0-based and your MAQ reference genome is 1-based.

Basically, the first one is starting at 0 and the second one is starting at 1.

These may help you understand:

Genome Browser FAQ

http://genome.ucsc.edu/FAQ/FAQtracks#tracks1

404 — Bitbucket

http://bytebucket.org/galaxy/galaxy-central/wiki/zero_based_coordinates.pdf

Both aligners are aligning to the same position, but your two reference genomes are calling the positions by different numbers.

**dukevn** · 11-24-2009, 02:25 PM

Originally posted by Michael.James.Clark View Post

It looks like your BOWTIE reference genome is 0-based and your MAQ reference genome is 1-based.

Basically, the first one is starting at 0 and the second one is starting at 1.

Thanks Michael for pointing it out. But I created the references from the same set of chrxx from UCSC by doing like below for BOWTIE ref:

Code:

$ bowtie-build <chr1.fa,...> hg19

and for MAQ ref

Code:

$ cd <path to folder contain all chrs>
$ files=*
$ cat $files>hg19.fa

What I am doing wrong?

Originally posted by Michael.James.Clark View Post

These may help you understand:

Genome Browser FAQ

http://genome.ucsc.edu/FAQ/FAQtracks#tracks1

http://bytebucket.org/galaxy/galaxy-...oordinates.pdf

Thanks for the links. They are really helpful.

Originally posted by Michael.James.Clark View Post

Both aligners are aligning to the same position, but your two reference genomes are calling the positions by different numbers.

So if I understand it correctly, MAQ gives result with 0-based, and BOWTIE with 1-based reference. When creating BED file, for example with the below read result from MAQ

Code:

HWUSI-EAS751_0001:6:120:1760:653#0/1	+	chr1	8922906	GTGAACCACAGGCCCCTTGTTCTCAGGAGCCCTCC	BBBBCBCBBBBBBBBBBBBAABBBAA??>=A>@??	0

then the position of that read should be 8922906? Please correct me if I am wrong.

Thanks,

D.

**nilshomer** · 11-24-2009, 03:25 PM

Originally posted by dukevn View Post

Thanks Michael for pointing it out. But I created the references from the same set of chrxx from UCSC by doing like below for BOWTIE ref:

Code:

$ bowtie-build <chr1.fa,...> hg19

and for MAQ ref

Code:

$ cd <path to folder contain all chrs>
$ files=*
$ cat $files>hg19.fa

What I am doing wrong?

Thanks for the links. They are really helpful.

So if I understand it correctly, MAQ gives result with 0-based, and BOWTIE with 1-based reference. When creating BED file, for example with the below read result from MAQ

Code:

HWUSI-EAS751_0001:6:120:1760:653#0/1	+	chr1	8922906	GTGAACCACAGGCCCCTTGTTCTCAGGAGCCCTCC	BBBBCBCBBBBBBBBBBBBAABBBAA??>=A>@??	0

then the position of that read should be 8922906? Please correct me if I am wrong.

Thanks,

D.

Regardless, the SAM format specifies that the position be one-based. There could be problem with the alignment or the converter from MAP to SAM. Have you looked at the alignments in the MAP format to see if it is the alignments or the SAM converter?

Nils

**dukevn** · 11-24-2009, 03:50 PM

Originally posted by nilshomer View Post

Regardless, the SAM format specifies that the position be one-based. There could be problem with the alignment or the converter from MAP to SAM. Have you looked at the alignments in the MAP format to see if it is the alignments or the SAM converter?

Nils

You saved my day Nils. The two files in my first post are bowtie.map and maq.map correspondingly (without any conversion). I just used samtools to convert maq.map and bowtie.map to sam, and the results are the same!

Code:

WUSI-EAS751_0001:6:120:1760:653#0/1	0	chr1	8922907	67	35M	*	0	0	GTGAACCACAGGCCCCTTGTTCTCAGGAGCCCTCC	BBBBCBCBBBBBBBBBBBBAABBBAA??>=A>@??	MF:i:0	NM:i:0	UQ:i:0	H0:i:1	H1:i:0
USI-EAS751_0001:6:120:1780:1329#0/1	0	chr1	9795559	67	35M	*	0	0	CGGGCGTGGGGAACTGCCGGGAGTTCAGGTACGAG	BBBBBBABBBBBBABBBABBB?B?BABAA=AAB>A	MF:i:0	NM:i:0	UQ:i:0	H0:i:1	H1:i:0
WUSI-EAS751_0001:6:120:1766:651#0/1	16	chr1	16255981	45	35M	*	0	0	AGACCTACAAGCAAGACTGGGAGAACCAGCAGGTG	A?2=>?=BBBABBBA=BBBBB@BB@BBBBBBBBBB	MF:i:0	NM:i:1	UQ:i:30	H0:i:0	H1:i:1
WUSI-EAS751_0001:6:120:1773:654#0/1	16	chr1	25571701	64	35M	*	0	0	AACAGTTCTGAGACTAGCTGGCAAGTCAATGTTGG	AAAA@?>@6=<=@@ABA<5BB<BBBABBBCB;BCB	MF:i:0	NM:i:0	UQ:i:0	H0:i:1	H1:i:0
WUSI-EAS751_0001:6:120:1781:606#0/1	0	chr1	32696571	30	35M	*	0	0	GGTCACTTTGGACCTATCAACAGTGTTGCCTTCCA	BBBBCBCCCBBBBBCCCBB<ABB?B?ABBABAAA>	MF:i:0	NM:i:0	UQ:i:0	H0:i:1	H1:i:1

I still dont understand why nobody ever saw this "issue" before?

D.

PS: there is still something weird with MAQ, as you can see my reads' names were cut off a little bit. But I guess that is no important.

**Michael.James.Clark** · 11-24-2009, 04:39 PM

Originally posted by dukevn View Post

Thanks Michael for pointing it out. But I created the references from the same set of chrxx from UCSC by doing like below for BOWTIE ref:

Code:

$ bowtie-build <chr1.fa,...> hg19

and for MAQ ref

Code:

$ cd <path to folder contain all chrs>
$ files=*
$ cat $files>hg19.fa

What I am doing wrong?

When you download from UCSC, you're downloading a 0-based reference genome. So when you cat the files together, you're catting together 0-based references.

On the other hand, when you have bowtie build the file for you, it appears that it recognizes that the input files are 0-based and corrects them to be 1-based.

So if I understand it correctly, MAQ gives result with 0-based, and BOWTIE with 1-based reference.

MAQ gives you the correct location according to the reference genome you fed MAQ to begin with. However, your reference genome was 0-based.

If you feed MAQ a 1-based reference genome (and I recommend that you do this in the future), it will of course give you the same output as bowtie (since bowtie used a 1-based reference genome as well).

I think this is effectively what you did when you "reconstructed" them using Samtools.

I still dont understand why nobody ever saw this "issue" before?

No worries, it's actually a very well recognized issue when dealing with UCSC's reference genomes.

That's why they have it in their FAQ!

**nilshomer** · 11-24-2009, 07:09 PM

Originally posted by dukevn View Post

You saved my day Nils. The two files in my first post are bowtie.map and maq.map correspondingly (without any conversion). I just used samtools to convert maq.map and bowtie.map to sam, and the results are the same!

Code:

WUSI-EAS751_0001:6:120:1760:653#0/1	0	chr1	8922907	67	35M	*	0	0	GTGAACCACAGGCCCCTTGTTCTCAGGAGCCCTCC	BBBBCBCBBBBBBBBBBBBAABBBAA??>=A>@??	MF:i:0	NM:i:0	UQ:i:0	H0:i:1	H1:i:0
USI-EAS751_0001:6:120:1780:1329#0/1	0	chr1	9795559	67	35M	*	0	0	CGGGCGTGGGGAACTGCCGGGAGTTCAGGTACGAG	BBBBBBABBBBBBABBBABBB?B?BABAA=AAB>A	MF:i:0	NM:i:0	UQ:i:0	H0:i:1	H1:i:0
WUSI-EAS751_0001:6:120:1766:651#0/1	16	chr1	16255981	45	35M	*	0	0	AGACCTACAAGCAAGACTGGGAGAACCAGCAGGTG	A?2=>?=BBBABBBA=BBBBB@BB@BBBBBBBBBB	MF:i:0	NM:i:1	UQ:i:30	H0:i:0	H1:i:1
WUSI-EAS751_0001:6:120:1773:654#0/1	16	chr1	25571701	64	35M	*	0	0	AACAGTTCTGAGACTAGCTGGCAAGTCAATGTTGG	AAAA@?>@6=<=@@ABA<5BB<BBBABBBCB;BCB	MF:i:0	NM:i:0	UQ:i:0	H0:i:1	H1:i:0
WUSI-EAS751_0001:6:120:1781:606#0/1	0	chr1	32696571	30	35M	*	0	0	GGTCACTTTGGACCTATCAACAGTGTTGCCTTCCA	BBBBCBCCCBBBBBCCCBB<ABB?B?ABBABAAA>	MF:i:0	NM:i:0	UQ:i:0	H0:i:1	H1:i:1

I still dont understand why nobody ever saw this "issue" before?

D.

PS: there is still something weird with MAQ, as you can see my reads' names were cut off a little bit. But I guess that is no important.

MAQ has a read length and read name upper limit. It's hard-coded!

**dukevn** · 11-24-2009, 07:32 PM

Originally posted by Michael.James.Clark View Post

When you download from UCSC, you're downloading a 0-based reference genome. So when you cat the files together, you're catting together 0-based references.

On the other hand, when you have bowtie build the file for you, it appears that it recognizes that the input files are 0-based and corrects them to be 1-based.

MAQ gives you the correct location according to the reference genome you fed MAQ to begin with. However, your reference genome was 0-based.

You made it very clear Michael. I dont know that BOWTIE has ability to correct 0-based to 1-based.

Originally posted by Michael.James.Clark View Post

If you feed MAQ a 1-based reference genome (and I recommend that you do this in the future),...

Could you please let me know why?

Originally posted by Michael.James.Clark View Post

I think this is effectively what you did when you "reconstructed" them using Samtools.

No worries, it's actually a very well recognized issue when dealing with UCSC's reference genomes.

That's why they have it in their FAQ!

I see. So because UCSC "thinks" that the BED file starts with 0-based, when converting SAM output (which is 1-based) to BED, I should offset it by one unit, am I correct? For example, for the first read, its position reported by SAM is 8922907, then its position in BED should be 8922906?

By the way, can anybody recommend me a tool to convert SAM to BED / WIG?

Thanks,

D.

**dukevn** · 11-24-2009, 07:33 PM

Originally posted by nilshomer View Post

MAQ has a read length and read name upper limit. It's hard-coded!

I did guess something like that, I just want to be sure. Thanks for your confirmation Nils.

D.

**HTS** · 11-24-2009, 09:24 PM

When I read through this post, I become really confused:

1. It is obvious from the original post that Bowtie reports 0-based alignments while MAQ reports 1-based ones, not the other way around.

2. When you build the genome index from fa files, all you use are the actual sequences plus the headers. There are no coordinates involved, and I can't see why it makes sense to call a reference sequence/genome itself 0-based or 1-based. It is how you later refer to it that makes the difference.

In summary, all I see here is that using the same reference genome, Bowtie reports alignments in a 0-based fashion (but using the SAM output should fix that) while MAQ reports in a 1-based fashion. Please do correct me if I am wrong. Thanks!

-- Leo

**dukevn** · 11-25-2009, 06:31 AM

Originally posted by HTS View Post

When I read through this post, I become really confused:

1. It is obvious from the original post that Bowtie reports 0-based alignments while MAQ reports 1-based ones, not the other way around.

2. When you build the genome index from fa files, all you use are the actual sequences plus the headers. There are no coordinates involved, and I can't see why it makes sense to call a reference sequence/genome itself 0-based or 1-based. It is how you later refer to it that makes the difference.

In summary, all I see here is that using the same reference genome, Bowtie reports alignments in a 0-based fashion (but using the SAM output should fix that) while MAQ reports in a 1-based fashion. Please do correct me if I am wrong. Thanks!

-- Leo

Well, it is for sure that MAQ and BOWTIE report results differently (with my data and reference, please check the first post). From what people are discussion, there are two assumptions here:

1. References built from same set of fa files are the same; MAQ reports 1-based, BOWTIE reports 0-based.

2. References built from same set of fa files are different; MAQ and BOWTIE report the same.

Honestly I have no idea which one is correct. Michael suggested that assumption 2 be correct (post #6). Maybe we need more infos and more evidence (documents) to prove that?

D.

**Michael.James.Clark** · 11-25-2009, 12:17 PM

Originally posted by HTS View Post

When I read through this post, I become really confused:

1. It is obvious from the original post that Bowtie reports 0-based alignments while MAQ reports 1-based ones, not the other way around.

2. When you build the genome index from fa files, all you use are the actual sequences plus the headers. There are no coordinates involved, and I can't see why it makes sense to call a reference sequence/genome itself 0-based or 1-based. It is how you later refer to it that makes the difference.

In summary, all I see here is that using the same reference genome, Bowtie reports alignments in a 0-based fashion (but using the SAM output should fix that) while MAQ reports in a 1-based fashion. Please do correct me if I am wrong. Thanks!

-- Leo

I agree with you, and I don't know the answer to the first problem but to say that it does appear to me that his BOWTIE result is 0-based and his MAQ result is 1-based.

I do think the problem stems from one genome being 0-based and one genome being 1-based. Or, basically, from one genome being one base off from the other genome.

However, I must question the original poster about the reference genomes used.

When I BLAT the reads from the original post, they are found at different positions from what you reported.

Code:

HWUSI-EAS751_0001:6:120:1760:653#0/1	35     1    35    35 100.0%     1   +    8845494   8845528     35
HWUSI-EAS751_0001:6:120:1780:1329#0/1	35     1    35    35 100.0%     1   +    9718146   9718180     35
HWUSI-EAS751_0001:6:120:1766:651#0/1	33     1    35    35  97.2%     1   +   16128568  16128602     35

You can see that the first two are 77412 bases off from your BOWTIE result in the first post, and the third one is 127412 off from your BOWTIE result.

Why would these come out so different from each other? I know BLAT is reporting the correct position, so I'm wondering why your BOWTIE and MAQ output are not. I'm also a little confused why the first two are the same number of bases off and then the third result is a suspiciously different number (77412 versus 127412... it seems odd that both end with -7412). Did you change the numbers when you posted them or is it something else?

**dukevn** · 11-25-2009, 12:42 PM

Originally posted by Michael.James.Clark View Post

You can see that the first two are 77412 bases off from your BOWTIE result in the first post, and the third one is 127412 off from your BOWTIE result.

Why would these come out so different from each other? I know BLAT is reporting the correct position, so I'm wondering why your BOWTIE and MAQ output are not. I'm also a little confused why the first two are the same number of bases off and then the third result is a suspiciously different number (77412 versus 127412... it seems odd that both end with -7412). Did you change the numbers when you posted them or is it something else?

Interesting! Now even BLAT gives a different result

. OK, here is what I did:

1. Downloaded reference chromFa.tar.gz from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/, untar, delete all but those chrxx.fa files. (xx = 1,...,22,M,X,Y).

2. Created MAQ reference by using cat all fa files; created BOWTIE index by using bowtie-build with the same set of fa files.

3. Run the attached fq with MAQ and BOWTIE.

If you follow all of what I did above, I believe you would get the same results that I posted in the first post. I did not change anything when posting those.

Except the new issue with BLAT, now I do think that everything was just fine: MAQ reports 1-based and BOWTIE 0-based. That seems to be confirmed by using samtools on both map outputs, and the same results were archived.

D.

Attached Files

ngsDataTest.fq.txt (2.9 KB, 34 views)

**Michael.James.Clark** · 11-25-2009, 01:06 PM

By the way, you asked about BED files being 0-based: Because a BED file represents a base as a range, it is 0-based. So the first base in a sequence is the base 0-1, the second base is 1-2, etc.

**dukevn** · 11-25-2009, 01:23 PM

Originally posted by Michael.James.Clark View Post

By the way, you asked about BED files being 0-based: Because a BED file represents a base as a range, it is 0-based. So the first base in a sequence is the base 0-1, the second base is 1-2, etc.

Thanks Michael.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

MAQ and BOWTIE, reads' positions different?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News