Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • MAQ and BOWTIE, reads' positions different?

    Hi all,

    I am testing MAQ and BOWTIE for the same fq file which is aligned to human. Human references were downloaded from ucsc and were converted to bowtie references by bowtie-build. The same set of references was concatenated to a big unique fa file.

    Below are the results from BOWTIE and MAQ correspondingly:

    Code:
    HWUSI-EAS751_0001:6:120:1760:653#0/1	+	chr1	8922906	GTGAACCACAGGCCCCTTGTTCTCAGGAGCCCTCC	BBBBCBCBBBBBBBBBBBBAABBBAA??>=A>@??	0	
    HWUSI-EAS751_0001:6:120:1780:1329#0/1	+	chr1	9795558	CGGGCGTGGGGAACTGCCGGGAGTTCAGGTACGAG	BBBBBBABBBBBBABBBABBB?B?BABAA=AAB>A	0	
    HWUSI-EAS751_0001:6:120:1766:651#0/1	-	chr1	16255980	AGACCTACAAGCAAGACTGGGAGAACCAGCAGGTG	A?2=>?=BBBABBBA=BBBBB@BB@BBBBBBBBBB	0	8:T>C
    HWUSI-EAS751_0001:6:120:1773:654#0/1	-	chr1	25571700	AACAGTTCTGAGACTAGCTGGCAAGTCAATGTTGG	AAAA@?>@6=<=@@ABA<5BB<BBBABBBCB;BCB	0	
    HWUSI-EAS751_0001:6:120:1781:606#0/1	+	chr1	32696570	GGTCACTTTGGACCTATCAACAGTGTTGCCTTCCA	BBBBCBCCCBBBBBCCCBB<ABB?B?ABBABAAA>	0
    Code:
    WUSI-EAS751_0001:6:120:1760:653#0/1	chr1	8922907	+	0	0	67	67	67	0	0	1	0	35	GTGAACCACAGGCCCCTTGTTCTCAGGAGCCCTCC	BBBBCBCBBBBBBBBBBBBAABBBAA??>=A>@??
    USI-EAS751_0001:6:120:1780:1329#0/1	chr1	9795559	+	0	0	67	67	67	0	0	1	0	35	CGGGCGTGGGGAACTGCCGGGAGTTCAGGTACGAG	BBBBBBABBBBBBABBBABBB?B?BABAA=AAB>A
    WUSI-EAS751_0001:6:120:1766:651#0/1	chr1	16255981	-	0	0	45	45	45	1	30	0	1	35	AGaCCTACAAGCAAGACTGGGAGAACCAGCAGGTG	A?2=>?=BBBABBBA=BBBBB@BB@BBBBBBBBBB
    WUSI-EAS751_0001:6:120:1773:654#0/1	chr1	25571701	-	0	0	64	64	64	0	0	1	0	35	AACAGTTCtGAGACTAGCtGGCAAGTCAATGtTGG	AAAA@?>@6=<=@@ABA<5BB<BBBABBBCB;BCB
    WUSI-EAS751_0001:6:120:1781:606#0/1	chr1	32696571	+	0	0	30	30	30	0	0	1	1	35	GGTCACTTTGGACCTATCAACAGTGTTGCCTTCCA	BBBBCBCCCBBBBBCCCBB<ABB?B?ABBABAAA>
    My question relates to the position of the match reported by MAQ and BOWTIE. Why they are always off one unit? How do I understand position of these matches?

    Thank you in advanced,

    D.

  • #2
    It looks like your BOWTIE reference genome is 0-based and your MAQ reference genome is 1-based.

    Basically, the first one is starting at 0 and the second one is starting at 1.

    These may help you understand:



    Both aligners are aligning to the same position, but your two reference genomes are calling the positions by different numbers.
    Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
    Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
    Projects: U87MG whole genome sequence [Website] [Paper]

    Comment


    • #3
      Originally posted by Michael.James.Clark View Post
      It looks like your BOWTIE reference genome is 0-based and your MAQ reference genome is 1-based.

      Basically, the first one is starting at 0 and the second one is starting at 1.
      Thanks Michael for pointing it out. But I created the references from the same set of chrxx from UCSC by doing like below for BOWTIE ref:
      Code:
      $ bowtie-build <chr1.fa,...> hg19
      and for MAQ ref
      Code:
      $ cd <path to folder contain all chrs>
      $ files=*
      $ cat $files>hg19.fa
      What I am doing wrong?

      Thanks for the links. They are really helpful.

      Originally posted by Michael.James.Clark View Post
      Both aligners are aligning to the same position, but your two reference genomes are calling the positions by different numbers.
      So if I understand it correctly, MAQ gives result with 0-based, and BOWTIE with 1-based reference. When creating BED file, for example with the below read result from MAQ
      Code:
      HWUSI-EAS751_0001:6:120:1760:653#0/1	+	chr1	8922906	GTGAACCACAGGCCCCTTGTTCTCAGGAGCCCTCC	BBBBCBCBBBBBBBBBBBBAABBBAA??>=A>@??	0
      then the position of that read should be 8922906? Please correct me if I am wrong.

      Thanks,

      D.

      Comment


      • #4
        Originally posted by dukevn View Post
        Thanks Michael for pointing it out. But I created the references from the same set of chrxx from UCSC by doing like below for BOWTIE ref:
        Code:
        $ bowtie-build <chr1.fa,...> hg19
        and for MAQ ref
        Code:
        $ cd <path to folder contain all chrs>
        $ files=*
        $ cat $files>hg19.fa
        What I am doing wrong?



        Thanks for the links. They are really helpful.



        So if I understand it correctly, MAQ gives result with 0-based, and BOWTIE with 1-based reference. When creating BED file, for example with the below read result from MAQ
        Code:
        HWUSI-EAS751_0001:6:120:1760:653#0/1	+	chr1	8922906	GTGAACCACAGGCCCCTTGTTCTCAGGAGCCCTCC	BBBBCBCBBBBBBBBBBBBAABBBAA??>=A>@??	0
        then the position of that read should be 8922906? Please correct me if I am wrong.

        Thanks,

        D.
        Regardless, the SAM format specifies that the position be one-based. There could be problem with the alignment or the converter from MAP to SAM. Have you looked at the alignments in the MAP format to see if it is the alignments or the SAM converter?

        Nils

        Comment


        • #5
          Originally posted by nilshomer View Post
          Regardless, the SAM format specifies that the position be one-based. There could be problem with the alignment or the converter from MAP to SAM. Have you looked at the alignments in the MAP format to see if it is the alignments or the SAM converter?

          Nils
          You saved my day Nils. The two files in my first post are bowtie.map and maq.map correspondingly (without any conversion). I just used samtools to convert maq.map and bowtie.map to sam, and the results are the same!

          Code:
          WUSI-EAS751_0001:6:120:1760:653#0/1	0	chr1	8922907	67	35M	*	0	0	GTGAACCACAGGCCCCTTGTTCTCAGGAGCCCTCC	BBBBCBCBBBBBBBBBBBBAABBBAA??>=A>@??	MF:i:0	NM:i:0	UQ:i:0	H0:i:1	H1:i:0
          USI-EAS751_0001:6:120:1780:1329#0/1	0	chr1	9795559	67	35M	*	0	0	CGGGCGTGGGGAACTGCCGGGAGTTCAGGTACGAG	BBBBBBABBBBBBABBBABBB?B?BABAA=AAB>A	MF:i:0	NM:i:0	UQ:i:0	H0:i:1	H1:i:0
          WUSI-EAS751_0001:6:120:1766:651#0/1	16	chr1	16255981	45	35M	*	0	0	AGACCTACAAGCAAGACTGGGAGAACCAGCAGGTG	A?2=>?=BBBABBBA=BBBBB@BB@BBBBBBBBBB	MF:i:0	NM:i:1	UQ:i:30	H0:i:0	H1:i:1
          WUSI-EAS751_0001:6:120:1773:654#0/1	16	chr1	25571701	64	35M	*	0	0	AACAGTTCTGAGACTAGCTGGCAAGTCAATGTTGG	AAAA@?>@6=<=@@ABA<5BB<BBBABBBCB;BCB	MF:i:0	NM:i:0	UQ:i:0	H0:i:1	H1:i:0
          WUSI-EAS751_0001:6:120:1781:606#0/1	0	chr1	32696571	30	35M	*	0	0	GGTCACTTTGGACCTATCAACAGTGTTGCCTTCCA	BBBBCBCCCBBBBBCCCBB<ABB?B?ABBABAAA>	MF:i:0	NM:i:0	UQ:i:0	H0:i:1	H1:i:1
          I still dont understand why nobody ever saw this "issue" before?

          D.

          PS: there is still something weird with MAQ, as you can see my reads' names were cut off a little bit. But I guess that is no important.

          Comment


          • #6
            Originally posted by dukevn View Post
            Thanks Michael for pointing it out. But I created the references from the same set of chrxx from UCSC by doing like below for BOWTIE ref:
            Code:
            $ bowtie-build <chr1.fa,...> hg19
            and for MAQ ref
            Code:
            $ cd <path to folder contain all chrs>
            $ files=*
            $ cat $files>hg19.fa
            What I am doing wrong?
            When you download from UCSC, you're downloading a 0-based reference genome. So when you cat the files together, you're catting together 0-based references.

            On the other hand, when you have bowtie build the file for you, it appears that it recognizes that the input files are 0-based and corrects them to be 1-based.

            So if I understand it correctly, MAQ gives result with 0-based, and BOWTIE with 1-based reference.
            MAQ gives you the correct location according to the reference genome you fed MAQ to begin with. However, your reference genome was 0-based.

            If you feed MAQ a 1-based reference genome (and I recommend that you do this in the future), it will of course give you the same output as bowtie (since bowtie used a 1-based reference genome as well).

            I think this is effectively what you did when you "reconstructed" them using Samtools.

            I still dont understand why nobody ever saw this "issue" before?
            No worries, it's actually a very well recognized issue when dealing with UCSC's reference genomes. That's why they have it in their FAQ!
            Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
            Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
            Projects: U87MG whole genome sequence [Website] [Paper]

            Comment


            • #7
              Originally posted by dukevn View Post
              You saved my day Nils. The two files in my first post are bowtie.map and maq.map correspondingly (without any conversion). I just used samtools to convert maq.map and bowtie.map to sam, and the results are the same!

              Code:
              WUSI-EAS751_0001:6:120:1760:653#0/1	0	chr1	8922907	67	35M	*	0	0	GTGAACCACAGGCCCCTTGTTCTCAGGAGCCCTCC	BBBBCBCBBBBBBBBBBBBAABBBAA??>=A>@??	MF:i:0	NM:i:0	UQ:i:0	H0:i:1	H1:i:0
              USI-EAS751_0001:6:120:1780:1329#0/1	0	chr1	9795559	67	35M	*	0	0	CGGGCGTGGGGAACTGCCGGGAGTTCAGGTACGAG	BBBBBBABBBBBBABBBABBB?B?BABAA=AAB>A	MF:i:0	NM:i:0	UQ:i:0	H0:i:1	H1:i:0
              WUSI-EAS751_0001:6:120:1766:651#0/1	16	chr1	16255981	45	35M	*	0	0	AGACCTACAAGCAAGACTGGGAGAACCAGCAGGTG	A?2=>?=BBBABBBA=BBBBB@BB@BBBBBBBBBB	MF:i:0	NM:i:1	UQ:i:30	H0:i:0	H1:i:1
              WUSI-EAS751_0001:6:120:1773:654#0/1	16	chr1	25571701	64	35M	*	0	0	AACAGTTCTGAGACTAGCTGGCAAGTCAATGTTGG	AAAA@?>@6=<=@@ABA<5BB<BBBABBBCB;BCB	MF:i:0	NM:i:0	UQ:i:0	H0:i:1	H1:i:0
              WUSI-EAS751_0001:6:120:1781:606#0/1	0	chr1	32696571	30	35M	*	0	0	GGTCACTTTGGACCTATCAACAGTGTTGCCTTCCA	BBBBCBCCCBBBBBCCCBB<ABB?B?ABBABAAA>	MF:i:0	NM:i:0	UQ:i:0	H0:i:1	H1:i:1
              I still dont understand why nobody ever saw this "issue" before?

              D.

              PS: there is still something weird with MAQ, as you can see my reads' names were cut off a little bit. But I guess that is no important.
              MAQ has a read length and read name upper limit. It's hard-coded!

              Comment


              • #8
                Originally posted by Michael.James.Clark View Post
                When you download from UCSC, you're downloading a 0-based reference genome. So when you cat the files together, you're catting together 0-based references.

                On the other hand, when you have bowtie build the file for you, it appears that it recognizes that the input files are 0-based and corrects them to be 1-based.

                MAQ gives you the correct location according to the reference genome you fed MAQ to begin with. However, your reference genome was 0-based.
                You made it very clear Michael. I dont know that BOWTIE has ability to correct 0-based to 1-based.

                Originally posted by Michael.James.Clark View Post
                If you feed MAQ a 1-based reference genome (and I recommend that you do this in the future),...
                Could you please let me know why?

                Originally posted by Michael.James.Clark View Post
                I think this is effectively what you did when you "reconstructed" them using Samtools.

                No worries, it's actually a very well recognized issue when dealing with UCSC's reference genomes. That's why they have it in their FAQ!
                I see. So because UCSC "thinks" that the BED file starts with 0-based, when converting SAM output (which is 1-based) to BED, I should offset it by one unit, am I correct? For example, for the first read, its position reported by SAM is 8922907, then its position in BED should be 8922906?

                By the way, can anybody recommend me a tool to convert SAM to BED / WIG?

                Thanks,

                D.

                Comment


                • #9
                  Originally posted by nilshomer View Post
                  MAQ has a read length and read name upper limit. It's hard-coded!
                  I did guess something like that, I just want to be sure. Thanks for your confirmation Nils.

                  D.

                  Comment


                  • #10
                    When I read through this post, I become really confused:

                    1. It is obvious from the original post that Bowtie reports 0-based alignments while MAQ reports 1-based ones, not the other way around.

                    2. When you build the genome index from fa files, all you use are the actual sequences plus the headers. There are no coordinates involved, and I can't see why it makes sense to call a reference sequence/genome itself 0-based or 1-based. It is how you later refer to it that makes the difference.

                    In summary, all I see here is that using the same reference genome, Bowtie reports alignments in a 0-based fashion (but using the SAM output should fix that) while MAQ reports in a 1-based fashion. Please do correct me if I am wrong. Thanks!

                    -- Leo

                    Comment


                    • #11
                      Originally posted by HTS View Post
                      When I read through this post, I become really confused:

                      1. It is obvious from the original post that Bowtie reports 0-based alignments while MAQ reports 1-based ones, not the other way around.

                      2. When you build the genome index from fa files, all you use are the actual sequences plus the headers. There are no coordinates involved, and I can't see why it makes sense to call a reference sequence/genome itself 0-based or 1-based. It is how you later refer to it that makes the difference.

                      In summary, all I see here is that using the same reference genome, Bowtie reports alignments in a 0-based fashion (but using the SAM output should fix that) while MAQ reports in a 1-based fashion. Please do correct me if I am wrong. Thanks!

                      -- Leo
                      Well, it is for sure that MAQ and BOWTIE report results differently (with my data and reference, please check the first post). From what people are discussion, there are two assumptions here:

                      1. References built from same set of fa files are the same; MAQ reports 1-based, BOWTIE reports 0-based.

                      2. References built from same set of fa files are different; MAQ and BOWTIE report the same.

                      Honestly I have no idea which one is correct. Michael suggested that assumption 2 be correct (post #6). Maybe we need more infos and more evidence (documents) to prove that?

                      D.

                      Comment


                      • #12
                        Originally posted by HTS View Post
                        When I read through this post, I become really confused:

                        1. It is obvious from the original post that Bowtie reports 0-based alignments while MAQ reports 1-based ones, not the other way around.

                        2. When you build the genome index from fa files, all you use are the actual sequences plus the headers. There are no coordinates involved, and I can't see why it makes sense to call a reference sequence/genome itself 0-based or 1-based. It is how you later refer to it that makes the difference.

                        In summary, all I see here is that using the same reference genome, Bowtie reports alignments in a 0-based fashion (but using the SAM output should fix that) while MAQ reports in a 1-based fashion. Please do correct me if I am wrong. Thanks!

                        -- Leo
                        I agree with you, and I don't know the answer to the first problem but to say that it does appear to me that his BOWTIE result is 0-based and his MAQ result is 1-based.

                        I do think the problem stems from one genome being 0-based and one genome being 1-based. Or, basically, from one genome being one base off from the other genome.

                        However, I must question the original poster about the reference genomes used.

                        When I BLAT the reads from the original post, they are found at different positions from what you reported.

                        Code:
                        HWUSI-EAS751_0001:6:120:1760:653#0/1	35     1    35    35 100.0%     1   +    8845494   8845528     35
                        HWUSI-EAS751_0001:6:120:1780:1329#0/1	35     1    35    35 100.0%     1   +    9718146   9718180     35
                        HWUSI-EAS751_0001:6:120:1766:651#0/1	33     1    35    35  97.2%     1   +   16128568  16128602     35
                        You can see that the first two are 77412 bases off from your BOWTIE result in the first post, and the third one is 127412 off from your BOWTIE result.

                        Why would these come out so different from each other? I know BLAT is reporting the correct position, so I'm wondering why your BOWTIE and MAQ output are not. I'm also a little confused why the first two are the same number of bases off and then the third result is a suspiciously different number (77412 versus 127412... it seems odd that both end with -7412). Did you change the numbers when you posted them or is it something else?
                        Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
                        Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
                        Projects: U87MG whole genome sequence [Website] [Paper]

                        Comment


                        • #13
                          Originally posted by Michael.James.Clark View Post
                          You can see that the first two are 77412 bases off from your BOWTIE result in the first post, and the third one is 127412 off from your BOWTIE result.

                          Why would these come out so different from each other? I know BLAT is reporting the correct position, so I'm wondering why your BOWTIE and MAQ output are not. I'm also a little confused why the first two are the same number of bases off and then the third result is a suspiciously different number (77412 versus 127412... it seems odd that both end with -7412). Did you change the numbers when you posted them or is it something else?
                          Interesting! Now even BLAT gives a different result . OK, here is what I did:

                          1. Downloaded reference chromFa.tar.gz from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/, untar, delete all but those chrxx.fa files. (xx = 1,...,22,M,X,Y).

                          2. Created MAQ reference by using cat all fa files; created BOWTIE index by using bowtie-build with the same set of fa files.

                          3. Run the attached fq with MAQ and BOWTIE.

                          If you follow all of what I did above, I believe you would get the same results that I posted in the first post. I did not change anything when posting those.

                          Except the new issue with BLAT, now I do think that everything was just fine: MAQ reports 1-based and BOWTIE 0-based. That seems to be confirmed by using samtools on both map outputs, and the same results were archived.

                          D.
                          Attached Files

                          Comment


                          • #14
                            By the way, you asked about BED files being 0-based: Because a BED file represents a base as a range, it is 0-based. So the first base in a sequence is the base 0-1, the second base is 1-2, etc.
                            Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
                            Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
                            Projects: U87MG whole genome sequence [Website] [Paper]

                            Comment


                            • #15
                              Originally posted by Michael.James.Clark View Post
                              By the way, you asked about BED files being 0-based: Because a BED file represents a base as a range, it is 0-based. So the first base in a sequence is the base 0-1, the second base is 1-2, etc.
                              Thanks Michael.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin




                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                04-22-2024, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 11:49 AM
                              0 responses
                              15 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-24-2024, 08:47 AM
                              0 responses
                              16 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              61 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              60 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X