Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Bowtie2 paired-end reads, sometimes read 2 is output first

    When aligning paired-end reads with Bowtie2 it's supposed to output the alignments on alternating lines: read1, read2; read1, read2; read1, read2; etc. For one particular type of alignment (note the similar SAM flags below) I'm getting them output in the opposite order, read 2 is coming first. Here are some examples, these are copied directly from the output .sam file:

    Code:
    [SIZE="3"]HWI-ST724:196:C0LCGACXX:6:1101:12507:1939	153	chr14	109251865	42	101M	=	109251865	0	CAGACACTAACTCTGTAGTACACTATGGACAGAGATGGTCTAGCCCATCTTAAGCACCACCACCATTACACTACCACAACTACCAGGAGCAGCAACAGCAA	DDEDCDDCCCECEEDEFFFEFFFFHFGHFFJJIIICJIGIGCGIFIGIIIGHHHF:HD@HF?GHGFF<F?HE?IHFIIHHFAIIGGIJHHHHHFFFFFCCC	AS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:101	YT:Z:UP
    HWI-ST724:196:C0LCGACXX:6:1101:12507:1939	69	chr14	109251865	0	*	=	109251865	0	NTTAGCTACTACAATAGATCCTGCTCATAGCCTTACCAGAAGTATCTCCTGCCTGCCCATTAGCTACTACAGACACTAACTCTGTAGTACACTATGGACAG	#1=DDFDFHHHGFHHIHJJIJGIJJJJJJJJJIJJGIIGGIJCFGIJIIJJJGGIJIJIJJJBGHIIJEIJJJJJDHHHHHFFDFFFEDEEEDEEDDDDDB	YT:Z:UP
    
    HWI-ST724:196:C0LCGACXX:6:1101:13494:1989	137	chr16	84333673	42	101M	=	84333673	0	TGTTACAGCAAATAAGCAAGACATAAATTAATTCAAGTGAGAAGGAGCCCAGTCTAATTTTCATAGCCTAAATGCCAGGGCCAGGAGGCAGGAGTGGGCAG	CCBFFFFFHHGHHJJJJJIJIJIFJIIIJJIIJEIGI?FFHGIHHIJJJJJHHIJIHJIJJIJGIJJJIJJJJJJJJHHGFFDFDDCDDDDDDD<ABDD?B	AS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:101	YT:Z:UP
    HWI-ST724:196:C0LCGACXX:6:1101:13494:1989	69	chr16	84333673	0	*	=	84333673	0	NTTGTGCTGGTGCCCCCCCCCCCCAAAAACCCCCTTCCCCCCCCTTTTTTAGGGGGCTCTCCCCCCCCCCCCCCCCCCCCCCCCCCCCTGCCCCCGGCTTT	#1=DDFFFHHCFFIIIIIIIIII##############################################################################	YT:Z:UP
    
    HWI-ST724:196:C0LCGACXX:6:1101:6762:2115	153	chr9	122691401	42	101M	=	122691401	0	TGATTTGGTTTGTCTTGGGGCCAGGGGGTGTTTTACCGAGGTTGTTGGTTGCACAGTTAGTATGGAGCCATTATTCCTAGAAATTGTTTAATGTAGTTTCA	C@>8ABDBCC??2@BB@ABA=;CDEB=FGHC@7E@HAF@B;C?CFEFBFD<BF@HFF?CC4BIGGIHEHCHGAEA?A>GIGIGGIGE?C?DD?>DDD=?<@	AS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:101	YT:Z:UP
    HWI-ST724:196:C0LCGACXX:6:1101:6762:2115	69	chr9	122691401	0	*	=	122691401	0	GCTCCTGCCTTAAAAAAAAAAAAAAAAAAAAAAAAGGTGTATAAGCCGCAAAGTAAAAGGGCCCCAGAATTTGTGAAATAAGATTGTGGTTTTCTTGCGGG	@@@1=?DDBD?HDBAFE:C<DGGGID6A?BBB#####################################################################	YT:Z:UP
    
    HWI-ST724:196:C0LCGACXX:6:1101:11774:2147	137	chr9	3032426	0	97M1D4M	=	3032426	0	TCCTAAAGTGTGTATTTCTCATTGGACGTGATTTTCAGGTTTCTCGCCATATTCCAGGTCCTACAGTGTGAATTTCTCATTTTTCATGTTTTCCTATATTT	@B@FDFFFHHBHDHIJJJJJEHIJIJJFGIGFHGID<FGGGHAGIIEIGIEGHGHIGJCHGHIIGJFHFGGIHJJGHHGHFHHFDFFFFFDDEEECCCDCF	AS:i:-51	XN:i:0	XM:i:8	XO:i:1	XG:i:1	NM:i:9	MD:Z:23T0T0C44C15C5T1A0G1^G4	YT:Z:UP
    HWI-ST724:196:C0LCGACXX:6:1101:11774:2147	69	chr9	3032426	0	*	=	3032426	0	AGGTAGTGAAATATGAAGAGAAATATAGGAAAACATGAAAAATGAGAAATTCACACTGTAGGACCTGGAATATGGCGAGAAACCTGAAAATCACGTCCAAT	@@@=DBDDHHHHDICDEEDHIIGHGGHIJJIIJIGIIIFIJIGHIIIIIGIJIIIIJIJIHJDHHIIIJJIJGIJEEFDEDDDECDCDDCDDCCDDDDDDD	YT:Z:UP
    
    HWI-ST724:196:C0LCGACXX:6:1101:10316:2079	153	chr1	177325404	40	101M	=	177325404	0	GTGTCTGTGTGTGTGGTGAGTGTTTTGCCTGCTTGTATGAGTGTACACCATGTGCATGTATCTGTTGCCCATGAAGGCCAGAAGAGGACATCATATCCCTG	CDDDDBBBDDDDDDEEDDEFFFFHHHHJJIIJJIJJIJIJJJIGGIHJIJIJJJJJIIJJJJJJJJIJJJJJJJJJJJIJJJJJJJJIFHGHHFFFFFCCC	AS:i:-17	XN:i:0	XM:i:3	XO:i:0	XG:i:0	NM:i:3	MD:Z:4G42A29A23	YT:Z:UP
    HWI-ST724:196:C0LCGACXX:6:1101:10316:2079	69	chr1	177325404	0	*	=	177325404	0	TTTTCTACAAACCTTAAAGACTTCTATTTAGAAATGTTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTCTGTGTGTCTGTGTGTGTGGTGAGTGTTTTGCC	CCCFFFFFHHHHHJJJJJJJJJJJJJJJJIIJIIIIJJJHIGGBFHHHIGHFFHIIHFHGHDHGHH<E)?EHHH;?);BDFCEAE(5=(;(9::@@#####	YT:Z:UP[/SIZE]
    It only happens with SAM flags 153/137 for the first displayed read and flag 69 for the second displayed read (which is less than 1% of all reads in this particular dataset). I think this means read 2 maps but read 1 doesn't, and for some reason when only read 2 maps it's being output first. Flags 153 and 137 both indicate that a read is the "second in pair" yet it is being output first, while flag 69 is "first in pair" yet it is being output second. Flags checked here: http://picard.sourceforge.net/explain-flags.html

    The other problem with these alignments is that for the unmapped read it's still giving a chromosome and coordinates, but it's just giving the values from the corresponding aligned read. I think it would make more sense for these values to be blank or have a * or something for the unmapped read.

    This is my first time doing much with Bowtie2 or with .sam files so maybe I'm misunderstanding something (previously was using Bowtie1 and it's default .map output format). Otherwise, are these normal behaviours for Bowtie2?

  • #2
    Sam format offers no guarantees about read order unless it is sorted (which is one of the things that makes it inconvenient). So, there is nothing wrong with the output. BBMap does currently output read 1 before read 2 (and I intend to keep it that way), but that's not a behavior you should design around because it is not part of the sam spec.

    Comment


    • #3
      Ok that's good to know about the sam format in general so I won't count on it being output in any particular order, but the Bowtie2 manual does say it will output in read1-->read2 order: http://bowtie-bio.sourceforge.net/bo...red-sam-output. Since this is only happening in the specific case I described (ie. it only happens when only read 2 is aligned) I'm wondering if there's a bug in Bowtie2 just when it encounters these types of alignments.

      Comment


      • #4
        Oh a second question, when you say "Sam format offers no guarantees about read order" does that mean it can be read1-->read2 or read2-->read1, but that the pairs will at least always be next to each other? Or could read1 be on one particular line in the file and the corresponding read2 could be at some other random place in the file?

        Comment


        • #5
          Yes, you can have read2 then read1 or read1 and then read2. This can happen in both sorted and unsorted files. In most cases, unsorted files will have read1 before read2, but that's mostly just out of convenience. Unless a file is marked as being name (query) sorted, then there's no absolute guarantee that reads in a pair will be next to each other. Having said that, most aligners will print them next to each other (unless they produce coordinate-sorted output).

          Regarding the example from bowtie2, note that read1 is unmapped in each case. I could imagine that it outputs mapped reads before unmapped reads (this would make some sense). I should note that unless you allow singletons, bowtie2 will indeed always produce output where read1 comes before read2.

          Comment


          • #6
            Also note the --reorder option for bowtie2:

            --reorder Guarantees that output SAM records are printed in an order corresponding to the order of the reads in the original input file, even when -p is set greater than 1. Specifying --reorder and setting -p greater than 1 causes Bowtie 2 to run somewhat slower and use somewhat more memory then if --reorder were not specified. Has no effect if -p is set to 1, since output order will naturally correspond to input order in that case.

            Comment


            • #7
              FYI, the --reorder option doesn't actually ensure the order of reads in a pair if they're aligned as singletons. In other words, unless you use --no-mixed as well then read#2 can come first if it aligns by itself and read#1 can't align (assuming there are no valid concordant or discordant alignments). Yes, it would be very nice if this were documented.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM
              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              25 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              27 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              24 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              52 views
              0 likes
              Last Post seqadmin  
              Working...
              X