Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • serenaliao
    Member
    • Jan 2013
    • 22

    Wierd SAM format chromosome column

    Dear all,

    According to my previous experience, in sam file output by bismark, the third column is usually chromosome number with "chr" in front. While my new data has a mixture of chr<number> and <number> in that column. Is that common?

    MWR-PRG-0014:106:C25B0ACXX:4:1101:1195:1983_1:N:0:CAAAAG/1 115 15 3399663 255 100M = 3399567 -196 CAAAAT
    ACCAAAAAATAAAACACAAAATAAAAAAACTATTCTTCCTACCTAAAAACATAATAACTTCCACATCAATAATTCTTTATTACATAAATTATAN #DDDC@</=BCEEEC>ECCD@B@?7EA=HIIIIGBIHG
    F>ACFFF??GGIIIHGCDD<<<B?19CIGHHIIIEHHEIHEIGIGEDEHHFHDHDDDBB=1# NM:i:23 XX:Z:2G3G3G1GGG1G1G5GG4G2GG12T2G1G1G1C2G1G31G4GT XM:Z:.
    .x...h...x.hhh.h.h.....xh....h..hh...............x.h.h....h.h...............................h....h. XR:Z:CT XG:Z:GA
    MWR-PRG-0014:106:C25B0ACXX:4:1101:1195:1983_1:N:0:CAAAAG/2 179 15 3399567 255 100M = 3399663 196 AACAAA
    AACAATTATAAAACTAAACTAAAAAAATCCCAAATCAAAATTTTAATATTAATTTATTCATTCACCTCACAAATAAATAAAAATATTTATCAAA B@@DD;DDFDDHDEBFHIIGGIJJIJJIJIIE>BD<?F
    DFGBGGID@CGICCGGIJJJJIJJJI=AECHHAE>BFFCCEE@ECCDAAC@=C>CCCDC;AC NM:i:24 XX:Z:1G2GGG3G2G1GGG3GG3GG1G8GG4G5G2G2G25G4G6G3G1 XM:Z:.
    h..xhh...x..h.hhh...xh...xh.h........xh....h.....h..h..h.........................h....h......h...x. XR:Z:GA XG:Z:GA
    MWR-PRG-0014:106:C25B0ACXX:4:1101:1234:1999_1:N:0:CAAAAG/1 115 chr9 57387259 255 100M = 57387164 --195 AAACAAGCATCTTAAAATAACTATTAAAATTCAAAAAACTATATATCCTCAAAACTAAAAATAATATCAAATCCATAATCTTAAAATCCTCTTTCTAAGN ?A>5>;-.;(.6:EEDA>?@
    ?=;DDDDACIIDECB=<??DBBEDDDDDD4DBIDIFCDD>EE?C9BDB<4<DEFEFEAEBDC+A9<B>DDBDB=A11# NM:i:21 XX:Z:1G3G8GG2GG2G2G7GG1G14G1G3G2G1G2G4G11GG15T
    XM:Z:.h...xH.......hh..hh..x..h.......xh.h..............x.h...h..h.h..h....h...........hh..............H. XR:Z:CT XG:Z:G
    A
    MWR-PRG-0014:106:C25B0ACXX:4:1101:1234:1999_1:N:0:CAAAAG/2 179 chr9 57387164 255 100M = 57387259 -195 CAAAATATTTATACTACCTACTATATACACAACACAATACTAAATTCCACTTTACTCCCTAATCTTCCACTATTCCCTCTTCCCTAAACAAAAAAAAACA ?@@FF?D4=E?FF3:AA:
    C4C:FGG>3+AFHIJGDH@DHGBDHIGG?@<FGGGHJIDCDHCHBGGIIJIIBEE=EHE;A?E7;7?7;;AA@BDD@? NM:i:19 XX:Z:6G3G1G2G3G2G1G1G4G3GG1G3G10G7G25G4G1G1G3-XM:Z:......h...h.h..x...x..x.h.h....x...zx.h...h..........h.......h.........................h....h.h.h... XR:Z:GA XG:Z:GA
    MWR-PRG-0014:106:C25B0ACXX:4:1101:1406:1986_1:N:0:CAAAAG/1 115 12 77985032 255 100M = 77984894 --238 AATATTAACATAAAACCAAAACAACAAAATATCTAAAACACTCCAATCCCCACTCATTCCAAACTTCAAACTACTAAATCAAAAATATTACATTTCATTN CDAEDEED@C>DDBB@EFFFD@
    HHCHHGGECCAEHD=.=8HFJHIHCHEGB9IIIHGGGIJIGIGDGIHHGEEJEJGIGIJJJJJIJHHFGHFFFDD=4# NM:i:30 XX:Z:3G2GG1G1GG1G3GG2GG3GG1G3GG8G16G6GG5GGG3GG
    GG1G2G9A XM:Z:...h..hh.z.hh.h...xh..zx...hh.h...xh........z................x......xh.....xhh...xhhh.h..h.......... XR:Z:C
    T XG:Z:GA
  • dpryan
    Devon Ryan
    • Jul 2011
    • 3478

    #2
    That's odd, you might run the following on your reference fasta file to see if this is expected or not:
    Code:
    grep ">" reference_genome.fa
    If ">15" pops up, then this is normal, though it'd be odd to have that and chr9 in the same fasta file. bismark does play around a bit with contig names, but something being messed up in the code dealing with that should result in different behaviour.

    Comment

    • fkrueger
      Senior Member
      • Sep 2009
      • 627

      #3
      Originally posted by dpryan View Post
      That's odd, you might run the following on your reference fasta file to see if this is expected or not:
      Code:
      grep ">" reference_genome.fa
      If ">15" pops up, then this is normal, though it'd be odd to have that and chr9 in the same fasta file. bismark does play around a bit with contig names, but something being messed up in the code dealing with that should result in different behaviour.
      Bismark takes whatever the fasta files had in the header until it hits the first white space, if you get '15' and 'chr9' in the output I would assume that these entries looked like '>15' and '>chr9' in the fasta files you used for the genome indexing process. I think it does replace '|' characters with underscores, but it would certainly not add or remove 'chr'.

      Comment

      • serenaliao
        Member
        • Jan 2013
        • 22

        #4
        Originally posted by fkrueger View Post
        Bismark takes whatever the fasta files had in the header until it hits the first white space, if you get '15' and 'chr9' in the output I would assume that these entries looked like '>15' and '>chr9' in the fasta files you used for the genome indexing process. I think it does replace '|' characters with underscores, but it would certainly not add or remove 'chr'.
        Thanks fkrueger,

        You are right. This happened to my FASTA file.(some are chr<number> and some are <number>) Is there any convenient way to add "chr" before the chromosome number in SAM file(third column) if there is no chr? Thanks!

        Comment

        • serenaliao
          Member
          • Jan 2013
          • 22

          #5
          Originally posted by serenaliao View Post
          Thanks fkrueger,

          You are right. This happened to my FASTA file.(some are chr<number> and some are <number>) Is there any convenient way to add "chr" before the chromosome number in SAM file(third column) if there is no chr? Thanks!
          Just to follow up, I used awk '{if($3!~/^chr/){$3="chr"$3} print($0)}' filename. Does this sound reasonable?

          Comment

          • fkrueger
            Senior Member
            • Sep 2009
            • 627

            #6
            I am no expert with awk but it looks ok, should be easy enough to test (maybe on a few lines first). Any clues why your fasta files have mixed chromosome names?

            Comment

            Latest Articles

            Collapse

            • SEQadmin2
              Nine Things a Sample Prep Scientist Thinks About Before Sequencing
              by SEQadmin2


              I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


              Here are nine questions we think about, in roughly the order they matter, before...
              Yesterday, 07:11 AM
            • SEQadmin2
              From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
              by SEQadmin2


              Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


              The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
              ...
              06-02-2026, 10:05 AM
            • SEQadmin2
              Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
              by SEQadmin2


              With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


              Introduction

              Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
              05-22-2026, 06:42 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by SEQadmin2, 06-17-2026, 06:09 AM
            0 responses
            20 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-09-2026, 11:58 AM
            0 responses
            38 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-05-2026, 10:09 AM
            0 responses
            45 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-04-2026, 08:59 AM
            0 responses
            49 views
            0 reactions
            Last Post SEQadmin2  
            Working...