Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Wierd SAM format chromosome column

    Dear all,

    According to my previous experience, in sam file output by bismark, the third column is usually chromosome number with "chr" in front. While my new data has a mixture of chr<number> and <number> in that column. Is that common?

    MWR-PRG-0014:106:C25B0ACXX:4:1101:1195:1983_1:N:0:CAAAAG/1 115 15 3399663 255 100M = 3399567 -196 CAAAAT
    ACCAAAAAATAAAACACAAAATAAAAAAACTATTCTTCCTACCTAAAAACATAATAACTTCCACATCAATAATTCTTTATTACATAAATTATAN #DDDC@</=BCEEEC>ECCD@B@?7EA=HIIIIGBIHG
    F>ACFFF??GGIIIHGCDD<<<B?19CIGHHIIIEHHEIHEIGIGEDEHHFHDHDDDBB=1# NM:i:23 XX:Z:2G3G3G1GGG1G1G5GG4G2GG12T2G1G1G1C2G1G31G4GT XM:Z:.
    .x...h...x.hhh.h.h.....xh....h..hh...............x.h.h....h.h...............................h....h. XR:Z:CT XG:Z:GA
    MWR-PRG-0014:106:C25B0ACXX:4:1101:1195:1983_1:N:0:CAAAAG/2 179 15 3399567 255 100M = 3399663 196 AACAAA
    AACAATTATAAAACTAAACTAAAAAAATCCCAAATCAAAATTTTAATATTAATTTATTCATTCACCTCACAAATAAATAAAAATATTTATCAAA B@@DD;DDFDDHDEBFHIIGGIJJIJJIJIIE>BD<?F
    DFGBGGID@CGICCGGIJJJJIJJJI=AECHHAE>BFFCCEE@ECCDAAC@=C>CCCDC;AC NM:i:24 XX:Z:1G2GGG3G2G1GGG3GG3GG1G8GG4G5G2G2G25G4G6G3G1 XM:Z:.
    h..xhh...x..h.hhh...xh...xh.h........xh....h.....h..h..h.........................h....h......h...x. XR:Z:GA XG:Z:GA
    MWR-PRG-0014:106:C25B0ACXX:4:1101:1234:1999_1:N:0:CAAAAG/1 115 chr9 57387259 255 100M = 57387164 --195 AAACAAGCATCTTAAAATAACTATTAAAATTCAAAAAACTATATATCCTCAAAACTAAAAATAATATCAAATCCATAATCTTAAAATCCTCTTTCTAAGN ?A>5>;-.;(.6:EEDA>?@
    ?=;DDDDACIIDECB=<??DBBEDDDDDD4DBIDIFCDD>EE?C9BDB<4<DEFEFEAEBDC+A9<B>DDBDB=A11# NM:i:21 XX:Z:1G3G8GG2GG2G2G7GG1G14G1G3G2G1G2G4G11GG15T
    XM:Z:.h...xH.......hh..hh..x..h.......xh.h..............x.h...h..h.h..h....h...........hh..............H. XR:Z:CT XG:Z:G
    A
    MWR-PRG-0014:106:C25B0ACXX:4:1101:1234:1999_1:N:0:CAAAAG/2 179 chr9 57387164 255 100M = 57387259 -195 CAAAATATTTATACTACCTACTATATACACAACACAATACTAAATTCCACTTTACTCCCTAATCTTCCACTATTCCCTCTTCCCTAAACAAAAAAAAACA ?@@FF?D4=E?FF3:AA:
    C4C:FGG>3+AFHIJGDH@DHGBDHIGG?@<FGGGHJIDCDHCHBGGIIJIIBEE=EHE;A?E7;7?7;;AA@BDD@? NM:i:19 XX:Z:6G3G1G2G3G2G1G1G4G3GG1G3G10G7G25G4G1G1G3-XM:Z:......h...h.h..x...x..x.h.h....x...zx.h...h..........h.......h.........................h....h.h.h... XR:Z:GA XG:Z:GA
    MWR-PRG-0014:106:C25B0ACXX:4:1101:1406:1986_1:N:0:CAAAAG/1 115 12 77985032 255 100M = 77984894 --238 AATATTAACATAAAACCAAAACAACAAAATATCTAAAACACTCCAATCCCCACTCATTCCAAACTTCAAACTACTAAATCAAAAATATTACATTTCATTN CDAEDEED@C>DDBB@EFFFD@
    HHCHHGGECCAEHD=.=8HFJHIHCHEGB9IIIHGGGIJIGIGDGIHHGEEJEJGIGIJJJJJIJHHFGHFFFDD=4# NM:i:30 XX:Z:3G2GG1G1GG1G3GG2GG3GG1G3GG8G16G6GG5GGG3GG
    GG1G2G9A XM:Z:...h..hh.z.hh.h...xh..zx...hh.h...xh........z................x......xh.....xhh...xhhh.h..h.......... XR:Z:C
    T XG:Z:GA

  • #2
    That's odd, you might run the following on your reference fasta file to see if this is expected or not:
    Code:
    grep ">" reference_genome.fa
    If ">15" pops up, then this is normal, though it'd be odd to have that and chr9 in the same fasta file. bismark does play around a bit with contig names, but something being messed up in the code dealing with that should result in different behaviour.

    Comment


    • #3
      Originally posted by dpryan View Post
      That's odd, you might run the following on your reference fasta file to see if this is expected or not:
      Code:
      grep ">" reference_genome.fa
      If ">15" pops up, then this is normal, though it'd be odd to have that and chr9 in the same fasta file. bismark does play around a bit with contig names, but something being messed up in the code dealing with that should result in different behaviour.
      Bismark takes whatever the fasta files had in the header until it hits the first white space, if you get '15' and 'chr9' in the output I would assume that these entries looked like '>15' and '>chr9' in the fasta files you used for the genome indexing process. I think it does replace '|' characters with underscores, but it would certainly not add or remove 'chr'.

      Comment


      • #4
        Originally posted by fkrueger View Post
        Bismark takes whatever the fasta files had in the header until it hits the first white space, if you get '15' and 'chr9' in the output I would assume that these entries looked like '>15' and '>chr9' in the fasta files you used for the genome indexing process. I think it does replace '|' characters with underscores, but it would certainly not add or remove 'chr'.
        Thanks fkrueger,

        You are right. This happened to my FASTA file.(some are chr<number> and some are <number>) Is there any convenient way to add "chr" before the chromosome number in SAM file(third column) if there is no chr? Thanks!

        Comment


        • #5
          Originally posted by serenaliao View Post
          Thanks fkrueger,

          You are right. This happened to my FASTA file.(some are chr<number> and some are <number>) Is there any convenient way to add "chr" before the chromosome number in SAM file(third column) if there is no chr? Thanks!
          Just to follow up, I used awk '{if($3!~/^chr/){$3="chr"$3} print($0)}' filename. Does this sound reasonable?

          Comment


          • #6
            I am no expert with awk but it looks ok, should be easy enough to test (maybe on a few lines first). Any clues why your fasta files have mixed chromosome names?

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM
            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            23 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            24 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            21 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            52 views
            0 likes
            Last Post seqadmin  
            Working...
            X