Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • CRAM compression and TLEN SAM's field

    Hi there,

    I'm evaluating the CRAM compression. So, I'm comparing the same BAM sample pre and post CRAM compression.

    I found that in my data the TLEN (Template Length) field is sistematically modified summing 1 to this value.

    For instance an original read like:
    Code:
    SRR107049.155163702     163     1       69171   44      76M     =       69404   [B][COLOR="Red"]308[/COLOR][/B]     CTATGGAGGAATCGTGTTTGGAAACCTTCTTATTGTCATAACAGTGGTATCTGACTCCCACCTTCACTCTCCCATG    S?;<CC>D=?@=C<>D?A@F?@CC
    D3A?<C?>=>G?CA=:AE>E@HC><??CDAEABD??:9=<AAFCBCABE<>S    RG:Z:SRR107049  NM:i:0  OQ:Z:GGG@AFDE@E@BC@CBFGFFAFEDD4>B9DBD??EDBC?;AD<ADEC>?C=DBDE@AD>A;:;=>EGEBE>BE<A9
    After compression/decompression looks like:
    Code:
    2       163     1       69171   44      76M     =       69404   [B][COLOR="Red"]309[/COLOR][/B]     CTATGGAGGAATCGTGTTTGGAAACCTTCTTATTGTCATAACAGTGGTATCTGACTCCCACCTTCACTCTCCCATG    S?;<CC>D=?@=C<>D?A@F?@CCD3A?<C?>=>G?CA=:AE>E@HC><??CDAEABD??:9=<AAFCBCABE<>S    RG:Z:SRR107049

    Anybody knows something about this issue? I have gone through some posts that talk about differences among the aligners in the TLEN calculation.

    Other changes that are OK for me are the read name which are replaced by numbers and tags which are removed, except the read group tag.


    Thanks!
    Pablo.

  • #2
    An issue like that was reported on the cram mailing list last year, actually a Picard bug:
    $ samtools view NA12878.mapped.illumina.mosaik.CEU.exome.20110411.chr20.bam 20:64000-65000 | cut -f -9 SRR098401.102568768 147 20 63932 64 76M = 63912 -95 SRR098401.7377046 99 20 63987 65 76M = 641...


    What version of the cram-tools do you have?

    Comment


    • #3
      The magnitude of TLEN is "the number of bases from the leftmost mapped base to the rightmost mapped base".

      So you should be able to check the other read and calculate TLEN to see which is right.

      Assuming that the other read also has 76 matches, then I would think

      TLEN = 69404 + 76 - 69171 = 309.

      Which aligner did you use for the original BAM file?

      Justin

      Comment


      • #4
        Thanks for your answers!
        maubp, I am using version 1.00-b244 and it is linked to Picard 1.79.
        Justin, this data is an Illumina exome from the 1000 Genomes Project. As it is reported in the project information, it was aligned with Mosaik (don't know the version).

        By the way this is the mate-pair:
        Code:
        SRR107049.155163702     83      1       69404   39      76M     =       69171   -308    TGGTGACCCCCATAGCCATGGGCTGTGACAGATATAGAGCAATATGCAAGCCCCTACACTACACTACAATTATGTG    ##############################################EABFCC@G?>E=E;<H>F>>F@>B=>>B>S    RG:Z:SRR107049  NM:i:4  OQ:Z:##############################################?CBECCAEC=>;A<:FBDCBDEDDDDADDD
        The value of 309 also makes sense for me. By the way, Justin, I saw you assumed that the mate-pair had a length of 76 just as his mate. And this is correct as you can see above, do mates usually have the same length?


        Pablo.

        Comment


        • #5
          Hi Pablo,
          Yeah, it looks like Picard might be giving the right answer on this. Not sure which downstream tools rely on TLEN - the only thing I can think of is possibly the genomic viewers.

          Paired-end reads won't necessarily be the same length (for example, if there was some trimming done to the reads to remove adapters). And even if they were the same length, they may not have the same number of aligned bases. To calculate the length of an alignment, I think it is #M (match/mismatch) + #D (deletions) + #N (skips). An alignment can have soft clips (bases at the ends of the read that aren't aligned), and those wouldn't be counted towards the length of the alignment. I just got lucky that it also happened to be 76M.

          best,
          Justin

          Comment


          • #6
            Thanks again Justin,

            Yes, it seems to be an issue with the original file, don't know if coming from mosaik or somewhere else.

            Pablo.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM
            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            25 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            28 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            24 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            52 views
            0 likes
            Last Post seqadmin  
            Working...
            X