Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sam file smaller than fastq

    Hi Guys,

    I hope it does not turn as a stupid question but this issue is driving me crazy. I have a reference genome A that contains chromosome I am not interested in. I used a python software to generate a new reference genome B with only some of the chromosomes of A. A and B have, as a way of example, Chr1 in common.
    If I align all my reads on the complete reference genome A I get for Chr1 an average coverage of 10
    If I align all my reads on chromosome B I get for Chr1 an average coverage of 2.
    This does not make any sense! Moreover in the second case I obtain sam files that are smaller than the original reads fastq file, which I guess is also impossibile.
    Any idea of such a strange result
    I appreciate your help

  • #2
    How are you handling multimappers in both cases? Perhaps in case B reads are getting discarded if they multimap more than A. If all of your reads are not making it into the sam file (you are likely not including unmapped reads in your sam?) then it may explain the size result (though as rule of thumb don't worry about file sizes for NGS data as long as the reads are all accounted for).

    Comment


    • #3
      Hi GenoMax

      thanks for your reply. Parameters are exactly the same for the two alignments. I used bwa with the default setting except I used 0.05 as edit distance. I then used samtools to convert sam to bam, merge and sort bam file. In the past bwa produced a sam file that included all the reads of the fastq input file and I guess this has not changed. If so sam file, which contains both the sequence and the quality string, plus other fields should not be, as in my case, one third the length of the input fastq file.
      Thanks.....

      Comment


      • #4
        Have you checked your sam files using samtools idxstats?

        See this recent thread as a reference for bam/sam file size issue (specially when the files are sorted).

        Comment


        • #5
          Originally posted by GenoMax View Post
          Have you checked your sam files using samtools idxstats?

          See this recent thread as a reference for bam/sam file size issue (specially when the files are sorted).
          Thanks for the advice. I ran the command you suggested and this is my result:
          I divided the number of line of my fastq files in order to get the number of reads. I got a total of 187,595,025 reads. I calculated the sum of all mapped and unmapped reads in the bam file and obtained 65,536,000. I can not understand where all the other read went! Also in the idxstats output what is the difference between the ummapped reads for each chromosome and the unmapped reads at the end of the tab (the one with chromosome *):
          This is my idxstats output:


          chr1 23037639 2881935 72869
          chr10 18140952 2435272 74191
          chr11 19818926 2692338 74879
          chr12 22702307 2797718 99784
          chr13 24396255 2992318 96653
          chr14 30274277 3751313 116119
          chr15 20304914 2609991 112248
          chr16 22053297 2630989 99628
          chr17 17126926 2166178 70746
          chr18 29360087 3792840 114974
          chr19 24021853 3113695 120154
          chr2 18779844 2507134 89904
          chr3 19341862 2276796 82991
          chr4 23867706 2918446 100922
          chr5 25021643 3282028 116137
          chr6 21508407 2523489 76380
          chr7 21026613 2756721 79519
          chr8 22385789 2801900 69991
          chr9 23006712 3440921 151837
          * 0 0 9344052

          thanks again for your help!

          Comment


          • #6
            Did you divide the number of lines by 4 to arrive at the ~187M number?

            Here is a useful thread that explains the idxstats output: https://www.biostars.org/p/14569/

            Comment


            • #7
              Originally posted by GenoMax View Post
              Did you divide the number of lines by 4 to arrive at the ~187M number?

              Here is a useful thread that explains the idxstats output: https://www.biostars.org/p/14569/
              Yes I did. I also thought I messed up the reference genome file while removing the unwanted chromosomes with my python software. However afterwards I calculated the GC content of the chromosomes in the original and in the "stripped" reference file and they correspond exactly

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM
              • seqadmin
                Techniques and Challenges in Conservation Genomics
                by seqadmin



                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                Avian Conservation
                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                03-08-2024, 10:41 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 06:37 PM
              0 responses
              10 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, Yesterday, 06:07 PM
              0 responses
              9 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-22-2024, 10:03 AM
              0 responses
              51 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-21-2024, 07:32 AM
              0 responses
              67 views
              0 likes
              Last Post seqadmin  
              Working...
              X