Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cleaning MNase BAM file

    Hi,
    I have this dataset of MNase.
    the raw reads were aligned with bwa, the BAM file is about 60GB, and I wonder if I should filter it.

    The header of the file looks like this:

    @RG ID:T0-001 DT:2009-12-18 PG:BWA PI:150 PL:ILLUMINA
    @RG ID:T0-002 DT:2010-09-10 PG:BWA PI:150 PL:ILLUMINA PU:LANE1
    @RG ID:T0-003 DT:2010-09-10 PG:BWA PI:150 PL:ILLUMINA PU:LANE2
    @RG ID:T0-004 DT:2010-09-10 PG:BWA PI:150 PL:ILLUMINA PU:LANE3
    @RG ID:T0-005 DT:2010-09-17 PG:BWA PI:150 PL:ILLUMINA PU:LANE1
    @RG ID:T0-006 DT:2010-09-17 PG:BWA PI:150 PL:ILLUMINA PU:LANE2
    @RG ID:T0-007 DT:2010-09-17 PG:BWA PI:150 PL:ILLUMINA PU:LANE3
    @RG ID:T0-008 DT:2011-03-18 PG:BWA PI:150 PL:ILLUMINA PU:LANE3
    @RG ID:T0-009 DT:2011-03-18 PG:BWA PI:150 PL:ILLUMINA PU:LANE4
    @RG ID:T0-010 DT:2011-11-22 PG:BWA PI:150 PL:ILLUMINA PU:LANE1
    @RG ID:T0-011 DT:2011-11-22 PG:BWA PI:150 PL:ILLUMINA PU:LANE2
    @SQ SN:chr1 LN:249250621
    @SQ SN:chr2 LN:243199373
    @SQ SN:chr3 LN:198022430
    @SQ SN:chr4 LN:191154276
    @SQ SN:chr5 LN:180915260
    @SQ SN:chr6 LN:171115067
    @SQ SN:chr7 LN:159138663
    @SQ SN:chr8 LN:146364022
    @SQ SN:chr9 LN:141213431
    @SQ SN:chr10 LN:135534747
    @SQ SN:chr11 LN:135006516
    @SQ SN:chr12 LN:133851895
    @SQ SN:chr13 LN:115169878
    @SQ SN:chr14 LN:107349540
    @SQ SN:chr15 LN:102531392
    @SQ SN:chr16 LN:90354753
    @SQ SN:chr17 LN:81195210
    @SQ SN:chr18 LN:78077248
    @SQ SN:chr19 LN:59128983
    @SQ SN:chr20 LN:63025520
    @SQ SN:chr21 LN:48129895
    @SQ SN:chr22 LN:51304566
    @SQ SN:chrX LN:155270560
    @SQ SN:chrM LN:16571
    @SQ SN:MMTV LN:10937

    Indicating that it is sorted, and it is a pileup(?) of many sequencing done over the years, (done on the same cell line)
    Last two groups are paired-end, titled @RG ID:T0-010 and @RG ID:T0-011, the remaining are single-end.

    Running samtools flagstat gives this output:

    974253427 + 425863565 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    974252673 + 233075580 mapped (100.00%:54.73%)
    603612708 + 143779894 paired in sequencing
    301806108 + 71890193 read1
    301806600 + 71889701 read2
    603406978 + 82441932 properly paired (99.97%:57.34%)
    603612645 + 91537096 with itself and mate mapped
    0 + 7424943 singletons (0.00%:5.16%)
    61464 + 8185912 with mate mapped to a different chr
    61464 + 2965666 with mate mapped to a different chr (mapQ>=5)


    First thing pops clearly is that there are quite a lot of reads who did not passed QC. However, 54.73% of them are aligned.

    I saw that duplicates were not flagged but instead are tagged.

    Considering all that, I would like to share some questions:
    1) Should I filter out QC failed reads ?
    I looked inside the file, how can a 51bp read with a running Phred score of 2 (ASCII chars consist only of "#" so I caculated 35 -33 = 2 , right?) get a CIGAR 51M ? and can I assume that it is a good alignment after all ?

    2) How should I relate to duplicates ?
    I run a "samtools view ...etc | grep -c ...etc " command to count the reads holding tags which indicate that they are duplicates. The output was 50169529 reads that are tagged as "duplicate", which id 3.6% of all the reads. I tend to relate to these reads as valid reads, considering that the cells were treated with MNase.

    3) If I decide to filter out some reads. What are the consequences of removing a read that is a part of a pair ? I suspect that read2 will be left with a wrong flag, indicating it is paired while read1 is no longer present. Can this flag thing create some problems downstream along the pipeline ?

    I would appreciate some thoughts of this. Thanks

  • #2
    Originally posted by Ohad View Post
    Considering all that, I would like to share some questions:
    1) Should I filter out QC failed reads ?
    Yes, unless that somehow biases the data or drops your coverage too low. Otherwise you'll be adding a lot of noise, and the noise may be biased in some way. Do you know why the %failed is so high?

    I looked inside the file, how can a 51bp read with a running Phred score of 2 (ASCII chars consist only of "#" so I caculated 35 -33 = 2 , right?) get a CIGAR 51M ? and can I assume that it is a good alignment after all ?
    Look at the mapping score, not the cigar string. Old-style cigar strings are worthless without context: M stands for Match or Mismatch. New-style cigar strings use '=' for match and 'X' for mismatch, and therefore actually give you useful information. Also, 2 is a "special" quality score for Illumina. 5 to 41 are normal Phred values, but 2 indicates something else (like, that two clusters are too close together to distinguish) and is usually much more accurate than a true Q2 base, up to a measured Q11~Q15 in my tests on HighSeq data. Bases with Q2 should not be used for error-sensitive things like assembly and variation calling, but they can increase alignment accuracy if the aligner is error-tolerant, because longer reads have less ambiguity. BWA is not very error-tolerant, though.

    3) If I decide to filter out some reads. What are the consequences of removing a read that is a part of a pair ? I suspect that read2 will be left with a wrong flag, indicating it is paired while read1 is no longer present. Can this flag thing create some problems downstream along the pipeline ?
    I do not recommend removing singletons from a paired dataset. Use a tool that can process paired reads in their original fastq, and if either fails your filtering criteria are, then remove both. I like to create 2 outputs, one for good pairs and one for orphaned singletons, so as to conserve all of the good data (though in practice the orphaned singletons file usually gets ignored unless there is a severe lack of coverage otherwise). Then map the clean files. This way saves a lot of processing and results in smaller intermediate files. And it's much easier, since pairs in the original fastq pairs will be together, and thus can be processed as a stream, whereas in a BAM they could be anywhere.

    Comment


    • #3
      Originally posted by Brian Bushnell View Post
      Yes, unless that somehow biases the data or drops your coverage too low. Otherwise you'll be adding a lot of noise, and the noise may be biased in some way. Do you know why the %failed is so high?

      Look at the mapping score, not the cigar string. Old-style cigar strings are worthless without context: M stands for Match or Mismatch. New-style cigar strings use '=' for match and 'X' for mismatch, and therefore actually give you useful information. Also, 2 is a "special" quality score for Illumina. 5 to 41 are normal Phred values, but 2 indicates something else (like, that two clusters are too close together to distinguish) and is usually much more accurate than a true Q2 base, up to a measured Q11~Q15 in my tests on HighSeq data. Bases with Q2 should not be used for error-sensitive things like assembly and variation calling, but they can increase alignment accuracy if the aligner is error-tolerant, because longer reads have less ambiguity. BWA is not very error-tolerant, though.
      I just finished doing fastqc.
      I see most reads are 50bp, but there are 40 and 36 as well.
      Quality control is GREEN and the box-plot way of distribution shows a large amount of reads with the "special value" of 2.
      I counted 58,180,456 of them using grep, not close enough to explain ~425M bad reads.

      Looking for some cases I saw these three:


      HWI-ST227_83:3:23:15749:55015 528 chr1 9995 25 50M * 0 165
      TTACGATATCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA
      ##################################################
      RG:Z:T0-008

      HWI-ST227:145:C06RAACXX:1:1104:11867:197280 629 chr1 9996 0 * = 9996 129
      TCTGATCTTAGGGTTTGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGT
      ###########################################?2+A1:1
      RG:Z:T0-010

      HWI-ST227_83:4:65:8785:4808 512 chr1 10000 25 50M * 0 165
      ATTACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTATCCGT
      ##################################################
      RG:Z:T0-009

      Rolling down some few more lines on my screen I see flags 512 / 528 are quite common.

      Perhaps those "2" quality lines are fine or explainable.

      But another thought in mind is that it seems, by the read lengths and sequencing date, that this is an old data and perhaps the technology was still new and quality calling was less advanced.

      I also see RED cross on 73.3% Duplication Level which is to be bothered by it, and some Kmer popping on some unique positions in the read. Though for Kmer I got the ORANGE escalation mark.

      I do not recommend removing singletons from a paired dataset. Use a tool that can process paired reads in their original fastq, and if either fails your filtering criteria are, then remove both. I like to create 2 outputs, one for good pairs and one for orphaned singletons, so as to conserve all of the good data (though in practice the orphaned singletons file usually gets ignored unless there is a severe lack of coverage otherwise). Then map the clean files. This way saves a lot of processing and results in smaller intermediate files. And it's much easier, since pairs in the original fastq pairs will be together, and thus can be processed as a stream, whereas in a BAM they could be anywhere.
      I don't have the original raw files, and it would be a problem for my small server to hold those files if I extract them back from BAM.

      My current direction of thought is to break this BAM into Read Groups and start evaluating each group's sequence quality singly.
      Last edited by Ohad; 03-02-2014, 05:13 PM.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM
      • seqadmin
        Strategies for Sequencing Challenging Samples
        by seqadmin


        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
        03-22-2024, 06:39 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      18 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      22 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 09:21 AM
      0 responses
      17 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-04-2024, 09:00 AM
      0 responses
      48 views
      0 likes
      Last Post seqadmin  
      Working...
      X