Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • We just released a new version of Bismark (v0.12.1) which adds the following functionality/fixes:

    o Bismark: Added calculation of MAPQ values for SAM/BAM output generated with Bowtie 2 for both single-end and paired-end mode. The calculation is implemented like in Bowtie 2 itself. Mapping quality values are still unavailable for alignments performed with Bowtie and retain a value of 255 throughout
    o Bismark: Fixed an uninitialised value warning for PE alignments with Bowtie 2 that occurred whenever Read 2 aligned to the very start of a chromosome (this only affected the warning itself and had no impact on any results)
    o coverage2cytosine: all chromosomes or scaffolds are now processed irrespective of whether they were covered in the sequencing experiment or not. Previously, CpG/cytosine reports for genomes with lots of small scaffolds that were not covered by any reads might have had a variable number of lines between experiments

    Thanks to all here on this forum and S. Jhanwar for their contributions.

    Bismark can be downloaded here: https://www.bioinformatics.babraham....jects/bismark/

    Comment


    • Hello Bismark-Users,

      I am a new user of Bismark v 0.12.1 and currently familarizing myself with the typical output generated along the pipeline.

      Upon a closer look at the .cov and .bedGraph files generated by bismark_methylation_extractor (1) I noticed the presence of a few thousand lines beginning with an asterisk (*) instead of the usual chromosome name. What do those entries stand for ?

      Code:
      awk < SAMPLE.cov ' !x[$1]++ {print $1}' | sort
      
      *
      chr1
      chr10
      chr11
      chr12
      chr13
      chr14
      chr15
      chr16
      chr17
      chr18
      chr19
      chr2
      chr3
      chr4
      chr5
      chr6
      chr7
      chr8
      chr9
      chrM
      chrY
      I notice that chrX is missing in the list, which made the think whether those are actually the readouts of chrX? The reference genome used for mapping is the standard mm9 from Illumina iGenomes by the way.

      Thanks a lot for your help!

      --------------------------------------
      1) invoked with the options: -s --comprehensive --bedGraph --counts --cutoff 10 --report

      Comment


      • Hi Thias,

        I have previously seen a * as the chromosome in a SAM/BAM file when Samtools encounters a read alignment to a chromosome that is not present in the SAM header (the @SQ SN: lines). From the order you posted it seems that the X should have been the second last chromosome in the list, and incidentally I had a report on Friday last week that the second last chromosome name was missing from the SAM header. I have by now tracked and fixed this bug which was caused by missing an increment counter when processing the very last chromosome of a multi-FastA reference file. So the question is, did you use a single multi-FastA reference file as your genome? If so I can send you the latest version of Bismark which fixes this bug.

        If you could just do a samtools view -H on your file you could see whether this was the problem. You don't have to redo the alignments, it would suffice if you could simply add the @SQ header line to your Bismark results file (and then maybe run the methylation extraction once more). Can you keep me updated on this via email? Sorry for the inconvenience, Felix

        Comment


        • Originally posted by fkrueger View Post
          [...] The second last chromosome name was missing from the SAM header. I have by now tracked and fixed this bug which was caused by missing an increment counter when processing the very last chromosome of a multi-FastA reference file. So the question is, did you use a single multi-FastA reference file as your genome?
          Yes, you have nailed it. The header line for chrX is indeed missing in the SAM and it was indeed a multi-FastA which I used as a reference.

          Comment


          • Wohoo! Please find the latest version attached. As added bonus it also supports the new large indexes for Bowtie 2 which you might or might not need. Please let me know if there are any problems.
            Attached Files

            Comment


            • Hi Felix,

              I have just seen a tiny bug, when the option --basename is used, the file name for ambiguous reads is missing an underscore.
              Also, if this option is used the basename is not used for the file name of the alignment pie chart (...alignment_overview.png). I'd prefer to use the basename for this file as well.

              Thanks
              Frank

              Comment


              • Hi Frank, I have tried to fix them both now (attached). Cheers, Felix
                Attached Files

                Comment


                • Originally posted by fkrueger View Post
                  You don't have to redo the alignments, it would suffice if you could simply add the @SQ header line to your Bismark results file (and then maybe run the methylation extraction once more).
                  In case someone else would like to do that on a batch of files, this script should help:

                  Code:
                  #!/bin/bash
                     find  $PWD  -type f -name '*.bam' | grep -v  "out.bam" | while read SAMPLE
                      do
                       samtools view -H $SAMPLE | sed 's=LN:16299=LN:16299\n@SQ\tSN:chrX\ LN:166650296=' > mynewheader.txt
                       samtools reheader mynewheader.txt $SAMPLE > out.bam
                        if [ -s out.bam ]
                          then
                                  rm mynewheader.txt $SAMPLE
                                  mv out.bam $SAMPLE
                          fi
                      done
                  exit 0
                  Of course, you have to adapt the sed command to your specific needs. In my case I took part of the header line above the desired position (LN:16299) and replaced it with the same string followed by a \n (newline) and the desired missing header line. Ths script avoids doing a BAM -> SAM -> BAM conversion, but unfortunately still needs a bit of time because a temporary copy of the BAM is written.
                  Last edited by Thias; 05-14-2014, 02:47 AM. Reason: Added a grep -v in the script to avoid find picking up the out.bam itself

                  Comment


                  • Hi Felix,

                    A few of our libraries have M-bias plots that show pronounced bias far in from the 5' end. When I first tried to avoid the bias by setting high values for the --ignore and --ignore_r2 options, bismark_methylation_extractor failed:
                    $ perl bismark_methylation_extractor --paired-end --no_overlap --ignore 35 --ignore_r2 35 --output out --report --samtools_path /mnt/ngs/analysis/software/samtools-0.1.19 --gzip --bedGraph --buffer_size 4G 99_26_1_GATCAG_003.bam

                    *** Bismark methylation extractor version v0.12.1 ***

                    Summarising Bismark methylation extractor parameters:
                    ===============================================================
                    Bismark paired-end SAM format specified (default)
                    First 35 bp will be disregarded when processing the methylation call string of Read 1
                    First 35 bp will be disregarded when processing the methylation call string of Read 2
                    Output path specified as: out/

                    ...

                    substr outside of string at bismark_methylation_extractor line 2199, <IN> line 179.
                    Use of uninitialized value $op in string eq at bismark_methylation_extractor line 2225, <IN> line 179.
                    Use of uninitialized value $op in string eq at bismark_methylation_extractor line 2230, <IN> line 179.
                    Use of uninitialized value $op in string eq at bismark_methylation_extractor line 2225, <IN> line 179.
                    Use of uninitialized value $op in string eq at bismark_methylation_extractor line 2230, <IN> line 179.
                    Use of uninitialized value $op in string eq at bismark_methylation_extractor line 2225, <IN> line 179.
                    Use of uninitialized value $op in string eq at bismark_methylation_extractor line 2230, <IN> line 179.
                    Use of uninitialized value $op in string eq at bismark_methylation_extractor line 2225, <IN> line 179.
                    Use of uninitialized value $op in string eq at bismark_methylation_extractor line 2230, <IN> line 179.
                    Use of uninitialized value $op in string eq at bismark_methylation_extractor line 2225, <IN> line 179.
                    Use of uninitialized value $op in string eq at bismark_methylation_extractor line 2230, <IN> line 179.
                    Use of uninitialized value $last_op_2 in concatenation (.) or string at bismark_methylation_extractor line 2287, <IN> line 179.
                    Use of uninitialized value $meth_call in split at bismark_methylation_extractor line 2502, <IN> line 179.
                    CIGAR string contained a non-matching number of lengths and operations

                    It seems that bismark_methylation_extractor has issues when --ignore or --ignore_r2 equals or exceeds the alignment length; this was the case in the example above, where a small subset of my reads were trimmed to <=35 bp.

                    I came up with a little patch that at least prevents failure when I run on my paired-end data (also see attached):

                    HTML Code:
                    --- a/bismark_methylation_extractor-0.12.1
                    +++ b/bismark_methylation_extractor-0.12.1_mod
                    @@ -2152,7 +2152,12 @@ sub process_Bismark_results_file{
                              if ($ignore) {
                                ### Clipping off the first <int> number of bases from the methylation call strings as specified with '--ignore <int>' for read 1
                                ### the methylation calls have already been reversed where necessary
                    -           $meth_call_1 = substr($meth_call_1,$ignore,length($meth_call_1)-$ignore);
                    +           my $len = length($meth_call_1)-$ignore;
                    +           if ($len <= 0) {
                    +             next;
                    +           } else {
                    +             $meth_call_1 = substr($meth_call_1,$ignore,$len);
                    +           }
                    
                                if ($strand eq '+') {
                    
                    @@ -2196,7 +2201,12 @@ sub process_Bismark_results_file{
                              if ($ignore_r2) {
                                ### Clipping off the first <int> number of bases from the methylation call string as specified with '--ignore_r2 <int>' for read 2
                                ### the methylation calls have already been reversed where necessary
                    -           $meth_call_2 = substr($meth_call_2,$ignore_r2,length($meth_call_2)-$ignore_r2);
                    +           my $len = length($meth_call_2)-$ignore_r2;
                    +           if ($len <= 0) {
                    +             next;
                    +           } else {
                    +             $meth_call_2 = substr($meth_call_2,$ignore_r2,$len);
                    +           }
                    
                                ### If we are ignoring a part of the sequence we also need to adjust the cigar string accordingly
                    If you have time, it would be great if you could take a quick look at the patch and let me know if anything looks suspect.

                    Many thanks,
                    Andrew
                    Attached Files

                    Comment


                    • We just released a new version of Bismark (v0.12.2) which fixes a couple if issues:

                      o Bismark: Added support for the new 64-bit index files for very large genomes in Bowtie 2 mode. The large genome indexes (ending in .bt2l instead of .bt2 for small genomes) are generated automatically by bismark_genome_preparation and work just as well in the Bismark alignment step
                      o Bismark: Fixed a bug that would omit the name of the second last chromomome from the SAM header if the genome had been supplied as Multi-FastA file. Everything else, including the alignments, would have been unaffected by this glitch
                      o Bismark: When the option '--basename' is specified, SE amibiguous file names now feature an underscore in their file name. Also, the pie chart file names are now derived from the the basename
                      o Methylation Extractor: Introduced a length check when the options --ignore or --ignore_r2 were set to ensure that only reads that are long enough are being processed

                      Bismark can be downloaded here: http://www.bioinformatics.babraham.a...jects/bismark/

                      @Andrew: Thanks, this seems to make perfect sense so I have included it in the current release

                      Comment


                      • Hi,

                        Would it be possible to add an option to bismark_methylation_extractor to generate a .cov file with 0-based start coords and 1-based end coords, in the same format as the .bedGraph file?

                        That way, I could do a downstream analysis with bedtools directly.
                        After extracting the methylation counts, I could also generate the bigWig files without first having to convert the start genomic coordinates.

                        There is a --zero_based parameter, but my understanding of the documentation is that it only applies to the genome-wide cytosine methylation report. I'm also not sure if the end-coords will still be 1-based. If I use this option, will the report have 0-based start coords and 1-based end-coords, or will both start and end coords be 0-based?

                        It seems to me that it would be nice to have the option to generate all the files with 0-based start coordinates and 1-based genomic coordinates, to respect the bedtools and internal UCSC format. Otherwise, having files in different formats becomes quite confusing, and requires an extra step to convert the genomic coordinates to the desired format.

                        Thank you.

                        Comment


                        • Hi blancha, that sounds like a sensible suggestion, I'll look into this when I am back from my holidays.

                          Comment


                          • Originally posted by blancha View Post
                            Hi,

                            Would it be possible to add an option to bismark_methylation_extractor to generate a .cov file with 0-based start coords and 1-based end coords, in the same format as the .bedGraph file?

                            That way, I could do a downstream analysis with bedtools directly.
                            After extracting the methylation counts, I could also generate the bigWig files without first having to convert the start genomic coordinates.

                            There is a --zero_based parameter, but my understanding of the documentation is that it only applies to the genome-wide cytosine methylation report. I'm also not sure if the end-coords will still be 1-based. If I use this option, will the report have 0-based start coords and 1-based end-coords, or will both start and end coords be 0-based?

                            It seems to me that it would be nice to have the option to generate all the files with 0-based start coordinates and 1-based genomic coordinates, to respect the bedtools and internal UCSC format. Otherwise, having files in different formats becomes quite confusing, and requires an extra step to convert the genomic coordinates to the desired format.

                            Thank you.
                            I have just changed the '--zero_based' option of the methylation extractor to write out an additional coverage file (ending in .zero.cov) which uses the UCSC zero-based, half-open standard. The cytosine report only writes out a single position (as it a single cytosine), which uses 1-based coords by default but can be changed to 0-based if desired. Please find the files attached, I hope this is what you were after.
                            Attached Files

                            Comment


                            • Hi Felix,

                              Looks great. Thanks.
                              I'll try it out.

                              Thanks.

                              Alexis

                              Comment


                              • Hi Felix,

                                On the topic of bismark_methylation_extractor, I have run into an error when I try to run it on BAM files with very few reads:
                                $ perl bismark_methylation_extractor --paired-end --no_overlap --ignore 2 --ignore_r2 33 --output $outDir --report --samtools_path /mnt/ngs/analysis/software/samtools-0.1.19 --gzip --bedGraph --buffer_size 4G 2_267_5_CTTGTA_021.bam
                                *** Bismark methylation extractor version v0.12.2 ***

                                Summarising Bismark methylation extractor parameters:
                                ===============================================================
                                Bismark paired-end SAM format specified (default)
                                First 2 bp will be disregarded when processing the methylation call string of Read 1
                                First 33 bp will be disregarded when processing the methylation call string of Read 2
                                Output path specified as: meth/ctl_results/A73L06H9O/2_267_5_CTTGTA_021/

                                ...

                                Processed 8 lines from meth/ctl_results/A73L06H9O/2_267_5_CTTGTA_021/2_267_5_CTTGTA_021.bam in total
                                Total number of methylation call strings processed: 16

                                Determining maximum read lengths for M-Bias plots
                                Maximum read length of Read 1: 73
                                Maximum read length of Read 2: 0

                                No data sets or points at /mnt/ngs/analysis/software/bismark/bismark_v0.12.2/bismark_methylation_extractor line 649, <IN> line 20.

                                The script dies when plot() fails to generate a read 2 M-bias plot. Since the absence of M-bias plots does not particularly bother me, I came up with this patch (also see attached) so that plot() failure causes a warning instead of a fatal error:

                                Code:
                                --- a/bismark_v0.12.2/bismark_methylation_extractor
                                +++ b/bismark_methylation_extractor-0.12.2-mod
                                @@ -558,11 +558,14 @@ sub produce_mbias_plots{
                                
                                     $graph1->set_legend('CpG methylation','CHG methylation','CHH methylation','CpG total calls','CHG total calls','CHH total calls');
                                
                                -    my $gd1 = $graph1->plot(\@mbias_read1) or die $graph1->error;
                                +    if (my $gd1 = $graph1->plot(\@mbias_read1)) {
                                +      open (MBIAS_G1,'>',$mbias_graph_1) or die "Failed to write to file for M-bias plot 1: $!\n\n";
                                +      binmode MBIAS_G1;
                                +      print MBIAS_G1 $gd1->png;
                                +    } else {
                                +      warn "WARNING: Cannot generate read 1 M-bias plot: " . $graph1->error;
                                +    }
                                
                                -    open (MBIAS_G1,'>',$mbias_graph_1) or die "Failed to write to file for M-bias plot 1: $!\n\n";
                                -    binmode MBIAS_G1;
                                -    print MBIAS_G1 $gd1->png;
                                   }
                                
                                   if ($paired){
                                @@ -646,11 +649,13 @@ sub produce_mbias_plots{
                                                  ) or die $graph2->error;
                                
                                       $graph2->set_legend('CpG methylation','CHG methylation','CHH methylation','CpG total calls','CHG total calls','CHH total calls');
                                -      my $gd2 = $graph2->plot(\@mbias_read2) or die $graph2->error;
                                -
                                -      open (MBIAS_G2,'>',$mbias_graph_2) or die "Failed to write to file for M-bias plot 2: $!\n\n";
                                -      binmode MBIAS_G2;
                                -      print MBIAS_G2 $gd2->png;
                                +      if (my $gd2 = $graph2->plot(\@mbias_read2)) {
                                +        open (MBIAS_G2,'>',$mbias_graph_2) or die "Failed to write to file for M-bias plot 2: $!\n\n";
                                +        binmode MBIAS_G2;
                                +        print MBIAS_G2 $gd2->png;
                                +      } else {
                                +        warn "WARNING: Cannot generate read 2 M-bias plot: " . $graph2->error;
                                +      }
                                
                                     }
                                   }
                                There may be a more elegant way to avoid the error, but this patch at least allowed the extractor to get past the plotting step and generate bedGraph output, so I thought I'd share. I'd be curious to hear your thoughts.

                                Thanks,
                                Andrew
                                Attached Files

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM
                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 03-27-2024, 06:37 PM
                                0 responses
                                13 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-27-2024, 06:07 PM
                                0 responses
                                11 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-22-2024, 10:03 AM
                                0 responses
                                53 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-21-2024, 07:32 AM
                                0 responses
                                69 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X