Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Hi Andrew,

    thanks for this, it looks alright to me so I already changed it in the current development version. I wanted to change a few things regarding the NM, MD and XX tags in the SAM output, but once this is done I'll put out a new release.

    Cheers, Felix

    Comment


    • I've downloaded the test dataset from the Bismark homepage and tried aligning the reads. The mapping efficiency is 0%. Here's my command line:
      Code:
      bismark -n 1 -l 50 /home/user/genome/hg19 test_dataset.fastq
      Any idea why none of the reads mapped?

      Comment


      • Originally posted by blakeoft View Post
        I've downloaded the test dataset from the Bismark homepage and tried aligning the reads. The mapping efficiency is 0%. Here's my command line:
        Code:
        bismark -n 1 -l 50 /home/user/genome/hg19 test_dataset.fastq
        Any idea why none of the reads mapped?
        Can you please link or send me the text you see on your screen to [email protected]?

        Cheers, Felix

        Comment


        • With the help of Felix, I determined that the problem had to do with the way I saved the test dataset. To anyone who also wants to use the test data, I recommend right clicking the link to the test dataset, and selecting "Save Link As..." instead of opening the link, selecting all text and pasting it into a word processor.

          Thank you Felix!

          Comment


          • how to disable '--no-mixed'

            Hi,
            Is it possible to disable '--no-mixed' option? I wanted to rescue some short reads filtered in previous bismark run, but not all reads are paired. With '--no-mixed' on, bowtie2 spits following errors:
            Error, fewer reads in file specified with -1 than in file specified with -2
            terminate called after throwing an instance of 'int'
            (ERR): bowtie2-align died with signal 6 (ABRT)


            If '--no-mixed' cannot be disabled, I might need to remove those problematic orphan reads.

            Thanks.

            Comment


            • Even if you wanted to use the --no-mixed option you would still have to supply paired-end files that have the same number of sequences in Read 1 and Read 2. Bowtie 2 would then report singleton alignments if a read pair fails to align in paired-end mode. This is however not supported by the current way Bismark works, but you could run Bismark in paired-end mode first, write out unmapped reads using --unmapped and then run it again in SE mode for R1 (default) and R2 (--pbat mode) individually if you really wanted to.

              First you need to get your paired-end files fixed though. The easiest way would be to use Trim Galore (with --paired) or another adapter/quality trimmer that is paired-end aware and does not leave orphan singleton reads in the FastQ file. Cheers, Felix

              Comment


              • I seem to be having a rare problem with bismark; it might not be accepting the Bowtie2 option, the multiseed alignment mismatch Number, greater than 1 (one). Following is the command line run and the reported error,

                bismark -L 10 -N 3 -p 4 --un --non_bs_mm --ambiguous --bowtie2 Sequence/Chromosomes/ SRR001725.fastq -o Cleandata/Output/OutputL15

                ERROR reported : Please set the number of multiseed mismatches for Bowtie 2 with '-N <int>' (where <int> can be 0 or 1

                Thanks.

                Comment


                • Originally posted by reco View Post
                  Please set the number of multiseed mismatches for Bowtie 2 with '-N <int>' (where <int> can be 0 or 1
                  As the error message hinted already, -N can only be 0 or 1, but not 3 (nor 5...). Cheers, Felix
                  Last edited by fkrueger; 06-13-2014, 11:27 PM. Reason: The closing bracket was in fact there, so not a bug...

                  Comment


                  • low coverage of mapping of reads from RRBS library

                    The problem is with low coverage (NOT a significant overlapping between the single-end reads) with RRBS library. Bismark was run with the options -N 1 -L 20 --bowtie2 with an outcome of 40% mapping efficiency (Uniquely mapped) and about 25% Ambiguous reads in all the 7 samples in the RRBS Experiment. Adapters were removed before the mapping but it turned out to be that mapping efficiency was the same even without removal.

                    In all locations that were checked there is often multiple copies of a read aligning at a location but not a overlap between different reads (such as formation of a contig). So the mapping seems to be erroneous somehow.

                    Thanks.
                    Last edited by reco; 06-13-2014, 04:32 PM.

                    Comment


                    • Hi Reco,
                      RRBS reads should start (and end) at MspI restriction sites in the genome. The read distribution is therefore governed by the length of the genomic MspI fragments, and whether reads overlap or not depends on the read length you used. If you would generate an annotation track of all theoretical MspI fragments in the genome (which are typically in the size range of 70-220bp) you should see that all reads line up nicely with these fragments. RRBS is very different to 'normal' sequencing, and you wouldn't necessarily expect reads to form contigs.

                      Comment


                      • Thanks a lot! fkrueger, that's a pleasant news for my RRBS data analysis.

                        Comment


                        • MAPQ and deduplicate

                          Hi,
                          I wanted to filter reads with poor MAPQ. It seems, with a same MAPQ value, reads are filtered a bit (1.5%) more if reads are de-duplicated (with deduplicate_bismark) then filtered by MAPQ than the other way (filter by MAPQ then dedup). What does this suggest? I've observed this from two bam files with MAPQ 20.

                          Also, after bismark alignment (before de-dup), ~7% of reads (out of 320M) show MAPQ<1 and ~13% for MAPQ<5. Not sure but 7% of zero MAPQ is a bit high? I used default parameters other than --maxins 564.
                          Last edited by sunggong; 06-19-2014, 06:44 AM.

                          Comment


                          • Hi Sung,

                            I don’t really have an explanation for the increase in filtered alignments after deduplication, but one might speculate that some poorer quality sequences could possibly get sorted - at least to a certain extent - to the start of FastQ files by the Illumina pipeline (reads from the outer edges of a flowcell?), and thus also end up more towards the start of the resulting BAM file? Since the bismark deduplication uses the first alignment per position this might explain the slight increase you are seeing?

                            Since you are bringing the MAPQ filtering up here and I had similar discussions with at least five people over the past couple of weeks it might be worth going into a bit more detail here. I’d like to give credit to Frank Wessely here who initiated this discussion.

                            The MAPQ quality value of an alignment depends on the alignment itself (which is in turn governed by the --score_min function and the maximum penalty of -6 for mismatches which is used throughout (--ignore-quals)), as well as the alignment score of the next best read. Since we assume (and recommend) that reads have been vigorously adapter and quality trimmed we recommend using fairly stringent alignment parameters within Bismark to minimise the rate of mis-alignments, and thus erroneous methylation calls. This is why the default --score_min function is set to L,0,-0.2 in Bismark (roughly 3 mismatches or InDels per 100bp read), and not the relatively lax L,-0.6,-0.6 Bowtie 2 default (roughly equivalent to 10 mismatches or Indels per 100bp read).

                            Now even without considering any similar next best alignments, the MAPQ scores drop relatively quickly for best alignments with a couple of errors in the read, which is mainly a consequence of the stringent --score_min function (and --ignore-quals). If –score_min is relaxed to L,0,-0.4 or even L,-0.6,-0.6 you can see a much wider range of MAPQ values. Here is an example of this, a more detailed list is attached (courtesy of F. Wessely).

                            --score_min L,0,-0.2 (Bismark default)
                            Code:
                            x    Min_sc     AS=0 AS=-6 AS=-8 AS=-11     AS=-12     AS=-14     AS=-18     AS=-24     
                            50   -10.00     42   8    0    0    0    0    0    0    
                            100  -20.00     42   40   24   8    8    3    0    0
                            --score_min L,-0.6,-0.6 (Bowtie 2 default)
                            Code:
                            x    Min_sc     AS=0 AS=-6 AS=-8 AS=-11     AS=-12     AS=-14     AS=-18     AS=-24
                            50	-30.60	42	42	40	24	24	23	8	0	
                            100	-60.60	42	42	42	42	42	40	40	24
                            As you can see in this example, a 50bp read would already get assigned MAPQ values of 0 for anything more than 1 mismatch or InDel in the read, while this would only be the case for 4 or more mismatches for the much relaxed Bowtie 2 setting. Very similar next alignments should bring the MAPQ values even further down.

                            It seems to me that filtering on high MAPQ values on already harshly trimmed as well as stringently aligned data is overly stringent, and I am not sure I would want to use much relaxed alignment settings (which may produce incorrect alignments) only to allow some additional quality filtering afterwards. I would still be interested to hear other opinions on this topic.
                            Attached Files

                            Comment


                            • We have just released a new version of Bismark (v0.12.3) which makes the following changes:

                              o Bismark: Replaced the XX-tag field (base-by-base mismatches to the reference, excluding indels) by an MD:Z: field that now properly reflects mismatches as well as indels
                              o Bismark: Fixed the hemming distance value (NM:i: field) for reads containing insertions (Bowtie 2 mode only) which was previously offset by the number of insertions in the read
                              o methylation extractor/bismark2bedGraph: Changed the '--zero_based' option of the methylation extractor and bismark2bedGraph to write out an additional coverage file (ending in .zero.cov) which uses the UCSC zero-based, half-open standard
                              o bismark2bedGraph: Changed the requirement of CpG context files to now start with CpG... (from CpG_...)

                              Bismark can be downloaded here: https://www.bioinformatics.babraham....jects/bismark/.

                              Comment


                              • Thanks Felix for bringing this up (again).

                                1. MAPQ before and after dedup
                                I don't have a proper explanation either, but again it seems true for low MAPQ value as well. Below shows % of reads filtered at different MAPQ threshold before and after dedup.

                                Code:
                                	         MAPQ<1	MAPQ<3	MAPQ<5	MAPQ<10
                                before dedup	6.57	6.59	12.83	18.33
                                after dedup	6.77	6.79	13.16	18.67
                                Anyway they are quite marginal, so could be not significant.

                                2. MAPQ & alignment score
                                I was hoping to plot MAPQ against alignment score from my data, but the score is not given from the bismark bam file (let me know if I'm wrong). Instead, I counted the number of 1) insertion (I), 2) deletion (D) and 3) mismatches (X) per CIGAR line. This is to see how the alignment performed for the reads of 0 MAPQ (~7%), which can be interpreted as near 'non-unique' as per Bowtie2 (http://bowtie-bio.sourceforge.net/bo...er-more-unique). As shown from you table above, I was sort of expecting quite large number of [IDX] for the reads of 0 MAPQ, but it was opposite; for most of 0 MAPQ reads there were no insertion, deletion or mismatches (see attached). The Pearson correlation shows -0.2331348 between MAPQ and sum(IDX). Did I miss something serious here?

                                Can I ask:
                                1) how the table above was drawn?
                                2) is it possible to extract alignment score from the Bismark bam file
                                3) how the MAPQ was calculated in the Bismark bam file? Is it from bowtie(2)?

                                Thanks,
                                Sung
                                PS. below is a perl code to extract IDX from BAM (via samtool view):
                                Code:
                                while (<STDIN>){
                                    @a=split/\s+/;
                                    @b=split(/([0-9]+[MIDNSHPX=])/,$a[5]);
                                    $bad=0;
                                    foreach (@b){
                                        if(/([0-9]+)[IDX]/){ #insertion, deletion, or mismatch
                                            $bad=$bad+1;
                                        }   
                                    }   
                                    print "$a[4]\t$bad\n"; # 'MAPQ' 'NO of I|D|X'
                                }
                                Attached Files

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Essential Discoveries and Tools in Epitranscriptomics
                                  by seqadmin




                                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                  04-22-2024, 07:01 AM
                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Today, 08:47 AM
                                0 responses
                                12 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                60 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                59 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                54 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X