Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by bruce01 View Post
    Hi Felix,

    firstly, thanks for Bismark, it is a great tool. And secondly, I have read the entire thread here and searched elsewhere, so apologies if this is discussed elsewhere.

    My issue is understanding the XM:Z tag in Bismark output which I believe indicates methylation status (from a line in the change log). I would like to be able to interpret this tag (i.e. what do x, h indicate?), to the end that possibly the sequence could be 'reconverted' to pre-bisulfite conversion (PI is trying to 'add value' to our methylseq...). Your thoughts on how useful this would be would also be of interest, specifically for copy number analysis (using off-target sequence with CopywriteR package, for example).

    Thanks,

    Bruce.
    Hi Bruce,

    First allow me to paste a passage from the User Guide:

    The methylation call string contains a dot ‘.’ for every position in the BS-read not involving a cytosine, or contains one of the following letters for the three different cytosine methylation contexts (UPPER CASE = METHYLATED, lower case = unmethylated):

    Code:
    z	unmethylated C in CpG context
    Z	methylated C in CpG context 
    x	unmethylated C in CHG context
    X	methylated C in CHG context
    h	unmethylated C in CHH context
    H	methylated C in CHH context
    u	unmethylated C in Unknown context (CN or CHN)
    U	methylated C in Unknown context (CN or CHN)
    I am not quite sure I understand your other question fully though, what exactly do you mean with a pre-bisulfite state? In theory it is of course possible to do various things with the read, for which you need to take the reference sequence, the CIGAR string as well as the mismatch string into consideration.

    Can't you just take the BAM file and use this to do copy number analyses? Using the coordinates of the read alignmnents you can relatively quickly extract the fully unconverted sequence from the reference genome (is this the pre-bisulfite state?), or go back and extract the bisulfite sequence as it comes from the sequencer out from the FastQ file (e.g. based on the read ID). Does this answer your questions?

    Comment


    • Hi Felix,

      thanks for the reply, sorry I missed the section in the manual.

      I have tried using methylation data for CNV, and comparing this to shallow WGS and also exome for same samples. Exome, WGS are consistent, but methylseq is not (it is whole genome, from a MiSeq QC run to determine conversion efficiency). We also have captured methylseq (CpGiant) which is also inconsistent with WGS, exome.

      The 'pre-bisulfite state' I refer to is before the DNA is bisulfite-treated, so yes, essentially the reference genome, albeit with mutations etc. My boss is thinking this can be used in SNP, indel calling, as well as CNV. Essentially instead of sequencing exome. I am unconvinced due to bisulfite treatment damaging DNA, but said I would try.

      Anyway, thanks for your help and sorry if this is obscure, and my nomenclature incorrect/confusing,

      Bruce.

      Comment


      • It appears that the coverage in BS-Seq can be affected by several factors in addition it simply being a harsh chemical treatment, e.g. the conditions used for the bisulfite-treatment, polymerase bias during amplification and the sequence composition of the single-stranded DNA. So yes maybe this is not the ideal way of looking at CNVs in these samples.

        There are to my knowledge currently two tools out there that attempt to look at SNVs in bisulfite data and correct for this. One is called Bis-SNP (http://epigenome.usc.edu/publicationdata/bissnp2011/), the other one MethylExtract (http://bioinfo2.ugr.es/MethylExtract/). Maybe it would be worth looking into them?

        Comment


        • Thanks Felix, I have tried BisSNP but it runs very slowly (so slow as to be unusable, trying to figure out why currently) and have read the MethylExtract paper but not yet gotten around to running it.

          Comment


          • --multicore and --temp_dir still crashing

            Hi,

            We are still having similar issues when specifying "--multicore" and "--temp_dir" in the same run. Note this is not a problem when specifying "--multicore" but not "--temp_dir" or "--temp_dir" but not "--multicore". Eg:
            Code:
            > bismark --temp_dir Pilot/tmp --non_directional --bowtie2 --multicore 4 -f -o Pilot/tmp lambda/ -1 /dcs01/ajaffe/junctionCountsData/sim1/Reads/run1_cov1/sample_01_1.fasta -2 /dcs01/ajaffe/junctionCountsData/sim1/Reads/run1_cov1/sample_01_2.fasta
            
            ...
            The provided filenames for paired-end alignments are /dcs01/ajaffe/junctionCountsData/sim1/Reads/run1_cov1/sample_01_1.fasta and /dcs01/ajaffe/junctionCountsData/sim1/Reads/run1_cov1/sample_01_2.fasta
            The provided filenames for paired-end alignments are /dcs01/ajaffe/junctionCountsData/sim1/Reads/run1_cov1/sample_01_1.fasta and /dcs01/ajaffe/junctionCountsData/sim1/Reads/run1_cov1/sample_01_2.fasta
            Failed to write output /dcs01/ajaffe/Brain/DNAm/WGBS/Pilot/tmp//dcs01/ajaffe/junctionCountsData/sim1/Reads/run1_cov1/sample_01_1.fasta.temp.1: No such file or directory
            Failed to write output /dcs01/ajaffe/Brain/DNAm/WGBS/Pilot/tmp//dcs01/ajaffe/junctionCountsData/sim1/Reads/run1_cov1/sample_01_1.fasta.temp.4: No such file or directory
            Failed to write output /dcs01/ajaffe/Brain/DNAm/WGBS/Pilot/tmp//dcs01/ajaffe/junctionCountsData/sim1/Reads/run1_cov1/sample_01_1.fasta.temp.2: No such file or directory
            Failed to write output /dcs01/ajaffe/Brain/DNAm/WGBS/Pilot/tmp//dcs01/ajaffe/junctionCountsData/sim1/Reads/run1_cov1/sample_01_1.fasta.temp.3: No such file or directory
            
            > bismark --non_directional --bowtie2 --multicore 4 -f -o Pilot/tmp lambda/ -1 /dcs01/ajaffe/junctionCountsData/sim1/Reads/run1_cov1/sample_01_1.fasta -2 /dcs01/ajaffe/junctionCountsData/sim1/Reads/run1_cov1/sample_01_2.fasta
            
            ...
            
            ====================
            Bismark run complete
            ====================
            
            > bismark --non_directional --bowtie2 --temp_dir Pilot/tmp/ -f -o Pilot/tmp lambda/ -1 /dcs01/ajaffe/junctionCountsData/sim1/Reads/run1_cov1/sample_01_1.fasta -2 /dcs01/ajaffe/junctionCountsData/sim1/Reads/run1_cov1/sample_01_2.fasta
            
            ...
            
            ====================
            Bismark run complete
            ====================
            
            
            >
            I was also wondering how --multicore differed from the -p flag with bowtie2, e.g. if I specify -p 4 keeping bismark itself on 1 core, will this have the same linear speed up as specifying --multicore 4 and -p 1?

            Thanks!!

            Comment


            • Again the problem here was that the filenames were specified with /full/path/information/ that broke things. I had fixed this for FastQ files but you were using FastA files... Please find attached a version that should be working now.

              Regarding your other question about parallelisation: I haven't done extensive testing on this but the -p parameter only allows Bowtie 2 to produce the alignments somewhat faster, whereas --multicore will should almost fully parallel. The -p speed increase seems to be plateauing fairly quickly, possibly also because of the unique alignment determination and methylation calling processes need some time, too. If you've got enough system resources I would certainly recommend using --multi over -p (or potentially a combination thereof) in terms of speed increase.
              Attached Files

              Comment


              • Hi! I work in the Jaffe Lab, and I'm using the new version of bismark that was attached in the message above. I'm having problems with it. I sampled the first million reads, and I'm aligning them to the human genome. I'm using --multicore 4, and submitted the job to 20 cores, 2Gb each (max allowable memory per core -- 3Gb). The program has been running for over 19 hours, its using 61Gb of scratch space, and has been on the following stage for the past 16 hours:

                Now starting a Bowtie paired-end alignment for GAread1CTread2GAgenome (reading in sequences from /scratch/temp/4549202.1.shared.q/WGC025593L_combined_R1.fastq.gz.temp.4.gz.GA_plus_CT.fastq.gz, with the options: -q -n 1 -k 2 --best --maxins 500 --chunkmbs 512 --norc)

                This seems slow. I was wondering if you could help. Thanks in advance!
                Last edited by nivanov; 05-15-2015, 11:01 AM.

                Comment


                • Originally posted by nivanov View Post
                  Hi! I work in the Jaffe Lab, and I'm using the new version of bismark that was attached in the message above. I'm having problems with it. I sampled the first million reads, and I'm aligning them to the human genome. I'm using --multicore 4, and submitted the job to 20 cores, 2Gb each (max allowable memory per core -- 3Gb). The program has been running for over 19 hours, its using 61Gb of scratch space, and has been on the following stage for the past 16 hours:

                  Now starting a Bowtie paired-end alignment for GAread1CTread2GAgenome (reading in sequences from /scratch/temp/4549202.1.shared.q/WGC025593L_combined_R1.fastq.gz.temp.4.gz.GA_plus_CT.fastq.gz, with the options: -q -n 1 -k 2 --best --maxins 500 --chunkmbs 512 --norc)

                  This seems slow. I was wondering if you could help. Thanks in advance!
                  Yes this is unacceptably slow, this process shouldn't take more than 20 seconds or so... I am not quite sure if the memory allocation is the problem, but I wouldn't be surprised. Just as a guideline: Each instance of Bismark will need ~3GB to hold the genome in memory, and around another 3-4GB per alignment thread, so 2 times this for default directional, and 4 times this for non-directional mapping. The other processes (i.e. Samtools or gzip streams etc) shouldn't need much extra memory. Since you specified --multi 4 you will need this memory times four again. so roughly 40-50GB for directional or ~60GB for non-directional alignments. Is there are a chance that your cluster just won't execute the commands properly?
                  As a solution you could either try to ask for enough resources or go down with --multi, but each instance of Bismark will need ~10-16GB either way. Cheers, Felix

                  Comment


                  • Hi there,

                    So I have recieved some Bismark output files:

                    Bismark (SAM/BAM)

                    Mbias files & Plots

                    Methylation Extractor reports



                    The documentation is very good and I have a good understanding of all the files however in my data from the Methylation extractor there are several sites that are not marked with the chromosome header.

                    The report files according to methylationExractor should have the following headers:

                    (1) seq-ID (2) methylation state (3) chromosome (4) start position (= end position) (5) methylation call

                    In my files I have as follows:

                    HWI-ST571:431:C43FMACXX:8:2316:9928:82689_1:N:0:GTCCGC - chr9 23913539 z
                    HWI-ST571:431:C43FMACXX:8:2316:9928:82689_1:N:0:GTCCGC + chr9 23913531 Z
                    HWI-ST571:431:C43FMACXX:8:2316:9928:82689_1:N:0:GTCCGC + chr9 23913529 Z
                    HWI-ST571:431:C43FMACXX:8:2316:9928:82689_1:N:0:GTCCGC - chr9 23913485 z
                    HWI-ST571:431:C43FMACXX:8:2316:9928:85484_1:N:0:GTCCGC + chr2 28594224 Z
                    HWI-ST571:431:C43FMACXX:8:2316:9928:85484_1:N:0:GTCCGC - chr2 28594341 z
                    HWI-ST571:431:C43FMACXX:8:2316:9928:85757_1:N:0:GTCCGC - 30126 6594 z
                    HWI-ST571:431:C43FMACXX:8:2316:9928:85757_1:N:0:GTCCGC - chr2 223724995 z
                    HWI-ST571:431:C43FMACXX:8:2316:9929:12206_1:N:0:GTCCGC - chr12 64474195 z
                    HWI-ST571:431:C43FMACXX:8:2316:9929:59562_1:N:0:GTCCGC + chr20 55837341 Z
                    HWI-ST571:431:C43FMACXX:8:2316:9929:59562_1:N:0:GTCCGC - chr20 55837591 z
                    HWI-ST571:431:C43FMACXX:8:2316:9929:59562_1:N:0:GTCCGC + chr20 55837557 Z
                    HWI-ST571:431:C43FMACXX:8:2316:9929:59562_1:N:0:GTCCGC + chr20 55837519 Z
                    HWI-ST571:431:C43FMACXX:8:2316:9929:89987_1:N:0:GTCCGC + chr5 92603043 Z
                    HWI-ST571:431:C43FMACXX:8:2316:9929:89987_1:N:0:GTCCGC + 26965 45121 Z
                    HWI-ST571:431:C43FMACXX:8:2316:9929:89987_1:N:0:GTCCGC + 26965 301 Z

                    HWI-ST571:431:C43FMACXX:8:2316:9929:89987_1:N:0:GTCCGC + chr5 92603089 Z
                    HWI-ST571:431:C43FMACXX:8:2316:9929:89987_1:N:0:GTCCGC + chr5 92603122 Z


                    I am not sure about this as they seem to be a strange format. Initially I though maybe there was a formatting issue and it skipped the "chr" but none of the numbers correspond to a CG site.

                    When I run bismark2bedgraph the job completes and these files are also listed but without the chromosomal location I cannot locate the methylation site.

                    Have any of you observed this before or know what has happened? Or could it be a small bug?

                    Cheers

                    Comment


                    • Hmm not quite sure what the problem is but to me it looks like all the information is right there:

                      Code:
                      HWI-ST571:431:C43FMACXX:8:2316:9928:82689_1:N:0:GTCCGC - chr9 23913539 z
                      HWI-ST571:431:C43FMACXX:8:2316:9928:82689_1:N:0:GTCCGC is the Read ID
                      - is the methylation state (+ is methylated, - is unmethylated)
                      chr9 is the chromosome, here chromosome 9
                      23913539 is the mapping position of the C in question
                      z is the context, so here CpG context

                      Or am I missing something?

                      Comment


                      • Note the 7th, 15th and 16th lines, which lack a chromosome column.

                        Edit: Rather, they likely have incorrect chromosomes.

                        Comment


                        • Indeed as noted by dpryan
                          some of the hits do not have associated chr, looks like some are lost.
                          I looked through one of the sample files and found that there was ~ 6700 that had this missing Chr from ~6.2 Million.

                          sorry if my original post was not clear.
                          thanks for clearing it up dpryan

                          Comment


                          • Sorry about not getting that in the first place. I would tend to agree with Devon that someone has tampered with the chromosome names (the ones that appear as the first line in the .fa files used for the genome preparation >...). There is no place within Bismark that would affect or remove chr from a chromosome name. Maybe you could get back to the people who gave you the data and see if they could clarify this point?

                            Comment


                            • Hi fkrueger,

                              Honestly I had a feeling that this is an in-house issue.
                              When I approached them I was told "there is no annotation for chromosome etc and that it is per read based so I should summarize the read information to get results...."
                              From reading the well documented details of Bismark I did not feel this was the case

                              I wanted to double check before going back.
                              Thanks Fkrueger & dpryan for the help.
                              Much appreciated!
                              jcrow

                              Comment


                              • Hi fkrueger,

                                I am currently trying to use bismark to analyse a huge BS-seq dataset in HPC environment. I am thinking to split the big fastq into smaller pieces, run bismark with each of them in one node, then merge back the BAM results. Do you have any suggestion how can I merge bismark reports? Do you have existing script to do this?

                                Thanks.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM
                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                25 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                29 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                25 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                52 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X