Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • FYI, this is a non-directional dataset, so make sure to use the appropriate options.

    Comment


    • Thank you for your info. I always think it is directional. Seems I need to change the command again. Can you tell me how to check whether a sequence is directional or not?

      Comment


      • I'll also note that you can probably get >75% alignment rate with this dataset, at least I did with a subset of it and using local alignment. This would probably be 80-85% if I included all of the multimappers in the metrics.

        Comment


        • The simplest method is to just take a million or so reads and align them in a non-directional manner. If there's considerable alignment to the CTOB and CTOT strands, then it's non-directional.

          Comment


          • Seems I made a mistake while performing trimming_galore. Thank you so much. I will add non directional parameter for tonight and tomorrow I will se the result again.

            Comment


            • For the sake of comparison, below are some metrics that I get using local alignment on that dataset. You won't get identical metrics (we're using different tools and trimming differently), but they won't be that terribly different (except CpG methylation). So >75% alignment is definitely possible with this dataset (at least after removing very low quality reads...of which there are many).

              Code:
              Alignment:
              	76660262 total reads analysed
              	59596692 paired-end reads mapped ( 77.74%).
              
              	27577367 concordant pairs
              	1545115 discordant pairs
              	1351728 reads aligned as singletons
              
              Number of hits aligning to each of the orientations:
              	11086480	 14.46%	OT (original top strand)
              	10666480	 13.91%	OB (original bottom strand)
              	19288820	 25.16%	CTOT (complementary to the original top strand)
              	18554912	 24.20%	CTOB (complementary to the original bottom strand)
              
              Cytosine Methylation (N.B., statistics from overlapping mates are added together!):
              	Number of C's in a CpG context: 188158298
              	Percentage of methylated C's in a CpG context:  44.23%
              	Number of C's in a CHG context: 173622589
              	Percentage of methylated C's in a CHG context:   2.27%
              	Number of C's in a CHH context: 348611484
              	Percentage of methylated C's in a CHH context:   7.04%

              Comment


              • Ok, thank you for your guidance. I will try to check the first million sequence to see which argument is best suited for analyzing so that I don't need to wait for hours to check whether I make a mistake or not.

                Btw, is it really takes a lot of time for analysis using Bismark? My computer spesification is i7 (8 cores) and 28 GB of RAM. The analysis process seems take too long, for fastqdump, trim galore and bismark. It's around 7 hours or maybe more to complete
                Last edited by barbarian; 03-18-2015, 07:21 PM.

                Comment


                • Fastq-dump is painfully slow. Sometimes you can get lucky and either ENA or DDBJ (conveniently located in Japan) has the files in fastq format. That speeds things up quite a bit, but unfortunately these other sources sometimes only have the SRA file (e.g., for this is experiment). Whenever you're dealing with large files, this can start taking a while. This is why many of us push jobs onto clusters where we use multiple machine at once (e.g., I used 13 nodes on ours to perform the alignment of this dataset in 30-40 minutes, which would have been less had someone else not decided to start 1000 jobs at the same time...).

                  Comment


                  • Hello again,

                    So, I decided to try single end dataset because I think paired end takes too many times. With current dataset, I've successfully use Bismark to do aligning process with 66% efficiency. But, I still have some problem. I read the paper published together with the dataset, it stated that they can get map efficiency 72% and higher percentage of CpG methylation. So, is it possible to increase the mapping efficiency with changing some parameter in Bismark or Trim Galore? In the paper, they didn't use Bismark or Trim galore. They developed their own code to do the trimming and methlyation call and use Bowtie as aligner. Thank you.

                    Comment


                    • We've just released a new version of Bismark 0.14.1 which mainly addresses a few bugs/glitches:

                      Bismark: Fixed the cleaning up stage in a --multicore run when --gzip had been specified as well
                      Bismark: Fixed the handling of files in a --multicore run when the input files had been specified including file path information
                      Bismark: Please note that the option -B/--basename in conjunction with --multicore is currently not supported (as in: disabled), but we are aiming to address this soon

                      Methylation Extractor: Fixed a bug with the position adjustment of paired-end reads when the reads should have been trimmed from their 3' ends (option --ignore_3prime)

                      deduplicate_bismark: Now also removing newline characters from the read conversion tag in case other programs interfered with the tag ordering and put this tag into the very last column

                      Download is available from the Bismark project page: http://www.bioinformatics.babraham.a...jects/bismark/

                      Comment


                      • bismark renames alignment file

                        Hello Felix,

                        I found a glitch in bismark 0.14.2, which made me smile a bit.
                        I ran bismark with the following arguments:

                        bismark --quiet --bowtie2 -N 1 -p 32 -X 1000 -o . -B smallsample1 /custom_indexes/hg19canon_bsgenome -1 ./smallsample1_R1_val_1.fq.gz -2 ./smallsample1_R2_val_2.fq.gz

                        Given the basename "smallsample1" I would expect the two following output files:

                        smallsample1_pe.sam
                        smallsample1_PE_report.txt

                        What I get however is:

                        smallbample1_pe.sam
                        smallsample1_PE_report.txt

                        I have now idea, but this looks like a string formatting mistake somewhere in your code. Or is there something I'm missing?

                        If it helps, here is the log of my bismark call:

                        Code:
                        Path to Bowtie 2 specified as: bowtie2
                        Output format is BAM (default)
                        Alignments will be written out in BAM format. Samtools found here: 'software/SAMtools/1.1-foss-2014b/bin/samtools'
                        Reference genome folder provided is /custom_indexes/hg19canon_bsgenome/   (absolute path is '/custom_indexes/hg19canon_bsgenome/)'
                        FastQ format assumed (by default)
                        Attention: using more than 4 cores per alignment thread has been reported to have diminishing returns. If possible try to limit -p to a value of 4
                        Each Bowtie 2 instance is going to be run with 32 threads. Please monitor performance closely and tune down if needed!
                        Library is assumed to be strand-specific (directional), alignments to strands complementary to the original top or bottom strands will be ignored (i.e. not performed!)
                        Output will be written into the directory: /pipelines/trimmed/
                        Setting parallelization to single-threaded (default)
                        
                        Single-core mode: setting pid to 1
                        
                        Paired-end alignments will be performed
                        =======================================
                        
                        The provided filenames for paired-end alignments are ./smallsample1_R1_val_1.fq.gz and ./smallsample1_R2_val_2.fq.gz
                        Input files are in FastQ format
                        Writing a C -> T converted version of the input file smallsample1_R1_val_1.fq.gz to smallsample1_R1_val_1.fq.gz_C_to_T.fastq
                        
                        Created C -> T converted version of the FastQ file smallsample1_R1_val_1.fq.gz (47 sequences in total)
                        
                        Writing a G -> A converted version of the input file smallsample1_R2_val_2.fq.gz to smallsample1_R2_val_2.fq.gz_G_to_A.fastq
                        
                        Created G -> A converted version of the FastQ file smallsample1_R2_val_2.fq.gz (47 sequences in total)
                        
                        Input files are smallsample1_R1_val_1.fq.gz_C_to_T.fastq and smallsample1_R2_val_2.fq.gz_G_to_A.fastq (FastQ)
                        Now running 2 instances of Bowtie 2 against the bisulfite genome of /custom_indexes/hg19canon_bsgenome/ with the specified options: -q -N 1 --score-min L,0,-0.2 -p 32 --reorder --ignore-quals --no-mixed --no-discordant --maxins 1000 --quiet
                        
                        Now starting a Bowtie 2 paired-end alignment for CTread1GAread2CTgenome (reading in sequences from smallsample1_R1_val_1.fq.gz_C_to_T.fastq and smallsample1_R2_val_2.fq.gz_G_to_A.fastq, with the options: -q -N 1 --score-min L,0,-0.2 -p 32 --reorder --ignore-quals --no-mixed --no-discordant --maxins 1000 --quiet --norc))
                        Found first alignment:
                        M00978:20:000000000-A5TV0:1:1101:15681:1332_1:N:0:3/1   99      chr16_CT_converted      70485744        42      26M     =       70485880        162  TTATTGGTGGTTGGGTATTGTGGTTT       >AABBFFFDBBB222EEFGGGFFGHF      AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:26 YS:i:0  YT:Z:CP
                        M00978:20:000000000-A5TV0:1:1101:15681:1332_2:N:0:3/2   147     chr16_CT_converted      70485880        42      26M     =       70485744        -162 AAAAATTAGTTGGGTGTGGTGGTGGG       GFG3FFGGGGGGFEA1AFADDA1A1>      AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:26 YS:i:0  YT:Z:CP
                        Now starting a Bowtie 2 paired-end alignment for CTread1GAread2GAgenome (reading in sequences from smallsample1_R1_val_1.fq.gz_C_to_T.fastq and smallsample1_R2_val_2.fq.gz_G_to_A.fastq, with the options: -q -N 1 --score-min L,0,-0.2 -p 32 --reorder --ignore-quals --no-mixed --no-discordant --maxins 1000 --quiet --nofw))
                        Found first alignment:
                        M00978:20:000000000-A5TV0:1:1101:15681:1332_1:N:0:3/1   77      *       0       0       *       *       0       0       TTATTGGTGGTTGGGTATTGTGGTTT   >AABBFFFDBBB222EEFGGGFFGHF       YT:Z:UP
                        M00978:20:000000000-A5TV0:1:1101:15681:1332_2:N:0:3/2   141     *       0       0       *       *       0       0       CCCACCACCACACCCAACTAATTTTT   >1A1ADDAFA1AEFGGGGGGFF3GFG       YT:Z:UP
                        
                        [B]>>> Writing bisulfite mapping results to [COLOR="Red"]smallbample1_pe.sam[/COLOR] <<<[/B]
                        
                        Current working directory is: /pipelines/trimmed
                        
                        Now reading in and storing sequence information of the genome specified in: /custom_indexes/hg19canon_bsgenome/
                        
                        chr chr1 (249250621 bp)
                        chr chr2 (243199373 bp)
                        chr chr3 (198022430 bp)
                        chr chr4 (191154276 bp)
                        chr chr5 (180915260 bp)
                        chr chr6 (171115067 bp)
                        chr chr7 (159138663 bp)
                        chr chr8 (146364022 bp)
                        chr chr9 (141213431 bp)
                        chr chr10 (135534747 bp)
                        chr chr11 (135006516 bp)
                        chr chr12 (133851895 bp)
                        chr chr13 (115169878 bp)
                        chr chr14 (107349540 bp)
                        chr chr15 (102531392 bp)
                        chr chr16 (90354753 bp)
                        chr chr17 (81195210 bp)
                        chr chr18 (78077248 bp)
                        chr chr19 (59128983 bp)
                        chr chr20 (63025520 bp)
                        chr chr21 (48129895 bp)
                        chr chr22 (51304566 bp)
                        chr chrX (155270560 bp)
                        chr chrY (59373566 bp)
                        chr chrM (16571 bp)
                        
                        
                        Reading in the sequence files ./smallsample1_R1_val_1.fq.gz and ./smallsample1_R2_val_2.fq.gz
                        Processed 47 sequences in total
                        
                        
                        Successfully deleted the temporary files smallsample1_R1_val_1.fq.gz_C_to_T.fastq and smallsample1_R2_val_2.fq.gz_G_to_A.fastq
                        
                        Final Alignment report
                        ======================
                        Sequence pairs analysed in total:       47
                        Number of paired-end alignments with a unique best hit: 28
                        Mapping efficiency:     59.6%
                        
                        Sequence pairs with no alignments under any condition:  18
                        Sequence pairs did not map uniquely:    1
                        Sequence pairs which were discarded because genomic sequence could not be extracted:    0
                        
                        Number of sequence pairs with unique best (first) alignment came from the bowtie output:
                        CT/GA/CT:       14      ((converted) top strand)
                        GA/CT/CT:       0       (complementary to (converted) top strand)
                        GA/CT/GA:       0       (complementary to (converted) bottom strand)
                        CT/GA/GA:       14      ((converted) bottom strand)
                        
                        Number of alignments to (merely theoretical) complementary strands being rejected in total:     0
                        
                        Final Cytosine Methylation Report
                        =================================
                        Total number of C's analysed:   326
                        
                        Total methylated C's in CpG context:    24
                        Total methylated C's in CHG context:    71
                        Total methylated C's in CHH context:    231
                        Total methylated C's in Unknown context:        0
                        
                        Total unmethylated C's in CpG context:  0
                        Total unmethylated C's in CHG context:  0
                        Total unmethylated C's in CHH context:  0
                        Total unmethylated C's in Unknown context:      0
                        
                        C methylated in CpG context:    100.0%
                        C methylated in CHG context:    100.0%
                        C methylated in CHH context:    100.0%
                        Can't determine percentage of methylated Cs in unknown context (CN or CHN) if value was 0
                        
                        
                        
                        ====================
                        Bismark run complete
                        ====================
                        Kind Regards,
                        Jens
                        Last edited by jhooge; 04-07-2015, 05:45 AM. Reason: Added Code Environment

                        Comment


                        • Thanks for that, I am sure some renaming settings are going wild here. I'll take a look once I am back from holiday. Cheers, Felix

                          Comment


                          • Bismark v0.14.2 fails to delete and merge files in --multicore mode

                            Hi Felix,

                            I think I might have found a bug in bismark v0.14.2 when run in --multicore mode at the cleaning up and merging stages.

                            Bismark fails to:
                            • delete the temporary sequence files (*.fastq.gz.temp.N)
                            • merge the BAM files (*.fastq.gz.temp.N_bismark_bt2_pe.bam)
                            • delete temporary BAM files (*.fastq.gz.temp.N_bismark_bt2_pe.bam)
                            • write the final report text file (*.fastq.gz_bismark_bt2_PE_report.txt)


                            Please see the full log here: http://pastebin.com/wP3SphjK

                            I think this happens when the --temp_dir argument is specified. From what I can tell Bismark tries to find the above files in the wrong location and mixes the input (02trimming) and the temp/output (05alignment) folders up a bit by concatenating them.

                            For example, when running bismark in this way:
                            Code:
                            bismark --multicore 2 --bowtie2 -p 2 --temp_dir /home/enrico16/analysis/05alignment/SampleX --output_dir /home/enrico16/analysis/05alignment/SampleX /home/enrico16/analysis/00genome -1 /home/enrico16/analysis/02trimming/SampleX/SampleX-trimmed-pair1.fastq.gz -2 /home/enrico16/analysis/02trimming/SampleX/SampleX-trimmed-pair2.fastq.gz
                            I get this type of errors:
                            Code:
                            /home/enrico16/analysis/02trimming/SampleX/SampleX-trimmed-pair1.fastq.gz.temp.1	Failed to delete temporary FastQ file 
                            
                            /home/enrico16/analysis/05alignment/SampleX//home/enrico16/analysis/02trimming/SampleX/SampleX-trimmed-pair1.fastq.gz.temp.1: No such file or directory
                            Can you help me figure out what's happening here? I might well be doing something wrong. If not, can you please fix this in your code?

                            Thanks a lot!

                            Best,
                            Enrico
                            Last edited by enrico16; 04-19-2015, 11:25 AM.

                            Comment


                            • New Bismark release: v0.14.3

                              We have just released a new version of Bismark (v0.14.3) to address the following issues:

                              Bismark: Changed the renaming settings for paired-end files so that 'sam' within the filename no longer gets renamed to 'bam' (e.g. smallsample.sam > smallbample.sam)

                              Bismark: Input files specified with filepath information are now handled properly in --multicore runs

                              Bismark: The --multicore option currently requires the output files to be in BAM format, so specifying --sam at the same time has been disallowed

                              Methylation Extractor: fixed another bug for the same issue as in 0.14.1 that had crept into the 0.14.2 release (to do with --ignore_3prime)

                              coverage2cytosine: Changed the option --merge_CpG so that CGs starting at position 1 are not considered (since the 3-base sequence context of the bottom strand C at position 2 can not be determined)

                              Bismark is available for download from the Babraham Bioinformatics Project website.

                              Comment


                              • Hi Felix,

                                firstly, thanks for Bismark, it is a great tool. And secondly, I have read the entire thread here and searched elsewhere, so apologies if this is discussed elsewhere.

                                My issue is understanding the XM:Z tag in Bismark output which I believe indicates methylation status (from a line in the change log). I would like to be able to interpret this tag (i.e. what do x, h indicate?), to the end that possibly the sequence could be 'reconverted' to pre-bisulfite conversion (PI is trying to 'add value' to our methylseq...). Your thoughts on how useful this would be would also be of interest, specifically for copy number analysis (using off-target sequence with CopywriteR package, for example).

                                Thanks,

                                Bruce.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM
                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                29 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                31 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                28 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                52 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X