Seqanswers Leaderboard Ad

**dpryan** · 03-18-2015, 12:55 AM

FYI, this is a non-directional dataset, so make sure to use the appropriate options.

**barbarian** · 03-18-2015, 01:03 AM

Thank you for your info. I always think it is directional. Seems I need to change the command again. Can you tell me how to check whether a sequence is directional or not?

**dpryan** · 03-18-2015, 01:04 AM

I'll also note that you can probably get >75% alignment rate with this dataset, at least I did with a subset of it and using local alignment. This would probably be 80-85% if I included all of the multimappers in the metrics.

**dpryan** · 03-18-2015, 01:04 AM

The simplest method is to just take a million or so reads and align them in a non-directional manner. If there's considerable alignment to the CTOB and CTOT strands, then it's non-directional.

**barbarian** · 03-18-2015, 01:24 AM

Seems I made a mistake while performing trimming_galore. Thank you so much. I will add non directional parameter for tonight and tomorrow I will se the result again.

**dpryan** · 03-18-2015, 05:18 AM

For the sake of comparison, below are some metrics that I get using local alignment on that dataset. You won't get identical metrics (we're using different tools and trimming differently), but they won't be that terribly different (except CpG methylation). So >75% alignment is definitely possible with this dataset (at least after removing very low quality reads...of which there are many).

Code:

Alignment:
	76660262 total reads analysed
	59596692 paired-end reads mapped ( 77.74%).

	27577367 concordant pairs
	1545115 discordant pairs
	1351728 reads aligned as singletons

Number of hits aligning to each of the orientations:
	11086480	 14.46%	OT (original top strand)
	10666480	 13.91%	OB (original bottom strand)
	19288820	 25.16%	CTOT (complementary to the original top strand)
	18554912	 24.20%	CTOB (complementary to the original bottom strand)

Cytosine Methylation (N.B., statistics from overlapping mates are added together!):
	Number of C's in a CpG context: 188158298
	Percentage of methylated C's in a CpG context:  44.23%
	Number of C's in a CHG context: 173622589
	Percentage of methylated C's in a CHG context:   2.27%
	Number of C's in a CHH context: 348611484
	Percentage of methylated C's in a CHH context:   7.04%

**barbarian** · 03-18-2015, 05:33 PM

Ok, thank you for your guidance. I will try to check the first million sequence to see which argument is best suited for analyzing so that I don't need to wait for hours to check whether I make a mistake or not.

Btw, is it really takes a lot of time for analysis using Bismark? My computer spesification is i7 (8 cores) and 28 GB of RAM. The analysis process seems take too long, for fastqdump, trim galore and bismark. It's around 7 hours or maybe more to complete

**dpryan** · 03-19-2015, 12:20 AM

Fastq-dump is painfully slow. Sometimes you can get lucky and either ENA or DDBJ (conveniently located in Japan) has the files in fastq format. That speeds things up quite a bit, but unfortunately these other sources sometimes only have the SRA file (e.g., for this is experiment). Whenever you're dealing with large files, this can start taking a while. This is why many of us push jobs onto clusters where we use multiple machine at once (e.g., I used 13 nodes on ours to perform the alignment of this dataset in 30-40 minutes, which would have been less had someone else not decided to start 1000 jobs at the same time...).

**barbarian** · 03-23-2015, 08:33 PM

Hello again,

So, I decided to try single end dataset because I think paired end takes too many times. With current dataset, I've successfully use Bismark to do aligning process with 66% efficiency. But, I still have some problem. I read the paper published together with the dataset, it stated that they can get map efficiency 72% and higher percentage of CpG methylation. So, is it possible to increase the mapping efficiency with changing some parameter in Bismark or Trim Galore? In the paper, they didn't use Bismark or Trim galore. They developed their own code to do the trimming and methlyation call and use Bowtie as aligner. Thank you.

**fkrueger** · 03-27-2015, 05:16 AM

We've just released a new version of Bismark 0.14.1 which mainly addresses a few bugs/glitches:

Bismark: Fixed the cleaning up stage in a --multicore run when --gzip had been specified as well
Bismark: Fixed the handling of files in a --multicore run when the input files had been specified including file path information
Bismark: Please note that the option -B/--basename in conjunction with --multicore is currently not supported (as in: disabled), but we are aiming to address this soon

Methylation Extractor: Fixed a bug with the position adjustment of paired-end reads when the reads should have been trimmed from their 3' ends (option --ignore_3prime)

deduplicate_bismark: Now also removing newline characters from the read conversion tag in case other programs interfered with the tag ordering and put this tag into the very last column

Download is available from the Bismark project page: http://www.bioinformatics.babraham.a...jects/bismark/

**jhooge** · 04-07-2015, 05:38 AM

bismark renames alignment file

Hello Felix,

I found a glitch in bismark 0.14.2, which made me smile a bit.
I ran bismark with the following arguments:

bismark --quiet --bowtie2 -N 1 -p 32 -X 1000 -o . -B smallsample1 /custom_indexes/hg19canon_bsgenome -1 ./smallsample1_R1_val_1.fq.gz -2 ./smallsample1_R2_val_2.fq.gz

Given the basename "smallsample1" I would expect the two following output files:

smallsample1_pe.sam
smallsample1_PE_report.txt

What I get however is:

smallbample1_pe.sam
smallsample1_PE_report.txt

I have now idea, but this looks like a string formatting mistake somewhere in your code. Or is there something I'm missing?

If it helps, here is the log of my bismark call:

Code:

Path to Bowtie 2 specified as: bowtie2
Output format is BAM (default)
Alignments will be written out in BAM format. Samtools found here: 'software/SAMtools/1.1-foss-2014b/bin/samtools'
Reference genome folder provided is /custom_indexes/hg19canon_bsgenome/   (absolute path is '/custom_indexes/hg19canon_bsgenome/)'
FastQ format assumed (by default)
Attention: using more than 4 cores per alignment thread has been reported to have diminishing returns. If possible try to limit -p to a value of 4
Each Bowtie 2 instance is going to be run with 32 threads. Please monitor performance closely and tune down if needed!
Library is assumed to be strand-specific (directional), alignments to strands complementary to the original top or bottom strands will be ignored (i.e. not performed!)
Output will be written into the directory: /pipelines/trimmed/
Setting parallelization to single-threaded (default)

Single-core mode: setting pid to 1

Paired-end alignments will be performed
=======================================

The provided filenames for paired-end alignments are ./smallsample1_R1_val_1.fq.gz and ./smallsample1_R2_val_2.fq.gz
Input files are in FastQ format
Writing a C -> T converted version of the input file smallsample1_R1_val_1.fq.gz to smallsample1_R1_val_1.fq.gz_C_to_T.fastq

Created C -> T converted version of the FastQ file smallsample1_R1_val_1.fq.gz (47 sequences in total)

Writing a G -> A converted version of the input file smallsample1_R2_val_2.fq.gz to smallsample1_R2_val_2.fq.gz_G_to_A.fastq

Created G -> A converted version of the FastQ file smallsample1_R2_val_2.fq.gz (47 sequences in total)

Input files are smallsample1_R1_val_1.fq.gz_C_to_T.fastq and smallsample1_R2_val_2.fq.gz_G_to_A.fastq (FastQ)
Now running 2 instances of Bowtie 2 against the bisulfite genome of /custom_indexes/hg19canon_bsgenome/ with the specified options: -q -N 1 --score-min L,0,-0.2 -p 32 --reorder --ignore-quals --no-mixed --no-discordant --maxins 1000 --quiet

Now starting a Bowtie 2 paired-end alignment for CTread1GAread2CTgenome (reading in sequences from smallsample1_R1_val_1.fq.gz_C_to_T.fastq and smallsample1_R2_val_2.fq.gz_G_to_A.fastq, with the options: -q -N 1 --score-min L,0,-0.2 -p 32 --reorder --ignore-quals --no-mixed --no-discordant --maxins 1000 --quiet --norc))
Found first alignment:
M00978:20:000000000-A5TV0:1:1101:15681:1332_1:N:0:3/1   99      chr16_CT_converted      70485744        42      26M     =       70485880        162  TTATTGGTGGTTGGGTATTGTGGTTT       >AABBFFFDBBB222EEFGGGFFGHF      AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:26 YS:i:0  YT:Z:CP
M00978:20:000000000-A5TV0:1:1101:15681:1332_2:N:0:3/2   147     chr16_CT_converted      70485880        42      26M     =       70485744        -162 AAAAATTAGTTGGGTGTGGTGGTGGG       GFG3FFGGGGGGFEA1AFADDA1A1>      AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:26 YS:i:0  YT:Z:CP
Now starting a Bowtie 2 paired-end alignment for CTread1GAread2GAgenome (reading in sequences from smallsample1_R1_val_1.fq.gz_C_to_T.fastq and smallsample1_R2_val_2.fq.gz_G_to_A.fastq, with the options: -q -N 1 --score-min L,0,-0.2 -p 32 --reorder --ignore-quals --no-mixed --no-discordant --maxins 1000 --quiet --nofw))
Found first alignment:
M00978:20:000000000-A5TV0:1:1101:15681:1332_1:N:0:3/1   77      *       0       0       *       *       0       0       TTATTGGTGGTTGGGTATTGTGGTTT   >AABBFFFDBBB222EEFGGGFFGHF       YT:Z:UP
M00978:20:000000000-A5TV0:1:1101:15681:1332_2:N:0:3/2   141     *       0       0       *       *       0       0       CCCACCACCACACCCAACTAATTTTT   >1A1ADDAFA1AEFGGGGGGFF3GFG       YT:Z:UP

[B]>>> Writing bisulfite mapping results to [COLOR="Red"]smallbample1_pe.sam[/COLOR] <<<[/B]

Current working directory is: /pipelines/trimmed

Now reading in and storing sequence information of the genome specified in: /custom_indexes/hg19canon_bsgenome/

chr chr1 (249250621 bp)
chr chr2 (243199373 bp)
chr chr3 (198022430 bp)
chr chr4 (191154276 bp)
chr chr5 (180915260 bp)
chr chr6 (171115067 bp)
chr chr7 (159138663 bp)
chr chr8 (146364022 bp)
chr chr9 (141213431 bp)
chr chr10 (135534747 bp)
chr chr11 (135006516 bp)
chr chr12 (133851895 bp)
chr chr13 (115169878 bp)
chr chr14 (107349540 bp)
chr chr15 (102531392 bp)
chr chr16 (90354753 bp)
chr chr17 (81195210 bp)
chr chr18 (78077248 bp)
chr chr19 (59128983 bp)
chr chr20 (63025520 bp)
chr chr21 (48129895 bp)
chr chr22 (51304566 bp)
chr chrX (155270560 bp)
chr chrY (59373566 bp)
chr chrM (16571 bp)


Reading in the sequence files ./smallsample1_R1_val_1.fq.gz and ./smallsample1_R2_val_2.fq.gz
Processed 47 sequences in total


Successfully deleted the temporary files smallsample1_R1_val_1.fq.gz_C_to_T.fastq and smallsample1_R2_val_2.fq.gz_G_to_A.fastq

Final Alignment report
======================
Sequence pairs analysed in total:       47
Number of paired-end alignments with a unique best hit: 28
Mapping efficiency:     59.6%

Sequence pairs with no alignments under any condition:  18
Sequence pairs did not map uniquely:    1
Sequence pairs which were discarded because genomic sequence could not be extracted:    0

Number of sequence pairs with unique best (first) alignment came from the bowtie output:
CT/GA/CT:       14      ((converted) top strand)
GA/CT/CT:       0       (complementary to (converted) top strand)
GA/CT/GA:       0       (complementary to (converted) bottom strand)
CT/GA/GA:       14      ((converted) bottom strand)

Number of alignments to (merely theoretical) complementary strands being rejected in total:     0

Final Cytosine Methylation Report
=================================
Total number of C's analysed:   326

Total methylated C's in CpG context:    24
Total methylated C's in CHG context:    71
Total methylated C's in CHH context:    231
Total methylated C's in Unknown context:        0

Total unmethylated C's in CpG context:  0
Total unmethylated C's in CHG context:  0
Total unmethylated C's in CHH context:  0
Total unmethylated C's in Unknown context:      0

C methylated in CpG context:    100.0%
C methylated in CHG context:    100.0%
C methylated in CHH context:    100.0%
Can't determine percentage of methylated Cs in unknown context (CN or CHN) if value was 0



====================
Bismark run complete
====================

Kind Regards,
Jens

**fkrueger** · 04-07-2015, 02:46 PM

Thanks for that, I am sure some renaming settings are going wild here. I'll take a look once I am back from holiday. Cheers, Felix

**enrico16** · 04-16-2015, 04:14 AM

Bismark v0.14.2 fails to delete and merge files in --multicore mode

Hi Felix,

I think I might have found a bug in bismark v0.14.2 when run in --multicore mode at the cleaning up and merging stages.

Bismark fails to:

delete the temporary sequence files (*.fastq.gz.temp.N)
merge the BAM files (*.fastq.gz.temp.N_bismark_bt2_pe.bam)
delete temporary BAM files (*.fastq.gz.temp.N_bismark_bt2_pe.bam)
write the final report text file (*.fastq.gz_bismark_bt2_PE_report.txt)

Please see the full log here: http://pastebin.com/wP3SphjK

I think this happens when the --temp_dir argument is specified. From what I can tell Bismark tries to find the above files in the wrong location and mixes the input (02trimming) and the temp/output (05alignment) folders up a bit by concatenating them.

For example, when running bismark in this way:

Code:

bismark --multicore 2 --bowtie2 -p 2 --temp_dir /home/enrico16/analysis/05alignment/SampleX --output_dir /home/enrico16/analysis/05alignment/SampleX /home/enrico16/analysis/00genome -1 /home/enrico16/analysis/02trimming/SampleX/SampleX-trimmed-pair1.fastq.gz -2 /home/enrico16/analysis/02trimming/SampleX/SampleX-trimmed-pair2.fastq.gz

I get this type of errors:

Code:

/home/enrico16/analysis/02trimming/SampleX/SampleX-trimmed-pair1.fastq.gz.temp.1	Failed to delete temporary FastQ file 

/home/enrico16/analysis/05alignment/SampleX//home/enrico16/analysis/02trimming/SampleX/SampleX-trimmed-pair1.fastq.gz.temp.1: No such file or directory

Can you help me figure out what's happening here? I might well be doing something wrong. If not, can you please fix this in your code?

Thanks a lot!

Best,
Enrico

**fkrueger** · 05-06-2015, 07:26 AM

New Bismark release: v0.14.3

We have just released a new version of Bismark (v0.14.3) to address the following issues:

• Bismark: Changed the renaming settings for paired-end files so that 'sam' within the filename no longer gets renamed to 'bam' (e.g. smallsample.sam > smallbample.sam)

• Bismark: Input files specified with filepath information are now handled properly in --multicore runs

• Bismark: The --multicore option currently requires the output files to be in BAM format, so specifying --sam at the same time has been disallowed

• Methylation Extractor: fixed another bug for the same issue as in 0.14.1 that had crept into the 0.14.2 release (to do with --ignore_3prime)

• coverage2cytosine: Changed the option --merge_CpG so that CGs starting at position 1 are not considered (since the 3-base sequence context of the bottom strand C at position 2 can not be determined)

Bismark is available for download from the Babraham Bioinformatics Project website.

**bruce01** · 05-07-2015, 02:07 AM

Hi Felix,

firstly, thanks for Bismark, it is a great tool. And secondly, I have read the entire thread here and searched elsewhere, so apologies if this is discussed elsewhere.

My issue is understanding the XM:Z tag in Bismark output which I believe indicates methylation status (from a line in the change log). I would like to be able to interpret this tag (i.e. what do x, h indicate?), to the end that possibly the sequence could be 'reconverted' to pre-bisulfite conversion (PI is trying to 'add value' to our methylseq...). Your thoughts on how useful this would be would also be of interest, specifically for copy number analysis (using off-target sequence with CopywriteR package, for example).

Thanks,

Bruce.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 29 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 31 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News