SEQanswers

Go Back   SEQanswers > Applications Forums > Epigenetics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Need help in mapping Single end and paired-end reads Lizex RNA Sequencing 0 02-17-2013 10:48 PM
Confused about the strand FLAG of Bismark paired-end seq results Jerry_Zhao Bioinformatics 2 09-14-2012 10:45 AM
Can paired-end mapping produce more reads than single-end ? warrenemmett Bioinformatics 13 03-20-2012 11:10 PM
Bismark paired-end read orientation xinyu Epigenetics 4 03-16-2012 05:57 AM
Bismark paired-end positions mixter Epigenetics 1 01-19-2012 10:45 AM

Reply
 
Thread Tools
Old 09-26-2014, 04:07 AM   #21
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 624
Default

I don't think there is a best answer to this question, because paired-end data might give you (some) information that single-end data can't provide (due to the read length, namely on the other side of the MspI fragment). On the other hand, RRBS fragments are often quite short so you are likely to get a lot of overlapping reads yielding redundant information which needs to be excluded later. We have tried to summarise the main facts about RRBS in this RRBS User Guide. From a processing point of view I would prefer single-end data (its just much faster and less to worry about), but PE has its benefits... In a nutshell:

Pros SE:
- cheaper
- trimming/mapping more straight forward and faster

Pros PE:
- possibly slightly higher mapping efficiency (not very much these days because reads are already fairly long)
- might yield data (at the end of MspI fragments) that SE doesn't reach
Cons SE:
- mapping somewhat slower
- overlapping reads (thus 2 reads != 2x data, this is taken care of in the methylation extractor)
fkrueger is offline   Reply With Quote
Old 09-26-2014, 04:32 AM   #22
tigerxu
Member
 
Location: Wuhan, China

Join Date: Oct 2013
Posts: 20
Default

Thanks a lot for your quick and useful reply, Felix!
tigerxu is offline   Reply With Quote
Old 01-08-2015, 12:39 AM   #23
mbk0asis
Member
 
Location: Korea

Join Date: Dec 2011
Posts: 41
Default

A quick question!

where do I put '--score-min' option in the command?

I tried below but got an error 'unknown option'...

Code:
bismark -u 1000 --score-min L,0,-0.6 --non_directional --bowtie2 -p 8 ../mm10/ -1 MEF_R1_val_1.fq -2 MEF_R2_val_2.fq
Code:
Unknown option: score-min
Please respecify command line options
Thank you!
mbk0asis is offline   Reply With Quote
Old 01-08-2015, 12:51 AM   #24
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 624
Default

The option is called --score_min (with an underscore _). Apart from that you can put it anywhere in the command line.
fkrueger is offline   Reply With Quote
Old 02-15-2015, 10:58 AM   #25
chxu02
Member
 
Location: GA

Join Date: Jan 2015
Posts: 18
Default

Hi Felix,

I'm mapping some 50bp PE BS-seq reads to hg38 using Bismark. The reads had some adapter contamination at 3' end, suggesting readthrough for some reads. So they were trimmed before mapping using strigent conditions:

java -jar trimmomatic-0.32.jar PE ... CROP:42 HEADCROP:2 ILLUMINACLIP:adapters/TruSeq3-PE-2.fa:0:40:1 LEADING:30 TRAILING:30 SLIDINGWINDOW:4:15 MINLEN:16

The trimming was successful according to fastQC. The adapter content was almost zero, and C and G content was close to zero throughout the whole read for read 1 and 2, respectively.

I tried both Bowtie1 and Bowtie2 using -n 1 option. Both gave mapping efficiency as low as 8%. The mapping efficiency was 75% and 37% respectively when I mapped read 1 reads using SE mode or read 2 reads using -pbat. This was not because of the trimming because pre-trimming reads also gave a 8% mapping efficiency when run in PE mode.

Any suggestion?

Last edited by chxu02; 02-15-2015 at 01:21 PM.
chxu02 is offline   Reply With Quote
Old 02-15-2015, 11:23 AM   #26
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 624
Default

Hmm, is there a chance that the two reads got out of sync during the trimming procedure? If you align just a small fraction of the untrimmed reads (e.g. -u 100000 would only try to align the first 100 reads), do you see a similarly low mapping efficiency?

Right sorry didn't read the last line properly... Do read 2 alignments also sport a rather high mapping percentage? Maybe there is something wrong with read 2 (technically) that would cause this? If you could zip up a few reads (say 100-500K) and send them to me by mail I could take a look myself. Cheers, Felix

Last edited by fkrueger; 02-15-2015 at 11:28 AM.
fkrueger is offline   Reply With Quote
Old 02-15-2015, 01:26 PM   #27
chxu02
Member
 
Location: GA

Join Date: Jan 2015
Posts: 18
Default

Quote:
Originally Posted by fkrueger View Post
Hmm, is there a chance that the two reads got out of sync during the trimming procedure? If you align just a small fraction of the untrimmed reads (e.g. -u 100000 would only try to align the first 100 reads), do you see a similarly low mapping efficiency?

Right sorry didn't read the last line properly... Do read 2 alignments also sport a rather high mapping percentage? Maybe there is something wrong with read 2 (technically) that would cause this? If you could zip up a few reads (say 100-500K) and send them to me by mail I could take a look myself. Cheers, Felix
Hi Felix, I updated the mapping result of read 2 reads in my original post. And please see email for some of my reads.
chxu02 is offline   Reply With Quote
Old 02-15-2015, 02:43 PM   #28
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 624
Default

Hi Youyou,

I have tried a number of different things with the 250K sequences you have sent and I am sorry but I can't exactly say what is going on. The things I have seen included:

- a smallish fraction of the reads still contain tags 1:Y: and 2:Y: ... these reads did not pass the Illumina quality filter and should be removed prior to aligning
- even though both R1 and R2 look fine (quality wise) with FastQC, R1 and R2 vary a lot in their mapping efficiency
- since there is a big discrepancy between R1 and R2 I don't think the problem has to do with the paired-end nature of the files
- on top of the problem with mapping efficiency the methylation values obtained from R1 (~51% in CpG context) and R2 (~17-26% in CpG context) vary wildly as well
- trimming 10bp off the 5' end of R2 increased its mapping efficiency to ~50% (with --bowtie2). Also, allowing more mismatches by means of relaxing the score_min function to L,0,-0.6 also increased the mapping efficiency noticably
- trimming 10bp off the 3' end of R2 also increased the mapping efficiency but not to the same extent as the 5' end trimming (by the way I used Trim Galore for the trimming)
- I have tried to align R2 to several usual suspect contaminations (mouse, PhiX, E. coli, Arabidopsis, Lambda, M13) but couldn't find any significant results.

So far, it looks like R2 is acting up but for a not very obvious reason. Have there been any problems reported for the run in the sequencing facility? As it stands I would probably lean towards only aligning R1 in Single-End mode because R2 clearly reduces the mapping efficiency to next-to-nothing.

Sorry if this wasn't very comprehensive but it is getting late now... Cheers, Felix
fkrueger is offline   Reply With Quote
Old 02-17-2015, 09:35 AM   #29
chxu02
Member
 
Location: GA

Join Date: Jan 2015
Posts: 18
Default

Hi Felix,
Thanks for your troubleshooting. Anyway, I think the extraordinary small insert size is the main reason for the failure in mapping. I'm preparing a library with longer insert size now. One quick question. How to filter those Y-tagged reads out of the pairs of reads without disrupting the synchrony? Is this supposed to be done by the sequencing facility usually? By no means these reads are going to be used.
chxu02 is offline   Reply With Quote
Old 02-17-2015, 09:40 AM   #30
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 624
Default

I only did a grep for 1:Y: and 2:Y:, and the two files seemed to have the same number of Ys, so presumably the entire read pair is tagged. And yes these reads are often filtered by the sequence facility, else if I were you I would write a quick script to remove them from both files. Good luck!
fkrueger is offline   Reply With Quote
Old 02-18-2015, 05:57 AM   #31
chxu02
Member
 
Location: GA

Join Date: Jan 2015
Posts: 18
Default

Quote:
Originally Posted by fkrueger View Post
Once you finished the methylation extraction you can simply take the C-context output and simply concatenate it, e.g.

cat CpG*.txt > CpG_context_merged.txt
or
zcat CpG*.txt.gz > CpG_context_merged.txt

and use this merged output as input for bismark2bedGraph. The only thing that is somewhat inconvenient is that you don't get one nice mapping report / html file because the alignment process was split up into 3 processes...
Felix, just wonder what's the advantage of this concatenation. I can use multiple files (either CpG_OT or CpG_OB or CpG_CTOT or CpG_CTOB) as input for bismark2bedgraph. This is what I do to my txt files from trimmed PE reads and unpaired SE reads.
chxu02 is offline   Reply With Quote
Old 02-18-2015, 06:01 AM   #32
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 624
Default

It shouldn't make a difference if you use bismark2bedGraph, other than looking a little bit tidier...
fkrueger is offline   Reply With Quote
Old 07-06-2020, 08:22 AM   #33
iramai
Junior Member
 
Location: Eibar

Join Date: Mar 2018
Posts: 6
Default

Hi all,
Maybe I am too late for posting here the same problem as all you did 6 years before.

I am trying to analyze some bisulfite sequencing samples with bismark/bowtie2, and I also get really low mapping efficiencies with paired-end samples.
> bismark --genome /home-gluster/mm10/ --bowtie2 --un --ambig_bam --nucleotide_coverage -q -1 R1C_1_val_1.fq.gz -2 R1C_2_val_2.fq.gz

These are the alingment rates:
435128526 (100.00%) were paired; of these:
434844627 (99.93%) aligned concordantly 0 times
62609 (0.01%) aligned concordantly exactly 1 time
221290 (0.05%) aligned concordantly >1 times
0.07% overall alignment rate
435128526 reads; of these:
435128526 (100.00%) were paired; of these:
434841421 (99.93%) aligned concordantly 0 times
58642 (0.01%) aligned concordantly exactly 1 time
228463 (0.05%) aligned concordantly >1 times
0.07% overall alignment rate
Processed 435128526 sequences in total

I have been quality trimming my sequences to get rid of the adaptors using Trim Galore! and FastQC did not report any specific problem. And then I have used Bowtie2, through Bismark, with the default parameters recommended on the protocol.

After reading this thread I also have try to map the sequences in the single-end mode with default parameters of bismark protocol,
> bismark --genome /home-gluster/mm10/ --bowtie2 --un --ambig_bam --nucleotide_coverage -q R1C_1_val_1.fq.gz

but also there the mapping efficiency is low:
435128526 (100.00%) were unpaired; of these:
434723969 (99.91%) aligned 0 times
110243 (0.03%) aligned exactly 1 time
294314 (0.07%) aligned >1 times
0.09% overall alignment rate
435128526 reads; of these:
435128526 (100.00%) were unpaired; of these:
434728822 (99.91%) aligned 0 times
118636 (0.03%) aligned exactly 1 time
281068 (0.06%) aligned >1 times
0.09% overall alignment rate

- Maybe the problem could be a desyncronization of the paired sequences on the adapter trimming step? But I have used the paired end option for the trimming process with Trim Galore!

- Maybe the problem is the reference genome? For the sequencing process the reference genome "gem3.mmusculus.GRCm38_BS" has been used, but as I always use the mm10 reference genome from the USCS for other sequencing analysis such as RNA or ChIP-Seq I tryed with the mm10 for the bisulfite sequencing analysis with bismark. Could be this the main fact of the low alingment rate? Some weeks ago I started a thread with this question but nobody answered me: http://seqanswers.com/forums/showthread.php?t=94563 , because I did not know if using the GRCm38 genome for the bisulfite analysis can not be appropiate if I want to compare this results with RNA and ChIP seq results analyzed with mm10 reference genome.

In conclusion, I am new with the bisulfite analysis and I am really lost about how to follow the analysis process. So I am hoping that somebody here might have some more experience and might help me.

Thanks in advance,

Iraia
iramai is offline   Reply With Quote
Old 07-06-2020, 08:29 AM   #34
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 624
Default

Hi there,

The mm10 and GRCm38 genomes should be exactly the same sequence, there are only some minor differences (e.g. chromosomes are called chr1, chr2 and not 1, 2 etc, chrM instead of MT and so on.

I have written up a few tips you could try out here: https://github.com/FelixKrueger/Bism...ite-seq-sample

If you still struggle to find an answer, please feel free to send me subset of your (gzip compressed) raw sequences, so I can take a look.

Cheers, Felix
fkrueger is offline   Reply With Quote
Old 07-13-2020, 06:28 AM   #35
iramai
Junior Member
 
Location: Eibar

Join Date: Mar 2018
Posts: 6
Default

Hi Felix,
Thanks for answering so fast.
I have been testing some of your tips from the link you shared on the previous comment, but I can't obtain higher alingment rates.

First I change the trim_galore parameters when trimming the raw data, to the standard ones. It seems that the program detects some Illumina adapters, however I have identified overrepresented sequences such as poly A/G/T, so I also have trimmed them, always with the --paired parameter. Although this subsequent trimming steps, I have detected that tue reads length in the FASTQC report is between 20-150bp, thing that I don't know if is correct, or I need to fix a specific number (150bp).

Next I have tried some of your tips for the alignment step. With the default parameters and paired end mode the alingment rate was too low (0.09%), as I described you in the first comment. Then I tried the alingment with single end mode. With R1 the alingment rate is 0.09% again and in the R2 the alingment rate is higher 47.04%. Now I am waiting for the paired end mode alingment with --score_min L,0,-0.6 parameter as you recommend, but as the data samples are to big it gets around 10 days to process the alingment.

If it is not such a big problem I would like to share with you some fragments of my raw samples and see what you think about them. For that I also need a little help, what command line steps do I need to follow to cut a subset of the samples and be able to send them to you? I am pretty new in this sequencing worl, so I will be very grateful to receive all kind of help.

Thanks in advance,

Iraia
iramai is offline   Reply With Quote
Old 07-13-2020, 09:00 AM   #36
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 624
Default

Hi Iraia,

I don't think you should start aligning the entire file until you know exactly what is the best way to do it. You are absolutely welcome to send me a few reads.

If you are on Linux or on a Mac, you could use the following 2 commands:

Code:
gunzip -c file_R1.fastq.gz | head -800000 | gzip -c - > 200K_R1.fastq.gz
gunzip -c file_R2.fastq.gz | head -800000 | gzip -c - > 200K_R2.fastq.gz
This should then fit as an email attachment, you can use this email address.
fkrueger is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:18 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO