SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
ATAC-seq primers Fasteno Sample Prep / Library Generation 14 02-06-2020 06:06 AM
ATAC-seq protocol wen yuan Sample Prep / Library Generation 69 02-22-2018 08:01 AM
ATAC-Seq large fragments cfuttner Sample Prep / Library Generation 5 01-30-2016 07:13 AM
ATAC-seq QC using gel Fasteno Sample Prep / Library Generation 0 03-12-2015 04:13 AM
ATAC seq lysis time zoe561 Sample Prep / Library Generation 3 03-05-2015 09:00 AM

Reply
 
Thread Tools
Old 06-02-2015, 01:46 AM   #1
Sciurus
Member
 
Location: Vancouver

Join Date: Dec 2013
Posts: 23
Default ATAC Seq data analysis

Hi everybody,

We have recently done our first ATAC Seq project in our lab and my task is analysing the data. But since I am quite new to bioinformatics, I was wondering whether somebody here has experience with the different analysis steps?

So far, I've mapped the reads with bwt and have the .bam files.
Now in the paper (Buenrostro et al. 2013) they say they adjusted the read start sites to represent the center of the transposon binding event for peak calling and footprinting and then used ZINBA for peak calling.

How would I go about adjusting the read start sites?
And ZINBA wants .bed tagAlign or bowtie files. Would you simply transfom bam to bed with bedtools?

Thanks!
Sciurus is offline   Reply With Quote
Old 06-02-2015, 07:06 AM   #2
blakeoft
Member
 
Location: Connecticut

Join Date: Oct 2013
Posts: 79
Default

You can use bamToBed from bedtools to convert from bam to bed format as you have mentioned. It's quite easy to adjust the start and end site of reads in bed format. I'm not an awk master, so there's probably a better way to do this, but it should work:

Code:
awk 'BEGIN {OFS = "\t"} ; {if ($6 == "+") print $1, $2 + 50, $3 + 50, $4, $5, $6; else print $1, $2 - 50, $3 - 50, $4, $5, $6}' alignment.bed > alignment_shifted.bed
That will shift the start and end forward by fifty base pairs if the strand is +, backward if the strand is -, and print fields 1 through 6. If you need more fields, just tack them on the end.
blakeoft is offline   Reply With Quote
Old 06-03-2015, 04:39 AM   #3
Sciurus
Member
 
Location: Vancouver

Join Date: Dec 2013
Posts: 23
Default

Thanks a lot!
Sciurus is offline   Reply With Quote
Old 06-03-2015, 08:57 AM   #4
hkoohy
Junior Member
 
Location: UK Cambrigde

Join Date: Jun 2015
Posts: 4
Default

Quote:
Originally Posted by Sciurus View Post
Hi everybody,

We have recently done our first ATAC Seq project in our lab and my task is analysing the data. But since I am quite new to bioinformatics, I was wondering whether somebody here has experience with the different analysis steps?

So far, I've mapped the reads with bwt and have the .bam files.
Now in the paper (Buenrostro et al. 2013) they say they adjusted the read start sites to represent the center of the transposon binding event for peak calling and footprinting and then used ZINBA for peak calling.

How would I go about adjusting the read start sites?
And ZINBA wants .bed tagAlign or bowtie files. Would you simply transfom bam to bed with bedtools?

Thanks!

For trimming follow what blakeoft said, and for peak calling you do not need to use ZINBA, rather use macs2 which is more easier to use, faster, accurate and works well with ATAC-SEQ.
Good luck
hkoohy is offline   Reply With Quote
Old 06-04-2015, 01:06 AM   #5
dariober
Senior Member
 
Location: Cambridge, UK

Join Date: May 2010
Posts: 311
Default

Quote:
Originally Posted by hkoohy View Post
for peak calling you do not need to use ZINBA, rather use macs2 which is more easier to use, faster, accurate and works well with ATAC-SEQ.
Good luck
I was going to say this myself. I treated ATAC-Seq libraries pretty much like ChIP-Seq libraries.

Trim adapters -> Align (bwa mem) -> Remove reads in blacklist regions if applicable (https://sites.google.com/site/anshul...cts/blacklists) -> Remove reads with MAPQ < 10 or there about -> Remove duplicates (debatable, but I tend to see better results w/o dups) -> call peaks (macs2, with mitochondrial reads removed, see below)

We find 10s to 100s of thousands of peaks, quite sharp, often mapping to transcription start sites. A feature we find in our data and in the published one is the large number of reads mapping to the mitochondrial genome (up to 40-60%), but it can vary a lot. So I prefer to remove these reads before peak calling.
dariober is offline   Reply With Quote
Old 06-04-2015, 02:48 AM   #6
hkoohy
Junior Member
 
Location: UK Cambrigde

Join Date: Jun 2015
Posts: 4
Default

Quote:
Originally Posted by dariober View Post
I was going to say this myself. I treated ATAC-Seq libraries pretty much like ChIP-Seq libraries.

Trim adapters -> Align (bwa mem) -> Remove reads in blacklist regions if applicable (https://sites.google.com/site/anshul...cts/blacklists) -> Remove reads with MAPQ < 10 or there about -> Remove duplicates (debatable, but I tend to see better results w/o dups) -> call peaks (macs2, with mitochondrial reads removed, see below)

We find 10s to 100s of thousands of peaks, quite sharp, often mapping to transcription start sites. A feature we find in our data and in the published one is the large number of reads mapping to the mitochondrial genome (up to 40-60%), but it can vary a lot. So I prefer to remove these reads before peak calling.
Have you ever investigated the effect of mitcondria reads on peak calling(in particular with macs2) for ATAC_Seq for which we have a lot of MT reads?
hkoohy is offline   Reply With Quote
Old 06-04-2015, 02:57 AM   #7
dariober
Senior Member
 
Location: Cambridge, UK

Join Date: May 2010
Posts: 311
Default

Quote:
Originally Posted by hkoohy View Post
Have you ever investigated the effect of mitcondria reads on peak calling(in particular with macs2) for ATAC_Seq for which we have a lot of MT reads?
Not quite. From what I recall keeping chrM doesn't affect the peak calling in the rest of genome too much. After all chrM is small and the ATAC-Seq reads there are fairly uniform, so macs is not too affected. Nevertheless I feel more comfortable removing chrM to avoid such imbalance.
dariober is offline   Reply With Quote
Old 06-04-2015, 03:03 AM   #8
hkoohy
Junior Member
 
Location: UK Cambrigde

Join Date: Jun 2015
Posts: 4
Default

thanks a lot,
fair enough.
Hkoohy
hkoohy is offline   Reply With Quote
Old 06-09-2015, 05:46 AM   #9
Sciurus
Member
 
Location: Vancouver

Join Date: Dec 2013
Posts: 23
Default

Thanks for your replies!

But in order to work with macs2, I'd need to use Python or is there another way of implementing it in linux?

I don't know how to program with Python, btw. ;-)
Sciurus is offline   Reply With Quote
Old 06-09-2015, 05:53 AM   #10
hkoohy
Junior Member
 
Location: UK Cambrigde

Join Date: Jun 2015
Posts: 4
Default

Quote:
Originally Posted by Sciurus View Post
Thanks for your replies!

But in order to work with macs2, I'd need to use Python or is there another way of implementing it in linux?

I don't know how to program with Python, btw. ;-)

No, you do not have do any python scripting, rather you must first install it and then
Code:
macs2 --help
will provide you with the functions and in particular you need the callpeak function:
Code:
macs2 callpeak --help
hkoohy is offline   Reply With Quote
Old 06-09-2015, 05:55 AM   #11
blakeoft
Member
 
Location: Connecticut

Join Date: Oct 2013
Posts: 79
Default

It looks like you just need to have Python installed, so you won't need to do any kind of Python programming. Check out the INTALL.rst file in the macs2 tarball for information on what you need to have installed and steps for installing macs2.
blakeoft is offline   Reply With Quote
Old 06-09-2015, 10:01 AM   #12
resara
Junior Member
 
Location: UK

Join Date: Aug 2014
Posts: 2
Default

Quote:
Originally Posted by blakeoft View Post
You can use bamToBed from bedtools to convert from bam to bed format as you have mentioned. It's quite easy to adjust the start and end site of reads in bed format. I'm not an awk master, so there's probably a better way to do this, but it should work:

Code:
awk 'BEGIN {OFS = "\t"} ; {if ($6 == "+") print $1, $2 + 50, $3 + 50, $4, $5, $6; else print $1, $2 - 50, $3 - 50, $4, $5, $6}' alignment.bed > alignment_shifted.bed
That will shift the start and end forward by fifty base pairs if the strand is +, backward if the strand is -, and print fields 1 through 6. If you need more fields, just tack them on the end.
Oh man, this is very useful. I need to school myself in AWK. I'm still making perl scripts to do this stuff.

I have a question though: I always thought that bam files were 0-indexed - does the bam to bed conversion change this, and so therefore do you need to account for it when you shift the read positions?
resara is offline   Reply With Quote
Old 06-09-2015, 10:24 AM   #13
blakeoft
Member
 
Location: Connecticut

Join Date: Oct 2013
Posts: 79
Default

resara,

I picked 50 base pairs as an example. However, I don't think that 0-indexing or otherwise will affect how much you want to shift the reads. For example, 0 -> 50 vs 1 -> 51 is is still a shift by 50 bases.

Regarding my awk code, you might also want to make sure you aren't shifting your reads to negative indices. You could check for that with a similar awk call.
blakeoft is offline   Reply With Quote
Old 06-09-2015, 11:55 AM   #14
resara
Junior Member
 
Location: UK

Join Date: Aug 2014
Posts: 2
Default

Thanks blakeoft, checking for negative indices and removing them is a good idea.

I understand what you mean, and in most cases you can disregard the 0/1-based positioning, as it won't affect your shift (it will still shift it by 50bp).

I was thinking that since bam is 0-based, if the conversion to bed changes the position to 1-based, that's a 1nt difference in position, which could be an issue if you're planning to do footprinting with the alignments.

edit: sorry, looks like bed is 0-based, so should be fine
resara is offline   Reply With Quote
Old 06-15-2015, 03:51 AM   #15
Sciurus
Member
 
Location: Vancouver

Join Date: Dec 2013
Posts: 23
Default

Thanks a lot!

Now, when using macs2 is it correct to set the offset of 75bp, that was used by Buenrostro et al. with
--nomodel --shift 37 --extsize 37?
Sciurus is offline   Reply With Quote
Old 07-20-2015, 09:40 PM   #16
baritone
Junior Member
 
Location: California

Join Date: Jun 2015
Posts: 5
Default Why 50

Quote:
Originally Posted by blakeoft View Post
You can use bamToBed from bedtools to convert from bam to bed format as you have mentioned. It's quite easy to adjust the start and end site of reads in bed format. I'm not an awk master, so there's probably a better way to do this, but it should work:

Code:
awk 'BEGIN {OFS = "\t"} ; {if ($6 == "+") print $1, $2 + 50, $3 + 50, $4, $5, $6; else print $1, $2 - 50, $3 - 50, $4, $5, $6}' alignment.bed > alignment_shifted.bed
That will shift the start and end forward by fifty base pairs if the strand is +, backward if the strand is -, and print fields 1 through 6. If you need more fields, just tack them on the end.


I don't think 50 is the number to shift. In Jason's paper, they did "all reads aligning to the + strand were offset by +4 bp, and all reads aligning to the strand were offset −5 bp". In fact, whether you shift or not, the distribution of reads will not significantly change.
baritone is offline   Reply With Quote
Old 08-27-2015, 07:03 AM   #17
cloverliang
Junior Member
 
Location: CT

Join Date: Aug 2015
Posts: 3
Default

Quote:
Originally Posted by dariober View Post
I was going to say this myself. I treated ATAC-Seq libraries pretty much like ChIP-Seq libraries.

Trim adapters -> Align (bwa mem) -> Remove reads in blacklist regions if applicable (https://sites.google.com/site/anshul...cts/blacklists) -> Remove reads with MAPQ < 10 or there about -> Remove duplicates (debatable, but I tend to see better results w/o dups) -> call peaks (macs2, with mitochondrial reads removed, see below)

We find 10s to 100s of thousands of peaks, quite sharp, often mapping to transcription start sites. A feature we find in our data and in the published one is the large number of reads mapping to the mitochondrial genome (up to 40-60%), but it can vary a lot. So I prefer to remove these reads before peak calling.
Hi Dariober,

I'm also analyzing ATAC-seq data. I did in a similar way to your procedure:

1. Map to hg19 with bowtie2 (-X 2000).
2. Remove reads with MAPQ < 10 and sort bam file using samtools.
3. Remove duplicates using picard tools.
4. Merge sorted bam of four replicates together.
5. Get cutting position for each fragment and visualize cutting frequency (per basepair) in IGV.

But the peaks I observed are quite noisy. I tried the data from Buenrostro 2013 Nature method paper. I got a good mapping rate (>90%) but a lot of duplicates (60%-70%). After removing duplicates, the per base cutting frequency becomes quite low. I was wondering if you observed a similar pattern for the public data. Would you have some suggestions on how I could improve the signal? Thank you!!
cloverliang is offline   Reply With Quote
Old 08-27-2015, 09:46 AM   #18
baritone
Junior Member
 
Location: California

Join Date: Jun 2015
Posts: 5
Default Do not remove duplicates

Quote:
Originally Posted by cloverliang View Post
Hi Dariober,

I'm also analyzing ATAC-seq data. I did in a similar way to your procedure:

1. Map to hg19 with bowtie2 (-X 2000).
2. Remove reads with MAPQ < 10 and sort bam file using samtools.
3. Remove duplicates using picard tools.
4. Merge sorted bam of four replicates together.
5. Get cutting position for each fragment and visualize cutting frequency (per basepair) in IGV.

But the peaks I observed are quite noisy. I tried the data from Buenrostro 2013 Nature method paper. I got a good mapping rate (>90%) but a lot of duplicates (60%-70%). After removing duplicates, the per base cutting frequency becomes quite low. I was wondering if you observed a similar pattern for the public data. Would you have some suggestions on how I could improve the signal? Thank you!!

According to my experience, removing duplicates and trimming the reads to 1 bp (i.e, tracking the integration site) will both decrease the resolution. Removing duplicates will force the coverage to be no higher than the read length, thus compromise good separation of peaks and non-peaks ; trimming the reads to 1bp is equivalent to giving up Kernel functions, thus the peaks will look very narrow and nasty. If you do want to trim to 1bp, you'd better separate the track into plus and minus strands so that you can easily tell which peak is real and which is not based on tag shifting explained here http://www.genomebiology.com/content/9/9/R137
baritone is offline   Reply With Quote
Old 08-27-2015, 12:23 PM   #19
fanli
Senior Member
 
Location: California

Join Date: Jul 2014
Posts: 198
Default

Quote:
Originally Posted by resara View Post
Thanks blakeoft, checking for negative indices and removing them is a good idea.

I understand what you mean, and in most cases you can disregard the 0/1-based positioning, as it won't affect your shift (it will still shift it by 50bp).

I was thinking that since bam is 0-based, if the conversion to bed changes the position to 1-based, that's a 1nt difference in position, which could be an issue if you're planning to do footprinting with the alignments.

edit: sorry, looks like bed is 0-based, so should be fine
Just fyi, SAM/BAM is 1-based
fanli is offline   Reply With Quote
Old 08-28-2015, 06:28 AM   #20
cloverliang
Junior Member
 
Location: CT

Join Date: Aug 2015
Posts: 3
Default

Quote:
Originally Posted by baritone View Post
According to my experience, removing duplicates and trimming the reads to 1 bp (i.e, tracking the integration site) will both decrease the resolution. Removing duplicates will force the coverage to be no higher than the read length, thus compromise good separation of peaks and non-peaks ; trimming the reads to 1bp is equivalent to giving up Kernel functions, thus the peaks will look very narrow and nasty. If you do want to trim to 1bp, you'd better separate the track into plus and minus strands so that you can easily tell which peak is real and which is not based on tag shifting explained here http://www.genomebiology.com/content/9/9/R137
Thank you baritone!! I'm looking for TF binding footprints, which presumably have less transposase cuts compared to its flanking region. So I wanted to trim reads to the 1bp cutting position. The peaks I got are really shadow (as in the first track of attached figure). I was wondering if that's common pattern people observed in ATAC-seq data.
Attached Images
File Type: png ATAC-seq Buenrostro 2013 data.png (47.5 KB, 242 views)
cloverliang is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:19 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO