Seqanswers Leaderboard Ad

**john789** · 01-22-2014, 07:32 AM

Thias, did you find a solution? I have a similar problem with two data sets I created. I can't see what I did different that would result in there being values for the introns in the ucsc browser for one experiment but not the other.

thanks
John

**Thias** · 01-23-2014, 04:08 AM

Some pondering about this problem

Dear John,

As mentioned in the previous post, I have no bioinformatic education (apart from self-taught stuff) and depend on forums like that to get answers. So I unfortunately can't provide you with a satisfactory, expert-bioinformatician-approved, ready-to-use solution to this problem.

However I spent some time trying and will summarise what I did in the end. I decided that there should be an possibility to filter out the undesired reads, which cause the plateaus.

Therefore I first obtained a BED file with the coordinates of all introns. If you have a known organism with a released reference genome on the UCSC Genome Browser this is fairly easy:

Navigate to the UCSC Table tool and decide for your appropriate organism and assembly. Then select as group "Gene and Gene Prediction Tracks" and the gene annotation track of your choice (e.g. " Refseq Genes" or "Ensembl Genes"). Choose the region "genome" and the output format "BED - browser extensible data". Name the file and click on the "get output" button. In the radio button form select "Introns plus 0" for the exact intron regions.

I used this introns.bed file to intersect my aligned reads with. To do so, I applied intersectBed from the bedtools suite and retained only those reads, which didn't overlap more than 30% of the read length with an intron (-v and -f 0.3 options).

Code:

bedtools bamtobed -split -i Alignment.bam | bedtools intersect -v -f 0.30 -a stdin  -b refseq_introns.bed > alignment-piped.bed 

bedtools intersect -v -f 0.30 -abam -split -a Alignment.bam  -b refseq_introns.bed > alignment-direct.bed

Attention: Both of the above commands work, but they don't produce the same output as detailed below!

The second commands execution time is much faster (because it is not piped) and it retains for my file less, but longer fragments. (BED file entries: command 1: 24797290, command 2: 19152916)

I used

Code:

awk < alignment-piped.bed  '{print $3-$2 '\n'}' >> pipedlength.txt

to get an idea about the length distribution of the BED entries and summarised that in R out of curiosity:

> summary(pipedlength)
X36
Min. : 1.00
1st Qu.:36.00
Median :36.00
Mean :31.89
3rd Qu.:36.00
Max. :36.00
> summary(directlength)
X36
Min. : 36.0
1st Qu.: 36.0
Median : 36.0
Mean : 37.9
3rd Qu.: 36.0
Max. :427053.0
>

So in case of the second command, some long reads are retained, while they are probably chopped into smaller fragments with command 1. I guess that those long reads cause the plateaus in the wiggle format.

So you have the choice: Apply the first command (despite the longer processing time) or use the second command and additionally filter those long reads.

Code:

awk '$3-$2 <= 36 { print }' alignment-direct.bed  > alignment-direct-filtered.bed

Obviously it would be interesting to resolve the nature of those long reads. I think - since our samples originate from cancer cells - this could indicate some chromosomal rearrangements. However I didn't proceed further at this point yet.

I hope I could help you a bit and if you find out about the reason of this differences, I'd be happy to read you reply here.

Best
Matthias

**Thias** · 05-09-2014, 03:02 AM

Forget what I wrote in the last post (I will not delete it though).

The solution is much simpler. I didn't align the reads myself, but recieved those files from a former collaborator and had no clue about the aligner which was used. The SAM/BAM format specification allow for two different letters to indicate a gap in the mapping: N for Introns and D for deletions.

However the -split option of bedtools only splits reads, if the gaps are recorded as N in the CIGAR string of the BAM file. (there is the special option -splitD to force bedtools to split also reads seperated by D). In my BAM file, I have had some intron-spanning reads (the longest introns) which were recorded with Ds and therefore not splitted.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 27 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Intron spanning reads result in plateaus in wiggle format ?

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News