Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Intron spanning reads result in plateaus in wiggle format ?

    Hello everybody,

    I would like to visualise the amount of mapped reads of custom RNAseq samples in the UCSC Genome Browser.

    For this purpose, I can start from already sorted and indexed BAM files, which we obtained from a former collaborator. Since we are all wetlab scientists with limited insight into RNAseq, I unfortunately can’t tell you, which platform and aligner was used.

    To generate the wiggle format tracks, I tried the following code, but despite using the –split option of genomeCoverageBed, I can’t reproduce the tracks as they have been previously created by the collaborator.

    Code:
    genomeCoverageBed –bg -split -ibam  ~/Bioinfo/BAM/RNASeq/unique_mapped/RNASeq_HYPO_L1_unique.bam  -g ./chromInfo-mm9.txt  >   ~/Bioinfo/Bioinformatics/HypoL1.bedgraph
    
    wigToBigWig ./Bioinformatics/HypoL1.bedgraph ./chromInfo-mm9.txt ./Bioinformatics/HypoL1.bigwig
    Namely, there seem to be reads which map across an intron and cause a plateau in the tracks wiggle representation - have a look at the second black track, which is the one I generated.

    Somehow, the former collaborator either excluded those reads or split them to into two pieces – the blue track at least doesn’t show those plateaus. (The track in the middle shows the collaborators bedgraph track after conversion into BigWig - so its not the different format)


    Link to public example session

    Does anybody have a hint for me ?

    Thanks a lot
    Thias

  • #2
    Thias, did you find a solution? I have a similar problem with two data sets I created. I can't see what I did different that would result in there being values for the introns in the ucsc browser for one experiment but not the other.

    thanks
    John

    Comment


    • #3
      Some pondering about this problem

      Dear John,

      As mentioned in the previous post, I have no bioinformatic education (apart from self-taught stuff) and depend on forums like that to get answers. So I unfortunately can't provide you with a satisfactory, expert-bioinformatician-approved, ready-to-use solution to this problem.

      However I spent some time trying and will summarise what I did in the end. I decided that there should be an possibility to filter out the undesired reads, which cause the plateaus.

      Therefore I first obtained a BED file with the coordinates of all introns. If you have a known organism with a released reference genome on the UCSC Genome Browser this is fairly easy:

      Navigate to the UCSC Table tool and decide for your appropriate organism and assembly. Then select as group "Gene and Gene Prediction Tracks" and the gene annotation track of your choice (e.g. " Refseq Genes" or "Ensembl Genes"). Choose the region "genome" and the output format "BED - browser extensible data". Name the file and click on the "get output" button. In the radio button form select "Introns plus 0" for the exact intron regions.

      I used this introns.bed file to intersect my aligned reads with. To do so, I applied intersectBed from the bedtools suite and retained only those reads, which didn't overlap more than 30% of the read length with an intron (-v and -f 0.3 options).

      Code:
      bedtools bamtobed -split -i Alignment.bam | bedtools intersect -v -f 0.30 -a stdin  -b refseq_introns.bed > alignment-piped.bed 
      
      bedtools intersect -v -f 0.30 -abam -split -a Alignment.bam  -b refseq_introns.bed > alignment-direct.bed
      Attention: Both of the above commands work, but they don't produce the same output as detailed below!

      The second commands execution time is much faster (because it is not piped) and it retains for my file less, but longer fragments. (BED file entries: command 1: 24797290, command 2: 19152916)

      I used
      Code:
      awk < alignment-piped.bed  '{print $3-$2 '\n'}' >> pipedlength.txt
      to get an idea about the length distribution of the BED entries and summarised that in R out of curiosity:

      > summary(pipedlength)
      X36
      Min. : 1.00
      1st Qu.:36.00
      Median :36.00
      Mean :31.89
      3rd Qu.:36.00
      Max. :36.00
      > summary(directlength)
      X36
      Min. : 36.0
      1st Qu.: 36.0
      Median : 36.0
      Mean : 37.9
      3rd Qu.: 36.0
      Max. :427053.0
      >
      So in case of the second command, some long reads are retained, while they are probably chopped into smaller fragments with command 1. I guess that those long reads cause the plateaus in the wiggle format.

      So you have the choice: Apply the first command (despite the longer processing time) or use the second command and additionally filter those long reads.

      Code:
      awk '$3-$2 <= 36 { print }' alignment-direct.bed  > alignment-direct-filtered.bed
      Obviously it would be interesting to resolve the nature of those long reads. I think - since our samples originate from cancer cells - this could indicate some chromosomal rearrangements. However I didn't proceed further at this point yet.



      I hope I could help you a bit and if you find out about the reason of this differences, I'd be happy to read you reply here.

      Best
      Matthias
      Last edited by Thias; 01-23-2014, 06:50 AM.

      Comment


      • #4
        Forget what I wrote in the last post (I will not delete it though).

        The solution is much simpler. I didn't align the reads myself, but recieved those files from a former collaborator and had no clue about the aligner which was used. The SAM/BAM format specification allow for two different letters to indicate a gap in the mapping: N for Introns and D for deletions.

        However the -split option of bedtools only splits reads, if the gaps are recorded as N in the CIGAR string of the BAM file. (there is the special option -splitD to force bedtools to split also reads seperated by D). In my BAM file, I have had some intron-spanning reads (the longest introns) which were recorded with Ds and therefore not splitted.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM
        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        25 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        27 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        24 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        52 views
        0 likes
        Last Post seqadmin  
        Working...
        X