Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • how to combine junction.bed files with different rows into one table

    Hi All,

    I am working on ~100 samples for detecting alternative splicing. TopHat generates a junction.bed file for each sample. However, each of these bed files has different number of rows, and the coordinates of each junction is not same across samples. I think this junction.bed file includes known and novel junctions.

    Since I am only interested in known junctions in Ensembl annotation database, how can I map these 100 junction.bed files to Ensembl gtf file and obtain a table matrix with the row as known-junction and column as sampleID?

    Or do I need to create a exon-exon junction annotation bed file from Ensembl, then apply RSeQC to obtain reads for each junction against mapped .bam files?

    Many thanks,

    Shirley

    Shirley

  • #2
    If I understand your aim correctly between the following two threads you should be able to get pointers for what you need.

    gtftobed: https://www.biostars.org/p/56280/
    bedops: https://www.biostars.org/p/119835/

    Comment


    • #3
      Thanks GenoMax for your quick response. I have tried "bedtools" as you suggested, but got very strange results: Below are the example files and output.

      1. Here is s1.junction.bed file generated by TopHat
      Column #5 "score" is the number of reads that contain the junction.
      1 12197 12639 JUNC00000001 1 + 12197 12639 255,0,0 2 30,45 0,397
      1 12190 12686 JUNC00000002 7 + 12190 12686 255,0,0 2 37,74 0,422
      1 12633 13292 JUNC00000003 6 + 12633 13292 255,0,0 2 64,72 0,587

      2. Here is hg19_RefSeq.mod.sorted.bed
      1 11873 12047 NR_046018 0 + 14409 14409 0 3 354,109,1189, 0,739,1347,
      1 12210 12684 NR_046018 0 + 14409 14409 0 3 354,109,1189, 0,739,1347,
      1 12644 13464 NR_046018 0 + 14409 14409 0 3 354,109,1189, 0,739,1347,
      1 12627 13527 NR_046018 0 + 14409 14409 0 3 354,109,1189, 0,739,1347,
      1 12661 13465 NR_046018 0 + 14409 14409 0 3 354,109,1189, 0,739,1347,
      1 12663 13469 NR_046018 0 + 14409 14409 0 3 354,109,1189, 0,739,1347,
      1 12657 13539 NR_046018 0 + 14409 14409 0 3 354,109,1189, 0,739,1347,

      3. run bedtools
      bedtools intersect -a hg19_RefSeq.sorted.bed -b s1.junctions.bed > out.txt

      The results in out.txt is very strange since column #5 "score" (read counts) is all 0.

      1 11873 12047 NR_046018 0 + 14409 14409 0 3 354,109,1189, 0,739,1347,
      1 12210 12684 NR_046018 0 + 14409 14409 0 3 354,109,1189, 0,739,1347,
      1 12644 13464 NR_046018 0 + 14409 14409 0 3 354,109,1189, 0,739,1347,
      1 12627 13527 NR_046018 0 + 14409 14409 0 3 354,109,1189, 0,739,1347,

      Could you let me know whether I have misused "bedtools intersect"?

      Many thanks,

      Comment


      • #4
        Hi GenoMax,

        I have figured it out by adding -wa -wb option:
        bedtools intersect -wa -wb -a A.bed -b B.bed

        Many thanks,
        Shirley

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Essential Discoveries and Tools in Epitranscriptomics
          by seqadmin




          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
          04-22-2024, 07:01 AM
        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Today, 08:47 AM
        0 responses
        12 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        60 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        59 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        54 views
        0 likes
        Last Post seqadmin  
        Working...
        X