Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • BAM files from RNAseq -Alignment (Basespace) not working with DESEq2 in R

    Hello experts,

    With my recent run in illumina Hiseq2500 I aligned the FASTAQ files generated in basespace to human reference genome (Homosapieans (PAR masked)/hg19 (Refseq) using the RNA-seq Alignment App.I used Tophat (Bowtie2) aligner for the same. except for "NO" to the stranded information, No other parameters were selected.
    I downloaded the BAM files generated and used them in the Bioconductor RNAseq workflow (Using DESeq2)
    Here we walk through an end-to-end gene-level RNA-seq differential expression workflow using Bioconductor packages. We will start from the FASTQ files, show how these were aligned to the reference genome, and prepare a count matrix which tallies the number of RNA-seq reads/fragments within each gene for each sample. We will perform exploratory data analysis (EDA) for quality assessment and to explore the relationship between samples, perform differential gene expression analysis, and visually explore the results.


    to create a summarized object I used "Homo_sapiens.GRCh37.75.gtf"

    below is the script:

    (txdb <- makeTxDbFromGFF("Homo_sapiens.GRCh37.75.gtf", format="gtf", circ_seqs=character()))
    (ebg <- exonsBy(txdb, by="gene"))

    se <- summarizeOverlaps(features=ebg,reads=bamfiles,mode="Union",singleEnd=FALSE,ignore.strand=TRUE,fragments=TRUE )

    I am getting the following error message and am unable to get any count

    Warning message:
    In .Seqinfo.mergexy(x, y) :
    The 2 combined objects have no sequence levels in common. (Use
    suppressWarnings() to suppress this warning.)

    Any suggestions?

    Ram

  • #2
    Hi Ram,

    could you check if the chromosomes' name in the bam files are the same as the ones in the gtf?

    Code:
    samtools view -H my_bam.bam | head
    and
    Code:
    head Homo_sapiens.GRCh37.75.gtf
    .

    Often, one annotation is 'chr1', another 'Chr1', and sometimes just '1'. If you are mixing these up, no program or tool will find an overlap.

    Cheers,

    Michael

    Comment


    • #3
      Thanks a lot Michael,

      below is the output of the codes you suggested:

      [rshukla@dev01 test]$ samtools view -H Ctrl13058.bam | head
      @HD VN:1.0 SO:coordinate
      @RG ID:0 SM:Ctrl13058-Trimmed PI:50
      @SQ SN:chrM LN:16571
      @SQ SN:chr1 LN:249250621
      @SQ SN:chr2 LN:243199373
      @SQ SN:chr3 LN:198022430
      @SQ SN:chr4 LN:191154276
      @SQ SN:chr5 LN:180915260
      @SQ SN:chr6 LN:171115067
      @SQ SN:chr7 LN:159138663

      [rshukla@dev01 test]$ head Homo_sapiens.GRCh37.75.gtf
      #!genome-build GRCh37.p13
      #!genome-version GRCh37
      #!genome-date 2009-02
      #!genome-build-accession NCBI:GCA_000001405.14
      #!genebuild-last-updated 2013-09
      1 pseudogene gene 11869 14412 . + . gene_id "ENSG00000223972"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene";
      1 processed_transcript transcript 11869 14409 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; g ene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana";
      1 processed_transcript exon 11869 12227 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_numb er "1"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana"; exon _id "ENSE00002234944";
      1 processed_transcript exon 12613 12721 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_numb er "2"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana"; exon _id "ENSE00003582793";
      1 processed_transcript exon 13221 14409 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_numb er "3"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana"; exon _id "ENSE00002312635";


      The format in .bam file is chr1 where as in .gtf I guess it is just the number.

      Do you have any suggestion how to proceed further, I am very new to all these.

      Thanks again,

      Ram

      Comment


      • #4
        Can you download a corresponding GTF file from BaseSpace? That should sort this problem out.

        It should be possible to change the GTF file you have (from 1 to chr1 etc) but if the above works then you would know that you have a file that exactly corresponds to the genome used for alignment. Using an incorrect GTF file will lead to erroneous results.

        Edit: If changing 1 to chr 1 is the preferred option then use the solution suggested by @vivek in this thread.

        Copying it here for reference (use your file names) :

        Code:
        $ awk '{if($0 !~ /^#/) print "chr"$0; else print $0}' no_chr.file> with_chr.file
        Last edited by GenoMax; 04-12-2016, 06:22 AM.

        Comment


        • #5
          Thanks GenoMax and Micheal

          I tried the other reference genome with same chromosome annotation but it still did not work

          Below is the error message

          > se <- summarizeOverlaps(features=ebg, reads=bamfiles,mode="Union",singleEnd=FALSE,ignore.strand=TRUE,fragments=TRUE,suppressWarnings())
          Error in validObject(.Object) :
          invalid class “SummarizedExperiment0” object: 'assays' nrow differs from 'mcols' nrow


          the heads of both the bam and the gtf files are as below:

          [rshukla@dev01 test]$ samtools view -H Ctrl13058.bam | head
          @HD VN:1.0 SO:coordinate
          @RG ID:0 SM:Ctrl13058-Trimmed PI:50
          @SQ SN:chrM LN:16571
          @SQ SN:chr1 LN:249250621
          @SQ SN:chr2 LN:243199373
          @SQ SN:chr3 LN:198022430
          @SQ SN:chr4 LN:191154276
          @SQ SN:chr5 LN:180915260
          @SQ SN:chr6 LN:171115067
          @SQ SN:chr7 LN:159138663
          [rshukla@dev01 test]$ head Human_hg19.gtf
          chr1 hg19_knownGene exon 11874 12227 0.000000 + . gene_id "uc001aaa.3"; transcript_id "uc001aaa.3";
          chr1 hg19_knownGene exon 12613 12721 0.000000 + . gene_id "uc001aaa.3"; transcript_id "uc001aaa.3";
          chr1 hg19_knownGene exon 13221 14409 0.000000 + . gene_id "uc001aaa.3"; transcript_id "uc001aaa.3";
          chr1 hg19_knownGene exon 11874 12227 0.000000 + . gene_id "uc010nxr.1"; transcript_id "uc010nxr.1";
          chr1 hg19_knownGene exon 12646 12697 0.000000 + . gene_id "uc010nxr.1"; transcript_id "uc010nxr.1";
          chr1 hg19_knownGene exon 13221 14409 0.000000 + . gene_id "uc010nxr.1"; transcript_id "uc010nxr.1";
          chr1 hg19_knownGene start_codon 12190 12192 0.000000 + . gene_id "uc010nxq.1"; transcript_id "uc010nxq.1";
          chr1 hg19_knownGene CDS 12190 12227 0.000000 + 0 gene_id "uc010nxq.1"; transcript_id "uc010nxq.1";
          chr1 hg19_knownGene exon 11874 12227 0.000000 + . gene_id "uc010nxq.1"; transcript_id "uc010nxq.1";
          chr1 hg19_knownGene CDS 12595 12721 0.000000 + 1 gene_id "uc010nxq.1"; transcript_id "uc010nxq.1";
          [rshukla@dev01 test]$


          As per Genomaxs suggestion I am also downloading reference genome from basespace, surprisingly it is 42.3GB !

          any further suggestions?

          Ram

          Comment


          • #6
            @Ram you should not need to get the reference genome. The error above is not related to the original issue but an error with your DESeq2 command.

            Comment


            • #7
              Thanks, what changes should I make to the command?

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM
              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              18 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              22 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              17 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              49 views
              0 likes
              Last Post seqadmin  
              Working...
              X