Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Two questions

    1. I want to ask a question about bam files.

    I have 2 sequencing library in a same sample, and get 2 fastq files, the length of reads are 50bp and 36bp separately.
    When I do tophat, because I need to specify the -r, I cannot merge the two fastq files. But after I got the accepted.bam files, can I merge them (bam files) with the samtools merge?

    I need to do cufflinks and cuffdiff using the merged bam files.

    2. I see the parameter of cuffdiff is
    cuffdiff transcripts.gtf 1.bam 2.bam

    Does this transcritpts.gtf is the output of cufflinks or just the reference transcript annotation?


    thanks everyone.

  • #2
    Originally posted by camelbbs View Post
    1. I want to ask a question about bam files.

    I have 2 sequencing library in a same sample, and get 2 fastq files, the length of reads are 50bp and 36bp separately.
    When I do tophat, because I need to specify the -r, I cannot merge the two fastq files. But after I got the accepted.bam files, can I merge them (bam files) with the samtools merge?

    I need to do cufflinks and cuffdiff using the merged bam files.

    2. I see the parameter of cuffdiff is
    cuffdiff transcripts.gtf 1.bam 2.bam

    Does this transcritpts.gtf is the output of cufflinks or just the reference transcript annotation?


    thanks everyone.
    I guess the sequences are not paired end, so you can't align the FastQ files in the same TopHat command. In that case, you can always merge two BAM files with 'samtools merge' or 'picard MergeSamFiles':



    You can use either a reference GTF file or the output from cufflinks. If you want novel transcripts, then do cufflinks first, but if you only want expression from known genes, you can just do cuffdiff with a GTF file downloaded from ensembl, UCSC, etc.

    Chris

    Comment


    • #3
      Thanks very much. But the sequences are paried end. Because one sample have several libraries, and the sequencing length is different between the libraries. So we just first to get the bam files by tophat -r xxx -G hg19_ucsc.gtf ERR001_1.fastq ERR001_2.fastq

      and then merge all the bam files that not belong to the sample library, but belong to the same sample. Is that right? Thanks
      Last edited by camelbbs; 10-24-2011, 12:01 PM.

      Comment


      • #4
        Originally posted by cjp View Post
        I guess the sequences are not paired end, so you can't align the FastQ files in the same TopHat command. In that case, you can always merge two BAM files with 'samtools merge' or 'picard MergeSamFiles':



        You can use either a reference GTF file or the output from cufflinks. If you want novel transcripts, then do cufflinks first, but if you only want expression from known genes, you can just do cuffdiff with a GTF file downloaded from ensembl, UCSC, etc.

        Chris
        And If we use the output from cufflinks, there will be two gtf files when we work on two samples. So how to input these two files into the cuffdiff. thanks very much for your help

        Comment


        • #5
          Originally posted by camelbbs View Post
          Thanks very much. But the sequences are paried end. Because one sample have several libraries, and the sequencing length is different between the libraries. So we just first to get the bam files by tophat -r xxx -G hg19_ucsc.gtf ERR001_1.fastq ERR001_2.fastq

          and then merge all the bam files that not belong to the sample library, but belong to the same sample. Is that right? Thanks
          Yes, you can merge BAM files from multiple sequencing runs if they are the same sample even if they have a different read length.

          Comment


          • #6
            Originally Posted by camelbbs

            And If we use the output from cufflinks, there will be two gtf files when we work on two samples. So how to input these two files into the cuffdiff. thanks very much for your help

            Cufflinks provides some software called gffread - from gffread -h, there are these options:

            -M/--merge : cluster the input transcripts into loci, collapsing matching
            transcripts (those with the same exact introns and fully contained)
            --cluster-only: same as --merge but without collapsing matching transcripts
            -K for -M option: also collapse shorter, fully contained transcripts
            with fewer introns than the container
            -Q for -M option, remove the containment restriction:
            (multi-exon transcripts will be collapsed if just their introns match,
            while single-exon transcripts can partially overlap (80%))

            I've never used myself, so am not sure if it does what you want. You could also convert to bed format and then use BEDtools, which has something called intersectBed that will get one bed file from combining two input bed files. To get a final GTF file from this bed file, I found this link on seqAnswers:

            Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc


            But converting between GTF and bed is not always so easy, as you can lose data.

            Chris

            Comment


            • #7
              Originally posted by cjp View Post
              I guess the sequences are not paired end, so you can't align the FastQ files in the same TopHat command. In that case, you can always merge two BAM files with 'samtools merge' or 'picard MergeSamFiles':



              You can use either a reference GTF file or the output from cufflinks. If you want novel transcripts, then do cufflinks first, but if you only want expression from known genes, you can just do cuffdiff with a GTF file downloaded from ensembl, UCSC, etc.

              Chris
              Thanks a lot Chris,
              Actually my purpose is to search and compare the alternative splicing events between two samples.

              My workflow is like this:

              First I got the two merged bam files from the two samples by tophat. Then I run

              cuffdiff hg19_ucsc.gtf sample1.bam sample2.bam

              And I got some results. But they don't contain the novel transcript assembled by cufflinks.

              So I run cufflinks in order to get the novel transcript

              cufflinks -g hg19_ucsc.gtf sample1.bam
              cufflinks -g hg19_ucsc.gtf sample2.bam

              I got two transcript.gtf files in the two samples.

              Then I merged the two transcript.gtf files, transcript1.gtf and transcript2.gtf with the reference annotation

              cuffmerge -o merged gtf_list (hg19_ucsc.gtf, transcript1.gtf, transcript2.gtf)

              Then run cuffdiff:

              cuffdiff merged.gtf sample1.bam sample2.bam

              Is that the right workflow for comparing the novel alternative splicing transcripts and their expression between the two samples.

              But I see there is a script called cuffcompare. If I run

              cuffcompare hg19_ucsc.gtf transcript1.gtf transcript2.gtf

              I can also get the different alternative splicing transcripts. So does that mean

              cufflinks + cuffcompare == cuffdiff ?

              Thanks a lot!!!
              Last edited by camelbbs; 10-25-2011, 01:35 PM.

              Comment


              • #8
                Sounds like you've got a better method than I suggested as have never used cuffcompare or cuffmerge before.

                cuffdiff seems to be always the last program to run whether you want FPKM's (expression levels) for known or novel transcripts. It gives the data in nice spreadsheet (.csv) formats and does some useful stats tests as well.

                Chris

                Comment


                • #9
                  Originally posted by camelbbs View Post
                  Thanks a lot Chris,
                  Actually my purpose is to search and compare the alternative splicing events between two samples.

                  My workflow is like this:

                  First I got the two merged bam files from the two samples by tophat. Then I run

                  cuffdiff hg19_ucsc.gtf sample1.bam sample2.bam

                  And I got some results. But they don't contain the novel transcript assembled by cufflinks.

                  So I run cufflinks in order to get the novel transcript

                  cufflinks -g hg19_ucsc.gtf sample1.bam
                  cufflinks -g hg19_ucsc.gtf sample2.bam

                  I got two transcript.gtf files in the two samples.

                  Then I merged the two transcript.gtf files, transcript1.gtf and transcript2.gtf with the reference annotation

                  cuffmerge -o merged gtf_list (hg19_ucsc.gtf, transcript1.gtf, transcript2.gtf)

                  Then run cuffdiff:

                  cuffdiff merged.gtf sample1.bam sample2.bam

                  Is that the right workflow for comparing the novel alternative splicing transcripts and their expression between the two samples.

                  But I see there is a script called cuffcompare. If I run

                  cuffcompare hg19_ucsc.gtf transcript1.gtf transcript2.gtf

                  I can also get the different alternative splicing transcripts. So does that mean

                  cufflinks + cuffcompare == cuffdiff ?

                  Thanks a lot!!!
                  I have done the same a few days ago, and in my project, I only used the merged.gtf for cuffdiff, and it goes well(there are "u" in the class code ), while for my workmate, she found there were not any "u" in the class code from merged.gtf, so she then run cuffcompare with merged.gtf and known.gtf(the species was not human), and last she used the combined.gtf as well for cuffdiff.

                  So, I am still a littlte confused for the difference of the merged.gtf and the combined.gtf. Any help will be grateful.

                  Comment


                  • #10
                    hi, i just want to know what do you mean the combine.gtf

                    Comment


                    • #11
                      Originally posted by tiffany081126 View Post
                      I have done the same a few days ago, and in my project, I only used the merged.gtf for cuffdiff, and it goes well(there are "u" in the class code ), while for my workmate, she found there were not any "u" in the class code from merged.gtf, so she then run cuffcompare with merged.gtf and known.gtf(the species was not human), and last she used the combined.gtf as well for cuffdiff.

                      So, I am still a littlte confused for the difference of the merged.gtf and the combined.gtf. Any help will be grateful.
                      I want to ask what do you mean combined.gtf

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Essential Discoveries and Tools in Epitranscriptomics
                        by seqadmin


                        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                        Yesterday, 07:01 AM
                      • seqadmin
                        Current Approaches to Protein Sequencing
                        by seqadmin


                        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                        04-04-2024, 04:25 PM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 04-11-2024, 12:08 PM
                      0 responses
                      55 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 10:19 PM
                      0 responses
                      52 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 09:21 AM
                      0 responses
                      45 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-04-2024, 09:00 AM
                      0 responses
                      55 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X