Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • merging sequencing data from different sequencing runs

    Hi everybody,

    What are the consideration in merging sequencing data from from different sequencing runs?
    if data are paired-end, should it matter how close the insert sizes between each dataset that will be merged are?
    Also, SAMTOOLS have a merge tool for merging BAM files, what is the difference between merging BAM files and merging fastq sequence files? Are the two methods equivalent? Or which one has advantage over the other?
    Thanks.

    CSoong

  • #2
    The insert size is irrelevant for what you are trying to do.

    The BAMs contain the alignments computed by your aligner of choice. The fastq contain the raw reads and associated qualities generated by your sequencer.

    Merging one or the other is not equivalent. Besides the alignments, your BAM will contain (if saved) useful metadata about your libraries, reference genome used, alignment tool, etc ... (check the SAM spec).

    The specification also supports keeping track of groups of reads that belong to a specific library.
    -drd

    Comment


    • #3
      Hi Drio,

      Thanks for the helpful explanation.

      Besides meta info about the libraries, would merging fastQ files then do alignments be equivalent to do alignments first on individual fastq files then merging them as BAM files?

      CSoong

      Comment


      • #4
        Originally posted by csoong View Post
        Besides meta info about the libraries, would merging fastQ files then do alignments be equivalent to do alignments first on individual fastq files then merging them as BAM files?
        It depends on the aligner. For straight forward alignments (Bowtie, BWA etc) then the two operations would be the same since each sequence is aligned independently. However, spliced aligners (for example TopHat) use the combined evidence from the whole of an aligned file to detect potential splice junctions, so in some cases you wouldn't get the same result from aligning independently, or together.

        Comment


        • #5
          good to know. thanks.

          Comment


          • #6
            Hi,

            I did a little test and found out that the alignment results is slightly different between (A) merging independently produced BAM files and (B) merging FASTQ before producing BAM. (I use bwa 0.5.8c aligner & samtools 0.1.12a)

            The difference is very slight so that downstream analysis may not be affected. However, as simonandrews pointed out, the result is unexpected since BWA aligns read independently. Any thoughts on why the slight difference? Below is the output of samtools idxstats between group (A) and (B).

            group (A): ~/Downloads/samtools-0.1.12a/samtools idxstats merge-bam-files.bam
            chr1 249250621 2163811 47135
            chr2 243199373 2269463 49964
            chr3 198022430 1765679 29872
            chr4 191154276 1652124 41892
            chr5 180915260 1491753 25751
            chr6 171115067 1537865 25355
            chr7 159138663 1492743 28693
            chr8 146364022 1341856 23936
            chr9 141213431 1172519 31277
            chr10 135534747 1494271 51130
            chr11 135006516 1290433 25822
            chr12 133851895 1244998 21283
            chr13 115169878 820855 12780
            chr14 107349540 854953 14756
            chr15 102531392 827560 16404
            chr16 90354753 946573 21926
            chr17 81195210 894572 20000
            chr18 78077248 714390 15820
            chr19 59128983 698790 15445
            chr20 63025520 650961 9911
            chr21 48129895 380978 8569
            chr22 51304566 442003 8882
            chrX 155270560 697698 17739
            chrY 59373566 232017 22257
            chrMT 16571 85978 1492
            * 0 0 1401138
            group (B): !~/Downloads/samtools-0.1.12a/samtools idxstats merge-fastq-first.bam
            chr1 249250621 2163772 47094
            chr2 243199373 2269455 49992
            chr3 198022430 1765768 29921
            chr4 191154276 1652083 41864
            chr5 180915260 1491797 25754
            chr6 171115067 1537813 25290
            chr7 159138663 1492795 28695
            chr8 146364022 1341854 23994
            chr9 141213431 1172460 31211
            chr10 135534747 1494343 51211
            chr11 135006516 1290482 25840
            chr12 133851895 1245085 21310
            chr13 115169878 820821 12782
            chr14 107349540 854905 14724
            chr15 102531392 827450 16386
            chr16 90354753 946565 21872
            chr17 81195210 894580 20008
            chr18 78077248 714386 15822
            chr19 59128983 698839 15451
            chr20 63025520 650978 9913
            chr21 48129895 380980 8598
            chr22 51304566 441970 8872
            chrX 155270560 697602 17725
            chrY 59373566 232109 22286
            chrMT 16571 85953 1474
            * 0 0 1401138

            Comment


            • #7
              I'm not too familiar with BWA, but I know that in Bowtie there are some circumstances where it will select a random hit from an equally good set of potential matches, which can lead to getting slightly different results from repeating the same run. Have you tried rerunning the same file through BWA to see if you get exactly the same result?

              Comment


              • #8
                BWA also picks a random alignment when there are multiple equally good matches. But, I am not sure how that is going to change those numbers from idxstats?

                I am not sure what is the meaning of the last column (unmapped reads). Why are they assigned to a specific chromosome.
                -drd

                Comment


                • #9
                  Simon, the results are from the same files - file A and file B. I either merge A and B as fastQ or merge A and B as BAM.

                  Drio, I am not 100% sure as well, but I think the last column where it's associated with a chromosome are reads that have a paired-read that maps confidently to the specified chromosome. As oppose to the last row, which are reads that neither pair mapped.

                  Comment


                  • #10
                    Originally posted by csoong View Post
                    I am not 100% sure as well, but I think the last column where it's associated with a chromosome are reads that have a paired-read that maps confidently to the specified chromosome. As oppose to the last row, which are reads that neither pair mapped.
                    You mean the third column shows reads where both ends map and the forth column shows reads where one of the reads maps? Then, if working with single end data, both columns should display the same values.

                    To confirm you can use samtools:

                    Code:
                    $ samtools view -f3 merge-bam-files.bam | grep -v chr1 | wc -l 
                    # should be: 2163811
                    $ samtools view -f9 merge-bam-files.bam | grep -v chr1 | wc -l 
                    # should be: 47135
                    -drd

                    Comment


                    • #11
                      Hi,
                      I did -f1 -f3 -f9. The -f3 options does not match the idxstats, see the output below.

                      !~/Downloads/samtools-0.1.12a/samtools view -f3 merge.bam | awk '$3=="chr1"'| wc -l
                      2061072

                      !~/Downloads/samtools-0.1.12a/samtools view -f1 merge.bam | awk '$3=="chr1"'| wc -l
                      2210946

                      !~/Downloads/samtools-0.1.12a/samtools view -f9 merge.bam | awk '$3=="chr1"'| wc -l
                      47135

                      Comment


                      • #12
                        Try:

                        Code:
                        $ samtools view -F5  merge.bam | awk '$3=="chr1"'| wc -l
                        That plus 2061072 should equal 2163811
                        -drd

                        Comment


                        • #13
                          odd:
                          !~/Downloads/samtools-0.1.12a/samtools view -F5 merge.bam | awk '$3=="chr1"'| wc -l
                          0

                          seems like the middle column in idxstats is a little mysterious...

                          Comment


                          • #14
                            Originally posted by simonandrews View Post
                            It depends on the aligner. For straight forward alignments (Bowtie, BWA etc) then the two operations would be the same since each sequence is aligned independently. However, spliced aligners (for example TopHat) use the combined evidence from the whole of an aligned file to detect potential splice junctions, so in some cases you wouldn't get the same result from aligning independently, or together.
                            I am facing the exact situation mentioned in this thread, in fact i started a new since i was unaware. Simon your reply is useful, especially splice aligners. Whats your opinion on aligning and then merging in the case when one is looking for just the unique matches (like in chip-seq). Wouldn't even bowties/BWA give different results?

                            Comment


                            • #15
                              Originally posted by epi View Post
                              I am facing the exact situation mentioned in this thread, in fact i started a new since i was unaware. Simon your reply is useful, especially splice aligners. Whats your opinion on aligning and then merging in the case when one is looking for just the unique matches (like in chip-seq). Wouldn't even bowties/BWA give different results?
                              Yes - I wouldn't generally trust mixing different aligners in the same analysis as although they operate on similar metrics in many cases they'll all have their own biases. Even if you're looking at two datasets from the same aligner I'd still want to know that they were run with the same options as this too can have an effect. To some extent the same problem exists when running different length reads in the same project. I'd always prefer to work on data from the same platform with the same run type analysed with the same aligner. That isn't to say that you can't do useful analysis if the aligners don't match, but this is definitely going to increase the noise in the results.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              18 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              22 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              16 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              47 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X