Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • mpileup of BAM files through ftp

    Dear SEQanswers users,

    I am trying to work out case-control association tests based on the raw data rather than the genotype calls. The idea is to use the samtools mpileup feature to stream 1,000 genomes indexed BAM files from the web to create vcf files that are combined for cases and controls. We face 2 difficulties:

    1- we are limited by the nb of ftp connections so that the work cannot be made fully parallel (we are thinking one gene per job but it simply does not work).

    2- Independently of the ftp issue, mpileup is very slow. Doing a genome-wide case control test is incredibly time consuming owing to the computations behind mpileup.

    Have others considered the same problem, and have solutions been found? Is basing the test on the published genotype calls really the only option? We are keen to go back to the BAM files if possible but it seems challenging.

    Thank you in advance for your help,

    Vincent

  • #2
    1. you should download the BAMs. You probably cannot open hundreds of BAMs via ftp at the same time.

    2. Disabling indel calling will help. However, doing association test, you have to call SNPs in the first place. You should know that this alone takes hundreds to thousands CPU days for every group doing 1000g SNP calling. Make sure you have a dedicate cluster.

    Comment


    • #3
      Thanks.

      This isn't completely satisfying though because it requires a local copy of gigantic files (how many Tb for the whole 1,000G data? Are we really expected to have this available locally in every university?). I was hoping that the whole network based approach would deliver but if it does not work then there is little one can do I suppose.

      And I appreciate the SNP calling takes time, but it seems overly long. But again, my intuition is probably flawed, and there is probably a good reason why it takes so long. But for SNPs... it's essentially just counting reference vs non reference allele, I am still surprised by the time these computations take.

      Comment


      • #4
        Vincent, I think it's a little unfair to complain on the one hand about the vast amounts of data and the difficulty of storing it, and then on the other about the time taken by SNP calling algorithms to wade through said massive amounts of data !
        In my opinion these are good, well optimised algorithms which do take a lot of factors into account, and do not just simply test reference vs non ref reads.

        Perhaps it might be very much more tractable to use the genotype calls done by the 1000 genomes team after all.

        Good luck with your ambitious project!

        Comment


        • #5
          You may dramatically speed up by disabling indel calling and BAQ computation. BAQ is necessary to get good calls especially for low-coverage samples, but for association you may not care too much. But even so, tens of CPU days is the minimum.

          As to data transfer, even if you keep a constant 5MB/sec downloading speed, downloading 10TB data will take about a month, which cannot be parallelized. Using aspera should be times faster. In addition, probably you are not allowed to open hundreds of BAMs directly from NCBI FTP. Downloading the file to your local disk is the only sensible solution for now.

          If you want to process terabytes of data, you must have enough computing resources and skills in the first place. This is the deal. Another option is to try cloud computing. Someone told me that the pilot alignments are available at EC2.

          Comment


          • #6
            Thank you all.

            Colin: I am sure the format is efficient, I did not mean otherwise. And yes I can work with the VCF calls, that's 99% fine I suppose. And if things look odd I can always go back to the proper full computation on a local basis and see what effect it may have.

            lh3: yes computing resources are required. I was hoping, when I started looking into it, that the network based interface would solve many of the issues but apparently I was too optimistic. I guess that if one really wants the raw data I will need to purchase large amount of cheap storage to store it all.

            This thread is useful for me to set my expectations at the right level, and be sure that I was not missing something obvious. So thank you again.

            Comment


            • #7
              The network interface is mainly useful when

              1) you want to visualize alignments

              2) you want to download the sequences in a few small regions or related to a few genes

              This functionality does not mean to be used to process terabytes of data in this way.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM
              • seqadmin
                Techniques and Challenges in Conservation Genomics
                by seqadmin



                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                Avian Conservation
                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                03-08-2024, 10:41 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 06:37 PM
              0 responses
              11 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, Yesterday, 06:07 PM
              0 responses
              10 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-22-2024, 10:03 AM
              0 responses
              51 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-21-2024, 07:32 AM
              0 responses
              68 views
              0 likes
              Last Post seqadmin  
              Working...
              X