Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to seperate the reads from each individual in

    Hi all,
    Illumina reads from 12 different genotypes were aligned with reference genome and call SNPs. I combined all reads together before alignment. After variant calling by samtools mpileup, how can I split 12 individuals in .pileup file or .vcf file?
    I wonder if anyone could help me.

  • #2
    You are going to need separate .bam files for each sample.

    You can use grep to pull out subsets of reads based on the read names.

    Comment


    • #3
      Thank you. .bam is binary file. How can I grep the read names?

      Comment


      • #4
        Originally posted by jdpr_100 View Post
        Thank you. .bam is binary file. How can I grep the read names?
        samtools view file.bam | grep 'name'

        Depending on how long your mapping took it may be simpler to re-map each genotype individually.

        Comment


        • #5
          Originally posted by jdpr_100 View Post
          Hi all,
          Illumina reads from 12 different genotypes were aligned with reference genome and call SNPs. I combined all reads together before alignment.
          This was a mistake. For an analysis such as this you should align each sample individually and each alignment command include a Read Group ID (RG) to be added to the alignment. After all the alignments are completed you then merge the per sample BAMs into a single BAM for analysis. Each alignment in this merged BAM will retain its RG ID. This is the standard method for performing SNP analysis with multiple samples. Without some way to identify which reads came from which sample how can your variant caller assign genotypes to specific samples? RG IDs are mandatory if you are using GATK.

          My Suggestion would be to go back to the beginning an properly do alignment of each sample individually, using your alignment tool's (e.g. bwa, bowtie2) options to assign unique RG IDs to each alignment. Then merge these BAMs and continue with your SNP analysis.

          Originally posted by jdpr_100 View Post
          After variant calling by samtools mpileup, how can I split 12 individuals in .pileup file or .vcf file?
          I wonder if anyone could help me.
          If there are RG IDs then it is stupid easy to select specific read groups:

          Code:
          samtools view -r <STR> -o readGroup.bam input.bam
          where <STR> is the specific RG ID you want in the output bam

          Comment


          • #6
            Hi all, Thanks for your suggestions.
            For me, 12 samples are only pilot experiment. Totally, we have 300 samples. If I align each sample individually, that would be time consuming. I am trying to figure out an effective workflow to call variant. Is there an easy way to deal with my case?

            I am using samtools to call SNPs. I have already assigned a specific name for each read from same individual before alignment and get .bam. Can I split .bam based on swbarnes2 and N311V 's suggestion and then apply to variant calling procedures individually.

            Comment


            • #7
              Originally posted by jdpr_100 View Post
              Hi all, Thanks for your suggestions.
              For me, 12 samples are only pilot experiment. Totally, we have 300 samples. If I align each sample individually, that would be time consuming. I am trying to figure out an effective workflow to call variant. Is there an easy way to deal with my case?
              No offense jdpr but properly setting up your analysis workflow to most efficiently use the capabilities of the analysis tools is far less time consuming than trying to wrestle with output which is not optimally formatted. Further, following your path you are simply running one alignment, sorting and the running 300 SNP analyses. That hardly seems any less time consuming than running 300 alignments, merging and running a single SNP analysis, with the output of that analysis retaining information about genotypes identified in each sample. You can set up a script (bash, perl, python, whatever) to efficiently pipeline the process of running the alignments and tagging them with RG IDs

              Multi-sample SNP calling programs are designed to work with BAM files in which the samples are identified by Read Group IDs so they can properly assign genotypes per sample. Here is what Samtools documentation has to say about multi-sample SNP calling:

              Suppose we have reference sequences in ref.fa, indexed by samtools faidx, and position sorted alignment files aln1.bam and aln2.bam, the following command lines call SNPs and short INDELs:
              Code:
              samtools mpileup -uf ref.fa aln1.bam aln2.bam | bcftools view -bvcg - > var.raw.bcf  
              bcftools view var.raw.bcf | vcfutils.pl varFilter -D100 > var.flt.vcf
              where the -D option sets the maximum read depth to call a SNP. SAMtools acquires sample information from the SM tag in the @RG header lines. One alignment file can contain multiple samples; reads from one sample can also be distributed in different alignment files. SAMtools will regroup the reads anyway. In addition, if no @RG lines are present, each alignment file is taken as one sample.
              You can see that when using samtools it is not strictly necessary to use RG and SM tags to identify your samples; you simply provide multiple BAM files on the command line with each BAM representing a single sample. However with 300 samples I wouldn't try it this way; I would stick with with a single merged BAM file and RG IDs.

              Comment


              • #8
                I really appreciate your suggestions.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM
                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                25 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                29 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                25 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                52 views
                0 likes
                Last Post seqadmin  
                Working...
                X