Seqanswers Leaderboard Ad

**swbarnes2** · 07-08-2014, 01:41 PM

You are going to need separate .bam files for each sample.

You can use grep to pull out subsets of reads based on the read names.

**jdpr_100** · 07-08-2014, 02:16 PM

Thank you. .bam is binary file. How can I grep the read names?

**N311V** · 07-08-2014, 07:55 PM

Originally posted by jdpr_100 View Post

Thank you. .bam is binary file. How can I grep the read names?

samtools view file.bam | grep 'name'

Depending on how long your mapping took it may be simpler to re-map each genotype individually.

**kmcarr** · 07-09-2014, 06:32 AM

Originally posted by jdpr_100 View Post

Hi all,
Illumina reads from 12 different genotypes were aligned with reference genome and call SNPs. I combined all reads together before alignment.

This was a mistake. For an analysis such as this you should align each sample individually and each alignment command include a Read Group ID (RG) to be added to the alignment. After all the alignments are completed you then merge the per sample BAMs into a single BAM for analysis. Each alignment in this merged BAM will retain its RG ID. This is the standard method for performing SNP analysis with multiple samples. Without some way to identify which reads came from which sample how can your variant caller assign genotypes to specific samples? RG IDs are mandatory if you are using GATK.

My Suggestion would be to go back to the beginning an properly do alignment of each sample individually, using your alignment tool's (e.g. bwa, bowtie2) options to assign unique RG IDs to each alignment. Then merge these BAMs and continue with your SNP analysis.

Originally posted by jdpr_100 View Post

After variant calling by samtools mpileup, how can I split 12 individuals in .pileup file or .vcf file?
I wonder if anyone could help me.

If there are RG IDs then it is stupid easy to select specific read groups:

Code:

samtools view -r <STR> -o readGroup.bam input.bam
where <STR> is the specific RG ID you want in the output bam

**jdpr_100** · 07-09-2014, 07:34 AM

Hi all, Thanks for your suggestions.
For me, 12 samples are only pilot experiment. Totally, we have 300 samples. If I align each sample individually, that would be time consuming. I am trying to figure out an effective workflow to call variant. Is there an easy way to deal with my case?

I am using samtools to call SNPs. I have already assigned a specific name for each read from same individual before alignment and get .bam. Can I split .bam based on swbarnes2 and N311V 's suggestion and then apply to variant calling procedures individually.

**kmcarr** · 07-09-2014, 08:11 AM

Originally posted by jdpr_100 View Post

Hi all, Thanks for your suggestions.
For me, 12 samples are only pilot experiment. Totally, we have 300 samples. If I align each sample individually, that would be time consuming. I am trying to figure out an effective workflow to call variant. Is there an easy way to deal with my case?

No offense jdpr but properly setting up your analysis workflow to most efficiently use the capabilities of the analysis tools is far less time consuming than trying to wrestle with output which is not optimally formatted. Further, following your path you are simply running one alignment, sorting and the running 300 SNP analyses. That hardly seems any less time consuming than running 300 alignments, merging and running a single SNP analysis, with the output of that analysis retaining information about genotypes identified in each sample. You can set up a script (bash, perl, python, whatever) to efficiently pipeline the process of running the alignments and tagging them with RG IDs

Multi-sample SNP calling programs are designed to work with BAM files in which the samples are identified by Read Group IDs so they can properly assign genotypes per sample. Here is what Samtools documentation has to say about multi-sample SNP calling:

Suppose we have reference sequences in ref.fa, indexed by samtools faidx, and position sorted alignment files aln1.bam and aln2.bam, the following command lines call SNPs and short INDELs:

Code:

samtools mpileup -uf ref.fa aln1.bam aln2.bam | bcftools view -bvcg - > var.raw.bcf  
bcftools view var.raw.bcf | vcfutils.pl varFilter -D100 > var.flt.vcf

where the -D option sets the maximum read depth to call a SNP. SAMtools acquires sample information from the SM tag in the @RG header lines. One alignment file can contain multiple samples; reads from one sample can also be distributed in different alignment files. SAMtools will regroup the reads anyway. In addition, if no @RG lines are present, each alignment file is taken as one sample.

You can see that when using samtools it is not strictly necessary to use RG and SM tags to identify your samples; you simply provide multiple BAM files on the command line with each BAM representing a single sample. However with 300 samples I wouldn't try it this way; I would stick with with a single merged BAM file and RG IDs.

**jdpr_100** · 07-09-2014, 10:53 AM

I really appreciate your suggestions.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 29 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 25 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

How to seperate the reads from each individual in

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News