Seqanswers Leaderboard Ad

**swbarnes2** · 03-22-2012, 08:18 AM

You have 85 million reads, and only 1600 mapped. That's obviously your problem. Maybe these reads aren't human?

**Rocketknight** · 03-22-2012, 09:25 AM

Code:

$ bwa aln -t 8 SRR292250.fastq > SRR292250.sai

You don't seem to have specified a reference genome for BWA aln here.

If that was just a typo in your post and you did specify a reference genome when you ran it, could it possibly be an error with the bwa index? BWA changed its index format with version 0.6.0, so if you're running 0.6.0+ with a pre-0.6.0 index, that might cause something weird like this. I've never tried it myself though, so I've no idea what actually happens when you do that.

**aituka** · 03-23-2012, 02:18 AM

Thanks for your hint Rocketknight.
First, as you suggested, yes it was a typo: i provided the reference genome.

But as you said, i might have indexed with a different version of BWA. so i reindexed the reference genome with BWA 0.6.1. however, I don't get the right files:

$ bwa index -a bwtsw human_g1k_v37.fasta

give me:
human_g1k_v37.fasta human_g1k_v37.fasta.ann human_g1k_v37.fasta.pac
human_g1k_v37.fasta.amb human_g1k_v37.fasta.bwt human_g1k_v37.fasta.sa

I don't know why, but there is no .rbwt file.
did i missed something?
I will check with BWA version 0.6.0
thx
tuka.

**Rocketknight** · 03-23-2012, 09:22 AM

I think the RBWT file may no longer be created in the new index format. You should be able to continue with aln or samse without it.

**aituka** · 03-24-2012, 04:21 AM

Hi,
i re-apply the pipeline
First, download of one HUMAN exome: from http://sra.dnanexus.com/studies/SRP007298 , http://dnanexus-sra.commondatastorag...92250.fastq.gz

second i installed bwa6.1 and I realigned my reference genome

bwa index -a bwtsw human_g1k_v37.fasta

then i computed the bam:

bwa aln -t 8 /gatk/ensembl_g37/bwa6.1/human_g1k_v37.fasta SRR292250.fastq > SRR292250.sai

bwa samse -r "@RG\tID:IDa\tSM:SM\tPL:Illumina" /gatk/ensembl_g37/bwa6.1/human_g1k_v37.fasta SRR292250.sai SRR292250.fastq >SRR292250.sam

java -Xmx4g -Djava.io.tmpdir=/tmp -jar /home/insilico/nextgen/picard-tools-1.62/SortSam.jar SO=coordinate INPUT=SRR292250.sam OUTPUT=SRR292250.bam VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true

Still, I obtain the same result:

$samtools flagstat SRR292250.bam
85493722 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
1680 + 0 mapped (0.00%:-nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (-nan%:-nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (-nan%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

everything is done with bwa6.1

Maybe the fastq is not in the good format, below are the first reads

@SRR292250.1 80DYAABXX_000168:7:1:10000:103094:Y length=100
GGGTCATGTGGGCTCATTATTTTCCTCTTTCTTTTACCCAAGTGGACAAGGGTCTTGAACTCCTGGTTTGCACATCTAAGGGGCTGCGACAGGACCTGTG
+SRR292250.1 80DYAABXX_000168:7:1:10000:103094:Y length=100
HHHHHHHHFHHHGHH6GGGHHHHHHHHFHGEHHHHHHGFFDG@BGDEGGGHHFHHHHHHHHHHHHHHHDHHEHHGHHHHHHHHHHHHBCHCFFEFFFHFG
@SRR292250.2 80DYAABXX_000168:7:1:10000:103139:Y length=100
ACCTGCAGGAACTCTCCAGTGGACTGGCCGTAGCGGTGCACGCTATGCTCGGTCATGGTGTAAGCATTTTCTCAAGCTATTTTCCTTCTTGCCTCATCTG
+SRR292250.2 80DYAABXX_000168:7:1:10000:103139:Y length=100
GHHHHHHHHGHHHGHHHHHH?HEGHHHHHGEFBFFHFHBHEFB6F;DADAHGHHHGGGEGFFGGGHHHHHHHHBHHHHHEHHHHHHHHDHFEHHHFHHHH
@SRR292250.3 80DYAABXX_000168:7:1:10000:104494:Y length=100
GAAGAGACCATCATCTTTGCATAAACGAAGTGGCACTGATACATCATTCTATTGCTTTGTTTTGGAGTCCCCCTTTTCTTCTTTCCTCTATAGAATGATG
+SRR292250.3 80DYAABXX_000168:7:1:10000:104494:Y length=100
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHFHGHHHGHHFHFHFHHHHHEHHHHHHHHHHHHHHHHHFHHHHHHHHHHHHHHHHHHHHHGGHHHHHHHHH
@SRR292250.4 80DYAABXX_000168:7:1:10000:105301:Y length=100
AGAGAAGGCACCGGGCTCACGGAGGGCTGCCTGGTGTCCGTGGACGACTTACTGCTGGCTCTGCGGCGGCCAGTACTGCAAGAATGTGTCCACGTCCACC
+SRR292250.4 80DYAABXX_000168:7:1:10000:105301:Y length=100
GHHHHHHHHGHHHHHHHHHHGHHHHFHHFHHHFHEEFBGGFFGFHHHGFHHHHHHHHHHHHHHFHFHHHAC9A5?8.:<9E<AD*;391;==A5=<>.6B
@SRR292250.5 80DYAABXX_000168:7:1:10000:107861:Y length=100
GCCTTTTTTTCCTTTTATCTGTGGTTTAAATTCCTAAGTTTGTATGTCTCAAGTGAGCATAGCAAGTATATTTTAAAACATGTTCATTTCAGTCGTCACA
+SRR292250.5 80DYAABXX_000168:7:1:10000:107861:Y length=100
HFHHHHHHHHHHHHHGHHHHHHFHHHHFHHHHHFDFEHEHHGFAHGFDCFHHHHHHHHHHHHHFHHHHHHHHHHHEGHGHHHHHGHHFGHGHGBHGGBG@

any idea?
thanks
tuka.

**aituka** · 03-24-2012, 04:22 AM

**Rocketknight** · 03-24-2012, 11:38 AM

I threw the first 250000 reads from that file at bwa on my own system and only 2 of them mapped to the human genome for me. The read format looks fine, but something is definitely wrong with the reads. Possibly untrimmed barcode sequences? Or they could be mislabelled results from another run. Either way, this issue isn't your fault.

If you're looking for proper exome fastq files to test your pipeline on, try:

ftp://ftp.1000genomes.ebi.ac.uk/vol1....filt.fastq.gz (1st reads of pair)

ftp://ftp.1000genomes.ebi.ac.uk/vol1....filt.fastq.gz (2nd reads of pair)

ftp://ftp.1000genomes.ebi.ac.uk/vol1....filt.fastq.gz (unpaired reads)

**aituka** · 03-26-2012, 12:00 PM

thanks for your answer Rocketknight.
You were right. It seems that the list of command I am executing are ok. The problem was with the fastq coming from the Hapmap SRP007298.
This leads to two I have two questions.

1/First what do you mean by "Possibly untrimmed barcode sequences". Is there a way to check if the barcode are trimmed, and if not, an algorithm to trim them?
In fact I was suspecting that problem and I tested the same pipeline to samples coming from two other datasets form SRA: SRR004076 and SRP008162 (both coming from SRA). Statistics from samtools flagstat gives me respectively '0 mapped (0.00%)' and '115 + 0 mapped (0.00%:-nan%)'. What is wrong?

2/ I checked the samples you suggested and indeed it do work. I tried with the .filt.fastq.gz file coming form the 1000 genome project and with the fastq file coming from sra (SRR231270 coming from SRP004061) which is not filtered. My question is: what is that filtering doing? results from flagstat are without filtering:

148318736 in total
0 QC failure
0 duplicates
141922418 mapped (95.69%)
148318736 paired in sequencing
74159368 read1
74159368 read2
139291198 properly paired (93.91%)
141011182 with itself and mate mapped
911236 singletons (0.61%)
1556783 with mate mapped to a different chr
1448948 with mate mapped to a different chr (mapQ>=5)

with .filt.

146907704 in total
0 QC failure
0 duplicates
140747864 mapped (95.81%)
146907704 paired in sequencing
73453852 read1
73453852 read2
138143478 properly paired (94.03%)
139853076 with itself and mate mapped
894788 singletons (0.61%)
1547655 with mate mapped to a different chr
1440436 with mate mapped to a different chr (mapQ>=5)

3/ additional question: any idea why sometimes flagstat's output has 2 values per line and somtimes just one
eg:

85493722 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
1680 + 0 mapped (0.00%:-nan%)

vs

146907704 in total
0 QC failure
0 duplicates
140747864 mapped (95.81%)

thank you for any hints
tuka.

**Rocketknight** · 03-27-2012, 06:11 AM

Trimming barcodes/indexes requires you to know their length and position in the reads. For standard Illumina primers, the indexes won't be within the normal reads and will instead be read with a different index read primer, so trimming is unneccessary, unless your insert size is so small that you read through the entire insert and out into the adapter at the other side. For people making their own barcodes, it will differ depending on the exact barcode structure used.

As for what the difference between the .filt and the unfiltered files is, I'm afraid I don't actually know.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Problem sequencing Hapmap Exome - no variant found!

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News