SEQanswers

Go Back   SEQanswers > Applications Forums > Genomic Resequencing

Similar Threads
Thread Thread Starter Forum Replies Last Post
problems understanding pileup format pi101 Bioinformatics 2 11-14-2012 02:47 PM
samtools pileup BioSlayer Bioinformatics 9 01-16-2012 03:10 AM
samtools pileup output file charlie_sequencing Bioinformatics 1 02-10-2011 12:51 PM
SAMtool's pileup format - reference base question mfischer Bioinformatics 11 10-19-2010 04:12 PM
SAMtools - * character in pileup file JosieR Bioinformatics 9 09-08-2010 11:14 AM

Reply
 
Thread Tools
Old 06-14-2010, 10:05 AM   #1
skingan
Member
 
Location: Rochester, NY

Join Date: Feb 2010
Posts: 12
Default samtools problems with reference in pileup file

Hello,

I am using samtools to generate a pileup file from a Mosaik genome assembly. My problem is that the pileup file contains all "N's" in the reference sequence field (column 3). I have double checked that I am using the exact same multi-sequence fasta file as the reference for both Mosaik and samtools.

Has anyone ever run into this before? I am using MosaikSort and MosaikText to generate a sorted bam file for input into samtools. Without the reference, I cannot get SNP quality scores, which is the whole point of my analysis.

Thanks for your help,
Sarah
skingan@mail.rochester.edu
skingan is offline   Reply With Quote
Old 06-15-2010, 08:53 AM   #2
reneeg36
Junior Member
 
Location: Seattle

Join Date: Jun 2010
Posts: 2
Default

Hi,

I am having the same problem. I am mapping reads to a list of contigs from a de novo assembly using bwa. I have converted the bwa sam alignment file to a sorted bam file using samtools. When I try running pileup to identify variants, I get only N's in the reference base position. I have checked the reference fasta file and don't find anything wrong.

Thanks!

Renee
rdg@u.washington.edu
reneeg36 is offline   Reply With Quote
Old 06-15-2010, 09:06 AM   #3
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,283
Default

Quote:
Originally Posted by skingan View Post
Hello,

I am using samtools to generate a pileup file from a Mosaik genome assembly. My problem is that the pileup file contains all "N's" in the reference sequence field (column 3). I have double checked that I am using the exact same multi-sequence fasta file as the reference for both Mosaik and samtools.

Has anyone ever run into this before? I am using MosaikSort and MosaikText to generate a sorted bam file for input into samtools. Without the reference, I cannot get SNP quality scores, which is the whole point of my analysis.

Thanks for your help,
Sarah
skingan@mail.rochester.edu
Post the command you are using so we can help
Quote:
Originally Posted by reneeg36 View Post
Hi,

I am having the same problem. I am mapping reads to a list of contigs from a de novo assembly using bwa. I have converted the bwa sam alignment file to a sorted bam file using samtools. When I try running pileup to identify variants, I get only N's in the reference base position. I have checked the reference fasta file and don't find anything wrong.

Thanks!

Renee
rdg@u.washington.edu
Did you specify the FASTA file with the "-f" option?
nilshomer is offline   Reply With Quote
Old 06-15-2010, 09:16 AM   #4
reneeg36
Junior Member
 
Location: Seattle

Join Date: Jun 2010
Posts: 2
Default

Quote:
Originally Posted by nilshomer View Post
Post the command you are using so we can help

Did you specify the FASTA file with the "-f" option?
Yes, here is what I have been trying:

command:
samtools pileup -f reference.fa sorted.bam > raw.txt

output (e.g.):
chr7:150849-150991 22 N 2 GG CC


Renee
reneeg36 is offline   Reply With Quote
Old 06-16-2010, 12:59 PM   #5
skingan
Member
 
Location: Rochester, NY

Join Date: Feb 2010
Posts: 12
Default

my command:

samtools pileup -f myref.fas my_sorted_assembly.bam -c > mydata.pileup

my output (e.g.):

NT_033777 27895205 N A 24 0 6 17 aaaaaaaaaaaaaaaaa $))))))()')))")))
skingan is offline   Reply With Quote
Old 06-16-2010, 01:03 PM   #6
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,283
Default

Quote:
Originally Posted by skingan View Post
my command:

samtools pileup -f myref.fas my_sorted_assembly.bam -c > mydata.pileup

my output (e.g.):

NT_033777 27895205 N A 24 0 6 17 aaaaaaaaaaaaaaaaa $))))))()')))")))
Is there an N at that position in the reference?
nilshomer is offline   Reply With Quote
Old 06-16-2010, 01:06 PM   #7
skingan
Member
 
Location: Rochester, NY

Join Date: Feb 2010
Posts: 12
Default

No. All of the positions in the pileup file are N's and my reference sequences do not consist of all N's.
Sarah
skingan is offline   Reply With Quote
Old 07-07-2010, 01:03 PM   #8
SMHfrog
Junior Member
 
Location: Austin, TX

Join Date: Jun 2010
Posts: 2
Default

I had this same problem, and after seeing no solution here did some more digging, and have a possible solution for you.

I noticed that the ref.fa.fai file for my whole genome was 0 kb. The .fai is used by samtools when building the pileup. When I ran the command to re-build the .fai:

samtools faidx reference.fa

I got the following error message:

[fai_build_core] different line length in sequence 'scaffold_14'.
Segmentation fault

No doubt this same message occurred the first time I ran the pileup command (which also builds the .fai if it doesn't exist), but I apparently didn't pay attention. After that first time, the .fai file EXISTED so no errors were subsequently reported when I ran pileup again.

In my case, there was an extra line after scaffold_14. I removed this, and re-built the .fai using the samtools faidx command and then re-ran the pileup command. My pileup then contained the reference base as intended!

Hope this helps y'all find the solution to your problem.
Best,
Shannon
University of Texas at Austin
SMHfrog is offline   Reply With Quote
Old 07-16-2010, 07:17 AM   #9
thaley
Junior Member
 
Location: Boston

Join Date: Jul 2010
Posts: 4
Default

Ran into the same problem. It may be worth someone adding this to the faidx documentation regarding null strings in the reference or make the thrown error more descriptive.
thaley is offline   Reply With Quote
Old 07-22-2010, 05:50 AM   #10
skingan
Member
 
Location: Rochester, NY

Join Date: Feb 2010
Posts: 12
Default

It turned out to be a similar problem to the one SMHfrog had. In my reference file, each chromosome sequence was on a single line, so when samtools built the .fai file there was a segmentation fault because of the length of the sequence. I used a different reference with line breaks and it worked. I used the same reference file for the Mosaik run and the pileup build.
Sarah
skingan is offline   Reply With Quote
Old 08-06-2010, 04:48 AM   #11
hollandorange
Member
 
Location: Holland

Join Date: May 2010
Posts: 11
Default

I got the same problem.
chr17 418628 N 54
chr17 418629 N 58
chr17 418630 N 57
hollandorange is offline   Reply With Quote
Old 08-25-2010, 07:52 AM   #12
mmartin
Member
 
Location: Stockholm

Join Date: Aug 2009
Posts: 58
Default

I had the same problem. In my case, I had colons in the reference sequence names, something like "Region1:1-100". When I removed them, samtools pileup worked as expected.
mmartin is offline   Reply With Quote
Old 09-03-2010, 03:26 AM   #13
brutus
Junior Member
 
Location: Uppsala

Join Date: Aug 2010
Posts: 1
Default

I also had this experience, in my case the problem disappeared when I removed spaces in the reference sequence name.
brutus is offline   Reply With Quote
Old 11-25-2010, 09:12 AM   #14
smol
Junior Member
 
Location: Ascot

Join Date: Oct 2010
Posts: 3
Default

Hi
I'm having the same problem with Ns in my pileup file and have tried everything mentioned above (thanks for suggestions!). I am using:

./samtools pileup data.sorted.bam -f reference.fasta > data.pileup

My reference .fai file looks like this:

chr2L 49364325 7 60 61
chr2R 61545105 50187078 60 61
chr3L 41963435 112757942 60 61
chr3R 53200684 155420775 60 61
chrUNKN 42389979 209508147 60 61
chrX 24393108 252604632 60 61
chrY 237045 277404298 60 61

Any ideas?
smol is offline   Reply With Quote
Old 02-03-2011, 07:31 AM   #15
colindaven
Senior Member
 
Location: Germany

Join Date: Oct 2008
Posts: 300
Default

Here's another possible solution - the headers are not consistent between SAM/BAM and the original fasta:

Even though the reference file was the same one in both cases, sometimes aligners just write a substring out into the SAM file. Samtools seems to take the full header.

For example the first contiguous part of my genome header is
gi|110645304|ref|NC_002516.2|

However in my SAM file the aligner has only written
NC_002516.2

Samtools has written the full header to the .fa.fai index
gi|110645304|ref|NC_002516.2|

.. and this does not match.

Solution:

Try correcting the original header on the reference fasta to just the substring which the aligner uses.
eg
gi|110645304|ref|NC_002516.2|
to
NC_002516.2
colindaven is offline   Reply With Quote
Old 04-20-2011, 05:39 PM   #16
bgibb
Junior Member
 
Location: San Francisco, CA

Join Date: Jul 2010
Posts: 7
Default

I noticed the same problem when running pileup under SAMtools-0.1.15. However the problem does not seem to occur when running pileup under SAMtools-0.1.4 (using the same reference file, same BAM file and same command line options).

samtools-0.1.4/samtools pileup -s -f reference.fa sorted.bam > pileup.out
bgibb is offline   Reply With Quote
Old 05-05-2011, 11:11 AM   #17
smehr12
Junior Member
 
Location: New York

Join Date: May 2011
Posts: 4
Default

Quote:
Originally Posted by SMHfrog View Post
I had this same problem, and after seeing no solution here did some more digging, and have a possible solution for you.

I noticed that the ref.fa.fai file for my whole genome was 0 kb. The .fai is used by samtools when building the pileup. When I ran the command to re-build the .fai:

samtools faidx reference.fa

I got the following error message:

[fai_build_core] different line length in sequence 'scaffold_14'.
Segmentation fault

No doubt this same message occurred the first time I ran the pileup command (which also builds the .fai if it doesn't exist), but I apparently didn't pay attention. After that first time, the .fai file EXISTED so no errors were subsequently reported when I ran pileup again.

In my case, there was an extra line after scaffold_14. I removed this, and re-built the .fai using the samtools faidx command and then re-ran the pileup command. My pileup then contained the reference base as intended!

Hope this helps y'all find the solution to your problem.
Best,
Shannon
University of Texas at Austin
Hi all,
I have the same error.
samtools faidx bwa.ref/ref.fasta ref.fa

ERROR:
different line length in sequence 'scaffold_67'.
Segmentation fault
NOTE: I see NNNN in that scaffold . Does anyone have a suggestion?
smehr12 is offline   Reply With Quote
Old 11-25-2011, 09:46 AM   #18
ericpante
Junior Member
 
Location: La Rochelle, France

Join Date: Nov 2011
Posts: 1
Default

Hi everybody,

I am have similar problems with samtools 0.1.18. I would like to have reference characters listed in a pileup files, but I have problems with headers.

samtools faidx AGSbrut.fasta
samtools view -q 20 -buh -t AGSbrut.fasta.fai A.sam | samtools sort - A
samtools view -q 20 -buh -t AGSbrut.fasta.fai S.sam | samtools sort - S
samtools mpileup -B -f AGSbrut_index.fai A.bam S.bam > AS.mpileup

[fai_build_core] different line length in sequence 'null'.
Segmentation fault

I hypothesized that this 'null' sequence may be a blank line; so I looked for it manually and with sed, with no luck. I also looked for other potential problems based on what was previously reported (no extra spaces, characters, etc in reference sequence names in fai and sam files). I also tried to re-head the file, with no success:

samtools view -HS -t AGSbrut.fasta.fai A.sam > Aheader.sam
samtools reheader Aheader.sam A.bam > Aheaded.bam

[bam_header_read] EOF marker is absent. The input is probably truncated.

All insights are welcome!
thank you, eric
ericpante is offline   Reply With Quote
Old 12-08-2011, 11:43 AM   #19
adowney
Junior Member
 
Location: MD, USA

Join Date: Dec 2011
Posts: 1
Default

Quote:
Originally Posted by colindaven View Post
Here's another possible solution - the headers are not consistent between SAM/BAM and the original fasta:

Even though the reference file was the same one in both cases, sometimes aligners just write a substring out into the SAM file. Samtools seems to take the full header.

For example the first contiguous part of my genome header is
gi|110645304|ref|NC_002516.2|

However in my SAM file the aligner has only written
NC_002516.2

Samtools has written the full header to the .fa.fai index
gi|110645304|ref|NC_002516.2|

.. and this does not match.

Solution:

Try correcting the original header on the reference fasta to just the substring which the aligner uses.
eg
gi|110645304|ref|NC_002516.2|
to
NC_002516.2
The above suggestion fixed the problem when I got this error
adowney is offline   Reply With Quote
Old 01-22-2013, 07:41 AM   #20
jgibbons1
Senior Member
 
Location: Boston, MA

Join Date: Oct 2009
Posts: 125
Default

Hey folks,

Have been struggling to figure out why I am getting N's for my pileup reference sequence. I found hope when I discovered this string but I have followed all the suggestions to no avail. I've tried this with different versions of samtools, different data sets, different reference files and have simplified ID names, rebuilt the faidx index, etc. etc.

Still can't figure out what's going on here. Has anyone found any other solutions?

Thanks
jgibbons1 is offline   Reply With Quote
Reply

Tags
mosaik, pileup, reference, samtools

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:25 PM.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.