SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
FASTQC guessing wrong quality encoding PFS Bioinformatics 14 05-21-2014 07:41 AM
base quality encoding changed after "bwa samse" command tomjan Bioinformatics 4 02-26-2013 12:23 PM
mapping quality tophat2 - always 255 vbernard Bioinformatics 1 07-03-2012 01:51 AM
Different MAPPING QUALITY/PER-BASE QUALITY SCORE m_elena_bioinfo Bioinformatics 2 09-02-2010 09:00 AM
dual base encoding and de novo assembly pmiguel SOLiD 11 12-23-2009 05:46 AM

Reply
 
Thread Tools
Old 07-10-2014, 11:10 PM   #1
pengchy
Senior Member
 
Location: China

Join Date: Feb 2009
Posts: 116
Default base quality encoding changed after tophat2 mapping

fastq
Code:
@FCC22UBACXX:2:1101:1463:2233#ACACGCGG/1
TGGTCTTCTAAATATTGTCTGAGGGCTCCGTAAGCCTGTGTTTTAGCAC
+
___acdeegee[efghgbbhgfdehhhfffffhZa^fggbaefhhghhh
after tophat2 mapping, the fastq base quality was changed into:
Code:
FCC22UBACXX:2:1115:5744:8409#ACACGCGG   256     EQ110773        344     3       49M     *       0       0       ATAAAACCCGACAAAAGCTGTTCGGAAAGCTCTACGGGCTCGACCGGCA     CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJJJJJJIJHFDDD       AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:MD:Z:49  YT:Z:UU NH:i:2  CC:Z:EQ123351   CP:i:3745       HI:i:0
also, if the input was solexa-quality, the base quality in the output bam will be 0-40, which will influence the downstream analysis.

Any idea?
Thank you.
pengchy is offline   Reply With Quote
Old 07-11-2014, 03:26 AM   #2
blancha
Senior Member
 
Location: Montreal

Join Date: May 2013
Posts: 367
Default

Your argument that TopHat has changed the quality scores would be more convincing if you posted the same sequence before and after alignment with TopHat.
You've posted two different sequences, with obviously different quality scores.
Can you post the same sequence, before and after alignment with TopHat?

TopHat2 does have optional arguments to handle quality scores in the Solexa scale, incidentally.

--solexa-quals Use the Solexa scale for quality values in FASTQ files.
--solexa1.3-quals As of the Illumina GA pipeline version 1.3, quality scores are encoded in Phred-scaled base-64. Use this option for FASTQ files from pipeline 1.3 or later.
blancha is offline   Reply With Quote
Old 07-11-2014, 04:20 AM   #3
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,168
Default

Quote:
Originally Posted by pengchy View Post
fastq
Code:
@FCC22UBACXX:2:1101:1463:2233#ACACGCGG/1
TGGTCTTCTAAATATTGTCTGAGGGCTCCGTAAGCCTGTGTTTTAGCAC
+
___acdeegee[efghgbbhgfdehhhfffffhZa^fggbaefhhghhh
after tophat2 mapping, the fastq base quality was changed into:
Code:
FCC22UBACXX:2:1115:5744:8409#ACACGCGG   256     EQ110773        344     3       49M     *       0       0       ATAAAACCCGACAAAAGCTGTTCGGAAAGCTCTACGGGCTCGACCGGCA     CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJJJJJJIJHFDDD       AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:MD:Z:49  YT:Z:UU NH:i:2  CC:Z:EQ123351   CP:i:3745       HI:i:0
also, if the input was solexa-quality, the base quality in the output bam will be 0-40, which will influence the downstream analysis.

Any idea?
Thank you.
Tophat has done nothing to change the encoding. Quality scores in BAM files are stored as numeric values, not ASCII characters like they are in FASTQ files. When you view the contents of a BAM file with samtools view it will convert those numbers to ASCII characters for display and it will always use the Sanger Phred+33 encoding for display. It is only important that when you are using FASTQ files as input you correctly identify what encoding method is used in that FASTQ, then the correct q-score numbers will be stored in the BAM and everything downstream will work fine.
kmcarr is offline   Reply With Quote
Old 07-11-2014, 04:58 AM   #4
blancha
Senior Member
 
Location: Montreal

Join Date: May 2013
Posts: 367
Default

Very informative answer kmcarr.
Much better than mine.
blancha is offline   Reply With Quote
Old 07-11-2014, 06:16 AM   #5
pengchy
Senior Member
 
Location: China

Join Date: Feb 2009
Posts: 116
Default

Quote:
Originally Posted by kmcarr View Post
Tophat has done nothing to change the encoding. Quality scores in BAM files are stored as numeric values, not ASCII characters like they are in FASTQ files. When you view the contents of a BAM file with samtools view it will convert those numbers to ASCII characters for display and it will always use the Sanger Phred+33 encoding for display. It is only important that when you are using FASTQ files as input you correctly identify what encoding method is used in that FASTQ, then the correct q-score numbers will be stored in the BAM and everything downstream will work fine.
Thank kmcarr and blancha's reply.

More question:
When I use the fastq with the base quality like this:
Code:
@FCC22UBACXX:2:1101:1463:2233#ACACGCGG
TGGTCTTCTAAATATTGTCTGAGGGCTCCGTAAGCCTGTGTTTTAGCAC
+
@@@BDEFFHFF<FGHIHCCIHGEFIIIGGGGGI;B?GHHCBFGIIHIII
the output bam will be like this:
Code:
FCC22UBACXX:2:2112:11384:88500#ACACGCGG 0       AF024514.1      14      50      49M     *       0       0       ATATTGCTTCTATTTCGGTTTTGTTCAAGCGTTGACCGTTGCAGGCGCT     %$%(((((*****++++"(*+++*+++++++++++++++++++++++++       AS:i:-4 XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:MD:Z:17A1A29     YT:Z:UU NH:i:1
FCC22UBACXX:2:1313:19247:88511#ACACGCGG 0       AF024514.1      15      50      49M     *       0       0       TATTGCTTCTATTTCGATTTTGTTCAAGCGTTGACCGTTGCAGGCGCTT     $$$(((((&((&(*))'%(**+&(*++*&)''')&)+++%()))'&()%       AS:i:-2 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:MD:Z:18A30       YT:Z:UU NH:i:1
Is it normal? I have used this bam file to call snp, none snp has been found, while several sites can be manually detected. Is it caused by this base quality?
pengchy is offline   Reply With Quote
Old 07-11-2014, 12:02 PM   #6
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,168
Default

Quote:
Originally Posted by pengchy View Post
Thank kmcarr and blancha's reply.

More question:
When I use the fastq with the base quality like this:
Code:
@FCC22UBACXX:2:1101:1463:2233#ACACGCGG
TGGTCTTCTAAATATTGTCTGAGGGCTCCGTAAGCCTGTGTTTTAGCAC
+
@@@BDEFFHFF<FGHIHCCIHGEFIIIGGGGGI;B?GHHCBFGIIHIII
the output bam will be like this:
Code:
FCC22UBACXX:2:2112:11384:88500#ACACGCGG 0       AF024514.1      14      50      49M     *       0       0       ATATTGCTTCTATTTCGGTTTTGTTCAAGCGTTGACCGTTGCAGGCGCT     %$%(((((*****++++"(*+++*+++++++++++++++++++++++++       AS:i:-4 XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:MD:Z:17A1A29     YT:Z:UU NH:i:1
FCC22UBACXX:2:1313:19247:88511#ACACGCGG 0       AF024514.1      15      50      49M     *       0       0       TATTGCTTCTATTTCGATTTTGTTCAAGCGTTGACCGTTGCAGGCGCTT     $$$(((((&((&(*))'%(**+&(*++*&)''')&)+++%()))'&()%       AS:i:-2 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:MD:Z:18A30       YT:Z:UU NH:i:1
Is it normal? I have used this bam file to call snp, none snp has been found, while several sites can be manually detected. Is it caused by this base quality?
Pengchy,

Again, you are showing us different reads from the FastQ file and from the BAM file. To best understand what is going on you really need to show us the FastQ record and the BAM record for the SAME read(s).

Do you know what the quality encoding was in your input FastQ? You really need to have this information before you start your analysis.

What version of Tophat are you using?

When you ran Tophat did you use a command line parameter to tell Tophat what q-score encoding was used for the input FastQ or did you leave it as default?
kmcarr is offline   Reply With Quote
Old 07-12-2014, 06:13 AM   #7
pengchy
Senior Member
 
Location: China

Join Date: Feb 2009
Posts: 116
Default

Quote:
Originally Posted by kmcarr View Post
Pengchy,

Again, you are showing us different reads from the FastQ file and from the BAM file. To best understand what is going on you really need to show us the FastQ record and the BAM record for the SAME read(s).

Do you know what the quality encoding was in your input FastQ? You really need to have this information before you start your analysis.

What version of Tophat are you using?

When you ran Tophat did you use a command line parameter to tell Tophat what q-score encoding was used for the input FastQ or did you leave it as default?
Hi kmcarr, thank you for the reply.

Actually, I have done my work in the steps:
1. mapped fastq of the reads with quality solexa 1.3, like:
Code:
@FCC22UBACXX:2:2112:11384:88500#ACACGCGG/1
ATATTGCTTCTATTTCGGTTTTGTTCAAGCGTTGACCGTTGCAGGCGCT
+
a_aeeeeegggggiiiiWeghiighhhiiiihiiiiiiihhhiiihihi
the read was mapped by tophat2 with parameters --solexa1.3-quals

2. Then the output file "unmapped.bam" of tophat2 were converted to fastq format by bedtools-2.17.0/bin/bamToFastq, the output file of this read is:
Code:
@FCC22UBACXX:2:2112:11384:88500#ACACGCGG
ATATTGCTTCTATTTCGGTTTTGTTCAAGCGTTGACCGTTGCAGGCGCT
+
B@BFFFFFHHHHHJJJJ8FHIJJHIIIJJJJIJJJJJJJIIIJJJIJIJ
the base quality characters of bamToFastq output is the same to the "samtools view".

3. The base quality is phred+33, the reads were mapped by tophat2 with the parameters:
Code:
/path/to/tophat2 --solexa-quals --library-type fr-unstranded -o /path/to/output/directory/ /genome/index /path/to/fastq.gz
The "samtools view " of the output bam file give the following base quality like
Code:
FCC22UBACXX:2:2112:11384:88500#ACACGCGG 0       AF024514.1      14      50      49M     *       0       0       ATATTGCTTCTATTTCGGTTTTGTTCAAGCGTTGACCGTTGCAGGCGCT     %$%(((((*****++++"(*+++*+++++++++++++++++++++++++       AS:i:-4 XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:MD:Z:17A1A29     YT:Z:UU NH:i:1
Here the read from the original fastq and bam file is same. I have checked many reads manually, the output bam file all have the similar quality characters.

I don't know this phenomenon is caused by Tophat2 or "samtools view".

I have correctly set the --solexa-quals for tophat2.

We can have a check for the first character of these reads:
the original: "a" == 97
the first mapping result: "B"== 66
the second mapping result: "%" == 37

We can got that the first transformation has minus the ASCII by 31 and the second by 29.

I guess the "samtools view" always transform the input bam ASCII by minus some value.
pengchy is offline   Reply With Quote
Old 07-12-2014, 06:32 AM   #8
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,168
Default

Quote:
Originally Posted by pengchy View Post
Hi kmcarr, thank you for the reply.

Actually, I have done my work in the steps:
1. mapped fastq of the reads with quality solexa 1.3, like:
Code:
@FCC22UBACXX:2:2112:11384:88500#ACACGCGG/1
ATATTGCTTCTATTTCGGTTTTGTTCAAGCGTTGACCGTTGCAGGCGCT
+
a_aeeeeegggggiiiiWeghiighhhiiiihiiiiiiihhhiiihihi
the read was mapped by tophat2 with parameters --solexa1.3-quals

2. Then the output file "unmapped.bam" of tophat2 were converted to fastq format by bedtools-2.17.0/bin/bamToFastq, the output file of this read is:
Code:
@FCC22UBACXX:2:2112:11384:88500#ACACGCGG
ATATTGCTTCTATTTCGGTTTTGTTCAAGCGTTGACCGTTGCAGGCGCT
+
B@BFFFFFHHHHHJJJJ8FHIJJHIIIJJJJIJJJJJJJIIIJJJIJIJ
the base quality characters of bamToFastq output is the same to the "samtools view".

3. The base quality is phred+33, the reads were mapped by tophat2 with the parameters:
Code:
/path/to/tophat2 --solexa-quals --library-type fr-unstranded -o /path/to/output/directory/ /genome/index /path/to/fastq.gz
The "samtools view " of the output bam file give the following base quality like
Code:
FCC22UBACXX:2:2112:11384:88500#ACACGCGG 0       AF024514.1      14      50      49M     *       0       0       ATATTGCTTCTATTTCGGTTTTGTTCAAGCGTTGACCGTTGCAGGCGCT     %$%(((((*****++++"(*+++*+++++++++++++++++++++++++       AS:i:-4 XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:MD:Z:17A1A29     YT:Z:UU NH:i:1
Here the read from the original fastq and bam file is same. I have checked many reads manually, the output bam file all have the similar quality characters.

I don't know this phenomenon is caused by Tophat2 or "samtools view".

I have correctly set the --solexa-quals for tophat2 (No, this is incorrect).

We can have a check for the first character of these reads:
the original: "a" == 97
the first mapping result: "B"== 66
the second mapping result: "%" == 37

We can got that the first transformation has minus the ASCII by 31 and the second by 29.

I guess the "samtools view" always transform the input bam ASCII by minus some value.
So if I am following your logic you:

1. Map your original reads to some reference with Tophat2.

2. Extract the unmapped reads from unmapped.bam to a new FastQ

3. Map the unmapped reads to some reference with Tophat2. I am going to assume that this mapping is to some different reference since trying to align unmapped reads back to the same reference wouldn't make sense.

I have highlighted the problem in red above. You correctly state "The base quality is phred+33", but then you incorrectly set "--solexa-quals" in your tophat command, which is telling tophat that your FastQ file is phred+64. Drop "--solexa-quals" from your second tophat alignment (step 3). Phred+33 is the default for tophat2 so there is no command line option used to specify it.
kmcarr is offline   Reply With Quote
Old 07-12-2014, 06:55 AM   #9
pengchy
Senior Member
 
Location: China

Join Date: Feb 2009
Posts: 116
Default

Quote:
Originally Posted by kmcarr View Post
So if I am following your logic you:

1. Map your original reads to some reference with Tophat2.

2. Extract the unmapped reads from unmapped.bam to a new FastQ

3. Map the unmapped reads to some reference with Tophat2. I am going to assume that this mapping is to some different reference since trying to align unmapped reads back to the same reference wouldn't make sense.

I have highlighted the problem in red above. You correctly state "The base quality is phred+33", but then you incorrectly set "--solexa-quals" in your tophat command, which is telling tophat that your FastQ file is phred+64. Drop "--solexa-quals" from your second tophat alignment (step 3). Phred+33 is the default for tophat2 so there is no command line option used to specify it.
thank you kmcarr, as you found, I have wrongly set the quality parameter for tophat2.

Your suggestion is very useful and help me a lot.

Best,
Pengcheng
pengchy is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:26 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO