SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   Bioinformatics (http://seqanswers.com/forums/forumdisplay.php?f=18)
-   -   base quality encoding changed after tophat2 mapping (http://seqanswers.com/forums/showthread.php?t=44873)

pengchy 07-10-2014 11:10 PM

base quality encoding changed after tophat2 mapping
 
fastq
Code:

@FCC22UBACXX:2:1101:1463:2233#ACACGCGG/1
TGGTCTTCTAAATATTGTCTGAGGGCTCCGTAAGCCTGTGTTTTAGCAC
+
___acdeegee[efghgbbhgfdehhhfffffhZa^fggbaefhhghhh

after tophat2 mapping, the fastq base quality was changed into:
Code:

FCC22UBACXX:2:1115:5744:8409#ACACGCGG  256    EQ110773        344    3      49M    *      0      0      ATAAAACCCGACAAAAGCTGTTCGGAAAGCTCTACGGGCTCGACCGGCA    CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJJJJJJIJHFDDD      AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:MD:Z:49  YT:Z:UU NH:i:2  CC:Z:EQ123351  CP:i:3745      HI:i:0
also, if the input was solexa-quality, the base quality in the output bam will be 0-40, which will influence the downstream analysis.

Any idea?
Thank you.

blancha 07-11-2014 03:26 AM

Your argument that TopHat has changed the quality scores would be more convincing if you posted the same sequence before and after alignment with TopHat.
You've posted two different sequences, with obviously different quality scores.
Can you post the same sequence, before and after alignment with TopHat?

TopHat2 does have optional arguments to handle quality scores in the Solexa scale, incidentally.

--solexa-quals Use the Solexa scale for quality values in FASTQ files.
--solexa1.3-quals As of the Illumina GA pipeline version 1.3, quality scores are encoded in Phred-scaled base-64. Use this option for FASTQ files from pipeline 1.3 or later.

kmcarr 07-11-2014 04:20 AM

Quote:

Originally Posted by pengchy (Post 144672)
fastq
Code:

@FCC22UBACXX:2:1101:1463:2233#ACACGCGG/1
TGGTCTTCTAAATATTGTCTGAGGGCTCCGTAAGCCTGTGTTTTAGCAC
+
___acdeegee[efghgbbhgfdehhhfffffhZa^fggbaefhhghhh

after tophat2 mapping, the fastq base quality was changed into:
Code:

FCC22UBACXX:2:1115:5744:8409#ACACGCGG  256    EQ110773        344    3      49M    *      0      0      ATAAAACCCGACAAAAGCTGTTCGGAAAGCTCTACGGGCTCGACCGGCA    CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJJJJJJIJHFDDD      AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:MD:Z:49  YT:Z:UU NH:i:2  CC:Z:EQ123351  CP:i:3745      HI:i:0
also, if the input was solexa-quality, the base quality in the output bam will be 0-40, which will influence the downstream analysis.

Any idea?
Thank you.

Tophat has done nothing to change the encoding. Quality scores in BAM files are stored as numeric values, not ASCII characters like they are in FASTQ files. When you view the contents of a BAM file with samtools view it will convert those numbers to ASCII characters for display and it will always use the Sanger Phred+33 encoding for display. It is only important that when you are using FASTQ files as input you correctly identify what encoding method is used in that FASTQ, then the correct q-score numbers will be stored in the BAM and everything downstream will work fine.

blancha 07-11-2014 04:58 AM

Very informative answer kmcarr.
Much better than mine. :)

pengchy 07-11-2014 06:16 AM

Quote:

Originally Posted by kmcarr (Post 144695)
Tophat has done nothing to change the encoding. Quality scores in BAM files are stored as numeric values, not ASCII characters like they are in FASTQ files. When you view the contents of a BAM file with samtools view it will convert those numbers to ASCII characters for display and it will always use the Sanger Phred+33 encoding for display. It is only important that when you are using FASTQ files as input you correctly identify what encoding method is used in that FASTQ, then the correct q-score numbers will be stored in the BAM and everything downstream will work fine.

Thank kmcarr and blancha's reply.

More question:
When I use the fastq with the base quality like this:
Code:

@FCC22UBACXX:2:1101:1463:2233#ACACGCGG
TGGTCTTCTAAATATTGTCTGAGGGCTCCGTAAGCCTGTGTTTTAGCAC
+
@@@BDEFFHFF<FGHIHCCIHGEFIIIGGGGGI;B?GHHCBFGIIHIII

the output bam will be like this:
Code:

FCC22UBACXX:2:2112:11384:88500#ACACGCGG 0      AF024514.1      14      50      49M    *      0      0      ATATTGCTTCTATTTCGGTTTTGTTCAAGCGTTGACCGTTGCAGGCGCT    %$%(((((*****++++"(*+++*+++++++++++++++++++++++++      AS:i:-4 XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:MD:Z:17A1A29    YT:Z:UU NH:i:1
FCC22UBACXX:2:1313:19247:88511#ACACGCGG 0      AF024514.1      15      50      49M    *      0      0      TATTGCTTCTATTTCGATTTTGTTCAAGCGTTGACCGTTGCAGGCGCTT    $$$(((((&((&(*))'%(**+&(*++*&)''')&)+++%()))'&()%      AS:i:-2 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:MD:Z:18A30      YT:Z:UU NH:i:1

Is it normal? I have used this bam file to call snp, none snp has been found, while several sites can be manually detected. Is it caused by this base quality?

kmcarr 07-11-2014 12:02 PM

Quote:

Originally Posted by pengchy (Post 144707)
Thank kmcarr and blancha's reply.

More question:
When I use the fastq with the base quality like this:
Code:

@FCC22UBACXX:2:1101:1463:2233#ACACGCGG
TGGTCTTCTAAATATTGTCTGAGGGCTCCGTAAGCCTGTGTTTTAGCAC
+
@@@BDEFFHFF<FGHIHCCIHGEFIIIGGGGGI;B?GHHCBFGIIHIII

the output bam will be like this:
Code:

FCC22UBACXX:2:2112:11384:88500#ACACGCGG 0      AF024514.1      14      50      49M    *      0      0      ATATTGCTTCTATTTCGGTTTTGTTCAAGCGTTGACCGTTGCAGGCGCT    %$%(((((*****++++"(*+++*+++++++++++++++++++++++++      AS:i:-4 XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:MD:Z:17A1A29    YT:Z:UU NH:i:1
FCC22UBACXX:2:1313:19247:88511#ACACGCGG 0      AF024514.1      15      50      49M    *      0      0      TATTGCTTCTATTTCGATTTTGTTCAAGCGTTGACCGTTGCAGGCGCTT    $$$(((((&((&(*))'%(**+&(*++*&)''')&)+++%()))'&()%      AS:i:-2 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:MD:Z:18A30      YT:Z:UU NH:i:1

Is it normal? I have used this bam file to call snp, none snp has been found, while several sites can be manually detected. Is it caused by this base quality?

Pengchy,

Again, you are showing us different reads from the FastQ file and from the BAM file. To best understand what is going on you really need to show us the FastQ record and the BAM record for the SAME read(s).

Do you know what the quality encoding was in your input FastQ? You really need to have this information before you start your analysis.

What version of Tophat are you using?

When you ran Tophat did you use a command line parameter to tell Tophat what q-score encoding was used for the input FastQ or did you leave it as default?

pengchy 07-12-2014 06:13 AM

Quote:

Originally Posted by kmcarr (Post 144756)
Pengchy,

Again, you are showing us different reads from the FastQ file and from the BAM file. To best understand what is going on you really need to show us the FastQ record and the BAM record for the SAME read(s).

Do you know what the quality encoding was in your input FastQ? You really need to have this information before you start your analysis.

What version of Tophat are you using?

When you ran Tophat did you use a command line parameter to tell Tophat what q-score encoding was used for the input FastQ or did you leave it as default?

Hi kmcarr, thank you for the reply.

Actually, I have done my work in the steps:
1. mapped fastq of the reads with quality solexa 1.3, like:
Code:

@FCC22UBACXX:2:2112:11384:88500#ACACGCGG/1
ATATTGCTTCTATTTCGGTTTTGTTCAAGCGTTGACCGTTGCAGGCGCT
+
a_aeeeeegggggiiiiWeghiighhhiiiihiiiiiiihhhiiihihi

the read was mapped by tophat2 with parameters --solexa1.3-quals

2. Then the output file "unmapped.bam" of tophat2 were converted to fastq format by bedtools-2.17.0/bin/bamToFastq, the output file of this read is:
Code:

@FCC22UBACXX:2:2112:11384:88500#ACACGCGG
ATATTGCTTCTATTTCGGTTTTGTTCAAGCGTTGACCGTTGCAGGCGCT
+
B@BFFFFFHHHHHJJJJ8FHIJJHIIIJJJJIJJJJJJJIIIJJJIJIJ

the base quality characters of bamToFastq output is the same to the "samtools view".

3. The base quality is phred+33, the reads were mapped by tophat2 with the parameters:
Code:

/path/to/tophat2 --solexa-quals --library-type fr-unstranded -o /path/to/output/directory/ /genome/index /path/to/fastq.gz
The "samtools view " of the output bam file give the following base quality like
Code:

FCC22UBACXX:2:2112:11384:88500#ACACGCGG 0      AF024514.1      14      50      49M    *      0      0      ATATTGCTTCTATTTCGGTTTTGTTCAAGCGTTGACCGTTGCAGGCGCT    %$%(((((*****++++"(*+++*+++++++++++++++++++++++++      AS:i:-4 XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:MD:Z:17A1A29    YT:Z:UU NH:i:1
Here the read from the original fastq and bam file is same. I have checked many reads manually, the output bam file all have the similar quality characters.

I don't know this phenomenon is caused by Tophat2 or "samtools view".

I have correctly set the --solexa-quals for tophat2.

We can have a check for the first character of these reads:
the original: "a" == 97
the first mapping result: "B"== 66
the second mapping result: "%" == 37

We can got that the first transformation has minus the ASCII by 31 and the second by 29.

I guess the "samtools view" always transform the input bam ASCII by minus some value.

kmcarr 07-12-2014 06:32 AM

Quote:

Originally Posted by pengchy (Post 144801)
Hi kmcarr, thank you for the reply.

Actually, I have done my work in the steps:
1. mapped fastq of the reads with quality solexa 1.3, like:
Code:

@FCC22UBACXX:2:2112:11384:88500#ACACGCGG/1
ATATTGCTTCTATTTCGGTTTTGTTCAAGCGTTGACCGTTGCAGGCGCT
+
a_aeeeeegggggiiiiWeghiighhhiiiihiiiiiiihhhiiihihi

the read was mapped by tophat2 with parameters --solexa1.3-quals

2. Then the output file "unmapped.bam" of tophat2 were converted to fastq format by bedtools-2.17.0/bin/bamToFastq, the output file of this read is:
Code:

@FCC22UBACXX:2:2112:11384:88500#ACACGCGG
ATATTGCTTCTATTTCGGTTTTGTTCAAGCGTTGACCGTTGCAGGCGCT
+
B@BFFFFFHHHHHJJJJ8FHIJJHIIIJJJJIJJJJJJJIIIJJJIJIJ

the base quality characters of bamToFastq output is the same to the "samtools view".

3. The base quality is phred+33, the reads were mapped by tophat2 with the parameters:
Code:

/path/to/tophat2 --solexa-quals --library-type fr-unstranded -o /path/to/output/directory/ /genome/index /path/to/fastq.gz
The "samtools view " of the output bam file give the following base quality like
Code:

FCC22UBACXX:2:2112:11384:88500#ACACGCGG 0      AF024514.1      14      50      49M    *      0      0      ATATTGCTTCTATTTCGGTTTTGTTCAAGCGTTGACCGTTGCAGGCGCT    %$%(((((*****++++"(*+++*+++++++++++++++++++++++++      AS:i:-4 XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:MD:Z:17A1A29    YT:Z:UU NH:i:1
Here the read from the original fastq and bam file is same. I have checked many reads manually, the output bam file all have the similar quality characters.

I don't know this phenomenon is caused by Tophat2 or "samtools view".

I have correctly set the --solexa-quals for tophat2 (No, this is incorrect).

We can have a check for the first character of these reads:
the original: "a" == 97
the first mapping result: "B"== 66
the second mapping result: "%" == 37

We can got that the first transformation has minus the ASCII by 31 and the second by 29.

I guess the "samtools view" always transform the input bam ASCII by minus some value.

So if I am following your logic you:

1. Map your original reads to some reference with Tophat2.

2. Extract the unmapped reads from unmapped.bam to a new FastQ

3. Map the unmapped reads to some reference with Tophat2. I am going to assume that this mapping is to some different reference since trying to align unmapped reads back to the same reference wouldn't make sense.

I have highlighted the problem in red above. You correctly state "The base quality is phred+33", but then you incorrectly set "--solexa-quals" in your tophat command, which is telling tophat that your FastQ file is phred+64. Drop "--solexa-quals" from your second tophat alignment (step 3). Phred+33 is the default for tophat2 so there is no command line option used to specify it.

pengchy 07-12-2014 06:55 AM

Quote:

Originally Posted by kmcarr (Post 144802)
So if I am following your logic you:

1. Map your original reads to some reference with Tophat2.

2. Extract the unmapped reads from unmapped.bam to a new FastQ

3. Map the unmapped reads to some reference with Tophat2. I am going to assume that this mapping is to some different reference since trying to align unmapped reads back to the same reference wouldn't make sense.

I have highlighted the problem in red above. You correctly state "The base quality is phred+33", but then you incorrectly set "--solexa-quals" in your tophat command, which is telling tophat that your FastQ file is phred+64. Drop "--solexa-quals" from your second tophat alignment (step 3). Phred+33 is the default for tophat2 so there is no command line option used to specify it.

thank you kmcarr, as you found, I have wrongly set the quality parameter for tophat2.

Your suggestion is very useful and help me a lot.

Best,
Pengcheng


All times are GMT -8. The time now is 11:58 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.