SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
The insert-size in paired-end data louis7781x Illumina/Solexa 15 02-06-2017 08:42 AM
BWA insert size calculation in paired end sequencing dg13 Bioinformatics 3 07-30-2013 09:17 PM
Tophat insert size of paired-end reads ozs2006 Bioinformatics 15 07-30-2013 06:40 PM
Average insert size from paired end data Petrichor Bioinformatics 4 04-29-2013 01:13 AM
Regarding paired -end assembly insert-size calculation on newbler2.3 ganga.jeena Bioinformatics 2 03-01-2011 05:34 AM

Reply
 
Thread Tools
Old 02-24-2015, 04:17 AM   #1
rozitaa
Member
 
Location: Sweden

Join Date: Jun 2013
Posts: 51
Default Problem with the insert size of RNA-seq paired end reads!

The problem is that I cannot get a huge part of my RNA-seq data aligned!
These are some samples from the liver, intestine and colon of mice. We are analyzing the expression level of their genes. The data has been sequenced in paired end by NextSeq 500 (2*75). And you can find the BioAnalyzer plot for the reads before sequencing in the attachment.

If you look at the trace the fragment sizes are distributed between approx. 190-650 bp with an average of somewhere between 300-350 bps. 130pbs of those belong to upper and lower primers, which won't be sequenced. So with 2*75 sequencing length we might have overlap in some cases (I looked in one sample and in that sample 30% of the fragments should have overlapped reads) and inner distance in most of reads. Please see the summary of that in the following table:

Quote:
Initial read length Primer length Actual read length Sequenced length 2*75 Overlap length Inner distance
190 130 60 150 90 0
300 130 170 150 0 20
350 130 220 150 0 70
650 130 520 150 0 370

Now the problem is the standard deviation and median length options while you want to align data by TopHat2. To solve the problem I thought I could align data first by bowtie2 to get some idea about the inner distance. But unfortunately I cannot get about 50% of my reads aligned.


Here is the command I use for that:

Code:
bowtie2 -q --phred33 -D 20 -R 3 -N 1 -L 20 -i S,1,0.50 --n-ceil L,0,0.15 --end-to-end --score-min L,-0.6,-0.6 -I 45 -X 900 -t --met-file bowtie_align_metrix.txt --met-stderr bowtie_stderr.txt --no-unal --al  ~/al --un ~/un --un-conc ~/un_conc --al-conc ~/all_conc -p 8 --non-deterministic -x ~/ref-files/mm37 -1 R1.paired.fq -2 R1.unpaired.fq -S result.sam >& bowtie_log_file

And here is the statistical result gotten from Bowtie2 aligning:

Code:
16852758 reads; of these:
  16852758 (100.00%) were paired; of these:
    8254156 (48.98%) aligned concordantly 0 times
    6476798 (38.43%) aligned concordantly exactly 1 time
    2121804 (12.59%) aligned concordantly >1 times
    ----
    8254156 pairs aligned concordantly 0 times; of these:
      974394 (11.80%) aligned discordantly 1 time
    ----
    7279762 pairs aligned 0 times concordantly or discordantly; of these:
      14559524 mates make up the pairs; of these:
        10751027 (73.84%) aligned 0 times
        3072453 (21.10%) aligned exactly 1 time
        736044 (5.06%) aligned >1 times
68.10% overall alignment rate
Time searching: 05:33:24
Overall time: 05:33:24
Why this (8254156 (48.98%) aligned concordantly 0 times) happends???????
Attached Images
File Type: png RNAseq_bioanalyzer.png (60.9 KB, 14 views)

Last edited by rozitaa; 02-24-2015 at 04:21 AM.
rozitaa is offline   Reply With Quote
Old 02-24-2015, 04:30 AM   #2
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Why are you using bowtie2? Are you aligning to the transcriptome?

In my mouse liver RNAseq datasets (aligned with STAR to the genome), I've gotten ~80% of reads uniquely aligned and another ~19% multimapped (so ~1% unaligned). Granted, theses are single-end, but you wouldn't expect paired-end reads to differ that much.
dpryan is offline   Reply With Quote
Old 02-24-2015, 04:37 AM   #3
rozitaa
Member
 
Location: Sweden

Join Date: Jun 2013
Posts: 51
Default

Quote:
Originally Posted by dpryan View Post
Why are you using bowtie2? Are you aligning to the transcriptome?

In my mouse liver RNAseq datasets (aligned with STAR to the genome), I've gotten ~80% of reads uniquely aligned and another ~19% multimapped (so ~1% unaligned). Granted, theses are single-end, but you wouldn't expect paired-end reads to differ that much.
No I am aligning to the genome reference! Because later on I want to align it by TopHat. So this is only a way to get idea about inner distance. But I can try star also.
I have had several experiences with single-end or paired-end with overlap alignment by bowtie2 and always got good results! but this time I doubt it might be the inner size that I cannot get them aligned.
rozitaa is offline   Reply With Quote
Old 02-24-2015, 04:46 AM   #4
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Honestly, I wouldn't worry too much about the insert size settings, it doesn't make much difference as far as I've seen. In any case, you'll get much faster results with STAR (and the results are just as reliable).
dpryan is offline   Reply With Quote
Old 02-24-2015, 04:49 AM   #5
rozitaa
Member
 
Location: Sweden

Join Date: Jun 2013
Posts: 51
Default

Now I was also checking my sam file and interestingly some of my sam alignment lines don't have all sam fields! like this one:

Quote:
@NS500175:21:H2T5HBGXX:1:11101:24788:1114 1:N:0:CTGAAGCT+NGGATAGG%0ACTCCAGTATAAACTACTTTCCATATTCATTGTAAATCACAATGGTTTCCCACAGGCACAAAACAAAGCACAGAAAT%0A+%0AA)AAAFFFFFAAFFFFFFFFFFFAFFFFFFFFFFAFFAFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFAFFFFA%0A
Do you have any idea about this?
rozitaa is offline   Reply With Quote
Old 02-24-2015, 04:52 AM   #6
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Which version of bowtie2 are you using? It really should be clipping off the extraneous information from the read name (1:N:0...). I wonder if not doing so is leading to a buffer overflow.
dpryan is offline   Reply With Quote
Old 02-24-2015, 04:59 AM   #7
diego diaz
Member
 
Location: Santiago, Chile

Join Date: Oct 2013
Posts: 62
Default

Did you check FastQC plots? Maybe there was an error during sequencing, and some kmers were enriched at the ends of the reads.

I dealt with this some time ago. I had a weird pattern of kmers at the 3' end of my reads, in all my samples, and about 50% of the reads were discarded during the alignment. I had to remove these kmers to improve the mapping percentage (luckily, my reads were long enough to perform a hard trimming).

Other cause is maybe that your DNA was contaminated. You should try to align your reads to some common source of contamination, like human or mycoplasma.
diego diaz is offline   Reply With Quote
Old 02-24-2015, 05:14 AM   #8
rozitaa
Member
 
Location: Sweden

Join Date: Jun 2013
Posts: 51
Default

Quote:
Originally Posted by dpryan View Post
Which version of bowtie2 are you using? It really should be clipping off the extraneous information from the read name (1:N:0...). I wonder if not doing so is leading to a buffer overflow.
I am using version 2.0.6!!!
rozitaa is offline   Reply With Quote
Old 02-24-2015, 05:26 AM   #9
rozitaa
Member
 
Location: Sweden

Join Date: Jun 2013
Posts: 51
Default

Quote:
Originally Posted by diego diaz View Post
Did you check FastQC plots? Maybe there was an error during sequencing, and some kmers were enriched at the ends of the reads.

I dealt with this some time ago. I had a weird pattern of kmers at the 3' end of my reads, in all my samples, and about 50% of the reads were discarded during the alignment. I had to remove these kmers to improve the mapping percentage (luckily, my reads were long enough to perform a hard trimming).

Other cause is maybe that your DNA was contaminated. You should try to align your reads to some common source of contamination, like human or mycoplasma.
I attached plots and statistical tables for two different samples one from read one end and read two for the other sample! I don't have any idea what is the cutoff kmer enrichment. Can you please explain this more?
Attached Images
File Type: png sam1_R2.plot.png (82.8 KB, 10 views)
File Type: png sam1_R2.table.png (52.9 KB, 5 views)
File Type: png sam2_R1.plot.png (81.1 KB, 8 views)
File Type: png sam2_R1.table.png (40.3 KB, 0 views)
rozitaa is offline   Reply With Quote
Old 02-24-2015, 05:51 AM   #10
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,076
Default

Quote:
Originally Posted by rozitaa View Post
I am using version 2.0.6!!!
That is a pretty old version (current as of January 2013).

Current is 2.2.4.
GenoMax is offline   Reply With Quote
Old 02-24-2015, 06:07 AM   #11
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,076
Default

Have you seen this (post #18 by Brian has some interesting data) http://seqanswers.com/forums/showpost.php?p=156399. It may be worth checking your own data.

Hopefully you had adapter trimmed your data before doing the alignments.
GenoMax is offline   Reply With Quote
Old 02-24-2015, 06:16 AM   #12
diego diaz
Member
 
Location: Santiago, Chile

Join Date: Oct 2013
Posts: 62
Default

kmer is a substring of length k present in a sequence (In this case, DNA sequence)

For example,

ATTACGAGCGATCGCGCG

If we consider kmers of length 5, then from left to the right we have:

ATTAC, TTACG, TACGA, and so on.

In Bioinformatics, is a common task get the frequency of all possible kmers of a given sequence.

During the sequencing protocol, DNA is randomly fragmented (theoretically), then all kmers should have similar frequency (although in reality it is not always the case). If you see some kmers enriched in your reads, this means possibly that you have a bias. For example, when the sequencing adapter is not removed from the reads, the kmers in the adapter will be overrepresented, because sequencing adapter is the same for all reads.

In the plots that you attached I can see a kmer enrichment at the end of reads. At the 5' end is maybe due random priming, and at the 3' end maybe some remains of sequencing adapter, I don't know.

The tables shows a observed/expected rate, if the value is greater than 1, it means that the observed frequency is greater than the expected.

Hope that helps!
diego diaz is offline   Reply With Quote
Old 02-24-2015, 06:26 AM   #13
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,317
Default

Quote:
Originally Posted by rozitaa View Post
No I am aligning to the genome reference! Because later on I want to align it by TopHat. So this is only a way to get idea about inner distance. But I can try star also.
I have had several experiences with single-end or paired-end with overlap alignment by bowtie2 and always got good results! but this time I doubt it might be the inner size that I cannot get them aligned.
Mammals have teeny little exons spread out over 10's-100's of kilobases of the the genome. Mapping RNA (which has the introns spliced out) reads to the genome isn't a good way to determine insert size. And only getting 50% of the reads to map "concordantly" doesn't seem so bad. How is bowtie2 going to handle reads spanning a splice site?

If you want to determine your insert sizes, try aligning your reads to a long (spliced) transcript instead of genomic DNA. In my experience with the MiSeq and HiSeq, your sizes will look like all the very shortest library products were sequenced preferentially.

--
Phillip
pmiguel is offline   Reply With Quote
Old 02-24-2015, 06:30 AM   #14
rozitaa
Member
 
Location: Sweden

Join Date: Jun 2013
Posts: 51
Default

Quote:
Originally Posted by GenoMax View Post
Have you seen this (post #18 by Brian has some interesting data) http://seqanswers.com/forums/showpost.php?p=156399. It may be worth checking your own data.

Hopefully you had adapter trimmed your data before doing the alignments.
Well it should be trimmed!!
rozitaa is offline   Reply With Quote
Old 02-24-2015, 06:32 AM   #15
rozitaa
Member
 
Location: Sweden

Join Date: Jun 2013
Posts: 51
Default

Quote:
Originally Posted by diego diaz View Post
kmer is a substring of length k present in a sequence (In this case, DNA sequence)

For example,

ATTACGAGCGATCGCGCG

If we consider kmers of length 5, then from left to the right we have:

ATTAC, TTACG, TACGA, and so on.

In Bioinformatics, is a common task get the frequency of all possible kmers of a given sequence.

During the sequencing protocol, DNA is randomly fragmented (theoretically), then all kmers should have similar frequency (although in reality it is not always the case). If you see some kmers enriched in your reads, this means possibly that you have a bias. For example, when the sequencing adapter is not removed from the reads, the kmers in the adapter will be overrepresented, because sequencing adapter is the same for all reads.

In the plots that you attached I can see a kmer enrichment at the end of reads. At the 5' end is maybe due random priming, and at the 3' end maybe some remains of sequencing adapter, I don't know.

The tables shows a observed/expected rate, if the value is greater than 1, it means that the observed frequency is greater than the expected.

Hope that helps!
Thanks for nice explanation!
rozitaa is offline   Reply With Quote
Old 02-24-2015, 06:34 AM   #16
sarvidsson
Senior Member
 
Location: Berlin, Germany

Join Date: Jan 2015
Posts: 137
Default

The k-mers in the ends of the reads could be short adapter remnants not identified by the adapter clipper. Typically they get soft-clipped by the aligner, but you should get rid of them if you use BBDuk in trimbyoverlap mode.
sarvidsson is offline   Reply With Quote
Old 02-24-2015, 06:35 AM   #17
rozitaa
Member
 
Location: Sweden

Join Date: Jun 2013
Posts: 51
Default

Quote:
Originally Posted by pmiguel View Post
Mammals have teeny little exons spread out over 10's-100's of kilobases of the the genome. Mapping RNA (which has the introns spliced out) reads to the genome isn't a good way to determine insert size. And only getting 50% of the reads to map "concordantly" doesn't seem so bad. How is bowtie2 going to handle reads spanning a splice site?

If you want to determine your insert sizes, try aligning your reads to a long (spliced) transcript instead of genomic DNA. In my experience with the MiSeq and HiSeq, your sizes will look like all the very shortest library products were sequenced preferentially.

--
Phillip
Thanks Phillip! I think it should be like that since now that I am aligning only read one I can get approximately the whole aligned!!! Sure I will try to align it against a transcript reference.
rozitaa is offline   Reply With Quote
Old 02-24-2015, 09:15 AM   #18
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

I think the problem lies in the bowtie2 command:

Quote:
-1 R1.paired.fq -2 R1.unpaired.fq
Based on the filenames, it appears that you are telling it to treat read1's as pairs with each other, using the output of something like Trimmomatic that produced 4 output files. You should be doing something like this:

Quote:
-1 R1.paired.fq -2 R2.paired.fq
Although, I'm not really sure why ANY of the pairs were concordant.
Brian Bushnell is offline   Reply With Quote
Old 02-24-2015, 09:44 AM   #19
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Quote:
Originally Posted by Brian Bushnell View Post
I think the problem lies in the bowtie2 command
Good catch! I share your wonderment that any were concordant. I'm also surprised it didn't throw an error that the files were of different lengths.
dpryan is offline   Reply With Quote
Old 02-26-2015, 12:19 AM   #20
rozitaa
Member
 
Location: Sweden

Join Date: Jun 2013
Posts: 51
Default

Quote:
Originally Posted by Brian Bushnell View Post
I think the problem lies in the bowtie2 command:



Based on the filenames, it appears that you are telling it to treat read1's as pairs with each other, using the output of something like Trimmomatic that produced 4 output files. You should be doing something like this:



Although, I'm not really sure why ANY of the pairs were concordant.
Oh no this is not the problem sorry for confusion. This is my fault in naming the files incorrectly in previous step! The R1_unpaired.fq is supposed to be named R2_paired.fq!!!
rozitaa is offline   Reply With Quote
Reply

Tags
align unmapped mate, insert size, rna bioanalyzer, rna-seq, rna-seq aligners

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:27 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO