Seqanswers Leaderboard Ad

**dpryan** · 02-24-2015, 05:30 AM

Why are you using bowtie2? Are you aligning to the transcriptome?

In my mouse liver RNAseq datasets (aligned with STAR to the genome), I've gotten ~80% of reads uniquely aligned and another ~19% multimapped (so ~1% unaligned). Granted, theses are single-end, but you wouldn't expect paired-end reads to differ that much.

**rozitaa** · 02-24-2015, 05:37 AM

Originally posted by dpryan View Post

Why are you using bowtie2? Are you aligning to the transcriptome?

In my mouse liver RNAseq datasets (aligned with STAR to the genome), I've gotten ~80% of reads uniquely aligned and another ~19% multimapped (so ~1% unaligned). Granted, theses are single-end, but you wouldn't expect paired-end reads to differ that much.

No I am aligning to the genome reference! Because later on I want to align it by TopHat. So this is only a way to get idea about inner distance. But I can try star also.
I have had several experiences with single-end or paired-end with overlap alignment by bowtie2 and always got good results! but this time I doubt it might be the inner size that I cannot get them aligned.

**dpryan** · 02-24-2015, 05:46 AM

Honestly, I wouldn't worry too much about the insert size settings, it doesn't make much difference as far as I've seen. In any case, you'll get much faster results with STAR (and the results are just as reliable).

**rozitaa** · 02-24-2015, 05:49 AM

Now I was also checking my sam file and interestingly some of my sam alignment lines don't have all sam fields! like this one:

@NS500175:21:H2T5HBGXX:1:11101:24788:1114 1:N:0:CTGAAGCT+NGGATAGG%0ACTCCAGTATAAACTACTTTCCATATTCATTGTAAATCACAATGGTTTCCCACAGGCACAAAACAAAGCACAGAAAT%0A+%0AA)AAAFFFFFAAFFFFFFFFFFFAFFFFFFFFFFAFFAFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFAFFFFA%0A

Do you have any idea about this?

**dpryan** · 02-24-2015, 05:52 AM

Which version of bowtie2 are you using? It really should be clipping off the extraneous information from the read name (1:N:0...). I wonder if not doing so is leading to a buffer overflow.

**diego diaz** · 02-24-2015, 05:59 AM

Did you check FastQC plots? Maybe there was an error during sequencing, and some kmers were enriched at the ends of the reads.

I dealt with this some time ago. I had a weird pattern of kmers at the 3' end of my reads, in all my samples, and about 50% of the reads were discarded during the alignment. I had to remove these kmers to improve the mapping percentage (luckily, my reads were long enough to perform a hard trimming).

Other cause is maybe that your DNA was contaminated. You should try to align your reads to some common source of contamination, like human or mycoplasma.

**rozitaa** · 02-24-2015, 06:14 AM

Originally posted by dpryan View Post

Which version of bowtie2 are you using? It really should be clipping off the extraneous information from the read name (1:N:0...). I wonder if not doing so is leading to a buffer overflow.

I am using version 2.0.6!!!

**rozitaa** · 02-24-2015, 06:26 AM

Originally posted by diego diaz View Post

Did you check FastQC plots? Maybe there was an error during sequencing, and some kmers were enriched at the ends of the reads.

I dealt with this some time ago. I had a weird pattern of kmers at the 3' end of my reads, in all my samples, and about 50% of the reads were discarded during the alignment. I had to remove these kmers to improve the mapping percentage (luckily, my reads were long enough to perform a hard trimming).

Other cause is maybe that your DNA was contaminated. You should try to align your reads to some common source of contamination, like human or mycoplasma.

I attached plots and statistical tables for two different samples one from read one end and read two for the other sample! I don't have any idea what is the cutoff kmer enrichment. Can you please explain this more?

Attached Files

**GenoMax** · 02-24-2015, 06:51 AM

Originally posted by rozitaa View Post

I am using version 2.0.6!!!

That is a pretty old version (current as of January 2013).

Current is 2.2.4.

**GenoMax** · 02-24-2015, 07:07 AM

Have you seen this (post #18 by Brian has some interesting data) http://seqanswers.com/forums/showpost.php?p=156399. It may be worth checking your own data.

Hopefully you had adapter trimmed your data before doing the alignments.

**diego diaz** · 02-24-2015, 07:16 AM

kmer is a substring of length k present in a sequence (In this case, DNA sequence)

For example,

ATTACGAGCGATCGCGCG

If we consider kmers of length 5, then from left to the right we have:

ATTAC, TTACG, TACGA, and so on.

In Bioinformatics, is a common task get the frequency of all possible kmers of a given sequence.

During the sequencing protocol, DNA is randomly fragmented (theoretically), then all kmers should have similar frequency (although in reality it is not always the case). If you see some kmers enriched in your reads, this means possibly that you have a bias. For example, when the sequencing adapter is not removed from the reads, the kmers in the adapter will be overrepresented, because sequencing adapter is the same for all reads.

In the plots that you attached I can see a kmer enrichment at the end of reads. At the 5' end is maybe due random priming, and at the 3' end maybe some remains of sequencing adapter, I don't know.

The tables shows a observed/expected rate, if the value is greater than 1, it means that the observed frequency is greater than the expected.

Hope that helps!

**pmiguel** · 02-24-2015, 07:26 AM

Originally posted by rozitaa View Post

No I am aligning to the genome reference! Because later on I want to align it by TopHat. So this is only a way to get idea about inner distance. But I can try star also.
I have had several experiences with single-end or paired-end with overlap alignment by bowtie2 and always got good results! but this time I doubt it might be the inner size that I cannot get them aligned.

Mammals have teeny little exons spread out over 10's-100's of kilobases of the the genome. Mapping RNA (which has the introns spliced out) reads to the genome isn't a good way to determine insert size. And only getting 50% of the reads to map "concordantly" doesn't seem so bad. How is bowtie2 going to handle reads spanning a splice site?

If you want to determine your insert sizes, try aligning your reads to a long (spliced) transcript instead of genomic DNA. In my experience with the MiSeq and HiSeq, your sizes will look like all the very shortest library products were sequenced preferentially.

--
Phillip

**rozitaa** · 02-24-2015, 07:30 AM

Originally posted by GenoMax View Post

Have you seen this (post #18 by Brian has some interesting data) http://seqanswers.com/forums/showpost.php?p=156399. It may be worth checking your own data.

Hopefully you had adapter trimmed your data before doing the alignments.

Well it should be trimmed!!

**rozitaa** · 02-24-2015, 07:32 AM

Originally posted by diego diaz View Post

kmer is a substring of length k present in a sequence (In this case, DNA sequence)

For example,

ATTACGAGCGATCGCGCG

If we consider kmers of length 5, then from left to the right we have:

ATTAC, TTACG, TACGA, and so on.

In Bioinformatics, is a common task get the frequency of all possible kmers of a given sequence.

During the sequencing protocol, DNA is randomly fragmented (theoretically), then all kmers should have similar frequency (although in reality it is not always the case). If you see some kmers enriched in your reads, this means possibly that you have a bias. For example, when the sequencing adapter is not removed from the reads, the kmers in the adapter will be overrepresented, because sequencing adapter is the same for all reads.

In the plots that you attached I can see a kmer enrichment at the end of reads. At the 5' end is maybe due random priming, and at the 3' end maybe some remains of sequencing adapter, I don't know.

The tables shows a observed/expected rate, if the value is greater than 1, it means that the observed frequency is greater than the expected.

Hope that helps!

Thanks for nice explanation!

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 18 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 17 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 48 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Problem with the insert size of RNA-seq paired end reads!

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News