SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
STAR: ultrafast universal RNA-seq aligner alexdobin Bioinformatics 218 04-02-2018 05:59 PM
Primer design Universal tailed Amplicon paloma84 454 Pyrosequencing 2 11-26-2012 11:33 AM
Program edit an ace file to simultaneously extract read information from all contig? cllorens Bioinformatics 3 06-27-2012 08:20 AM
PubMed: Target-Enrichment Through Amplification of Hairpin-Ligated Universal Targets Newsbot! Literature Watch 0 03-25-2011 06:10 AM
PubMed: Development and quantitative analyses of a universal rRNA-subtraction protoco Newsbot! Literature Watch 1 03-12-2010 05:41 PM

Reply
 
Thread Tools
Old 06-24-2013, 02:55 PM   #41
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Dear Nicolas,

No worries. Thanks for your interest in featureCounts.

featureCounts calls an overlap when there is at least 1bp overlap found between a read and a feature. This also applies for paired-end reads, ie. an overlap is called as long as 1bp overlap was found between one of the two reads and the feature.

featureCounts will check every feature to see if it overlaps with a read. If more than one overlapping feature was found, it will not assign this read to any feature (by default). However if '-O' is specified, featureCounts will assign it to all its overlapping features.

For paired-end reads, featureCounts also checks if both ends from the same fragment overlapping with the features. For example, if a fragment has only one end overlapped with a feature F1 but this fragment also has both ends overlapped with a feature F2, this fragment will be assigned to F2 because this overlap has a stronger support.

Hope this makes it clearer. Don't hesitate to ask me if you have any further questions.


Best wishes,

Wei
shi is offline   Reply With Quote
Old 06-25-2013, 12:18 AM   #42
Nicolas Nalpas
Member
 
Location: Ireland

Join Date: Apr 2011
Posts: 19
Default

Dear Wei,

That is great, very helpful.
Thanks a lot.
Best wishes,
Nicolas
Nicolas Nalpas is offline   Reply With Quote
Old 06-25-2013, 03:04 AM   #43
bruce01
Senior Member
 
Location: .

Join Date: Mar 2011
Posts: 157
Default

Hi Wei,

so to return to my issue above with the many OVERLAP reads resulting from using Trimmomatic:

Having contacted the authors of STAR we have determined that it is not the aligner at fault for the issue I have reported to you. I have attached 3 short SAM files; I took the first 100 OVERLAP reads from Trimmomatic trimmed reads as determined by featureCounts. I grepped these out of Trimmomatic qulity trimmed at Q30 ('trimmo'), hard-trimmed by 10bp ('trim') and not trimmed ('notrim') SAM. I then used featureCounts to again determine what features these reads map to. Reads were initially found using:
Code:
grep -m 100 OVER trimmo.featco.counts.reads | cut -f 1 | grep - ./trimmo/Aligned.out.sam > trimmo.100-over.sam
now when I rerun featureCounts on the 'trimmo' set:
Code:
~/bin/subread-1.3.5/bin/featureCounts -a Bos_taurus.UMD3.1.70.gtf -i trimmo.100-over.sam -o trimmo.100-over.counts -p -P -C -R -D 500000
then
Code:
grep -v OVER trimmo.100-over.counts.reads | wc -l
returns 10. So now, from exactly the same 100 reads that were initially OVERLAP, 10% are assigned with clarity to a single feature. The other two sets, 'notrim' and 'trim', return the same results as previously, each having only 2 reads return OVERLAP.

Interestingly, or unfortunately, depending on viewpoint:
Code:
grep -f trimmo-100.over.sam notrim-100.over.sam | wc -l
returns 71. So 71 of ~200 lines are identical between 'trimmo' and 'notrim', ie there has been no trimming done on these reads and they align to the same position. But when these reads are under the header of 'trimmo' they produce OVERLAP, and under the header 'notrim' they produce ACCEPTED.

I have now taken all reads which are called OVERLAP in the trimmo analysis and grepped them from trimmo and notrim SAM, then compared those two files for number of lines which are identical.

Code:
wc -l trimmo.featco.reads.solo
7380972 trimmo.featco.reads.solo
wc -l notrim.trimmo-overlap.matched.lines.sam
5036198 notrim.trimmo-overlap.matched.lines.sam
So that's 68% identical, of which 100% in trimmo data are called OVERLAP, but those 5,036,198 from notrim data, identical reads to those called OVERLAP, give a breakdown thus (obviously there are nearly double the number of reads as this is PE data and the featureCounts run include -p -P flags):

Code:
28587 ACCEPTED_2VOTE_GENE
2616603 ACCEPTED_GENE
6191 NOTFOUND_GENE
55188 OVERLAPPED_GENES
234 PAIR_DISTANCE
Which is 96% ACCEPTED, 2% OVERLAP versus 100% OVERLAP from the trimmo data. Which is a massive 'discrepancy'.

Your thoughts on this would be greatly appreciated.
Attached Files
File Type: zip overlap.sams.zip (59.4 KB, 3 views)

Last edited by bruce01; 06-25-2013 at 01:50 PM. Reason: Further evidence of issue.
bruce01 is offline   Reply With Quote
Old 06-25-2013, 04:27 PM   #44
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Hi Bruce,

The three SAM files you sent to me contained different numbers reads, so I'm not sure if you used the same set of reads in your evaluations.

Also, your 'trimmo' SAM file is not in the correct format. One of the read pairs had only one read included in the file, the other was missing. The ID of this read pair is:

HWI-ST600:257:C1MUWACXX:3:1101:8762:2175

featureCounts requires that for paired-end read data both ends must be included in the SAM/BAM file and the two reads from the same pair must be next to each other. It has a unpredicted behavior if this requirement is not met. The problematic read pair in your 'trimmo' SAM file is located at line 17, so it is possible all the subsequent read pairs were incorrectly summarized. This may explain the bizarre result you observed for this dataset.

I speculate that the discrepancy in your previous results was also due to the missing reads in your SAM files.

Best wishes,

Wei
shi is offline   Reply With Quote
Old 06-27-2013, 05:06 AM   #45
bruce01
Senior Member
 
Location: .

Join Date: Mar 2011
Posts: 157
Default

Hi Wei,

I was hopeful this was the issue resolved. Using Trimmomatic, you are returned 'paired' and 'unpaired' fastq files, so I immediately thought that was a nice way to remove unpaired reads; I only aligned the paired fastq. I have a script to remove all unpaired lines in SAM, and have confirmed the new, paired sam. When it is run in featureCounts, I still get the very large amount of OVERLAP:

Code:
::::::::::::::
trimmo.paired.diags
::::::::::::::
  87664 ACCEPTED_2VOTE_GENE
11663844 ACCEPTED_GENE
2824684 MULTI_MAPPING
1925844 NOTFOUND_GENE
7234680 OVERLAPPED_GENES
  10940 PAIR_DISTANCE
::::::::::::::
trimmo.diags
::::::::::::::
  87710 ACCEPTED_2VOTE_GENE
11713945 ACCEPTED_GENE
2826089 MULTI_MAPPING
1929703 NOTFOUND_GENE
7380972 OVERLAPPED_GENES
  10940 PAIR_DISTANCE
The difference is 201,703 counts, which is half the number of reads removed, ~70% are from the OVERLAPPED_GENES, but the rest are not 'correctly' allocated.

Any other ideas about this?

Last edited by bruce01; 06-27-2013 at 05:25 AM.
bruce01 is offline   Reply With Quote
Old 06-27-2013, 03:31 PM   #46
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Hi Bruce,

I would suggest you trying to count reads instead of fragments to see if you will still get a large number of reads overlapping with multiple genes. This will help determine if the large number of multi-overlapping fragments you observed were due to summarization.

If you still get a large number, then that will mean it is either a mapping problem, or a problem with your data generation, or a lot of genes in your annotation overlapping with each other.

You can simply remove those paired-end parameters, such as -p -P -D and -C, from your command to summarize reads instead of fragments.

Best wishes,

Wei
shi is offline   Reply With Quote
Old 06-28-2013, 01:54 AM   #47
bruce01
Senior Member
 
Location: .

Join Date: Mar 2011
Posts: 157
Default

Hi Wei,

I had looked at these before, but am keen to keep the counts to fragments as I believe there is more accuracy in this method. I have run for the initial trimmo sam, and the one with all non-pairs removed:

Code:
::::::::::::::
featco.pair.read.diags
::::::::::::::
35364121 ACCEPTED_GENE
4944952 MULTI_MAPPING
6848219 NOTFOUND_GENE
 338020 OVERLAPPED_GENES
::::::::::::::
featco.read.diags
::::::::::::::
35711285 ACCEPTED_GENE
4944952 MULTI_MAPPING
6902210 NOTFOUND_GENE
 340271 OVERLAPPED_GENES
bruce01 is offline   Reply With Quote
Old 06-28-2013, 02:42 AM   #48
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Hi Bruce,

I agree it is better to count fragments for paired-end data. Looking at read counts is just to help diagnose what the problem was and it turned out this is quite helpful.

The percentage of multi-overlapping reads is much smaller that of multi-overlapping fragments, suggesting that something went wrong when read pairs were being summarized. We have never seen this before for the summarization of mapping results from Subread and a few other aligners.

One possibility is that in your mapping results the order of the two reads from the same pair was altered, ie the second read appeared before the first read. If this is the case, the read pair might be wrongly assigned. You may check a few multi-overlapping fragments to see if this is the case.

Alternatively, you may try other aligners as well. Subread is guaranteed to work with featureCounts.

Hope this helps.

Best wishes,

Wei
shi is offline   Reply With Quote
Old 10-02-2013, 06:27 AM   #49
choseqid
Junior Member
 
Location: Europe

Join Date: Apr 2013
Posts: 5
Default

Dear Wei,

I have been trying to run featureCounts, both from R and command line, but I keep getting a segmentation fault. I checked the different things suggested on this discussion thread and also tried to allocate more memory to the job resp. R session, but it didn't help. I use subread 1.3.6 and here is my command line:

featureCounts -p -P -d 50 -D 600 -a mm10/annotation/mm10.allmrna.gtf -t exon -g gene_id -b -f -i tophat_out/accepted_hits.sort.bam -o subread_counts.txt

Here is the error message:

/var/spool/gridengine/node-hp0211/job_scripts/1095023: line 10: 57569 Segmentation fault (core dumped)


Have I overseen anything?

Thanks in advance,
Cho

Last edited by choseqid; 10-02-2013 at 06:36 AM.
choseqid is offline   Reply With Quote
Old 10-02-2013, 06:06 PM   #50
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Dear Cho,

Could you provide the complete output of your featureCounts run? It is hard to figure out what went wrong from the information you currently provided.

Also could you provide the first 100 lines of your annotation file and also the first 100 reads in your BAM file?

Cheers,
Wei
shi is offline   Reply With Quote
Old 10-03-2013, 02:48 AM   #51
choseqid
Junior Member
 
Location: Europe

Join Date: Apr 2013
Posts: 5
Default

Dear Wei,

Thanks for the quick reply. Attached are the files you ask for. I am including the R output, as I do not have any from command line other than the error message I already quoted.

Cheers,
CHo
Attached Files
File Type: zip Archive.zip (12.0 KB, 1 views)

Last edited by choseqid; 10-03-2013 at 03:01 AM.
choseqid is offline   Reply With Quote
Old 10-03-2013, 11:08 PM   #52
ddb
Member
 
Location: Europe

Join Date: Feb 2012
Posts: 13
Default

"featureCounts requires that for paired-end read data both ends must be included in the SAM/BAM file and the two reads from the same pair must be next to each other."

If this is not stated in the User Guide (I did not see it there) then it should be added as it is essential for correct functioning of the program.
ddb is offline   Reply With Quote
Old 10-04-2013, 01:53 AM   #53
choseqid
Junior Member
 
Location: Europe

Join Date: Apr 2013
Posts: 5
Default

Thanks, ddb, for the reminder. It looks like the problem lies in the way I aligned the reads with tophat: I allowed multiple hits (which apparently hampers the sorting by name) and didn't disable the separate alignment reporting for unpairable reads (ie. didn't use --no-mixed). Would fixing these two parameters help?
choseqid is offline   Reply With Quote
Old 10-04-2013, 04:13 PM   #54
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

I'm still not sure if it is the issue with paired-end reads that caused the problem. You can try to change those parameters to see it will work. But you may also try to count your reads as single-end reads by NOT using the '-p' option. This will tell us if the problem arose from dealing with the paired-end reads. Your command should be like this:

featureCounts -a mm10/annotation/mm10.allmrna.gtf -t exon -g gene_id -b -f -i tophat_out/accepted_hits.sort.bam -o subread_counts.txt

Wei
shi is offline   Reply With Quote
Old 10-08-2013, 08:54 AM   #55
choseqid
Junior Member
 
Location: Europe

Join Date: Apr 2013
Posts: 5
Default

Dear Wei,

I tried that command line, but it still drops a Segmentation fault. I also tried aligning my reads using Subreads (which succeeded), but when I ran featureCounts on the resulting SAM file I also got a Segmentation fault. The output is the same as I attached to a previous post.

Any more ideas?
choseqid is offline   Reply With Quote
Old 10-08-2013, 10:27 AM   #56
adaigle
Junior Member
 
Location: Boston

Join Date: Sep 2013
Posts: 6
Default

Hi, quick question. I was wondering if there was a way to get featureCounts to work on a Windows 7 OS. Going through R and Bioconductor would be perfect, but it looks like Rsubreads does not have a Windows version? Is there any other way?
adaigle is offline   Reply With Quote
Old 10-11-2013, 02:29 AM   #57
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Quote:
Originally Posted by choseqid View Post
Dear Wei,

I tried that command line, but it still drops a Segmentation fault. I also tried aligning my reads using Subreads (which succeeded), but when I ran featureCounts on the resulting SAM file I also got a Segmentation fault. The output is the same as I attached to a previous post.

Any more ideas?
Hi,

Thank you for trying these options. We found featureCounts always works nicely with Subread. So the segment fault is likely to be due to some unexpected data in the annotation. We have also received some other bug reports similar to this recently. The 1.3.x version of featureCounts allows up to 60 features overlapping with each other in the annotation. If the number of such features exceeded this limit, we found the program crashed. Although this is rare but it may happen and we suspect this might be the reason causing the seg fault seen in your data.

We have removed this limit in the latest version 1.4.0 and hopefully this will solve the problem.

Also, if reads in your BAM file were sorted by chromosomal locations, you should include '-S' option in your command. Not doing so will not crash the program, but will result in incorrect read counts.

Let me know if the problem persists.

Wei
shi is offline   Reply With Quote
Old 10-11-2013, 02:39 AM   #58
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Quote:
Originally Posted by adaigle View Post
Hi, quick question. I was wondering if there was a way to get featureCounts to work on a Windows 7 OS. Going through R and Bioconductor would be perfect, but it looks like Rsubreads does not have a Windows version? Is there any other way?
You are correct that Rsubread does not have a Windows version. It is pretty hard to develop a Windows version for this package due to most of the code was written in C. I think we might eventually come up with a Windows version, but it will take a fair bit of time. If you have access to a unix machine, you can fairly easily use featureCounts via the Bioconductor package Rsubread.

Wei
shi is offline   Reply With Quote
Old 11-04-2013, 08:20 PM   #59
bw.
Member
 
Location: San Francisco, CA

Join Date: Mar 2012
Posts: 21
Default

Hi,
I would like to use featureCounts, but miss the stats provided by htseq-count (copied below) as these let me make sure I got the 'strand' setting right and other things.
Any chance you could add similar output to featureCounts (either as a separate 'stats.txt' file or as part of the main table)?

no_feature 20123817
ambiguous 9026940
too_low_aQual 0
not_aligned 0
alignment_not_unique 3034042

Thanks
-Ben
bw. is offline   Reply With Quote
Old 11-05-2013, 01:05 AM   #60
bruce01
Senior Member
 
Location: .

Join Date: Mar 2011
Posts: 157
Default

Ben, I had the same issue, so made a command to get this info. It requires you to make the 'reads' output using -R flag.

cut -f 2 <featco.counts.reads> | sort | uniq -c > <featco.counts.diags>

Output looks like:

154266 ACCEPTED_2VOTE_GENE
23169444 ACCEPTED_GENE
40066 MULTI_MAPPING
4470627 NOTFOUND_GENE
100013 OVERLAPPED_GENES
2850 PAIR_DISTANCE

Hope that helps, Bruce.
bruce01 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:14 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO