SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   Literature Watch (http://seqanswers.com/forums/forumdisplay.php?f=10)
-   -   CNVkit: Copy number detection and visualization for targeted sequencing using off-tar (http://seqanswers.com/forums/showthread.php?t=47910)

etal 10-30-2014 03:48 PM

CNVkit: Copy number detection and visualization for targeted sequencing using off-tar
 
CNVkit is a software toolkit to infer and visualize copy number from targeted DNA sequencing data. It is designed for use with hybrid capture, including both whole-exome and custom target panels, and short-read sequencing platforms such as Illumina.

This method uses the nonspecifically captured off-target reads to supplement read depth information from on-target regions. With relatively simple normalization steps to make these read depths comparable across the genome, CNVkit can produce copy ratio estimates extremely close to those by array CGH.

Manuscript preprint:
http://dx.doi.org/10.1101/010876

Source code:
https://github.com/etal/cnvkit

Documentation:
http://cnvkit.readthedocs.org/en/latest/

I've attempted to make CNVkit compatible with other software and easy to integrate into sequencing analysis pipelines. (Currently supported or under development: bcbio-nextgen, Galaxy, THetA2, IGV, BioDiscovery Nexus Copy Number, Java TreeView, probably others.) If you would like to see CNVkit play nicely with another existing program, please let me know.

jaspersaris 02-04-2015 04:56 AM

Hi etal,

As I am not fluent with python (and certainly not my collagues), I wondered what progress you made with running in Galaxy (is there already a toolshed-version?) and/or the combination with Biodiscovery's Nexus. Would the latter be a report in a txt-format readable by Nexus?

With regards, Jasper

etal 02-04-2015 09:37 AM

Hi Jasper,

If you would like to quickly get started without doing any technical work, I recommend the DNAnexus app, which is essentially complete and free to use initially after you've set up a DNAnexus account.

The Galaxy tool is not feature-complete, but I've made it available for testing in the Test Tool Shed (not the production one) in order to learn how people would like to use it. I'll spend more time on this if there's interest, but at the moment it's not the easiest way to use CNVkit.

The BioDiscovery Nexus Copy Number support in CNVkit is simple: CNVkit can export bin-level log2 copy ratio data to a tabular file which can then be loaded in Nexus Copy Number (e.g. "cnvkit.py export nexus-basic MySample.cnr -o MySample.nexus"). The DNAnexus applet also emits these files. Then, within Nexus Copy Number, load the generated file and specify the "basic" format. The copy number data should then appear similarly to array CGH.

Also, please note that CNVkit does not require Python programming to use, only the Unix/Linux command line. Feel free to let me know if you find any of the commands difficult to use or poorly documented.

jaspersaris 02-18-2015 02:39 AM

Hi Etal,

For the moment it appears that using a Galaxy is the better solution for us. I am running one in a VMware and installed CNVkit this morning. Now I am uploading some .bam files to work with.

With regards, Jasper

jaspersaris 02-18-2015 08:09 AM

Hi Etal,

In the current galaxy wrapper all three .bam files appear in both the 'sample' list, as well the 'Normal' list, without a selection or adjustment option.

etal 02-18-2015 01:45 PM

The BAM files available on your server or under your account (however you've set it up) should be visible under either field in the input form, and you should be able to select the files you want for each field through the usual means, e.g. Ctrl+click. Choose the tumor samples for the "sample" list, and any normal or germline files that were sequenced under the same protocol for the "normal" list.

You can also run it without any "normal" BAM inputs and the results should be reasonably good.

shrutimish@gmail.com 02-23-2015 10:57 AM

Hi etal, I have a more basic question for CNV detection- If I have a custom target panel through Ampliseq multiplex PCR, then is there a way to detect CNVs in that data?

Your CNVkit software works only for hybrid capture target panels like TSCA (Illumina) and Haloplex (Agilent)? Am I getting it correct?

Thanks

etal 02-23-2015 05:37 PM

Yes, CNVkit is designed for hybrid capture target panels like those of Illumina, Agilent and Nimblegen. For best results from amplicon-based targeted resequencing, you should probably use another program like OncoCNV.

I've recently added a little bit of support for running the CNVkit pipeline without using any off-target bins. You can try it by downloading release 0.3.4 and following the instructions here. However, this new feature isn't thoroughly tested and I'm told it doesn't work for everyone, and in any case another layer of gene-level normalizations would be needed to make it perform well on targeted amplicon sequencing data. I'll let SeqAnswers know when that mode is ready for broader use.

shrutimish@gmail.com 02-24-2015 07:42 AM

Thanks etal for your reply and this information. I will look into OncoCNV.

Mulos 02-25-2015 09:34 AM

fail to open file?
 
Hi Etal,

I wanted to test CNVkit on a small set of targeted sequencing data, but it is giving me trouble. With both batch and coverage (I haven't tried the other functions yet), I get the error:

"Processing reads in test.bam
[E::hts_open] fail to open file '-Q'
Segmentation fault: 11"

The indexing right before this worked fine. The chromosome names in my bam-files, fasta and target files are all the same (no "chr" prefix). The tests of cnvkit were all ok, no errors or warnings, and my pysam and pyvcf are up to date.
Any idea what I can do to fix this?

Thanks in advance,
Marlous

etal 02-25-2015 10:25 AM

Hi Marlous,

Thanks for reporting this. Can you show me the commands you used to trigger this error, including any special options you used? Are your BAM files definitely in BAM format, and e.g. "samtools idxstats" works on your BAM files?

I think this error indicates that command line options for "samtools bedcov" on your system are different than those in the versions I tested, missing the "-Q" option to set a minimum mapping quality score when counting reads. This would break CNVkit's coverage command internally. The source code of pysam's bundled samtools (in both pysam 0.8.0 and the current trunk) shows that the bedcov command does still accept the -Q option (using getopt), so either you have a version of pysam/samtools/htslib different from what I looked at, or I'm misunderstanding the error. The bedcov command is barely documented in samtools 1.1 and I can't see why this option might be missing in some versions. Maybe a samtools expert here can help?

In the meantime, you can avoid the call to bedcov with "cnvkit.py coverage --count", which returns similar results to the default strategy but filters out unmapped reads directly using the Pysam API, rather than by calling bedcov with the -Q option.

Best,
Eric

Mulos 02-26-2015 02:12 AM

Hey Eric,

Thanks for the quick reply! My bam-files are indeed bam-files, the idxstats works fine. I tried samtools bedcov from the command line (version 1.2), without the '-Q' option it works fine but if I add this option I get the same error message, so it seems to be a samtools-related error indeed...
I was mostly interested in the batch option from CNVkit, is it also possible to run this while circumventing the -Q?

Just to be sure, the command I used for the coverage analysis:
Quote:

python cnvkit.py coverage ~/Documents/CNVkit/testT_dedup.realigned.bam ~/Documents/genomes/design_test4.bed -o ~/Documents/CNVkit/test_dedup.realigned.cnn
Output:
Quote:

Processing reads in testT_dedup.realigned.bam
[E::hts_open] fail to open file '-Q'
Segmentation fault: 11
For the batch analysis:
Quote:

python cnvkit.py batch ~/Documents/CNVkit/*T_dedup.realigned.bam --normal ~/Documents/CNVkit/*R*bam --fasta ~/Documents/genomes/GRCh37_gatk.fasta --access ~/bin/cnvkit/data/access-10kb.hg19_noChr.bed --output-reference ~/Documents/CNVkit/test.cnn --output-dir ~/Documents/test/ -t ~/Documents/genomes/design_test4.bed
Output:
Quote:

Detected file format: BED
Detected file format: BED
Detected file format: BED
Wrote /Users/.../Documents/CNVkit/test/design_test4.antitarget.bed with 1678 background intervals
Building a copy number reference from normal samples...
Processing reads in test1R_dedup.realigned.bam
[E::hts_open] fail to open file '-Q'
Segmentation fault: 11
Thanks again, Marlous

etal 02-26-2015 01:33 PM

I've added the "-c/--count" option from the "coverage" command to the "batch" options as well. If you installed CNVkit from the GitHub repo, you can get the latest by either pulling the new commits or downloading the latest source code Zip file from the master branch and installing that.

Please let me know if that works for you.

Mulos 02-27-2015 05:52 AM

Works like a charm, thank you so much.

Also, very pleased with the results!

inijman 03-02-2015 03:30 AM

HI Etal,

Very nice work on a promising tool.

I'm giving it a go as well and what strikes me is that I don't always get the PDF files with the diagram and scatter option in a batch run. Is there a way to get verbose (error) logging?

When I do generate them with the individual scatter/diagram option, they are there, but the diagram is very difficult to read as all the labels overlap each other. Is there a way to influence this, or break all chromosomes to sepperate pages?

If i'm on a X11 terminal I can generate the plots, but when we run analyses on the cluster, there is no display available and we would need the pdf files.

Additionally, it would be great if you could use both sample.bam.bai als sample.bai file as indices.

Best, Ies
ps: and Hi to all the other posters!

etal 03-12-2015 02:20 PM

Hi Ies,

Sorry I missed your post earlier. There isn't a verbose logging mode in CNVkit, but the messages on standard error are fairly verbose already and should always report when there is an error. In particular, if something crashes then you'll see a Python traceback message. However, if you parallelized the batch run (-p >1), the messages from each process will be interleaved, which makes them somewhat harder to intepret.

The scatter and diagram PDFs should always be generated in a batch run; there isn't a special code path where they would be skipped. Does the log say something like "Wrote MySample-scatter.pdf" for the missing PDFs, or not?

The --scatter option uses matplotlib to generate a PDF. On a cluster, the default matplotlib backend (e.g. Wx or Gtk) might not be available, and so I guess it's possible the plotting engine of matplotlib gets confused and silently fails to write the file. You could address that by setting a different backend on your cluster -- create a file called "matplotlibrc" in the current working directory or your home folder, with the contents:

backend : pdf

The --diagram option uses a different backend, Reportlab, which always generates a PDF from scratch and does not have an interactive mode. I can't think of a reason why this one would occasionally fail to write a file. Can you suggest anything unusual about your system's configuration? Outdated software versions, maybe?

If the diagram is showing labels for hundreds of genes, that means:
(a) you did exome sequencing, so there's lots to show;
(b) significant copy number alterations cover large regions of chromosomes in your sample; and/or
(c) the purity of your tumor samples is fairly high.

You can:

- Thin out the labeling to some extent by specifying a higher threshold (-t) log2 ratio value in the diagram command; the default is 0.6, so try 0.8 or 0.9 to only show the higher-amplitude CNAs.
- Drop the labels altogether by specifying a high value for -t or just passing the .cns segment file (with -s), without the .cnr.
- Use the "heatmap" command instead to view the unlabeled CNA regions for many samples at once.
- Use the "gainloss" command to list all genes with log2 ratio amplitudes beyond a given threshold, essentially the labels you're currently seeing on the diagram but in a more manageable plain-text, tabular format.

The diagram is based on Biopython's Graphics/BasicChromosome module. If you're handy with Python and have a specific modification in mind, you could edit cnvlib/diagram.py (202 lines) to do it. For example, you can change PAGE_SIZE to much larger dimensions like 22x17" and the chromosomes will scale proportionally, but the gene labels stay the same size and will be more readable if they were overlapping before.

Thanks for the suggestion on sample.bai, I'll look into it.

Cheers,
Eric

lethalfang 04-06-2015 10:44 AM

When I tried to run more tumor samples using the reference .cnn file from a past normal run, there is no .cnr output.

This gives me .cnr files:
cnvkit.py batch T1.bam T2.bam T3.bam T4.bam T5.bam T6.bam T7.bam T8.bam T9.bam T10.bam --normal Na.bam Nb.bam Nc.bam --targets captures.bed --fasta b37_decoy.fasta --split --access b37.decoy.accessibles.bed --scatter --diagram --output-reference N.cnn --output-dir .


This does not output .cnr pr antitargetcoverage.cnn file:
cnvkit.py batch T1.bam T2.bam T3.bam T4.bam T5.bam T6.bam T7.bam T8.bam T9.bam T10.bam -r N.cnn --targets captures.bed --fasta b37_decoy.fasta --split --access b37.decoy.accessibles.bed --scatter --diagram --output-dir .

What is missing? Thanks.

etal 04-06-2015 11:05 AM

What files were generated by the second run, if any? Can you show me the status messages or any errors that were printed?

When you run CNVkit with a reference, you don't need the "--targets", "--fasta", "--split" and "--access" arguments as that information has already been captured in the reference file. The default output directory is the current directory ("."). Try this instead:

cnvkit.py batch T1.bam T2.bam T3.bam T4.bam T5.bam T6.bam T7.bam T8.bam T9.bam T10.bam -r N.cnn --scatter --diagram

lethalfang 04-06-2015 11:15 AM

Quote:

Originally Posted by etal (Post 168490)
What files were generated by the second run, if any? Can you show me the status messages or any errors that were printed?

When you run CNVkit with a reference, you don't need the "--targets", "--fasta", "--split" and "--access" arguments as that information has already been captured in the reference file. The default output directory is the current directory ("."). Try this instead:

cnvkit.py batch T1.bam T2.bam T3.bam T4.bam T5.bam T6.bam T7.bam T8.bam T9.bam T10.bam -r N.cnn --scatter --diagram

Thanks. For the second run where I was having question, there is no error message, and a bunch of .targetcoverage.cnn files are generated.

The status messages are:
Code:

Summary: #bins=292, #reads=7633921, mean=26143.5653, min=212.35, max=224646.99
On-target percentage: 37.877 (of 20154547 mapped)
Wrote ./T4.targetcoverage.cnn
Running the CNVkit pipeline on T9.bam ...
Processing reads in T9.bam
Time: 52.775 seconds (154291 reads/sec, 6 bins/sec)
Summary: #bins=292, #reads=8142711, mean=27885.9981, min=268.94, max=279225.27
On-target percentage: 38.472 (of 21165340 mapped)
Wrote ./T3.targetcoverage.cnn
Running the CNVkit pipeline on T10.bam ...
Processing reads in T10.bam
Time: 55.564 seconds (154063 reads/sec, 5 bins/sec)
Summary: #bins=292, #reads=8560291, mean=29316.0670, min=326.38, max=293055.07
On-target percentage: 38.593 (of 22181003 mapped)
Wrote ./T6.targetcoverage.cnn
Time: 57.101 seconds (149544 reads/sec, 5 bins/sec)
Summary: #bins=292, #reads=8539037, mean=29243.2782, min=302.54, max=292123.56
Time: 57.092 seconds (155098 reads/sec, 5 bins/sec)
Summary: #bins=292, #reads=8854878, mean=30324.9281, min=281.23, max=295576.58
Time: 57.155 seconds (147061 reads/sec, 5 bins/sec)
Summary: #bins=292, #reads=8405294, mean=28785.2540, min=254.16, max=285135.1
On-target percentage: 38.485 (of 22188207 mapped)
On-target percentage: 39.087 (of 22654458 mapped)
On-target percentage: 38.633 (of 21757000 mapped)
Wrote ./T2.targetcoverage.cnn
Wrote ./T8.targetcoverage.cnn
Wrote ./T1.targetcoverage.cnn
Time: 58.704 seconds (143289 reads/sec, 5 bins/sec)
Summary: #bins=292, #reads=8411624, mean=28806.9335, min=241.63, max=282191.54
On-target percentage: 38.845 (of 21654563 mapped)
Wrote ./T7.targetcoverage.cnn
Time: 60.050 seconds (150608 reads/sec, 5 bins/sec)
Summary: #bins=292, #reads=9044023, mean=30972.6832, min=298.46, max=306083.6
On-target percentage: 38.815 (of 23300051 mapped)
Wrote ./T5.targetcoverage.cnn
Time: 50.180 seconds (165905 reads/sec, 6 bins/sec)
Summary: #bins=292, #reads=8325083, mean=28510.5614, min=309.7, max=283705.95
On-target percentage: 38.622 (of 21555485 mapped)
Wrote ./T9.targetcoverage.cnn
Time: 51.261 seconds (173281 reads/sec, 6 bins/sec)
Summary: #bins=292, #reads=8882609, mean=30419.8946, min=299.88, max=296973.17
On-target percentage: 38.693 (of 22956916 mapped)
Wrote ./T10.targetcoverage.cnn



I ran again with the command you suggested, and the results are now as expected, i.e., identical to the first run.

Thanks.

wisekh 05-11-2015 02:33 PM

calling LOH from a pair of tumor and normal exome bams
 
Dear Eric,

I wonder if you can provide an instruction/example on how to generate LOH calls from a pair of tumor and normal exom bams.
I assume this tool can generate them, but I couldn't find a tuturoal in the cnvkit website. probably, I missed it?

thank you very much in advance,

wisekh

etal 05-11-2015 04:15 PM

Hi wisekh,

The LOH functionality in CNVkit is described here:
http://cnvkit.readthedocs.org/en/lat...s.html#scatter

However, the "calls" are simply displayed visually -- the variant allele frequencies are plotted alongside the copy ratios, and a shift in VAF from 0.5 indicates LOH. I'm currently working on expanding this functionality to make it more useful.

To run the complete pipeline with a tumor-normal pair and make a plot of the copy number and LOH shifts together, follow the quick start guide here:
http://cnvkit.readthedocs.org/en/latest/quickstart.html

Separately from running the CNVkit "batch" pipeline, you'll need to call SNPs in the tumor sample in VCF format. Then use that VCF file as input to the CNVkit "scatter" command along with the .cnr and .cns files from the CNVkit pipeline to make the plot.

Hope that helps,
Eric

wisekh 05-11-2015 09:08 PM

I got it. Thank you very much for your quick response.

hoon

wisekh 05-12-2015 08:24 PM

using a mapping quality threshold
 
Dear Eric,
I have another question. Is it possible to use a mapping quality threshold to, for example, a tumor bam or multiple tumor bams that I would infer copy number from?

thank you,

Hoon

etal 05-14-2015 10:08 AM

The mapping quality threshold is hard-coded to only exclude unmapped or ambiguously mapped reads, see here:
https://github.com/etal/cnvkit/blob/...verage.py#L115

Just change the -Q value to another integer if you'd like to try it yourself. However, I recall reading (can't find the reference at the moment) that keeping low-MAPQ reads did not harm copy number estimation and may have improved it.

In any case, in CNVkit the script genome2access.py can be used to directly exclude poorly mappable genomic regions. This is done already for hg19 in the bundled file access-5k-mappable.hg19.bed.

oxydeepu 03-08-2016 04:25 AM

Sequence IDs don't match with bed Error CNVkit.
 
Hi all,

I trying to run CNVkit with tumor and normal samples on exome sequencing. But I tried all possible things mentioned in the docs. I tried different bed file input as the target region and also the annotations but I always end up getting the error.
ValueError: BED file 'results/S04380110_Covered.target.bed' sequence IDs don't match any in BAM file.

Anybody came across this issue. Please help.

Thank you in advance.

Deepak

etal 03-08-2016 07:33 AM

Hi Deepak,

The sequence IDs are the chromosome names in your reference genome and the first column of your BED file. For the human genome the first chromosome might be "1" or "chr1" depending on where you got your reference genome.

Check that the name schemes match between your BED and BAM files, either "1" or "chr1". You can use "samtools view -H" to see the BAM header. If the names don't match, then you should edit your BED file, either adding or removing the "chr" prefix, so that they do match.

Hope that helps,
Eric

DrWorm 03-09-2016 01:30 PM

errors in production of cns file
 
Hello all,

I'm having issues working through the install of CNVkit. I've pinpointed my issues to problems with the segment sub-command. When I try to run the test procedure, it dies with this error:

Traceback (most recent call last):
File "../cnvkit.py", line 11, in <module>
args.func(args)
File "/home/snmcnulty/bin/cnvkit-master/cnvlib/commands.py", line 666, in _cmd_segment
rlibpath=args.rlibpath)
File "/home/snmcnulty/bin/cnvkit-master/cnvlib/segmentation/__init__.py", line 58, in do_segmentation
segarr = cnarr.as_dataframe(seg2cns(seg_out))
File "/home/snmcnulty/bin/cnvkit-master/cnvlib/segmentation/__init__.py", line 143, in seg2cns
+ seg_text)
ValueError: Segmentation output is not valid SEG format:

[then a big table]

Makefile:57: recipe for target 'build/p2-5_5.cns' failed
make: *** [build/p2-5_5.cns] Error 1

If anyone could help me pinpoint the problem, I would really appreciate it!

etal 03-09-2016 03:25 PM

Sorry for the trouble. Could you post the first few lines of the output table and the version of CNVkit you're using (try "cnvkit.py version")?

If you're able to load the PSCBS package in R, could you post that package's version too?

If the default CBS method is failing, as an alternative you can try the "haar" method, which does not rely on R at all:

cnvkit.py segment MySample.cnr -m haar

(Use the -t parameter to specify the significance threshold, which might need to be very small for exome sequencing data to segment well.)

oxydeepu 03-10-2016 07:39 AM

Hi Eric,

I have solved the problems with the reference. But now the problem seems to be different at the step of calculating coverage. This is the error I get.

#######################################

cnvkit.py coverage /media/ae5197fd-42a8-4d8f-96d6-1c8762b4d8ae/MM02-N_MM02-020200/colossus_analysis/BAM/MM02-020200_final.bam my_targets.bed -o .targetcoverage.cnn
Processing reads in MM02-020200_final.bam
Traceback (most recent call last):
File "/usr/local/bin/cnvkit.py", line 4, in <module>
__import__('pkg_resources').run_script('CNVkit==0.7.9.dev0', 'cnvkit.py')
File "/usr/local/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 726, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/local/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 1491, in run_script
exec(script_code, namespace, namespace)
File "/usr/local/lib/python2.7/dist-packages/CNVkit-0.7.9.dev0-py2.7.egg/EGG-INFO/scripts/cnvkit.py", line 11, in <module>

File "build/bdist.linux-x86_64/egg/cnvlib/commands.py", line 434, in _cmd_coverage

File "build/bdist.linux-x86_64/egg/cnvlib/commands.py", line 457, in do_coverage

File "build/bdist.linux-x86_64/egg/cnvlib/coverage.py", line 37, in interval_coverages
File "build/bdist.linux-x86_64/egg/cnvlib/coverage.py", line 110, in interval_coverages_pileup
File "build/bdist.linux-x86_64/egg/cnvlib/coverage.py", line 143, in bedcov
RuntimeError: Bad line from bedcov:
1

############################

Can you shed some light into the issue?

Thank you.

Regards,
Deepak

DrWorm 03-10-2016 12:20 PM

segmentation, ctd.
 
M CNVKit version is 0.7.8. R is installed, and I'm using PSCBS v0.61.0. Here's the output I get from the segmentation command.

to connect to socket /tmp/dbus-Hufo1aFYF5: Connection refused
Dropped 1 outlier bins:
chromosome start end gene log2 weight
0 chr4 46638882 46738865 Background -4.1499 0.167756
Traceback (most recent call last):
File "/opt/apps/cnvkit/0.7.8/bin/cnvkit.py", line 11, in <module>
args.func(args)
File "/opt/apps/cnvkit/0.7.8/lib/python2.7/site-packages/cnvlib/commands.py", line 666, in _cmd_segment
rlibpath=args.rlibpath)
File "/opt/apps/cnvkit/0.7.8/lib/python2.7/site-packages/cnvlib/segmentation/__init__.py", line 58, in do_segmentation
segarr = cnarr.as_dataframe(seg2cns(seg_out))
File "/opt/apps/cnvkit/0.7.8/lib/python2.7/site-packages/cnvlib/segmentation/__init__.py", line 143, in seg2cns
+ seg_text)
ValueError: Segmentation output is not valid SEG format:
WARNING: ignoring environment value of R_HOME
"sampleName" "chromosome" "start" "end" "nbrOfLoci" "mean"
"19987_main" "chr1" 93708 2582834 32 0.0317406246246301
"19987_main" "chr1" 2684720 6239917 34 -0.392436669374812
"19987_main" "chr1" 6239917 26177167 253 0.0239173699112572
"19987_main" "chr1" 26177167 29477537 33 0.379337119869004

Also, you were right -- using a non-R workflow completed perfectly.

etal 03-10-2016 12:35 PM

Deepak: I haven't been able to replicate the error. Are you sure you used a valid BED file, i.e. my_targets.bed is a 4-column BED file that you've preprocessed with "cnvkit.py target"? The error message suggests you might have given a 1-column list of chromosome IDs instead.

etal 03-10-2016 12:41 PM

DrWorm: The SEG parser handles 5- or 6-column tables like the one you've posted. I wonder if the ValueError is due to the "WARNING" message from R getting fed to the SEG parser in CNVkit, just before the start of the usual table. If so, maybe you could disable that message from R and therefore prevent the CNVkit error by prefixing your command line to reset or unset R_HOME:

R_HOME="" cnvkit.py 19987_main.cnr

oxydeepu 03-11-2016 08:18 AM

Dear Eric,

Thank you for sparing your time. I tried the target and it gives a 4 field bed. But I gues the problem is coming when I use the BAM files.

###################

cnvkit.py batch /home/deepak/MM02-N_MM02-020200/colossus_analysis/BAM/MM02-020200_MarkDup.bam --normal /home/deepak/MM02-N_MM02-020200/colossus_analysis/BAM/MM02-N_MarkDup.bam --targets Agilent_b37_exome_covered.bed --split --annotate refFlatMod.txt --fasta /home/deepak/Ref/human_g1k_v37_decoy.fasta --access /home/deepak/src/cnvkit/data/access-5k-mappable.hg19.bed --output-reference my_reference.cnn --output-dir results/ --diagram --scatter

Processing reads in MM02-N_MarkDup.bam
Traceback (most recent call last):
File "/usr/local/bin/cnvkit.py", line 4, in <module>
__import__('pkg_resources').run_script('CNVkit==0.7.9.dev0', 'cnvkit.py')
File "/usr/local/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 726, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/local/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 1491, in run_script
exec(script_code, namespace, namespace)
File "/usr/local/lib/python2.7/dist-packages/CNVkit-0.7.9.dev0-py2.7.egg/EGG-INFO/scripts/cnvkit.py", line 11, in <module>

File "build/bdist.linux-x86_64/egg/cnvlib/commands.py", line 97, in _cmd_batch

File "build/bdist.linux-x86_64/egg/cnvlib/commands.py", line 172, in batch_make_reference

File "build/bdist.linux-x86_64/egg/cnvlib/parallel.py", line 14, in apply_async
File "build/bdist.linux-x86_64/egg/cnvlib/commands.py", line 193, in batch_write_coverage

File "build/bdist.linux-x86_64/egg/cnvlib/commands.py", line 457, in do_coverage

File "build/bdist.linux-x86_64/egg/cnvlib/coverage.py", line 37, in interval_coverages
File "build/bdist.linux-x86_64/egg/cnvlib/coverage.py", line 110, in interval_coverages_pileup
File "build/bdist.linux-x86_64/egg/cnvlib/coverage.py", line 143, in bedcov
RuntimeError: Bad line from bedcov:
1

#######################

But it does create two bed files.

$ head results/Agilent_b37_exome_covered.*

==> results/Agilent_b37_exome_covered.antitarget.bed <==
chr1 10500 93708 Background
chr1 93708 176917 Background
chr1 227917 267219 Background
chr1 318219 394543 Background
chr1 394543 470868 Background
chr1 521868 563949 Background
chr1 570871 671469 Background
chr1 671469 772067 Background
chr1 772067 872665 Background
chr1 872665 973263 Background

==> results/Agilent_b37_exome_covered.target.bed <==
1 65509 65625 -
1 65831 65973 -
1 69481 69600 OR4F5
1 721381 721519 -
1 721530 721806 -
1 721851 721942 -
1 752916 753035 FAM87B
1 762095 762275 LINC00115
1 762280 762414 LINC00115
1 762420 762565 LINC00115

I have no idea where it is going wrong.

Sorry for a messy and long mail.

Thank you.

Best,
Deepak

etal 03-11-2016 11:09 AM

Deepak, I notice your antitarget BED chromosomes names start with "chr" but your target BED chromosome names do not. Which names are in your BAM file? (I'm guessing "chr", like hg19, based on the generated antitarget BED file.) Either way, there will be a mismatch between chromosome names in the BAM and BED files at some point. I recommend checking (try "samtools view -H"), and if the "chr" prefix is used in the BAM, then add "chr" to the input target BED file.

oxydeepu 03-11-2016 11:28 AM

Hi Eric,

Thanks for the reply. I cross checked chr notations in bam and reference bed file. It is all without chr. But when it creates antitarget files it adds chr to the file and vice versa if I use a reference bed file with chr. I don't know why.

Best,
Deepak

etal 03-11-2016 05:13 PM

The antitarget BED file has "chr" because "access-5k-mappable.hg19.bed" has chr. You could build an access file for GRCh37 using "cnvkit.py access", but it would probably be quicker for you to just copy the hg19 one and delete the "chr" prefixes from each line.

oxydeepu 03-12-2016 10:31 AM

Dear Eric,


I tried all possible removal of chr as my BAM file is with just chromosome numbers.But still ended up getting same error as I mentioned above. Now the target and the anti target files have chromosome numbers without chr. I get this error now.

Wrote results/Agilent_b37_exome_covered.antitarget.bed with 54084 regions
Building a copy number reference from normal samples...
Processing reads in MM02-N_final.bam
Traceback (most recent call last):
File "/usr/local/bin/cnvkit.py", line 4, in <module>
__import__('pkg_resources').run_script('CNVkit==0.7.9.dev0', 'cnvkit.py')
File "/usr/local/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 726, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/local/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 1491, in run_script
exec(script_code, namespace, namespace)
File "/usr/local/lib/python2.7/dist-packages/CNVkit-0.7.9.dev0-py2.7.egg/EGG-INFO/scripts/cnvkit.py", line 11, in <module>

File "build/bdist.linux-x86_64/egg/cnvlib/commands.py", line 97, in _cmd_batch

File "build/bdist.linux-x86_64/egg/cnvlib/commands.py", line 172, in batch_make_reference

File "build/bdist.linux-x86_64/egg/cnvlib/parallel.py", line 14, in apply_async
File "build/bdist.linux-x86_64/egg/cnvlib/commands.py", line 193, in batch_write_coverage

File "build/bdist.linux-x86_64/egg/cnvlib/commands.py", line 457, in do_coverage

File "build/bdist.linux-x86_64/egg/cnvlib/coverage.py", line 37, in interval_coverages
File "build/bdist.linux-x86_64/egg/cnvlib/coverage.py", line 110, in interval_coverages_pileup
File "build/bdist.linux-x86_64/egg/cnvlib/coverage.py", line 143, in bedcov
RuntimeError: Bad line from bedcov:
1


Sorry for the trouble. I really need to make this work.. :(.

Best,
Deepak

etal 03-12-2016 06:10 PM

You can skip bedcov with the -c/--count-reads option; it uses a different method to calculate read depths but should give similar results.

Also, if you've successfully built the target and antitarget BED files, you can use those as inputs to the "batch" command and drop some of the command line options:

cnvkit.py batch ~/MM02-N_MM02-020200/colossus_analysis/BAM/MM02-020200_MarkDup.bam --normal ~/MM02-N_MM02-020200/colossus_analysis/BAM/MM02-N_MarkDup.bam --targets results/Agilent_b37_exome_covered.target.bed --antitargets results/Agilent_b37_exome_covered.antitarget.bed --fasta ~/Ref/human_g1k_v37_decoy.fasta --output-reference my_reference.cnn --output-dir results/ --diagram --scatter --count-reads

etal 03-13-2016 08:46 PM

Deepak, another user reported this issue and found that it was due to a bug in Pysam 0.9:
https://github.com/etal/cnvkit/issues/86

Your original command should work for you if you are able to remove Pysam 0.9 and install Pysam 0.8.4 instead.

oxydeepu 03-14-2016 02:01 AM

Thank you Eric. I will try and let you know.

Best,
Deepak

oxydeepu 03-14-2016 08:27 AM

Dear Eric,

CNVkit worked when I used the older version of pysam. Now I have a doubt regarding running CNVkit. I have 20 samples with matched tumor and normal. Do I make reference from all normal samples and run each tumor sample or run individually? Sorry about naive question.

Thank you.

Best,
Deepak

oxydeepu 03-14-2016 12:18 PM

Hi Eric,

Sorry to flood you with questions. one more thing, my files have same naming for normals across samples and it is giving me an error saying Dupicate ID. The problem is the files are write protected and huge files, so I don't want to rename or move them. Is there any way around it?

Thank you for your patience.

Best,
Deepak


All times are GMT -8. The time now is 02:15 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2022, vBulletin Solutions, Inc.