Advanced tutorials

Database

Clean-up space

To avoid duplicating large files, remove at least the initial dbSNP and CADD files and vep cache::

cd dbsnp ; git annex drop GCF_000001405.40.gz*
cd cadd ; git annex drop whole_genome_SNVs.tsv.gz*
cd vep-human ; git annex drop homo_sapiens_merged_vep_110_GRCh38.tar.gz

Download only some databases

Simply get the corresponding folder. From dgenomes root directy, download only clinvar with:

datalad get clinvar

Update

To keep things up-to-date, synchronize dgenomes: datalad update --how merge Then update the database by going into the corresponding folder and run the same command.

Dependencies

Avoid recompiling dependencies

By default, nix downloads the source code of a package and builds it from scratch. Nixpkgs has its own binary cache to avoid most of compilation. However, packages dependent from python 2 are no longer in nixkpgs, due to the lack of support of python 2. To avoid rebuilding those packages, you can set up a binary cache with cachix for example. This is what the Github CI does to avoid rebuilding everything. Of course, if there has been a modification in the package, the binary cache will invalidated and it will be built from the source code.

Install all dependencies globally

Instead of having a dedicated shell, you may want to have them available in the PATH:

nix profile install .#*

Use another version of a tool

Sometimes, you may want to use a newer or older version. Nix is quite flexible but it does requires a bit configuration. Basically, creates packages/MYPACKAGE/default.nix following pyflow configuration for example. Then udpate packages/default.nix to override the pacakge (again, see how we do it for pyflow).

Running the pipeline

Run other tools

Sarek allow for multiple tools to be used at once. Output of each tools will be in a different subdirectory. A common use case is to compare variant caller:

--tools haplotypecaller,strelka

Run other steps

It's possible to start (or restart) for a specific stew with --step and a suitable samplesheet. For example, to annotate directy a vcf:

--step annotate --tools snpeff

And use a samplesheet similar to

patient,sample,vcf
patient1,test_sample,test.vcf.gz

More examples are in the official documentation.

Running from another folder

The above command use a configuration file nextflow.config that is assumed to be in the current directory. It is perfectly fine to call it from another location but -c $GITHUB_ROOT/nextflow.config must be appended to the command-line.

By default, What is hidden in this command line More information about running sarek -can be found here.

Set sarek version

Add -r version on the command-line.

Setting cluster resources

Usually, each cluster has its own configuration files. The simplest way to start is to set ressources for processes tagged:

  • process_single
  • process_low
  • process_medium
  • process_high

An example is given in conf/slurm.config for slurm. Oher executors can of course be used. It's possible to set resources for specific processes.

Troubleshoot

This happens when extracting the FASTA file in genome_human. gzip -f is required for that.