Introduction

This guide aims to help you to quickly set up a reproducible environment for nf-core Sarek pipeline. More advanced topics are covered later on, with a FAQ.

To help situate the project, we develop the context there. Technical details are also available for reference. Contributions are of course welcome !

Tutorials

This tutorial will show you how to configure everything needed to run the sarek pipeline (germline only) and how to use the pipeline for analysis. The only requirement is to install nix.

Installing all dependencies

First, enable nix flakes:

cat 'experimental-features = nix-command flakes' >> ~/.config/nix/nix.conf

Then, setup with nix a nice hell with everything installed. From the code source directory, run:

nix develop

All the commands below should be typed in this shell. It's also possible to use a global installation.

Technical details are developed further here.

Downloading all databases

This is managed by datalad. In the following, we assume there are stored in /Work/data/dgenomes. To download all of them at once:

cd /Work/data
git clone https://github.com/apraga/dgenomes
cd dgenomes
datalad get .

Unfortunately, some data are not pipeline-ready. Especially, dbSNP and CADD use a different chromosome notation and must be renamed. For CADD, it takes around 7 hours ! Note the version numbers must be corrected to match the current version.

tar xzf  genome-human -C genome-human/*.tar.gz
gunzip genome-human/*.gz
tar xzf  vep-human -C vep-human/*.tar.gz
unzip snpeff-human/*.zip -d snpeff-human
# Rename chromosomes for Refseq to chr1,chr2...
cd dbsnp ; bash rename_chr.sh
# Rename chromosomes for 1,2, to chr1,chr2.... Take around 7 hours !
cd cadd ; bash rename_chr.sh

Don't forget to cleanup large files if needed. See also the advanced tutorial for databases and the technical details.

Running the pipeline

A run is defined by a CSV files where the samples and input files are defined (samplesheet). Here's on for a single, minimal, sample to start from a FASTQ

patient,sex,status,sample,lane,fastq_1,fastq_2
test,XX,0,test,ada2-e5-e6,tests/ada2-e5-e6_R1.fastq.gz,tests/ada2-e5-e6_R2.fastq.gz

Additional samples can be added to the CSV and nextflow will run as many jobs as needed. See the official docs for more information.

Running the pipeline from alignment with bwa up to variant calling with GATK is done from the root of github repo with:

 nextflow run nf-core/sarek \
 --input tests/ada2-e5-e6.csv --outdir test-datalad-full \
 --tools haplotypecaller,snpeff --skip_tools haplotypecaller_filter \
 --dgenomes /Work/dgenomes

Several things are happening here so let us examine each argument:

  • nextflow run nf-core/sarek will download the latest version of sarek online and run it. You can set sarek version for more reproductibility.
  • --input define the samplesheet
  • --output define the output directory where the result of each step will be stored in a subfolder. If multiple samples are analyzed, each subfolder will have the result of all samples.
  • --tools define the step to run and which tool to use. bwa is used by default for alignement. Here, haplotypecaller is set for variant calling. Sarek filter with GATK FilterVariantTranches by default but GATK CNNScoreVariant has not been packaged by nix yet
  • --dgenomes must be set to full path the root folder of the databases (see above by default, assumes a nextflow.config in the current directory. This files setups some tools for nix and is mandatory. See here to run from another location.

Running on a cluster

This command will run the pipeline as a normal process. Nextflow supports many schedulers. A simple configuration file can be used to select a scheduler Here, a profile for slurm is available in the conf directory. To use it, simply append to the command line -c conf/slurm.conf. To customize it, see here.

Advanced tutorials

Database

Clean-up space

To avoid duplicating large files, remove at least the initial dbSNP and CADD files and vep cache::

cd dbsnp ; git annex drop GCF_000001405.40.gz*
cd cadd ; git annex drop whole_genome_SNVs.tsv.gz*
cd vep-human ; git annex drop homo_sapiens_merged_vep_110_GRCh38.tar.gz

Download only some databases

Simply get the corresponding folder. From dgenomes root directy, download only clinvar with:

datalad get clinvar

Update

To keep things up-to-date, synchronize dgenomes: datalad update --how merge Then update the database by going into the corresponding folder and run the same command.

Dependencies

Avoid recompiling dependencies

By default, nix downloads the source code of a package and builds it from scratch. Nixpkgs has its own binary cache to avoid most of compilation. However, packages dependent from python 2 are no longer in nixkpgs, due to the lack of support of python 2. To avoid rebuilding those packages, you can set up a binary cache with cachix for example. This is what the Github CI does to avoid rebuilding everything. Of course, if there has been a modification in the package, the binary cache will invalidated and it will be built from the source code.

Install all dependencies globally

Instead of having a dedicated shell, you may want to have them available in the PATH:

nix profile install .#*

Use another version of a tool

Sometimes, you may want to use a newer or older version. Nix is quite flexible but it does requires a bit configuration. Basically, creates packages/MYPACKAGE/default.nix following pyflow configuration for example. Then udpate packages/default.nix to override the pacakge (again, see how we do it for pyflow).

Running the pipeline

Run other tools

Sarek allow for multiple tools to be used at once. Output of each tools will be in a different subdirectory. A common use case is to compare variant caller:

--tools haplotypecaller,strelka

Run other steps

It's possible to start (or restart) for a specific stew with --step and a suitable samplesheet. For example, to annotate directy a vcf:

--step annotate --tools snpeff

And use a samplesheet similar to

patient,sample,vcf
patient1,test_sample,test.vcf.gz

More examples are in the official documentation.

Running from another folder

The above command use a configuration file nextflow.config that is assumed to be in the current directory. It is perfectly fine to call it from another location but -c $GITHUB_ROOT/nextflow.config must be appended to the command-line.

By default, What is hidden in this command line More information about running sarek -can be found here.

Set sarek version

Add -r version on the command-line.

Setting cluster resources

Usually, each cluster has its own configuration files. The simplest way to start is to set ressources for processes tagged:

  • process_single
  • process_low
  • process_medium
  • process_high

An example is given in conf/slurm.config for slurm. Oher executors can of course be used. It's possible to set resources for specific processes.

Troubleshoot

This happens when extracting the FASTA file in genome_human. gzip -f is required for that.

Context

With the development of Nextflow and the likes (Sarek), it is now much simpler to have a portable pipeline than run with multiple schedulers. Nf-core is an initiative based on nextflow to offer "reference" pipeline for a given application. Sarek is one of these pipeline than can analyse germline or somatic data for mutations. Nf-core pipeline package their dependencies according to multiple strategies (Docker, Singularity, Podman, Shifter, Charliecloud or conda as a "last resort"). Databases management for Sarek is deferred to a central server managed by Illumina, called iGenomes.

We offer another approach where dependencies and databases are managed outside Sarek, using Nix and Datalad. Here are the main pros:

Reproducibility For software, containerization is commonly used to offer a portable, closed and reproducible environment. Nix and Guix offer, in our sense, a superior approach to package management. For a given input (software version, checksum of the source code), a package will be build in exactly the same way across multiple architectures. This is possible as Nix stores everything in a special folder, the store, and with its graph-based approach to dependencies. A single change in the chain of dependencies will rebuild everything. On top of that, Nix can produce Docker files, allowing for containerization later on.

Databases are managed on Datalad, which is based on git-annex. This allows large files to be managed by git in a Decentralized approach where multiple location can be kept in sync. Raw files are not stored, only their "address" (URL, local folder....).

Decentralized All packages version are fixed using a simple configuration file called flake.nix. Each database is versioned in a git repository, which stores only its location on the web. This means it's lightweight, easily hackable and can be used for other pipelines.

Up-to-date. It is easier to keep track of changes in separate repositories. Nixpkgs is an (very) active project and software are often on the latest version. Database are downloaded directly from their "producer". We maintain the versions so you don't have to.

Portable. Dependencies are simply installed in the PATH. They only require nix to be installed. Even on systems without nix, it's possible to create standalone archives for each software where all dependencies are packed in. Databases are simply symlinks to folder managed by git annex and can be installed anywhere. Finally, this approach is not specific to sarek and used on any relevant pipieline.

Tested. With datalad, checksums on datafile allows to detect partial downloads. Continuous integration ensures all packages build. By defining a small but clinically relevant test case, SNV (single nucleotide variation) and structural variant calling are checked on each change.

Reference

Architecture

Packaging with Nix

Nix stores every package is a dedicated folder /nix/store. Each package (derivation) is defined in an unique way according to the source code (checksum) and all of its derivations. If one of this parameters change, another version is created. This allow to have multiple version in parallel, and better reproducibility. While we strive to be fully reproducible by fixing all inputs of a package, there may be some cases where reproducibility "to the byte" cannot (yet) be achieved.

A package is defined by a single configuration files, default.nix in our repository. Most of the packages are on nixpkgs, the central repository for all nix packages. We had to create or improve several packages and contributions have been upstreamed to nixpkgs. To improve reproducibility, another configuration file, flake.nix define the nixpkgs version and the list of packages.

In short, a few configuration files uniquely define all the packages. Ideally, all packages should be on nixpkgs but some of them depends on Python 2, whose support have been removed from nixpkgs.

Conflict between strelka and dragmap

Strelka and dragmap install files with the same name in /lib/python, /libexec if installed globally. This is an issue when we try to make both of them available. As a workaround, strelka installation directory was modified to have as a subfolder strelka containing everything except bin.

Database with Datalad

Git-annex allows to manage large files with git across multiple locations. Datalad make this approach more user-friendly. The basic principle in our configuration is that [each dataset has its own git repository. A master repository lists all datasets and keep them in sync. No data is stored in the git repository in itself, only its location on the web and its checksum. When cloning the repository, no data is download yet, only symlink to a local folder are created. Data is downloaded upon the user request with a single command.

It enables easier update and data reutilisation. As we closely follow-up upstream sources, recent version. Thanks to git, it's also easy to switch between minor or major version. Finally, it can be adapted to local databases if needed. A thorough guide is available on datalad website. Work is underway to contribute our datasets to Datalad collection.

Tests

Each change to the github repository on the main branch ensure

  1. All package builds. To decrease compilation times, a cache with cachix is used but any change to a package or its depedencies will cause a rebuild.
  2. A small but clinically relevant test case is run to ensure 2 mutations (SNVs) and a deletion (CNV) are aligned and called properly.
  3. Most packages include tests, either functional or unitary. Those tests are run by nix when creating the package.
  4. Annotation is much harder to set as databases often changes leading often to different functional annotations. Also, vep cache is too big for Github CI. We only test snpeff and ensure no variants are lost after annotation and there is an annotation field.

CNVKit is not tested in our minimal testing as it is less accurate in detecting CNVs smaller than 1 Mbp. Also, we did not use sarek testing suite for several reasons. First, its minimal testing does not include germline data. Second, it compare the checksum of output files, that can change due to different software versions. We chose to use a more clinically relevant invariant where only variants are checked. If metadata differs due to an updgrade, test will still pass.

Limits

Currently, there are few limits to our approach.

  • on Github CI, nextflow version does not run at all. This has not been reproduced on other machines and architectures. This package has been used in production on our cluster without issue. Current Github CI use upstream nextflow version for the moment. This is the only package not in Nix in our tests.
  • By default, Sarek filters variants with GATK haplotypecaller using a convolutional neural network. This has not yet been packaged in Nix so this step must be skipped in Sarek. The command-line examples in the tutorial reflect that.
  • Database only follows upstream conventions, so some manual intervention is needed to extract the archives and rename the chromosomes.

For the first two issues, contributions are welcome !

FAQ

Can I use on my cluster ?

You need to have nix setup for your cluster. Otherwise, see below. you can install dependencies through sarek itself and You can still use datalad and our approach to only download database. Internet access over https is mandatory though.

Can I install dependencies without nix ?

Absolutely. You can either let sarek install them for you using different profiles. When running the pipeline, simply add -profile docker or -profile singularity for example. See nf-core instructions for more information. Or you can install them manually. As long as there are available on the $PATH, it should work. This is not recommended as it is quite painful to do and even more so to keep it uptodate.

Can I use my own databases ?

Yes, you simply need to replace the paths in nextlow.config accordingly for the reference genome, snpeff and vep cache, dbSNP and clinvar.

What kind of scheduler do you support ?

Everything supported by nextflow.

Can I apply your project to my pipeline ?

Please do ! We aim for modularity so flake.nix can be used in any project. It will make software available in $PATH, either in a shell of globally.

Can you add other software ?

We only support dependencies for sarek (gemline at the moment).

Do you plan to support Sarek somatic pipeline ?

Not yet but contributions are welcome.

Could you add support to other nf-core pipeline ?

This is out-of-scope for this repository. But contriubtions to nixpkgs are welcome. Don't hesitate to fork this repository to apply this approach to other pipelines. If this gains enough traction, we could create an organisation on Github to have all these projects under the same umbrella.

Is your project compatible with guix ?

Databases are managed by datalad, which is not yet in guix. Other packages are partially in guix. Don't hesitate to port the nix derivations to Guix !

Contributing

For softwares, this project is really a proxy for nixpkgs so any update should happen there ! This could be newer software versions, or new software for Sarek. Once it has been merged, please open a pull request with the changes.

Databases are managed by datalad, meaning each dataset has its own git repository. The central hub for those is dgenomes. If updates are too slow for you, please open a pull request there. If databases are missing, please open a pull request.

This project is only for germline analysis. We are open to improve it to add somatic analysis too.