Centre for Population Genomics

Software Tools

The CPG production infrastructure stack

  • We regularly run analyses that involve tens of thousands of genomes. Since such workloads are typically bursty, we use on-demand cloud resources, typically on Google Cloud or Microsoft Azure.
  • Our main workflow scheduling system is Hail Batch, for which we have set up a local deployment. It integrates directly with Hail Query, a set of scalable APIs designed specifically for genomics. For workflows like GATK-SV, we rely on Cromwell / Terra to run WDL.
  • About a dozen collaborating groups in Australia use our local deployment of seqr for rare disease analysis. Internally, we continue the development of Broad’s loss-of-function curation portal.
  • Our public data browsers typically use Django and React on the frontend, with Elasticsearch or Hail in the backend.
  • All our sample metadata is managed centrally with an extensive set of APIs, which allows us to automate our workflows and ingest new data regularly without incurring toil.
  • We like to set up our infrastructure as code either through Terraform or Pulumi, which helps to bring up consistent dev / prod namespaces across multiple clouds.
  • All our code is available on GitHub. We control production data access on a dataset level and enforce code reviews through an analysis runner wrapper, while allowing quick prototyping and exploration on subsets for testing.

Repositories

  • AIP: automated rare disease variant prioritization
  • cpg-utils: a set of helpers to build reusable pipelines
  • production-pipelines: our large cohort processing + QC pipeline
  • tob-wgs: analyses related to the Tasmanian Ophthalmic Biobank Whole Genome Sequencing project
  • structural-constraint: calculating missense constraint within protein tertiary structures
  • tob-wgs-browser: the public data browser for the Tasmanian Ophthalmic Biobank Whole Genome Sequencing project
magnifiercrossmenu