Software Tools
Our infrastructure
Given the vast scale of our data and computing needs, we have optimised our workflow and analysis systems to operate exclusively on the cloud.
We currently use Google Cloud as our primary cloud computing platform, with increasing capabilities on Azure, as well as Pulumi to automate permission management and infrastructure as code for cloud deployments.
Technology stack & process
We believe in creating free, open-source code with permissive licences: whenever possible, we make our code freely available for reuse through our GitHub organisation under an MIT licence.
Our process for deploying software is unusually robust for an academic research team: all code undergoes review and a rigorous CI pipeline with testing, linting, and type checks to ensure quality. We welcome feedback and pull requests on our code!
Workflow challenges & solutions
Genomic data processing requires multiple containerised tools requiring varying compute requirements, well-orchestrated into workflows.
Production pipelines are our way to orchestrate genomics analysis across a combination of workflow systems like Hail Batch and Cromwell. This service is integrated with metamist to understand what to process, and capture status of different analysis.
Data challenges & solutions
We currently manage over a petabyte of genomics data, across 20,000 GCP resources and 100 distinct projects.
Metamist is our solution for storing and coordinating access to metadata, scalable to thousands of samples and integrated with analytical tools and open-source platforms for rare disease genomics like seqr, offering efficient and secure genomic metadata management.
Security
Security and data privacy is critical for our work in CPG.
We handle community data with care, supported by strong access policies that isolate data between projects, and enforce a physical separation of genomic data from personal identifiers.
Our research team can only access and process genomic data through code-reviewed software, and access permissions are tightly controlled. These practices, along with the separation of access permissions for genomic data and personal identifiers, dramatically reduce the risk of unauthorised access even in the event of an account being compromised.