Skip to content

Running a Whole Genome Pedigree Dataset (Simplified workflow)

Charles Markello edited this page Jan 7, 2020 · 6 revisions

Setup Instructions

Setup the main working directory

cd into a directory with at least 2TB of allocated Disk space

cd /data/$USER

Launch an interactive session on Biowulf and load requisite Biowulf modules:

sinteractive
module load cromwell/40 git python/3.6

Clone the github repo and create a work directory for running the wdl workflow:

VG_WDL_DIR="/data/$USER/test_vg_wdl_run/wdl_tools"
mkdir -p ${VG_WDL_DIR} && cd ${VG_WDL_DIR}
git clone https://github.com/vgteam/vg_wdl.git

Download workflow inputs and set up miniwdl virtual environment to run vg_wdl workflows:

WORKFLOW_INPUT_DIR="/data/$USER/test_vg_wdl_run/workflow_inputs"
${VG_WDL_DIR}/vg_wdl/scripts/setup_vg_wdl.sh -g ${WORKFLOW_INPUT_DIR} -v ${VG_WDL_DIR}
exit

Input Read Setup Instructions

Setup the cohort working directory and collect input reads for the cohort (this should take a few minutes). Only need to change COHORT_NAME from this template. The COHORT_NAME should be the sample name of the proband in a UDP cohort. The COHORT_NAMES_LIST bash array variable needs to list the proband, sibling and parental ids in a space-delimited manner.

COHORT_INPUT_DATA should contain the full path to the directory containing all raw read data of the cohort. For example, if the raw reads for PROBAND and SIBLING_1 are located in /data/Udpdata/Individuals/PROBAND/R1_fastq.gz and /data/Udpdata/Individuals/SIBLING_1/R1_fastq.gz respectively, then the path for COHORT_INPUT_DATA should be /data/Udpdata/Individuals/.

COHORT_NAME="UDP****"
COHORT_NAMES_LIST=("UDP_MATERNAL" "UDP_PATERNAL" "UDP_PROBAND" "UDP_SIB_1" "UDP_SIB_2")
VG_WDL_DIR="/data/$USER/test_vg_wdl_run/wdl_tools"
COHORT_WORKFLOW_DIR="/data/$USER/test_vg_wdl_run/${COHORT_NAME}_cohort_workdir"
COHORT_INPUT_DATA="/PATH/TO/DIRECTORY/CONTAINING/INPUT/READS/"
${VG_WDL_DIR}/vg_wdl/scripts/setup_input_reads.sh -l "${COHORT_NAMES_LIST[*]}" -w ${COHORT_WORKFLOW_DIR} -c ${COHORT_INPUT_DATA}

Running the Workflow

CD into cohort work directory and setup input variables. The SIBLING_ID_LIST bash array variable needs to list the proband and sibling ids in a space-delimited manner. The proband must be listed first. For example, if the pedigree has one proband UDP_PROBAND and 2 additional siblings UDP_SIB_1 and UDP_SIB_2: SIBLING_ID_LIST=("UDP_PROBAND" "UDP_SIB_1" "UDP_SIB_2").

For one of the input variables the PED_FILE must point to a valid .ped file in the form of the COHORT_ID.ped or PROBAND_SAMPLE_ID.ped naming scheme and must follow the tab-delimited PED file format. The .ped file needs to only contain the mother-father-proband trio set of samples. For example the HG002 trio .ped file looks like the following where the proband is HG002 the father is HG003 and the mother is HG004:

#Family ID  Father  Mother  Sex[1=M]    Affected[2=A]
HG002   HG002   HG003   HG004   1   2
HG002   HG003   0   0   1   1
HG002   HG004   0   0   2   1

Setup input variables

MATERNAL_SAMPLE_NAME="UDP_MOM"
PATERNAL_SAMPLE_NAME="UDP_DAD"
SIBLING_ID_LIST=("UDP_PROBAND" "UDP_SIB_1" "UDP_SIB_2")
VG_WDL_DIR="/data/$USER/test_vg_wdl_run/wdl_tools"
WORKFLOW_INPUT_DIR="/data/$USER/test_vg_wdl_run/workflow_inputs"
COHORT_WORKFLOW_DIR="/data/$USER/test_vg_wdl_run/${SIBLING_ID_LIST[0]}_cohort_workdir"
PED_FILE="${COHORT_WORKFLOW_DIR}/${SIBLING_ID_LIST[0]}.ped"

Setup workflow bash script

${VG_WDL_DIR}/vg_wdl/scripts/setup_pedigree_script.sh -s "${SIBLING_ID_LIST[*]}" -m ${MATERNAL_SAMPLE_NAME} -f ${PATERNAL_SAMPLE_NAME} -c ${PED_FILE} -w ${COHORT_WORKFLOW_DIR} -g ${WORKFLOW_INPUT_DIR} -v ${VG_WDL_DIR}

Run the cohort mapping and variant calling workflow

cd ${COHORT_WORKFLOW_DIR}
sbatch --cpus-per-task=6 --mem=20g --gres=lscratch:200 --time=240:00:00 ${SIBLING_ID_LIST[0]}_pedigree_workflow.sh

The final output files can be found in the following directory:

${COHORT_WORKFLOW_DIR}/${SIBLING_ID_LIST[0]}_cohort_map_call.final_outputs/output_links