-
Notifications
You must be signed in to change notification settings - Fork 9
Running a Whole Genome Pedigree Dataset (Simplified workflow)
cd into a directory with at least 2TB of allocated Disk space
cd /data/$USER
Launch an interactive session on Biowulf and load requisite Biowulf modules:
sinteractive
module load cromwell/40 git python/3.6
Clone the github repo and create a work directory for running the wdl workflow:
VG_WDL_DIR="/data/$USER/test_vg_wdl_run/wdl_tools"
mkdir -p ${VG_WDL_DIR} && cd ${VG_WDL_DIR}
git clone https://github.com/vgteam/vg_wdl.git
Download workflow inputs and set up miniwdl virtual environment to run vg_wdl workflows:
WORKFLOW_INPUT_DIR="/data/$USER/test_vg_wdl_run/workflow_inputs"
${VG_WDL_DIR}/vg_wdl/scripts/setup_vg_wdl.sh -g ${WORKFLOW_INPUT_DIR} -v ${VG_WDL_DIR}
exit
Setup the cohort working directory and collect input reads for the cohort (this should take a few minutes). Only need to change COHORT_NAME
from this template. The COHORT_NAME
should be the sample name of the proband in a UDP cohort. The COHORT_NAMES_LIST
bash array variable needs to list the proband, sibling and parental ids in a space-delimited manner.
COHORT_INPUT_DATA
should contain the full path to the directory containing all raw read data of the cohort. For example, if the raw reads for PROBAND
and SIBLING_1
are located in /data/Udpdata/Individuals/PROBAND/R1_fastq.gz
and /data/Udpdata/Individuals/SIBLING_1/R1_fastq.gz
respectively, then the path for COHORT_INPUT_DATA
should be /data/Udpdata/Individuals/
.
COHORT_NAME="UDP****"
COHORT_NAMES_LIST=("UDP_MATERNAL" "UDP_PATERNAL" "UDP_PROBAND" "UDP_SIB_1" "UDP_SIB_2")
VG_WDL_DIR="/data/$USER/test_vg_wdl_run/wdl_tools"
COHORT_WORKFLOW_DIR="/data/$USER/test_vg_wdl_run/${COHORT_NAME}_cohort_workdir"
COHORT_INPUT_DATA="/PATH/TO/DIRECTORY/CONTAINING/INPUT/READS/"
${VG_WDL_DIR}/vg_wdl/scripts/setup_input_reads.sh -l "${COHORT_NAMES_LIST[*]}" -w ${COHORT_WORKFLOW_DIR} -c ${COHORT_INPUT_DATA}
CD into cohort work directory and setup input variables.
The SIBLING_ID_LIST
bash array variable needs to list the proband and sibling ids in a space-delimited manner. The proband must be listed first. For example, if the pedigree has one proband UDP_PROBAND
and 2 additional siblings UDP_SIB_1
and UDP_SIB_2
:
SIBLING_ID_LIST=("UDP_PROBAND" "UDP_SIB_1" "UDP_SIB_2").
For one of the input variables the PED_FILE
must point to a valid .ped
file in the form of the COHORT_ID.ped
or PROBAND_SAMPLE_ID.ped
naming scheme and must follow the tab-delimited PED file format. The .ped
file needs to only contain the mother-father-proband trio set of samples. For example the HG002 trio .ped
file looks like the following where the proband is HG002
the father is HG003
and the mother is HG004
:
#Family ID Father Mother Sex[1=M] Affected[2=A]
HG002 HG002 HG003 HG004 1 2
HG002 HG003 0 0 1 1
HG002 HG004 0 0 2 1
Setup input variables
MATERNAL_SAMPLE_NAME="UDP_MOM"
PATERNAL_SAMPLE_NAME="UDP_DAD"
SIBLING_ID_LIST=("UDP_PROBAND" "UDP_SIB_1" "UDP_SIB_2")
VG_WDL_DIR="/data/$USER/test_vg_wdl_run/wdl_tools"
WORKFLOW_INPUT_DIR="/data/$USER/test_vg_wdl_run/workflow_inputs"
COHORT_WORKFLOW_DIR="/data/$USER/test_vg_wdl_run/${SIBLING_ID_LIST[0]}_cohort_workdir"
PED_FILE="${COHORT_WORKFLOW_DIR}/${SIBLING_ID_LIST[0]}.ped"
Setup workflow bash script
${VG_WDL_DIR}/vg_wdl/scripts/setup_pedigree_script.sh -s "${SIBLING_ID_LIST[*]}" -m ${MATERNAL_SAMPLE_NAME} -f ${PATERNAL_SAMPLE_NAME} -c ${PED_FILE} -w ${COHORT_WORKFLOW_DIR} -g ${WORKFLOW_INPUT_DIR} -v ${VG_WDL_DIR}
Run the cohort mapping and variant calling workflow
cd ${COHORT_WORKFLOW_DIR}
sbatch --cpus-per-task=6 --mem=20g --gres=lscratch:200 --time=240:00:00 ${SIBLING_ID_LIST[0]}_pedigree_workflow.sh
The final output files can be found in the following directory:
${COHORT_WORKFLOW_DIR}/${SIBLING_ID_LIST[0]}_cohort_map_call.final_outputs/output_links