Download hg19.fa
file (hg19/GRCh37) from GRCh37/hg19 and save it in canonical
folder.
For the datasets creation use a python2 env and install requirements:
pip install -r requirements/python2.txt
From canonical
folder run the following command to create the sequence file:
./grabsequence.sh
Then, to create datasets run:
./create_datafile.sh
./create_dataset.sh
Use a python3 env and install requirements:
pip install -r requirements/python3.txt
In canonical
folder create a Outputs
folder for saving outputs of different training, create also a Models
folder for saving the models.
For training the 5 models with 10.000 nucleotides run the following commands from the canonical
folder:
./script_train.sh 10000 1
./script_train.sh 10000 2
./script_train.sh 10000 3
./script_train.sh 10000 4
./script_train.sh 10000 5
For training the model with 10.000 nucleotides, from the canonical
folder run:
./script_test.sh 10000
This file sets the maximum nucleotide context length and the sequence length of the SpliceAI models.
It sets also the path of the canonical dataset and the path of the FASTA file of the genome.
In this file are present the functions for creating the datasets: creating datapoints, reformatting data and one hot encoding sequence.
There is also a function for printing the statistics after the model testing.
This parser takes as input the text files canonical_dataset.txt and canonical_sequence.txt, and produces a .h5 file datafile.
This parser takes as input the .h5 file produced by create_datafile.py and outputs a .h5 file with datapoints of the form (X, Y), which can be understood by Keras models.
This file has the functions to create the spliceAI model, that is represented in the figure below, depending on the number of nucleotides used a different model is created.
This file contains the code to train the SpliceAI model.
Contains code to test the SpliceAI model.