Skip to content

lucafumagalli/spliceai-thesis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

78 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SpliceAI

Download hg19.fa file (hg19/GRCh37) from GRCh37/hg19 and save it in canonical folder.

Datasets creation

For the datasets creation use a python2 env and install requirements:

pip install -r requirements/python2.txt

From canonical folder run the following command to create the sequence file:

./grabsequence.sh

Then, to create datasets run:

./create_datafile.sh  
./create_dataset.sh

Train and test models

Use a python3 env and install requirements:

pip install -r requirements/python3.txt

Train

In canonical folder create a Outputs folder for saving outputs of different training, create also a Models folder for saving the models.

For training the 5 models with 10.000 nucleotides run the following commands from the canonical folder:

./script_train.sh 10000 1
./script_train.sh 10000 2
./script_train.sh 10000 3
./script_train.sh 10000 4
./script_train.sh 10000 5

Test

For training the model with 10.000 nucleotides, from the canonical folder run:

./script_test.sh 10000

Files description

constants.py

This file sets the maximum nucleotide context length and the sequence length of the SpliceAI models.
It sets also the path of the canonical dataset and the path of the FASTA file of the genome.

utils.py

In this file are present the functions for creating the datasets: creating datapoints, reformatting data and one hot encoding sequence.
There is also a function for printing the statistics after the model testing.

create_datafile.py

This parser takes as input the text files canonical_dataset.txt and canonical_sequence.txt, and produces a .h5 file datafile.

create_dataset.py

This parser takes as input the .h5 file produced by create_datafile.py and outputs a .h5 file with datapoints of the form (X, Y), which can be understood by Keras models.

spliceai.py

This file has the functions to create the spliceAI model, that is represented in the figure below, depending on the number of nucleotides used a different model is created. alt text

train_model.py

This file contains the code to train the SpliceAI model.

test_model.py

Contains code to test the SpliceAI model.

multi_gpu.py

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published