DLEB

Documentation

1.	Model & layer

1.1.	Model architecture in DLEB
1.2.	Input layer in DLEB
1.3.	Output layer in DLEB
1.4.	Setting training parameter of output layer
1.5.	Hidden layers in DLEB
1.6.	Layer types available in DLEB
1.7.	Custom layer
1.8.	Activation function

2.	Designing and building your deep learning model

2.1.	Adding and removing layers
2.2.	Setting parameters of layers
2.3.	Handling layers
2.4.	Adding and removing links between layers
2.5.	Handling activation function
2.6.	Grouping and ungrouping layers
2.7.	Using template models
2.8.	Importing to Python code

3.	Using a deep learning model recommended by DLEB

3.1.	Choosing the purpose of a deep learning model
3.2.	Selecting the type of data for training a deep learning model
3.3.	Selecting the structure of a deep learning model

4.	Executing the Python code generated by DLEB

4.1.	Preparing the requirements for using the Python code
4.2.	Preparing the input data for a deep learning model
4.3.	Setting config file
4.4.	Running the Python code using users' machine
4.5.	Running the Python code using Google Colaboratory
4.6.	Guidelines for hyperparameter tuning
4.7.	Optimization of deep learning models
4.8.	Overfitting and underfitting
4.9.	Outputs

1. Model & layer

1.1. Model architecture in DLEB

DLEB is designed to build models mainly consisting of 3 parts: (i) input layers, (ii) hidden layers, and (iii) output layers.
The model must be started with the input layer and ended with the output layer.

1.2. Input layer in DLEB

Input layer is the layer for converting input data in matrix format to a tensor.
Tensor is multi-dimensional array which is used in all kinds of operations in deep learning models. For more details about tensor, see https://www.tensorflow.org/guide/tensor.
A model can contain multiple input layers. If the multiple input layers are not merged by one of merging layers, such as concatenate layer, the input layers will be sequentially used to train or test the model.

1.3. Output layer in DLEB

The output layer in DLEB is defined as the layer which functions as the placeholder of model results and calculates losses for the results based on set training parameters including optimizer and loss function.
Generally, the output layer is used to produce the result for given inputs. However, the output layer in DLEB does not perform any operations to calculate the result. This means that the layer following the formal definition of output layer is the hidden layer just before the output layer in DLEB.

The training range parameter of the output layer helps to set layers to be trained by losses calculated in the output layer.

If the training range parameter is not set for the output layer, the output layer remains the model result placeholder.

As the input layer, a model can contain multiple output layers, and by setting different training ranges for different output layers in a model, those layers can be differently trained. This function can be useful to build complicate model, such as GAN.

1.4. Setting training parameter of output layer

Users can set parameters for model training in the "Parameter Settings" panel of the output layer.
Select input layers and hidden layers (or groups for hidden layers) that will be trained using the loss value of the current output layer in the "Training layer" option.

If the current model contains only the output layer or any layer is not set in the "Training layer" option, the output layer will only functions as the model result placeholder, and parameters for training will be hidden.
When the a set of layers layer is set in the "Training layer" option, the parameters for training will be shown.

The selected input layers and hidden layers have to be connected to each other.

Select the type of optimizer for training hidden layers.

Optimizers are algorithms used for training the selected layers to minimize the loss value.

Select a input layer for using to calculate model result.

Select a label data compared with the model result.

Select a loss function used to calculate loss by comparing the model result and label data.

If users want to calculate the loss using with multiple input and label data together, users can add the set of ‘Model loss’ parameters by clicking the plus icon. The final loss value will be calculated by summing loss values calculated by using each "Model loss" parameter.

1.5. Hidden layers in DLEB

Hidden layers in DLEB are intermediate layers between the input and output layers where all computations are conducted and the results are produced for given inputs.
All type of layers except the input and output layers can be used as hidden layers.
All hidden layers in main model have to be connected to each other.
The first and last hidden layers have to be connected to the input and output layers, respectively.
Multiple layers designed to work together for the same function can be grouped and visualized as a layer group.

1.6. Layer types available in DLEB

Layers consisting of multiple nodes are basic building blocks of deep learning models.
Basically, input data fed into a certain layer is combined using weights, transformed, and then transferred to the next layer.
DLEB supports various types of layers including convolution and recurrent layers.
Supported layer types

link

Type	Name	Description
Core layer	Dense layer	Basic fully connected layer implementing an operation: output = (input * weight) + bias.
	Flatten layer	Layer converting input data as a matrix form to a single array
	Reshape layer	Layer changing the shape of input data.
Convolution layer	1D convolution layer	Convolution layer with one-dimensional input/output data and a kernel moving in one direction.
	2D convolution layer	Convolution layer with two-dimensional input/output data and a kernel moving in two directions.
	3D convolution layer	Convolution layer with three-dimensional input/output data and a kernel moving in three directions.
	1D pooling layer	Layer for down-sampling input data by taking average or maximum value over a one-dimensional window.
	2D pooling layer	Layer for down-sampling input data by taking average or maximum value over a two-dimensional window.
	3D pooling layer	Layer for down-sampling input data by taking average or maximum value over a three-dimensional window.
	1D deconvolution layer	Layer working through an opposite direction of a convolution layer in terms of data transformation for one-dimensional input/output data.
	2D deconvolution layer	Layer working through an opposite direction of a convolution layer in terms of data transformation for two-dimensional input/output data.
	3D deconvolution layer	Layer working through an opposite direction of a convolution layer in terms of data transformation for three-dimensional input/output data.
Recurrent layer	Simple RNN layer	Fully connected recurrent neural network layer.
	GRU layer	Gated recurrent unit layer.
	LSTM layer	Long short-term memory layer.
Merging layer	Concatenate layer	Layer for concatenating the list of input data. * The layer must be connected with at least two layers.
	Average layer	Layer for averaging the list of input data. * The layer must be connected with at least two layers.
	Maximum layer	Layer for computing the maximum value among the list of input data. * The layer must be connected with at least two layers.
	Minimum layer	Layer for computing the minimum value among the list of input data. * The layer must be connected with at least two layers.
	Add layer	Layer for adding the list of input data. * The layer must be connected with at least two layers.
	Subtract layer	Layer for subtracting the list of input data. * The layer must be connected with at least two layers.
	Multiply layer	Layer for multiplying the list of input data. * The layer must be connected with at least two layers.
Normalization / Regularization	Dropout	Randomly selecting and not using some of input data for preventing overfitting.
Normalization / Regularization	Batch normalization	Normalization of output data by using the mean and the standard deviation of input data in a mini batch.
Layer group	Encoder	A group of layers used for learning data representation in an autoencoder.
	Decoder	A group of layers used for reconstructing data based on data features in an autoencoder.
	1D Convolutional encoder	A encoder layer group with 1D convolutional layers.
	1D Convolutional decoder	A decoder layer group with 1D convolutional layers.
	2D Convolutional decoder	A encoder layer group with 2D convolutional layers.
	2D Convolutional decoder	A decoder layer group with 2D convolutional layers.
	3D Convolutional decoder	A encoder layer group with 3D convolutional layers.
	3D Convolutional decoder	A decoder layer group with 3D convolutional layers.
	Generator	A group of layers for generating similar data as input data in a GAN model.
	Discriminator	A group of layers for discriminating real data from faked ones in a GAN model.
Advanced layer	Noise layer	Layer for introducing additive zero-centered Gaussian noise.
	Sampling layer	Layer for adding random noise to input data.
	Bottleneck layer	Layer for generating encoded data in an autoencoder.
Pretrained layer	Xception	Model for instantiating the Xception architecture. The default input size for this model is 299x299. * More description about Xception in this link.
	VGG	Model for instantiating the VGG16 architecture. The default input size for this model is 224x224. * More description about VGG in this link.
	ResNet	Model for instantiating the ResNet50 architecture. The default input size for this model is 224x224. * More description about ResNet in this link.
	Inception	Model for instantiating the Inception V3 architecture. The default input size for this model is 299x299. * More description about Inception in this link.
	MobileNet	Model for instantiating the MobileNetV3Large architecture. The default input size for this model is 224x224. * More description about MobileNet in this link.
	DenseNet	Model for instantiating the Densenet121 architecture. The default input size for this model is 224x224. * More description about DenseNet in this link.
	NASNet	Model for instantiating the NASNet model in the ImageNet mode. The default input size for this model is 331x331. * More description about NASNet in this link.
Custom layer	Custom layer	Layer that can be customized by users.

1.7. Custom layer

If users want to make their own layers, they can use customized layers by following the steps.

The "custom" layer is added while building model in the web interface.

The output shape for the customized layer should be defined in the web interface to check integrity of model structures and parameter values before importing Python codes.

After the Python code for the model containing the “custom” layer is imported, the user just define the “custom” layer operation using the “custom_layer.py” file.

The config file will be downloaded with the Python code together and located in the following path.

(Directory containing Python code) model/src/custom_layer.py

The shape of the return value must be equal to the shape of "custom" layer set during building model.
An example of “custom” layer operation (“_example_sampling”).

class _example_sampling (layers.Layer):
    def call (self, inputs):
        z_mean, z_log_var = inputs
        batch = tf.shape(z_mean)[0]
        dim = tf.shape(z_mean)[1]
        epsilon = tf.keras.backend.random_normal(shape=(batch, dim))
        return z_mean + tf.exp(0.5 * z_log_var) * epsilon

1.8. Activation function

An activation function can be used to transform the weighted combination of input data to make the output data of a specific node in a layer.
The choice of the activation function has a large impact on the capability and performance of deep learning models, and different activation functions should be used in different parts of the models.
See 2.5 Handling activation function for setting the activation function for each layer.
Activation functions available in DLEB

Function	Description
Sigmoid	A sigmoid function is an S-shaped activation function producing numbers between 0 and 1.
Softmax	A softmax function is a function producing a vector of values that sum to 1.
Tanh	A Tanh (hyperbolic tangent) activation function is an S-shaped activation function producing numbers between -1 to 1. Recurrent networks commonly use the Tanh activation function.
ReLU	A ReLU activation function is a linear function producing the same input data if it is positive and zero otherwise. It is the most common activation function in deep learning models.
Elu	An Elu activation function is a linear function producing the same input data if it is positive and negative values otherwise. The negative values are calculated by using an exponential function.

You can find more information at the following links and papers.

https://training.galaxyproject.org/training-material/topics/statistics/tutorials/FNN/tutorial.html#activation-functions
https://www.tensorflow.org/api_docs/python/tf/keras/activations
Greener, J. G., Kandathil, S. M., Moffat, L., & Jones, D. T. (2022). A guide to machine learning for biologists. Nature Reviews Molecular Cell Biology, 23(1), 40-55
Géron, A. (2019). Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: Concepts, tools, and techniques to build intelligent systems. " O'Reilly Media, Inc."

2. Designing and building your deep learning model

2.1. Adding and removing layers consisting of a deep learning model

Users can add layers by clicking the name of the layer from the list in the “Layer” panel.
Users can remove layers by selecting the layers they want to delete and clicking the “Trash can” icon on the toolbar or right-clicking on the selected layers and clicking the “Delete”.
A group of layers (e.g., encoder, generator) and a pre-trained model (e.g., ResNet, Inception) can be also added in the same manner as other layers.
All models must be started with the input layer and ended with the output layer. Multiple input and output layers are also allowed.

2.2. Setting the parameters of layers in a deep learning model

Users can set the parameters of layers in the “Parameter Settings” panel. The panel appears by clicking the layer the user wants to set the parameters for.
Parameters in the “Parameter Settings” panel are divided into “Mandatory Parameters” and “Optional Parameters”. All parameters except for the ones related to dimensions of users' input data are set as default values recommended by DLEB. Users can also change the parameter values by users' preference.
“Mandatory Parameters” must be set before writing the final Python code for a deep learning model.
Users can get statistics for a deep learning model including the number of layers and parameters in the “Model Statistics” panel, which is useful information for estimating the size of a model and the time needed to train the model.
Trainable parameters are parameters updated during training. In contrast, non-trainable parameters are parameters not updated during training, and they can be added by using the “Batch normalization” layers and the “Pre-trained model” layers.

2.3. Handling layers in a deep learning model

Users can select a single layer by mouse click or multiple layers by holding down the “Shift” key and dragging. The selected layer(s) will be highlighted by bold border.
The selected layers can be moved together by dragging one of them.

2.4. Adding and removing links between layers connecting layers in a deep learning model

Users can add links connecting layers by clicking and dragging the white circle on a layer.
- The links that came from hidden layers and go to the input layer cannot be added.
An edge is automatically created when a new layer is added when existing layers in a model are selected. The automatically added links connect all selected layers to a newly added one.
- When the selected layer is hidden layer but the new layer is an input layer, the link will not be added.
Users can remove links by selecting them and clicking the “Trash can” icon on the toolbar or right-clicking on the selected links and clicking the “Delete”.

2.5. Handling activation function

Users can set the activation function for each layer by clicking the links from the layer and selecting a function.

Users cannot set the activation function for the layers that are not hidden layers (input and output layers).

If users do not want to use activation function for the layer, users select "None" option for the links that came from the layer. The marks for activation function will disappear in the links.

2.6. Grouping and ungrouping layers in a deep learning model

Users can make a group of layers by selecting multiple layers and clicking the “Grouping” button on the toolbar or right-clicking on the selected layers and clicking “Grouping”.
Layers in a group can be ungrouped by clicking the “Ungrouping” button on the toolbar or right-clicking on the selected layer groups and clicking “Ungrouping”.

2.7. Using template models provided by DLEB

Users can use one of the template models by simply clicking the listed models in the “Template” panel.
Users can also edit the structure of the template model and the parameters of the layers.

2.8. Importing to Python code

Users can build only one deep learning model at a time.
Users can create the Python code for the deep learning model designed in DLEB by clicking the “Writing Code” button.
When users set the parameters for model training and click the "Importing code” on the popup window, the deep learning model will be imported to the Python source code.
When the Python source code is successfully created, the popup window will be changed as the below figure. A compressed file containing the source code and IPYNB file will be downloaded when users click the "Download code" on the popup window.

3. Using a deep learning model recommended by DLEB

3.1. Choosing the purpose of a deep learning model

Users can choose the purpose of a deep learning model among four different tasks: “Feature extraction”, “Data generation”, “Classification”, and “Regression”. The description of each task is as follows.
“Feature extraction” is the task of reducing the dimension of a dataset by summarizing original features into new informative and non-redundant features. The extracted features can be used as input data for clustering. This task does not require label data for model training.
“Data generation” is the task of generating synthetic data that mimics an input dataset. A generative deep learning model learns patterns in input data and generates new synthetic data based on the learned patterns. This task does not require label data for model training.
“Classification” is the task of organizing data into predefined categories (labels) based on the features of input data. Label data is necessary for this task.
“Regression” is the process of investigating the relationship between input features and outputs (labels). Label data is necessary for this task.

3.2. Selecting the type of data for training a deep learning model

Users can select the type of input data for training a deep learning model.
The Python code generated by DLEB contains functions specific for the selected input data type for preprocessing input data.

3.3. Selecting the structure of a deep learning model

Based on the purpose and the input data type users select, DLEB recommends several deep learning models.
Users choose a final deep learning model structure that is shown on the “Model Design” page. Users can edit the structure of the chosen model and the layer parameters on this page.

4. Executing the Python code generated by DLEB

4.1. Preparing the requirements for using the Python code generated by DLEB

Requirements

Python >= 3.6 (https://www.python.org)
Tensorflow >= 2.x (https://www.tensorflow.org)
pip (https://pypi.org/project/pip/)
numpy (https://numpy.org)
pillow (https://pillow.readthedocs.io/en/stable/)
argparse (https://pypi.org/project/argparse/)
scikit-learn (https://scikit-learn.org/stable/)
Janggu (https://github.com/BIMSBbioinfo/janggu)

Preparing requirements using conda

If conda is available, users can easily set up the requirements with the following steps.

Download an environment file on your server to prepare requirements based on conda.
Build new conda environment with the requirements using the downloaded environment file.

$ conda env create -n DLEB -f [FILE_PATH]/environment.yml

When the above command completes successfully, the requirements will be set up under the conda environment named as "DLEB". Imported Python codes can be run after the "DLEB" conda environment is activated.

$ conda activate DLEB # Activate conda environment
$ model.py --config_file [path of config_file] --outdir [path of out_directory] # Run Python code. See "3.3. Running the Python code" for details.

4.2. Preparing the input data for a deep learning model

Sequence data (FASTA format)

(Case 1) The length of all sequences in the FASTA file should be equal if users want to use the sequences without any additional preprocessing steps, such as splitting sequences.

# Sequence data

>SEQ1

ACGTTTGCCGGGTGGGGTTCGAAAC

>SEQ2

GCGGTTTGCGCTCTCTCTCTAAATT

>SEQ3

ATGACTCTAGTCTCTCTAGTCTAGT

# Label data

SEQ1 1

SEQ2 0

SEQ3 1

(Case 2) Raw sequences can be also used as an input sequence file if the BED file defining the region of interest (ROI) is given together. Each region in the BED file can be same in length. The 5th column in the BED file will be used as labels for each region.

# Raw sequence data

>SEQ1

TTTGCTGTTCCTGCATGTAGTTTAAACGAGATTGCCAGCACCGGGTATCA

TTCACCATTTTTCTTTTCGTTAACTTGCCGTCAGCCTTTTCTTTGACCTC

...

CCTTTCCACCGGGCCTTTGAGAGGTCACAGGGTCTTGATGCTGTGGTCTT

CATCTGCAGGTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTGCA

# An example of label data in BED format defining the region of interest (ROI)

SEQ1 10000 10010 . 0 +

SEQ1 10020 10030 . 1 +

SEQ1 10030 10040 . 1 +

# Processed sequences split by DLEB code

>SEQ1:10000-10010

ACGTCCCGTA

>SEQ1:10020-10030

ACGAAATGTT

>SEQ1:10030-10040

TTGACTCTAT

(Case 3) If the length of each region in the BED file containing the ROI is different, users provide the BED file with bin size and step size for dividing sequences into bins of the same size.

Bin size and step size can be set in config file. See 4.3. Setting config file

# An example of Raw sequence data

>SEQ1

TTTGCTGTTCCTGCATGTAGTTTAAACGAGATTGCCAGCACCGGGTATCA

TTCACCATTTTTCTTTTCGTTAACTTGCCGTCAGCCTTTTCTTTGACCTC

...

CCTTTCCACCGGGCCTTTGAGAGGTCACAGGGTCTTGATGCTGTGGTCTT

CATCTGCAGGTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTGCA

# An example of label data in BED format defining the region of interest (ROI)

SEQ1 10000 15000 . 0 +

SEQ1 25500 28500 . 1 +

SEQ1 30500 31000 . 1 +

# An example of Processed sequences split by DLEB code

# Bin size: 2,500 / Step size :2,500

>SEQ1:10000-15000

ACTGGTG ... ACCCTTGGG

>SEQ1:12500-15000

ACTTGCTT ... ATTGGATCA

>SEQ1:25500-28000

TGGCTAG ... ATGATACTTA

Alignment data (BAM format)

(Case 1) The alignment information in the BAM file can be preprocessed, if the BED file defining the ROI is given together. Each region in the BED file should be the same in length, if users want to use the regions without any binning or processing steps. The 5th column in the BED file will be used as labels for each region.

# An example of alignment data

READ1 16 SEQ1 15944 255 36M * 0 0 TTATCACAATGTCATCCGCAGCTAATTTTGAGCCCA==>:;6;?>>;?8?937?;8;1A?@@>@@A?;8>?= XA:i:0 MD:Z:36 NM:i:0

READ2 16 SEQ1 16076 255 36M * 0 0 AGGTTTCAATAACATCTTTGTCCTCTATTACAACGG AA@@@@ABBABAAB@BBBBBABBABABABACBBCCB XA:i:0 MD:Z:36 NM:i:0

READ3 16 SEQ1 16542 255 36M * 0 0 GGCACCACTCACGATAACCTGGGCACCGGTGTTCCT 55;5=:4;;=5?=>;=6>A;=;:>4?>;=?=A??B@ XA:i:0 MD:Z:36 NM:i:0

# An example of label data in BED format defining ROI

SEQ1 10000 15000 . 1 *

SEQ1 25500 31000 . 0 *

SEQ1 30500 38000 . 1 *

(Case 2) If the lengths of each region in the BED file are different, users provide the BED file with bin size and step size for dividing sequences into bins of the same size.

Bin size and step size can be set in the config file. See 4.3. Setting config file

Signal data (bigWig format)

(Case 1) As same with the alignment data, the signal information in the bigWig file can be preprocessed, if the BED file defining the ROI is given at the same time. Each region in the BED file should be the same in length, if users want to use the regions without any binning or processing steps. The 5th column in the BED file will be used as labels for each region.
(Case 2) If the lengths of each region in the BED file are different, users provide the BED file with bin size and step size for dividing sequences into bins of the same size.

Bin size and step size can be set in the config file. See 4.3. Setting config file

Image data (JPEG or PNG format)

The sizes of an image data in JPEG or PNG format should be equal.
If the image data are different sizes, the data can be resized into the fixed width and height for images.

Users can set image width and height in the config file. See 4.3. Setting config file

# An example of label data

Img1.jpg 1

Img2.jpg 0

Img3.jpg 1

Img4.jpg 0

Text data (TXT, CSV or TSV format)

Users can also use the matrix-formatted data that are already preprocessed.

# An example of text data

Feature1 Feature2 Feature3 Feature4 Feature5

Sample1 0.5 0.2 0.3 0.1 0.5

Sample1 0.3 0.5 0.1 0.5 0.3

Sample1 0.4 0.3 0.1 0.4 0.2

# An example of label data

Sample1 1

Sample2 0

Sample3 1

4.3. Setting config file

Users then set a config file for model input and output data.
When the Python codes for the designed model are imported, the config file in JSON format is also provided. The config file is automatically formatted with the information about the model structure.
Users can add file paths for input and output data. If there are multiple input or output layers in the model, users should add the file path for each layer.

{
    "inputs": {
        "_input_1": {
            "data_type": "txt",
            "input_filepath": "$DIR_PATH/$FILEPATH1",
        },
        "_input_2": {
            "data_type": "txt",
            "input_filepath": "$DIR_PATH/$FILEPATH2",
        },
    },
    "outputs": {
        "_output_1": {
            "label_type": "bed",
            "label_filepath": "$DIR_PATH/$FILEPATH3",
        },
        "_output_2": {
            "label_type": "txt",
            "label_filepath": "$DIR_PATH/$FILEPATH4",
        }
    }
}

If input data should be preprocessed using the BED file, bin size, or step size, they can be also set by the config file.

If each region in the BED file should be the same length,

{
    "inputs": {
        "_input_1": {
            "data_type": "seq",
            "input_filepath": "$DIR_PATH/$FILEPATH1",
            "roi_filepath": ""$DIR_PATH/$ROI_FILEPATH1",
            "binsize": 0,
            "stepsize": 0
        }
    }
}

If each region in the BED file should be a different length,

{
    "inputs": {
        "_input_1": {
            "data_type": "seq",
            "input_filepath": "$DIR_PATH/$FILEPATH1",
            "roi_filepath": ""$DIR_PATH/$ROI_FILEPATH1",
            "binsize": 100000,
            "stepsize": 50000
        }
    }
}

If input data is image data and need to be resized into the fixed values, they can be also set by the config file.

{
    "inputs": {
        "_input_1": {
            "data_type": "img",
            "input_filepath": "$DIR_PATH/$FILEPATH1",
            "width": 0,
            "height": 0
        }
    }
}

Users can select a BED file containing either ROI or TXT format as label data. If the labels are string, and needed to be encoded as numeric values, the “encoding” option can be set as “true”.
An example for config file.

{
    "inputs": {
        "_input_1": {
            "data_type": "seq",
            "input_filepath": "$DIR_PATH/$FILEPATH1",
            "roi_filepath": ""$DIR_PATH/$ROI_FILEPATH1",
            "binsize": 0,
            "stepsize": 0
        },
        "_input_2": {
            "data_type": "txt",
            "input_filepath": "$DIR_PATH/$FILEPATH2"
        }
    },
    "outputs": {
        "_output_1": {
            "label_type": "txt / bed",
            "label_filepath": "$DIR_PATH/$FILEPATH3",
            "encoding": true
        }
    }
}

4.4. Running the Python code using users' machine

Command

Running the Python code

$ model.py --config_file [path of config_file] --outdir [path of out_directory]

If users can use Tensorboard, they can track loss and accuracy of models in Tensorboard using the following command:

$ tensorboard $OUTDIR_PATH/log_dir

Options

--config: The file path for the config file (config.json).
--outdir: The directory path for outputs of a deep learning model. If “--outdir” option is unused, the outputs will be created in the current directory.
--print_lyrs: The comma-separated name list of hidden layers whose output should be printed. The layer names of one or more layers can be passed by the option.
--model_name: The file name for the finally saved model.
--model_format: The format for saving an entire model to disk (SavedModel or h5).

4.5. Running the Python code using Google Colaboratory

Users can use Google Colab for training and testing their deep learning models constructed in DLEB.

Goolge Colab is an online browser-based platform that provides free computer resources including GPUs for deep learning applications.

How to run Python code

Upload the following directory and files required for training deep learning models to users’ Google Drive.

Directory downloaded from DLEB
Files containing input data including training, testing data and label data.

Open the IPYNB file in uploaded directory with Google Colaboratory.

In the ‘Runtime’ menu, click the ‘Change runtime type’ and select ‘GPU’ for hardware accelerator for training a deep learning model.

Mount Google Drive by running the first cell by clicking the play button.

Run the second cell by clicking the play button for installing Conda and setting the Conda environment.

Open and edit the config.json file in Google Drive.

To open config.json file, double click the "config.json" tab in sibebar.
File paths for input and label data in your Google Drive can be easily obtained by clicking the "Copy path" button as shown below.

For more details about setting the config.json file, See the 4.3. Setting config file section

Run the third cell by clicking the play button for training and testing a deep learning model.

The default output directory path in Google Colab is "/content/", which can be changed by using the "--outdir" option. For more details about options for running the Python code, see the 4.4 Running the Python code using users' machine section.

4.6. Guidelines for hyperparameter tuning

Hyperparameter is the parameter used to control leaning process. Selecting proper hyperparameters helps to improve performances of the deep learning models. Here is some guidelines about hyperparameter tuning.
Number of hidden layers

For many problems, starting with just one or two hidden layers will work just fine.
For more complex problems, researchers can gradually increase the number of hidden layers, until overfitting is observed.
Very complex tasks, such as large image classification, typically require networks with dozens of layers, and they need a huge amount of training data.
Reusing parts of a pretrained network that performs a similar task can be a good alternative for such tasks.

Number of nodes in a hidden layer

The number of nodes in the input and output layers is determined by the type of input and output data.
For hidden layers, it is a common practice to size them to form a pyramid by gradually increase or decrease the number of nodes in hidden layers.
However, simply using the same number of nodes in all hidden layers performs just as well in many cases. Researchers can try increasing the number of nodes gradually until overfitting is observed.

Learning rate, batch size and other hyperparameters

In general, an optimal learning rate is about half of the maximum learning rate. A simple approach for tuning the learning rate is to start with a large value, then decrease this value and try again, and repeat until the training algorithm stops diverging.
Choosing a good optimizer for training is also quite important. Detailed information about optimizer is in the ‘Optimization of deep learning models’ section.
The batch size can also have a significant impact on the performance of deep learning models and the time of training. A small batch size ensures the short time of each training iteration, while a large batch size gives a more precise estimate of the gradients in the expense of training time. If ‘Batch normalization’ is used, the batch size should not be too small.
Selecting an appropriate activation function is also important. In general, the ReLU activation function will be a good default for all hidden layers. For the output layer, it depends on researchers’ tasks.
You can find more information at the following links and papers.

https://training.galaxyproject.org/training-material/topics/statistics/tutorials/intro_deep_learning/tutorial.html#introduction
https://www.tensorflow.org/tensorboard/hyperparameter_tuning_with_hparams
Greener, J. G., Kandathil, S. M., Moffat, L., & Jones, D. T. (2022). A guide to machine learning for biologists. Nature Reviews Molecular Cell Biology, 23(1), 40-55.
Géron, A. (2019). Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: Concepts, tools, and techniques to build intelligent systems. " O'Reilly Media, Inc.".
Lee, B. D., Gitter, A., Greene, C. S., Raschka, S., Maguire, F., Titus, A. J., ... & Boca, S. M. (2022). Ten quick tips for deep learning in biology. PLOS Computational Biology, 18(3), e1009803.

4.7. Optimization of deep learning models

The parameters in deep learning models can be obtained by minimizing the difference between the predicted values of the models and true ones in training dataset. This process is called optimization, and here is some information about the optimizers available in DLEB.

Optimizer	Description
Stochastic gradient descent	The stochastic gradient descent (SGD) algorithm (momentum optimization) is the variant of the gradient descent (GD) algorithm which can find the local minimum of a loss function by following the opposite direction of a gradient at each iteration. Instead of using a whole training dataset in the GD algorithm, the SGD algorithm randomly selects a small portion of training dataset and uses them for calculating the gradient. The SGD algorithm converges in less time and requires less memory.
AdaGrad	A learning rate is an important parameter that controls the amount of movement during the optimization steps. Too small learning rate increases the time to converge, whereas too large value makes the model converge too quickly to a sub-optimum. The AdaGrad algorithm automatically adjusts the learning rate as training goes on. Therefore, researchers do not have to manually tune the learning rate. However, it is computationally expensive because of the need to calculate the second order derivative.
Adadelta	It is an extension of AdaGrad for removing the decaying learning rate problem in AdaGrad.
RMSprop	The RMSProp algorithm fixes the problem of AdaGrad, which slows down a bit too fast and ends up never converging to a global optimum, by accumulating only the gradients from the most recent iterations. RMSprop was the preferred optimization algorithm of many researchers until the Adam optimizer was developed.
Adam	Adam (Adaptive moment estimation) combines the ideas of the momentum optimization in SGD and RMSProp. Since Adam is an adaptive learning rate algorithm, it requires less tuning of the learning rate.
Adamax	Adamax is an extension to the Adam algorithm. Adamax scales down the parameter updates based on the infinity norm. This can make Adamax more stable than Adam.
Nesterov Adam	The Nesterov Adam (Nadam) optimizer is simply the Adam optimizer plus the Nesterov trick. It often converges slightly faster than Adam.

You can find more information at the following links and paper.

https://training.galaxyproject.org/training-material/topics/statistics/tutorials/intro_deep_learning/tutorial.html#introduction
https://d2l.ai/chapter_optimization/index.html
https://www.tensorflow.org/api_docs/python/tf/keras/optimizers
Lee, B. D., Gitter, A., Greene, C. S., Raschka, S., Maguire, F., Titus, A. J., ... & Boca, S. M. (2022). Ten quick tips for deep learning in biology. PLOS Computational Biology, 18(3), e1009803.

4.8. Overfitting and underfitting

Overfitting occurs when the error measured using a test dataset begins to increase while the error measured using a training dataset still decreases.
Although it is often possible to achieve high accuracy on a training dataset, our final goal should be to develop models that generalize well on a test dataset.
The opposite of overfitting is underfitting. It occurs when the model is not able to obtain a sufficiently low error on a training dataset.
This means the model is not flexible enough to learn relevant patterns in the training dataset.
You can find more information at the following links and book.

https://towardsdatascience.com/handling-overfitting-in-deep-learning-models-c760ee047c6e
https://www.tensorflow.org/tutorials/keras/overfit_and_underfit
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT press.

4.9. Outputs

Preprocessed biological input data

File path:

$ OUTDIR_PATH/input_preproc_data/[layer_name]_[train|val|test].npy

The input data is preprocessed into the format available for training and testing a deep learning model. The preprocessed data is provided as one of the output files.
All preprocessed biological data are saved in the NumPy array format.

Biological data	Preprocessed data
Sequence data (FASTA)	One-hot encoded array
Alignment data (BAM)	Read coverage array
Signal data (bigWig)	Signal coverage array
Image data (JPEG, PNG)	Decoded array

The structure of a deep learning model

File path:

$ OUTDIR_PATH/[model_name](.h5)

The structure (architecture) of a model, a set of weights, and optimizer information are saved in the Tensorflow SavedModel format and the Keras H5 format.
This saved model can be reused with new test datasets.

Outputs of the intermediate hidden layers

File path:

$ OUTDIR_PATH/layer_output/[layer_name].output.txt

The outputs of the intermediate layers selected by users using the “--print_lyrs” option are saved into a text file format.

The log directory for using Tensorboard

Directory path:

$ OUTDIR_PATH/log_dir/

Users can use Tensorboard for tracking model loss and accuracy and visualizing the model graph. If the users set the “Use Tensorboard” option on the web page, the log files for using Tensorboard are saved into the output directory.