Enzyme Prediction and Optimization with Enzeptional (GT4SD)¶

This tutorial describes how to predict feasibility and kcat for enzymes using the Enzeptional framework from the Generative Toolkit for Scientific Discovery (GT4SD).

The example focuses on polysaccharide hydrolysis reactions:

exo-inulinase: inulin to fructose
exo-levanase: levan to fructose
invertase: sucrose to glucose + fructose
endo-inulinase: inulin to inulooligosaccharides such as F4

Technologies¶

GT4SD v1.5.0 with the Enzeptional framework for enzyme optimization
ESM2 650M (facebook/esm2_t33_650M_UR50D) for protein embeddings
ChemBERTa (seyonec/ChemBERTa-zinc-base-v1) for substrate/product representation
PyTorch 2.5.1 with GPU support, tested on RTX 4090
Docker container execution with NVIDIA GPU support

Requirements¶

Docker
Base image: drugilsberg/gt4sd-base:v1.4.2-gpu
Working-directory files:
all_mapping_active_sites.fasta: enzyme sequences with mapped active sites
SMILES files such as inulin_substrato.smi, produto_exo_inulin.smi, and related reaction files
run_pred.py: adapted Enzeptional prediction script
run_all_predictions.sh: Bash script for multiple predictions

Container Setup¶

Create the GPU-enabled container:

docker run --gpus all -dit --name gt4sd-gpu -v /home/node03/Alunos/lucas/gt4sd-core-gpu:/workspace drugilsberg/gt4sd-base:v1.4.2-gpu bash

To use a specific GPU, for example GPU device 1:

docker run --gpus '"device=1"' -dit --name gt4sd-gpu -v /home/node03/Alunos/lucas/gt4sd-core-gpu:/workspace drugilsberg/gt4sd-base:v1.4.2-gpu bash

Enter the container:

docker exec -it gt4sd-gpu bash

Run the setup commands inside the container:

# Install dependencies if requirements.txt is available
pip install -r requirements.txt

# Install GT4SD
pip install gt4sd==1.5.0 --upgrade

# Compatibility fix
pip uninstall numpy -y
pip cache purge
pip install numpy==1.23.5 --no-cache-dir --force-reinstall

# Refresh certificates if downloads fail
apt-get update --allow-unauthenticated
apt-get install -y ca-certificates
update-ca-certificates --fresh

# Upgrade PyTorch with CUDA support for RTX 4090
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu118

# Check PyTorch and GPU visibility
python -c "import torch; print(f'Version: {torch.__version__} | CUDA: {torch.cuda.is_available()} | CUDA version: {torch.version.cuda}')"

The forced PyTorch upgrade enables GPU execution even when GT4SD brings older dependency constraints.

Run Predictions¶

Make the batch script executable and run it:

chmod +x run_all_predictions.sh
./run_all_predictions.sh

Example `run_all_predictions.sh`¶

#!/bin/bash

# =========================
# GENERAL SETTINGS
# =========================
FASTA="all_mapping_active_sites.fasta"
SCRIPT="run_pred.py"
PYTHON="python"

OUTDIR="results"
mkdir -p "${OUTDIR}"

# =========================
# EXECUTION FUNCTION
# =========================
run_prediction () {
    local TAG=$1
    local SUBSTRATE=$2
    local PRODUCT=$3

    local OUT_FEAS="${OUTDIR}/${TAG}_feasibility.csv"
    local OUT_KCAT="${OUTDIR}/${TAG}_kcat.csv"

    echo "===================================================="
    echo "   Running optimization: ${TAG}"
    echo "   Substrate: ${SUBSTRATE}"
    echo "   Product:   ${PRODUCT}"
    echo "===================================================="

    ${PYTHON} ${SCRIPT} \
        ${FASTA} \
        ${SUBSTRATE} \
        ${PRODUCT} \
        ${OUT_FEAS} \
        ${OUT_KCAT}

    if [ $? -ne 0 ]; then
        echo "Execution failed: ${TAG}"
    else
        echo "Finished successfully: ${TAG}"
        echo "   -> ${OUT_FEAS}"
        echo "   -> ${OUT_KCAT}"
    fi

    echo
}

# =========================
# RUNS
# =========================

# Inulin, exo reaction
run_prediction \
    "inulin_exo" \
    "inulin_substrato.smi" \
    "produto_exo_inulin.smi"

# Levan, exo reaction
run_prediction \
    "levan_exo" \
    "substrato-levan.smi" \
    "produto_levan_exo.smi"

# Invertase
run_prediction \
    "invertase" \
    "invertase_substrato.smi" \
    "invertase_produto.smi"

# Inulin to F4 + F4
run_prediction \
    "inulin_endo" \
    "inulin_substrato.smi" \
    "inulin_f4_produto.smi"

echo "All predictions finished."

Outputs¶

The script runs predictions for four reactions and writes CSV files to ./results/:

${TAG}_feasibility.csv: catalytic feasibility score for each enzyme
${TAG}_kcat.csv: predicted kcat values and evolutionary optimization results

Runtime Notes¶

Runtime can range from hours to days depending on the number of sequences and evolutionary iterations. In the reference workflow, about 3,177 sequences were processed.

Use nvidia-smi to confirm GPU usage while the workflow is running.