You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm using Allegro as well as NequIP and FLARE to build MLIPs for modeling condensed phase systems and systems for heterogeneous catalysis, and I'm having a little bit of difficulty with Allegro. On my laptop, I can build smaller Allegro models and training goes as expected. However, for larger models that I am training on Perlmutter, after training on the second batch of the first epoch, it takes a while for the 3rd batch to process, and I get the message copied below. After this message gets displayed, training continues as expected. Have you seen this issue before, and if so is there a way to fix this and make training not take so long in the beginning? I've copied the error message, my allegro config file, and my SLURM script on Perlmutter below. The SLURM script and config file are for a hyperparameter scan, and for the hyperparameters I have looked at so far they all have this issue. Any help would be much appreciated. Thanks!
Sincerely,
Woody
Message that appears in training:
# Epoch batch loss loss_f loss_stress loss_e f_mae f_rmse Ar_f_mae psavg_f_mae Ar_f_rmse psavg_f_rmse e_mae e/N_mae stress_mae stress_rmse
0 1 0.951 0.949 1.31e-05 0.00122 0.106 0.203 0.106 0.106 0.203 0.203 1.39 0.00546 0.000341 0.000754
0 2 0.9 0.899 4.69e-06 0.000544 0.101 0.197 0.101 0.101 0.197 0.197 0.414 0.00385 0.000281 0.000451
/global/homes/w/wnw36/.conda/envs/nequip/lib/python3.10/site-packages/torch/autograd/__init__.py:276: UserWarning: operator() profile_node %884 : int[] = prim::profile_ivalue(%882)
does not have profile information (Triggered internally at /opt/conda/conda-bld/pytorch_1659484808560/work/torch/csrc/jit/codegen/cuda/graph_fuser.cpp:104.)
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
0 3 1.21 1.21 1.24e-05 0.000595 0.114 0.229 0.114 0.114 0.229 0.229 0.652 0.00382 0.000324 0.000732
SLURM script on Perlmutter:
#!/bin/bash
#SBATCH --job-name=nequip
#SBATCH --output=nequip.o%j
#SBATCH --error=nequip.e%j
#SBATCH --nodes=1
#SBATCH --time=24:00:00
#SBATCH --constraint=gpu
#SBATCH --qos=regular
#SBATCH --exclusive
module load python
conda activate nequip
mkdir -p outputs
for rcut in 4.0 6.0; do
for learning in 0.001 0.005; do
for lmax in 4 5; do
for nfeatures in 32 64; do
for nlayers in 4; do
file=gridsearch-$rcut-$learning-$lmax-$nfeatures-$nlayers.yaml
sed -e "s/rcrcrc/$rcut/g" -e "s/lmaxlmaxlmax/$lmax/g" -e "s/lratelratelrate/$learning/g" -e "s/nfeatnfeatnfeat/$nfeatures/g" -e "s/nlayernlayernlayer/$nlayers/g" template.yaml > $file
nequip-train $file > outputs/$rcut-$learning-$lmax-$nfeatures-$nlayers.log
done
done
done
done
done
If training continues at a reasonable speed, this is expected behavior due to TorchScript JIT compilation after 3 warmup calls. On the other hand, if you observe sustained degradation in performance, please see issue mir-group/nequip#311 and report the relevant details there.
Hello MIR group,
I'm using Allegro as well as NequIP and FLARE to build MLIPs for modeling condensed phase systems and systems for heterogeneous catalysis, and I'm having a little bit of difficulty with Allegro. On my laptop, I can build smaller Allegro models and training goes as expected. However, for larger models that I am training on Perlmutter, after training on the second batch of the first epoch, it takes a while for the 3rd batch to process, and I get the message copied below. After this message gets displayed, training continues as expected. Have you seen this issue before, and if so is there a way to fix this and make training not take so long in the beginning? I've copied the error message, my allegro config file, and my SLURM script on Perlmutter below. The SLURM script and config file are for a hyperparameter scan, and for the hyperparameters I have looked at so far they all have this issue. Any help would be much appreciated. Thanks!
Sincerely,
Woody
Message that appears in training:
SLURM script on Perlmutter:
Allegro config file:
The text was updated successfully, but these errors were encountered: