We've found that some workloads with the network models can cause poor rollback behavior.
In this case, we recommend trying out the optimistic real time scheduler (--sync=5 instead of --sync=3).
For real time mode, perhaps try out a --gvt-interval of 32 and decrease if necessary.
If you stick with the traditional optimistic scheduler (--sync=3), --gvt-interval=128 may be a good starting point.
For either scheduler, you will probably want to set --batch to either 1 or 2.
The default for number of KPs is 16 per PE. For models with poor rollback behavior, you will want to increase this, depending on the number of LPs per PE you have.
So you may want to consider --nkp=128 or higher (this setting is per PE, not for the whole simulation).
Or you could determine the number of LPs per PE you have and set it so that you will end up with one LP per KP.
Finally, there is --max-opt-lookahead. For dragonfly, we've found success in setting this to either 100 or 1000, but it can depend on the workload being used.
1000 is probably good enough to start out with for any of the network models and workloads.