To bring things full circle: the cross-entropy loss is the KL divergence. So intuitively, when you're minimizing cross-entropy loss, you're trying to minimize the "divergence" between the true distribution and your model distribution.
This intuition really helped me understand CE loss.
Cross-entropy is not the KL divergence. There is an additional term in cross-entropy which is the entropy of the data distribution (i.e., independent of the model). So, you're right in that minimizing one is equivalent to minimizing the other.
Yes, you are totally correct, but I believe this term is omitted from the cross-entropy loss function that is used in machine learning? Because it is a constant which does not contribute to the optimization.
This intuition really helped me understand CE loss.