Could you elaborate? I thought DINO took images and outputted segmented objects? Or do you mean that your first step was something like a yolo model to get bounding boxes and you are just using dino to segment to make the classification part easier?
We got bboxes from yolo indeed to identify "here is a traffic sign", "here is a traffic light" etc. Then we cropped out these objects of interest and took the DINOv2 embeddings of them.
Not using it to create segmentations (there are YOLO models that do that, so if you need a segmentation you can get it in one pass), no, just to get a single vector representing each crop.
Our goal was not only to know "this is a traffic sign", but also do multilabel classification like "has graffiti", "has deformations", "shows decoloration" etc. If you store those it becomes pretty trivial (and hella fast) to pass these off to a bunch of data scientists so they can let loose all the classifiers in sklearn on that. See [1] for a substantially similar example.