Consider a feature from earlier in the keynote: the thing Notes (and Math Notes)...

Hugsun · 2024-06-11T06:41:03 1718088063

I didn't see the presentation but judging by your description, this is achievable using in-context learning.

wmf · 2024-06-11T04:03:38 1718078618

For everything other than handwriting I don't think the LoRAs are fine-tuned locally.

derefr · 2024-06-11T04:22:06 1718079726

Well, here's another one: they promised that your local (non-iCloud) photos don't leave the device. Yet they will now — among many other things they mentioned doing with your photos — allow you to generate "Memoji" that look like the people in your photos. Which includes the non-iCloud photos.

I can't picture any way to use a RAG to do that.

I can picture a way to do that that doesn't involve any model fine-tuning, but it'd be pretty ridiculous, and the results would probably not be very good either. (Load a static image2text LoRA tuned to describe the subjects of photos; run that once over each photo as it's imported/taken, and save the resulting descriptions. Later, whenever a photo is classified as a particular subject, load up a static LLM fine-tune that summarizes down all the descriptions of photos classified as subject X so far, into a single description of the platonic ideal of subject X's appearance. Finally, when asked for a "memoji", load up a static "memoji" diffusion LoRA, and prompt it with the that subject-platonic-appearance description.)

But really, isn't it easier to just fine-tune a regular diffusion base-model — one that's been pre-trained on photos of people — by feeding it your photos and their corresponding metadata (incl. the names of subjects in each photo); and then load up that LoRA and the (static) memoji-style LoRA, and prompt the model with those same people's names plus the "memoji" DreamBooth-keyword?

(Okay, admittedly, you don't need to do this with a locally-trained LoRA. You could also do it by activating the static memoji-style LoRA, and then training to produce a textual-inversion embedding that locates the subject in the memoji LoRA's latent space. But the "hard part" of that is still the training, and it's just as costly!)

janekm · 2024-06-11T08:28:26 1718094506

That's going to be something similar to IPAdapter FaceID: https://ipadapterfaceid.com Basically you use a facial structure representation that you'd use for face recognition (which of course Apple already compute on all your photos) together with some additional feature representations to guide the image generation. No need for additional fine-tuning. A similar approach could likely be used for handwriting generation.

gokuldas011011 · 2024-06-11T05:14:54 1718082894

I believe this could be achieved by providing a seed image to the diffusion model and generating memoji based on it. This way fine tuning isn't required.

raverbashing · 2024-06-11T06:18:19 1718086699

Yup this is pretty much it, and DALLE and others can do this already