You really should look into using the right hardware for the purpose instead. (Disclamer - I despise Raspberry and their overpriced closed devices, and also HN maniacally trying to use them for stuff they the wrong choice for)
A bunch of other SoC manufacturers have working system sleep implementation, either manufacturer supported or community supported. Never mind much faster boot options (like hundreds of ms to kernel).
So you just need to pick one that also has whatever camera interface you need supported. Say any RK3399 based board can be made to boot to simple userspace in 1-2s and have working upstream camera MIPI-CSI drivers and ISP. System sleep is ~300mW so ~60mA@5V. Pick one with wifi onboard if you need that. And it's all opensource software, no binary crap that can't be optimized.
ESP32 is great, but it simply can't work with the IMX477 camera used in this project. This camera has resolution of 4072x3176, or about 12M pixels, which is way above what any ESP32 can handle.
I can imagine following should be doable (with assumption that IMX477 has it's own buffer and doesn't DMA directly):
1) take a picture
2) read some lines
3) stream them via WiFi to some server
4) repeat 2-3 until whole picture is read
5) reconstruct the picture from slices on the server side
The sensor doesn’t have a framebuffer (because it’s just a sensor) and the RPi HQ cam is basically just a sensor on a board with some mipi connectors. You might be able to buy a package with an IMX477 sensor and some microcontroller/FPGA and frame buffer ram, but that would cost a lot more.