I use Efficientdet and Websockets. On the LAN, WS can give subsecond latency quite easily.
Unfortunately I don't have a learning accelerator or dedicated NVR machine, so I'm just using tricks like only decoding keyframes and running inference on those, but only if I've detected motion.
I'd really like to do more with image recognition, edge computing surveillance has a lot of of potential to help people who don't trust the cloud solutions.
I am still waiting on someone to come out with a NLU model that can do the basic home automation tasks that most people use alexa/google assistant for. Ex: turn on lights, turn on heat, etc.
I think if you could develop a piece of hardware that could run inference on all of that for under $500 a lot people would ditch the cloud based voice assistants. Especially if it came with home assistant integrations.
Is NLU even needed? Seems like just a normal parser generator would be enough, the hard part is all the UI and integrations since there are bazillions of devices, many of which don't have a local API, some of which can't because they're cloudful inherently like Google Calendar.
I think $100 would be about the limit for consumers, if that, but enough of the techies might switch to make it profitable.
I’d be interested in more details of your setup. It sounds like it might be a good approach for something I’m planning to setup in my garden to capture the wildlife. I’d like to use motion detection and megadetector to analyse the footage. Currently trying to find some half decent cameras to use at a reasonable price.
Not the original poster, but I recommend zoneminder(open source) software. I do the same as the parent. I have an old cpu in a server that does motion detection on multiple h264 HD streams and only once (zone based) motion is detected it feeds it into a SSDLite_MobileDet object detection running on a pcie edge TPU (a cheap Google made AI accelerator). Same can be done with no accelerator, but I plan on adding many more cameras.
(Zmeventnotification add on is used for edgetpu stuff.)
Important things to note. Streams are recorded at 30fps,but motion detection runs at only 5fps as I found it better for slow moving objects. Also be prepared for the motion detection to run horribly until you tune it in, but although many more user friendly software packages exist I recommend zoneminder, because no other give so much control over motion detection parameters. I have one camera that looks at a 100m long field with multiple bushes and trees moving in the wind. It needs to be less sensitive for nearby stuff(bottom of the frame) and a lot more for far away. It has to exclude rain, snow, switches to IR etc. Only ZM allows me to tweak it properly and now that I did it runs with no false positives for days.
I wish zoneminder's setup was better - it's frankly abysmal. https://www.how2shout.com/linux/how-to-install-zoneminder-on... - and this may not even be up to date. The official docs are no help, only covering up to 18.04, and not being functional. There are docker images, but they're deprecated/unsupported. This may all have been de rigeur when zoneminder was first being created (in 2002!) but expecting this level of hackery to set up an application in 2023 is somewhat taking the piss. The UI isn't that much better - it's the kind of application sysadmins would build for each other, reminding me a lot of a typical router admin app.
Indeed it is true, I hated it when I was setting it up, but I love how it works now... So, I still recommend it, but only to people that don't mind looking at code if things don't work.
Code is here(Specifically under src/thirdparty/iot_devices/NVRChannel).
Basically I stream from RTSP cameras into GStreamer. I have GStreamer constantly convert to MPEG-TS(A wonderful format because the fixed size chunks make stream streaming easy) and output on a named pipe, plus output the keyframes to ramdisk files.
The TS stream I output via a web socket and use a JS player to decode that(See kaithem/data/modules/Beholder for the player code).
But I also output the video to files in the ramdisk continually and keep the last few segments. When record is triggered, I copy all the existing segments over(To catch things before record started), start copying any new ones and start creating an HLS playlist file. When you go to play a file in the web UI, I use an HLS player directly.
This allows playback while a session is still being recorded, and future metadata stuff in the file, and lets me use .ts for everything.
Keyframes go into a motion algorithm that uses PIL to perform an erosion operation and remove small bits of noise, taking into account real motion will be connected pixels. If motion is seen, tflite-runtime runs the Efficientdet model. This is probably the weakest part of the system, since I only look at keyframes and the model is not very accurate I occasionally miss things.
The results get postprocessed because sometimes the model sees things that aren't there. Definitely could use better modeling.
The motion estimator value is exposed as a tag point for other automation triggers in realtime.
I'm using Amcrest at the moment but they have some big ethical problems last I heard, so I'd probably go with TP Link Tapo if they still have RTSP next time.
What are you doing motion detection with? I'm using dvr-scan (which wraps opencv) right now, and it's both slow and hard to tune.
And, are you using zones, or do you just have filtering tuned so that ambient movement of leaves etc don't trigger an event? It seems to me that since a camera's viewpoint is static that you ought to be able to train it to identify and disregard "normal" movement, but I haven't dug far enough to figure out how to do that yet.
I'm using Efficientdet for object, which is not that great but pretty fast on a GPU.
I don't have any zones, but I do have a custom detection algorithm. Since I only look at a frame every few seconds I went with an algorithm a bit more optimized for that application, it has to detect motion with only one frame or it won't get another chance for several seconds, and possibly miss the event.
Basically I take the difference between this keyframe and the last one, then do an erosion to remove pixel noise and tiny movement.
After the erosion I take the average value of the pixels and get rid of any pixels not significantly above that, to get rid of small uniform lighting changes and really noisy artifacts in low light.
Then after that I square every pixel and square root the mean of the whole image.
I forgot how complicated this was and how many tweaks I added!
As far as I have seen, most AI solutions focus on object detection on a single frame. Would temporal memory, or video detection, increase the confidence a lot? I have not seen any solutions that would understand larger context over multiple seconds timespans.
The real kicker would be for it to integrate a 3D model of what it's looking at. But that would require some heuristics of the world which would probably require some other kind of training data than just a bunch of images. Maybe if/when 3D scans and the corresponding 2D images can be acquired en masse together, or if it could be done in a simulated environment with virtual cameras in them?
Unfortunately I don't have a learning accelerator or dedicated NVR machine, so I'm just using tricks like only decoding keyframes and running inference on those, but only if I've detected motion.
I'd really like to do more with image recognition, edge computing surveillance has a lot of of potential to help people who don't trust the cloud solutions.