Im doing exactly this, but will use it with a raspberrypi with a 7 inch touch screen as my doorbell. Someone hits the link on the screen, it hits the server, server texts me a link to join the video session and thats it really. I got the core code going (I used a simple tornado [python] implementation as it has web sockets built in)
This looks pretty straight-forward and makes me wonder if im doing too much work creating and juggling two RTCPeerConnection connections. I will try your approach
This code works fine in iOS safari (and desktop chrome, Firefox, and safari). I've seen a few comments on how mobile safari has pitfalls or is the new IE, but there isnt any specific code there for safari. it all just works
To be fair this is a very simplistic project just to get the 2-way chat working. There are no fallbacks for anything that doesn't work correctly. There are also numerous edge cases especially iOS and Safari that would add 1,000 more lines of code to properly account for.
Not too mention all the features that people actually want like muting, toggle video, noise detection/cancellation.
So yeah, setting up a P2P video chat in 2021 is somewhat easy. Until it's not.
It makes sense that video chat would be complicated though. Skype came out, what, 2003? And it didn't start becoming popular until years later. FaceTime came out 2010. I would bet the vast majority of people had their very first video chat experience some time in the past decade.
So of course it's hard. Nothing is built with video chat in mind, especially nothing that's existed for 30+ years like the Web. Our solutions are janky and feel bolted-on because they are.
Also, I think video (especially live-streamed video) is hands-down the hardest format to work with in computing. It's simultaneously network, disk, memory, and processor intensive, and doubly so with 2+ streams at the same time. We try to fix some of this with compression, but that just makes the codecs more complex, which makes it harder to work with...
Truth is though, you could "just add video chat," if you accept using a video chat vendor, of which there are probably hundreds (WebEx, Google Meet, Microsoft Teams, Discord, off the top of my head). But that means offloading the complexity to someone else. In many cases that's the right call. In OP's case this was clearly meant to be a learning experience so rolling something DIY is of course acceptable. Hard to estimate, maybe, but of course it would be hard to estimate something you don't know anything about and have never done before. Would a 4th grader be good at estimating how long it would take them to learn enough abstract algebra to start publishing papers on it?
I'm not that convinced. I mean yes, if you literally just estimate "add video chat" without any research or further clarification, but that'd get you in hell in every other discipline too.
EDIT: I guess one part might be that people are less likely to recognize specializations than in other disciplines?
You might have a big shed, but it wouldn’t be a very complicated one. Electrical supply is extremely well-abstracted and clearly defined. Software really is worse than other disciplines in how complexity leaks out everywhere.
Blogposts like this don't really help -- someone knocks up a basic WebRTC implementation for a weekend project, reckons it's better than what the specialists do and wants to show it off. That puts the message out that WebRTC is easy, when in reality scaling it to production usage with a full feature set has a hell of a lot of challenges.
It would be better if this sort of thing was heavily caveated ("this is the Hello World of WebRTC") because otherwise a lot of people (non-technical types, junior engineers) see it and think -- well we can do that, should take us a few weeks max.
You made me curious about how large Pion is, 162k lines! I made sure to delete all test files first.
sean@SeanLaptop:~/go/src/github.com/pion$ find . -type f -name '\*.go' | xargs wc
47871 162998 1394063 total
pion/webrtc is the largest package with 58k lines. Every other package (ICE, DTLS, SCTP....) are all around 20k lines. It feels wrong that WebRTC is so large (and not pushed into sub packages) will for sure be digging into that for fun in the next few weeks :)
- A lot of examples: https://github.com/pion/webrtc/tree/master/examples
- A lot of tests - if you exclude `_test.go` and `examples/` you are down to ~58k loines, which is only ~3x bigger than the (much simpler!) ICE and SCTP packages.
With a naive exclude via grep -v '_test.go' and grep -v 'examples/*' we are down to:
Not quite so easy as the blog makes out... didn't see any mention of turn and stun servers, and multi-peer adds layers of complexity...
To stably build a negotiation system you'll probably need an infrastructure of websockets and some kind of nosql db to handle identity and other quirks around negotiation...
Example... how do you handle refresh from a new tab or after the connection has dropped... some kind of device signature is probably needed too!!
(We've just spent a year building this for ecommerce @ https://yown.it)
BIG thumbs up for the interest in WebRTC though enormous potential...
WebRTC is complicated, its been around for a while and support in browsers have not been great in the past, which might be why Zoom first used WebSockets for video. They use WebRTC now though, and WebRTC is fine now, it is the standard, but potential is not the right word.
Have a look at WebTransport to see a future alternative with potential.
For those who are interested, the technical term is signalling (not negotiation), and there are many providers that will help with that (ably.com, pubnub.com, pusher.com), you don't need to build your own infrastructure. WebSockets is also just one option.
Using a SFU/ MCU is almost a requirement for multi person calls, becoming more important for bigger groups.
I had a look at yown.it, I don't know what it does, your description of it is a bit vague. Those problems you mention are not hard to solve: "device signature"? You just set a cookie. Connection dropped? Cookie got you covered. New tab? Cookie got you covered. Refresh? Cookie you got covered.
Essentially we enable in-browser comms (including but not limited to WebRTC for video and audio streams on top of storefronts).
Given we allow anonymous connections, we need to associate each WebRTC connection with user defined data (read user profile). It's not quite as simple as "a cookie" because one user can have multiple devices, updated user information has to sync across the other connections and for a smooth experience you have to have synced connection statuses.
We did look at syncing all this with RTC data channels, problem... you can't get message history and you also can't depend on the channel until after a successful negotiation, which again for us is only part of the larger infrastructure...
This forces the use of a parallel comms system such as websockets, allowing for event based synchronisation as well as the organisation of the WebRTC metadata both pre and post connection...
Most people don't want "naked javascript" with two faces on it, and WebRTC is a fantastic tool for video and audio streaming, however it is limited in its wider use (which is perfectly fine it does enough!)...
I think the problem is that people associate "video chat" with simply the media streaming, whereas the reality is that integrating it into a feature rich front end framework is significantly more complicated, and not simply a case of "adding a cookie"
The difference between the solutions you posted and websockets is as far as I can tell, "your own websockets" or "pay someone else to run your websockets".
What do you mean by anonymous connections? Without them being logged in and you actively tracking them, it is anonymous. You'd be reinventing the wheel to de-anonymize the user if you want to track users across devices, which is certainly not anonymous: existing companies use advertising IDs or cookies, based on the problem. There is no way you can identify users across devices devices (or solve this problem better) than Google and Facebook, since you run in 1 application, they run in almost all of them.
"We did look at syncing all this with RTC data channels,", that's when you use a reliable service with additional functionality like history and presence, not WebRTC data channels, that might be why you struggled. It sounds like you should be using WebSockets for this type of data.
It sounds like you're trying to build chat for ecommerce websites, but isn't that Intercom, tidio.com (free tier alternative). Agora is lower level, but also solves these problems and more: messaging, audio, video calls. I don't think any of these offer cross device identification without having users log in on all their devices.
Since you advertised yourself as a solution to some problem, I first wanted to find out what the problem was, and then see how you are solving it. I don't know either at the moment (I've read the product hunt too). I did visit the yown.it website, and still didn't understand. Now I have read that blog post, and I still don't understand. That blog post served to explain that you didn't read enough about WebRTC before trying it. You didn't know that WebRTC doesn't specify signalling, but this quite literally a basic concept in WebRTC, have a read of the introduction section. https://www.w3.org/TR/webrtc/#introduction
I tried this myself too and when I try p2p with 4 people, out of 10 tests about 50% of the time I won't be able to see all 4 people or someone wouldn't be able to see all 4 people.
It was really hard to make p2p work and debugging the ice connections was even harder.
WebRTC + networking is frustrating. IMO it is a leaky abstraction. There was a hope that ICE+TURN would work everywhere and users would never need to worry. That isn't true so we need to do a better job educating developers about what/why things went wrong.
I am working on a Open Source book that includes a WebRTC networking chapter[0]. Would love your opinions/feedback if this would have actually been helpful when learning this stuff!
I too experimented with a p2p golang webchat setup. All the jargon was confusing and very hard to look up. This post has already given me much more clarity!!
iOS and Safari is riddled with WebRTC bugs like this. Sounds similar to my experience. Everything consistently works great in Chrome and Firefox and then only kinda works Safari. worst browser on the planet
I've jokingly referred to Safari as SafarIE for the last 5 years. It does tick all these boxes:
1. Backed by an OS manufacturer that doesn't care about the web
2. Spends more time working on features that suit itself than meeting standards agreed upon by a body of which they're a part.
3. The only sanctioned/allowed browser on their platform (MS didn't even achieve this holy grail)
4. Lagging behind most other popular browsers by years in some cases
But due to it being the ONLY browser that'll run on iOS, I have no choice but to dumb down user experience for it. This year's lovely issue has been MediaRecorder - but supposedly that's made it into the most recent release.
I use firefox on iOS, it’s even set as my default browser. Curious what you mean by “safari is the only browser that works in iOS”? For in browser video chat? Genuinely curious, what are the limitations?
Ah I see, thanks I had no idea (I assume most average users are in the same situation). I assume IpadOS is the same, and only MacOS allows a true non safari browser.
iOS binaries that are not signed by apple are not permitted to mark memory pages as executable if they have previously been writable.
This restriction makes exploiting buffer overruns very difficult on iOS - particularly important as objective-c doesn't give you much help avoiding them.
However, you can't write a runtime compiler unless you can generate bytecode (write) and then execute it, and nobody has found a way to write a performant javascript or CSS engine without some form of runtime compilation.
So, Apple does allow you to write your own browser backends, but they won't give their signature, which would permit you to use riskier techniques to gain performance.
In practice, that means any browser not using the safari engines would be unacceptably slow on the modern web.
Yeah, it made me feel like I had to read the manual but there wasn't any. Seeing such a bad one reminded me how spoiled we are with beautiful abstractions.
That's probably Network Address Translation (NAT), which requires TURN (a fancy name for a central relay for all media) to "punch through". TURN literally stands for "Traversal Using Relay around NAT". And it's just a traditional, centralized. non-p2p fallback for people on paternalistic networks that don't allow them to create UDP connections or TCP connections on any ports other than 80 or 443.
Which, as it turns out, is a lot of users. I've seen estimates in the range of 10 to 20% of users. Which means, for a random selection of 7 users, you pretty much have a 50/50 chance of not being able to peer everyone using just STUN.
I think it is likely also bandwidth and cpu issues with mesh peer-to-peer.
Unless you're capping the video bitrate, the browser will try to use whatever the browser's default target is, for each connection. On Chrome that's 3mb/s, which is a lot of network bandwidth, and turns out to be a lot of cpu as well just shuffling those packets through the encoding->sending->bandwidth-estimation and receiving->decoding->rendering pipelines.
Capping the video bitrate is more complicated and confusing than it should be. It's better now that the browser implementations are all more or less closing in on "WebRTC 1.0" compliance. But you still need to reach into either the raw SDP you are exchanging during signaling, or the RTCPeerConnection objects, and set the encoding bitrate target.
The SaaS platforms that offer WebRTC APIs and infrastructure all do a lot of work under the covers to set bitrate caps, track constraints (resolution, for example), and other bits and pieces of WebRTC config that work well on a wide variety of networks, devices, and browsers.
The only numbers I have ever seen published are Whereby's[0] they saw 17% used TURN.
There is a little more nuance then just paternalistic networks though. In same cases like NAT Mapping exhaustion you just can't give an individual user multiple long lived mappings. Address Dependendent filtering/mapping also makes sense in some cases. It makes P2P harder, but does give you the ability to provide your users more sessions at least!
We see about the same numbers Whereby does at Daily, globally across our whole user base. Bounces around a little but is usually just under 20%.
Way more for customers that are mostly serving corporate users, of course (firewalls). And more for mobile-heavy user populations.
Actually, that's a good reminder that it would be nice to understand the mobile data networks breakdown in more detail. Most of the US mobile data networks require TURN, as far as I remember when I last looked at this. But I don't know if that's true everywhere in the world.
Thinking about it, v4 addresses and oil have a lot in common. The exhaustion/depletion is coming, but we keep finding ways to circumvent it. For oil you got fracking and sand. For v4 you see wider adoption of wide/carrier-grade NAT. In both cases it temporarily solves the supply problem really effectively. However it also ruins the environment.
I always assumed everyone is behind NAT, you're saying on 10 to 20% of people are, and therefore only they need TURN. I'd love to see where you got that number.
If I were to guess, the problem GP is facing is bandwidth, a mesh network uses exponentially more bandwidth. For each user, the bandwidth is linear, N more people requires N more bandwidth. This is fine for downloads, but uploading N more can be much more challenging for certain networks.
He's mistaken that NAT always requires TURN. Consumer NAT typically still allows incoming UDP, using STUN/punch-through, or TCP with uPNP support. He maybe meant to talk about only about more restrictive NAT situations or campus/corporate/ISP/nation-state/scientology-compound firewalls.
Im not sure about UDP hole punching and how it relates to WebRTC, i don't see it being talked about much. In general hole punching is a rare thing to hear, I cannot find much resources about it.
And uPNP (Universal Plug and Play) sounds like its for device discovery in the same local network, so again, it doesn't sound related to webRTC, we can connect directly with each other on the same local network anyway.
The 10-20 percent would reflect people with symmetric NAT, which is rather common with mobile networks, corporate NAT, etc. Symmetric NAT requires TURN relays. Typical home router configurations are not symmetric NAT, and usually work with just STUN.
The worst one for me so far is mDNS not working on my local network so the one circumstance you should basically be able to guarantee an easy P2P connection doesn’t work.
i wonder if webrtc was built to be intentionally complex or if a better standard would make adoption easier, perhaps in conjunction with a standard server (like we have httpd for html)
Instead of `getUserMedia` replace with `getDisplayMedia`.
If you are looking for a native option use [0] or [1] and you can send anything from ffmpeg to webrtc. ffmpeg itself doesn't support WebRTC so need to use something for the last part.
This is the version of the js code that I got going (I couldn't reason about straight inline scripting, I had to make unnecessary classes. you dont need them) https://gist.github.com/emehrkay/1ea9a87a91e00b27843d9b71a3c...
You also need to tell nginx to serve the wss connection with http 1.1 or the handshakes fail
``` location /websocket/path { proxy_pass http://whateverSiteDotCom; proxy_http_version 1.1; proxy_set_header Connection "upgrade"; proxy_set_header Upgrade $http_upgrade; proxy_set_header Origin ''; } ```