This project should somewhat thank my interest (actually it was because I'm avoiding my capstone project) in WebRTC.

The most important knowledge are from

MDN covered most of the knowledge required while Google's sample code demonstrated a holistic logic behind the scenes. As a result of learning, I've done my implementation as well. Please visit and give it a try.

Roughly, this project did the following:

Clients exchange the session description

SDP (Session description Protocol) is the standard describing a peer-to-peer connection. SDP contains the codec, source address, and timing information of audio and video.

Quote MDN.

Caller side at first generates an offer, which contains the session description and transmitted by the signalling server (in many cases a self-hosted backend) to the callee side. Once the callee received the offer, it then generates an answer, which is transmitted by signalling server as well, back to the caller. Now both sides have the session descriptor from the other side, and they can start try to contact each other via ICE candidates.

Client generates, send and retrieve ICE candidates

ICE candidates include the IP address, port number and the protocol.

In typical home use IPv4 network, an STUN server is needed so that the client can expose UDP port over the public internet. This is to address the restrictions come from both Carrier-grade NAT and LAN firewall policies.

However, if both sides are behind firewalls that reject unknown sourced IP connections to exposed ports (such as Symmetric NAT), P2P connection will fail and a TURN server is needed to relay the traffic between two clients.

Backend implementation

It's a fairly straightforward process. I used Python + Flask to capture HTTP endpoint traffics. Every unique visit will be set a randomised user identifier.

The web frontend asks the caller to create a room, in which the backend stores the ICE candidates and session descriptions. The same process happens on the callee side, but instead they are asked to join a room.

To keep the implementation simple I opt out WebSocket, so the frontend is polling the backend in a set interval for information retrieval and stop polling once WebRTC connection is established, in the meantime, the backend deletes the room as well.

My current implementation only covers one-on-one calls, and I am yet to discover video calling, screen sharing, multicast, etc. In addition, the state management on the front end is very primitive at this stage. What's worse is that since UDP connection is so much QoL'd (international), I've done nothing to re-establish the connection so that's a bummer. Maybe I'll try to finish it and make it somewhat reliable.

I'd like to thank @kegns, @littleboarx, @n3ih7 for the help and testing along the way.