Why would you need 50 packets per second vs 10? Is 100ms not acceptable but 20ms...

tgtweak · on June 13, 2024

Default configuration for SIP used to be 20ms, the rationale behind it was actually sourced in the fact that most SIP was done on LANs and inter-campus WAN which had generally high bitrate connectivity and low latency. The lower the packet time window the sooner the recipient could "hear" your voice, and if there were to be packet loss, there would be less of an impact if that packet were dropped - you'd only lose 20ms of audio vs 100ms. The same applies for high bitrate but high latency (3g for example) connectivity - you want to take advantage of the bandwidth to mitigate some of the network level latency that would impact the audio delay - being "wasteful" to ensure lower latency and higher packet loss tolerance.

Pointedly - if you had a 75ms of one-way latency (150ms RTT) between two parties, and you used a 150ms audio segment length (ptime) you'd be getting close to the 250ms generally accepted max audio delay for smooth two-way communication. the recipient is hearing your first millisecond of audio 226ms later at best. If any packet does get lost, the recipient would lose 150ms of your message vs 20ms.

Modern voice apps and voip use dynamic ptime (usually via "maxptime" which specifies the highest/worst case) in their protocol for this reason - it allows the clients to optimize for all combinations of high/low bandwidth, high/low latency and high/low packet loss in realtime - as network conditions can often change during the course of a call especially while driving around or roaming between wifi and cellular.

lxgr · on June 13, 2024

> the rationale behind it was actually sourced in the fact that most SIP was done on LANs and inter-campus WAN which had generally high bitrate connectivity and low latency

In addition to that, early VoIP applications mostly used uncompressed G.711 audio, both for interoperability with circuit switched networks and because efficient voice compression codecs weren't yet available royalty-free.

G.711 is 64 kbps, so 12 kbps of overhead are less than 25% – not much point in cutting that down to, say, 10% at the expense of doubling effective latency in a LAN use case.

crazygringo · on June 13, 2024

> Is 100ms not acceptable but 20ms is?

Yup pretty much. Doubling it for round-trip, 200 ms is a fifth of a second which is definitely noticeable in conversation.

40 ms is a twenty-fifth of a second, or approximately a single frame of a motion picture. That's not going to be noticeable in conversation all.

Of course both of these are on top of other sources of latencies, too.

IshKebab · on June 14, 2024

200ms is noticeable but in conversation it's still pretty good, and certainly way better than the average WhatsApp call which is on the order of 0.5-1s.

dilyevsky · on June 14, 2024

Anything over 40ms will be noticeable. Just to give you an idea how sensitive our ears are - there’s a max distance certain instruments can sit away from each other in an orchestra booth or they start falling out of sync due to speed of sound delay

callalex · on June 14, 2024

To clarify this even further, as someone who professionally plays an instrument that is traditionally placed at the back of an orchestra, you absolutely cannot play by ear: you MUST play by watching a combination of the stick in the conductor’s hand and the bow of the first violinist and cellist at the front. If you play what sounds in sync to you, the conductor and audience will hear you too late; the round trip from the front to the back of a stage, plus the sound traveling through the brass tubes of your instrument, plus the trip from the rear of the stage to the first row of the audience simply takes so long that it will sound noticeably wrong. The same is true for the far sides of an orchestra pit underneath the stage of a musical or opera. It only takes 20 meters/yards to become an issue.

lotharcable2 · on June 14, 2024

It is generally said that the lowest threshold for people to perceive time delays is around 10-15ms.

Speed of sounds is roughly 343 meters per second. Which means translates we can sense the delay difference of about 4-8 meters or so.

Which 100% corresponds with what you are saying. 20 meters is a 58ms-ish delay.

A 200ms is about 70 meters. Which would be like having conversation between people using one of those accidental sound projection features that sometimes happens with large open buildings like sports stadiums.

people talk in a cadence of around 100-200 words per minute. I guess we could say that is 300-600 syllables per minute. So that is about 200-100ms per syllable.

It all kinda lines up.

NavinF · on June 13, 2024

Yes 100ms feels horrible. People constantly interrupting each other because they start talking at around the same time and then both say "you go first". Discord has decent latency and IMO it's a major reason behind their success