Emulator Latency2016-08-03

Latency represents the delay between an event occurring and being acknowledged. For instance, when you press a key in a game, this is the amount of time it takes before the key press is acknowledged, the screen is updated, and the corresponding sound effect is heard.

Latency is an unavoidable issue in general; and emulators by their very nature tend to fare worse than regular games for various reasons.

In this article, I'll discuss the types of latency that exist, the unique challenges to emulation, and why things are as good as they can get already.

The last point is the key reason for this article: every few months, someone new comes along with a 'revolutionary' technique to reduce emulation latency. I've become rather exhausted having this debate, which has led to a rather nasty temperament on my side. So instead of joining the debate endlessly, I'll just explain my reasoning here and leave it at that.

I've been researching and considering this very topic, as I consider it the absolute #1 weakness of emulation over real hardware, for a dozen plus years now. So believe me when I say I've heard it all before. There are no miracle software changes that are going to 'solve' emulator latency.

External Latency #1: hardware input polling

The first source of latency is input itself. When you press a keyboard key, click a mouse button, or press a gamepad button, PC software won't know about this input event instantaneously.

First, USB peripherals tend to have polling rates. Generally speaking, input devices poll at a rate of 125Hz. Or in other words, once every 8 milliseconds.

So even when a PC program asks for the current state of input, if you have pressed a button immediately after the polling, it won't be reflected when the PC program polls for updated input.

It's possible to increase USB polling rates. I've seen it pushed as high as 1000Hz. The tradeoff is that it uses more CPU power, and thus, more battery life. And it's something that is quite hidden and rarely done by users.

External Latency #2: software input polling

These days, input is handled by calling out to various APIs: DirectInput, XInput, RawInput, udev, devd, etc.

This process can incur quite a bit of CPU time as well. Attempting to poll the keyboard state, mouse state, and all attached gamepads can easily eat several milliseconds per call. So it's just not possible to poll every millisecond.

So when you press a button immediately after the program polls inputs, you can be stuck waiting until the next program poll for it to be acknowledged. Worse, this latency can stack with a hardware input polling miss.

External Latency #3: hardware audio mixing

Computers these days are heavy on multi-tasking. A user does not expect a computer program to take exclusive access to the sound card, and thus prevent audio from working in their chat clients and other background applications.

Accomplishing this means that the OS holds audio buffers for each application, and mixes samples from all of the applications into a final stream to output to speakers or headphones. The output queue while speaking to hardware also needs some buffering of its own to maintain exact sample rates.

Generally, this can end up being anywhere from 10ms to 40ms of latency.

Again, this is a latency source that is avoidable, but only in limited cases. Currently, I'm only aware of WASAPI exclusive mode. A Windows-only technology with terrible driver support. So bad that I've yet to find two systems that act similarly when following the official API.

It's rare to see a program support this, and it's even rarer for it to work all that well.

External Latency #4: software audio mixing

Just like an OS with multiple applications, a game can have multiple streams of audio as well. Think background music and sound effects. Or for emulators, the output from an SNES plus the output from a Super Game Boy peripheral.

This requires another buffering system to queue input samples and mix them into a larger queue that is then fed to the OS audio API.

Once again, this can range from 10ms to 40ms of latency. It's very difficult to minimize this value, because the lower bounds vary widely between systems. Aim too low, and your out of the box experience will be horribly scratchy audio full of pops and crackles.

External Latency #5: video displays

It takes time for a picture to get from a video card to a display, and to render fully for you to see the frame.

And since the industry has moved from CRTs to LCDs, this has only gotten worse. LCDs tend to have built-in scalers and other processing logic (like onscreen display overlays and image filters) that result in added latency. I targeted my HP ZR30w specifically because it lacks a scaler and OSD, and 'only' adds 30ms of added latency compared to most CRT displays.

Good panels with scalers are even worse. Pretty much the only way to reduce this is to sacrifice quality and aim for grossly inferior TN panels full of terrible viewing angles, inconsistent colors, and dithering artifacts.

This is another hard one to pin down, but you can expect an average of 20ms to 100ms for the display itself.

External Latency #6: hardware video processing

Before a frame can be pushed from the video card to the display, the frame has to be rendered and rasterized by the video card itself. For 2D, this usually just means getting the pixels into video RAM. For 3D, it means rendering all of the 3D draw commands. But even 2D can gain a frightening increase in latency by the use of user-programmable pixel shaders to perform software-based processing effects like CRT simulation, scanline simulation, image sharpening, etc.

Even worse are things like vertical sync, page flipping and triple buffering to prevent video tearing artifacts. On a 60fps game, these alone can add 16ms or 32ms of additional latency.

An effective counter to this is adaptive sync. But adaptive sync displays are more expensive, require somewhat newer video cards, and are generally not widely available. A certain vendor that Linus Torvalds is quite fond of makes the problem especially bad by needlessly price gouging for their non-standard tech. Even worse, the tech is once again only really available on Windows.

External Latency #7: software video processing

It can take time for software to buffer an entire video frame to send to the video card; or to issue all of the required draw commands so that the video card can start processing the end result.

This stage can add another 16ms of latency at 60hz in the worst case.

Emulator Specific Latency

Another important point is that emulators are not like regular PC games. They have additional latency constraints that PC games can avoid.

This is yet another needlessly 'controversial' statement that I've had to argue with people whom have never written emulators before, but it's true.

Input Polling

Most games are going to poll input once per frame. But pathological cases can and do exist. Especially prominent is that there are certain devs that like to troll emulator authors and will do things like put out a mini-game that hammers the input polling thousands of times a second if we leave an opening there.

As stated earlier, polling the OS input APIs is quite time consuming. If we call out to it every time the emulated environment asks for input states, we could end up having to make thousands or even millions of calls per second, crashing our framerates down to nothing.

We don't have a choice: we must place an upper limit on the number of times we poll the OS input API per second. Generally this will be 16ms in emulators. And once again, this can stack with the hardware polling delay. We can make reasonable guesses when an emulated game is going to poll inputs (usually in the NMI handler that happens after a frame is fully rendered), but in practice any game is capable of polling inputs at any time they like.

Audio Sampling

Whereas a regular PC game can rely on native OS mixing controls to handle multiple voices, an emulator focused on accuracy is not so fortunate.

We cannot skew the audio sample rate to whatever is convenient. We can only generate audio samples when the real hardware generates samples. The emulated game simply outputs the finalized stream, or streams (which we must mix ourselves); and then we must feed that into a buffer to push to the OS audio API.

By losing control of the rate that samples are generated, we end up with more latency.

Video Processing

Once again, when we emulate a system, we can't take any shortcuts. For 2D systems, we have to buffer an entire video frame. Trying to push one pixel at a time as it is generated directly into a video buffer would be a recipe for disaster. Modern GPUs are not designed to work this way. Things work best when you stream a large block of data all at once in a tight loop. 3D systems tend to have similar issues: the transformations needed to convert to OS 3D video APIs like Direct3D and OpenGL generally can't be streamed in real-time to a video card: lots of processing is needed once the entire frame's state is known.

Overlying Theme: Indirection

What all three of the above added delays share in common is that they all represent a layer of indirection, which is exactly what an emulator is: a system running inside another system.

And the cost of this indirection is latency.

Versus Real Hardware

When you look at modern systems, they act more like computers anyway.

But older systems like the SNES would literally blast pixels just in time for the CRT beam cannon, output samples instantly through the DAC to the speakers, poll inputs immediately, etc.

So when you combine the disadvantages of moving from dedicated old hardware to all of the layers of indirection required both by multi-tasking PCs in general in addition to the emulator-specific latency that is required to emulate hardware platforms ... this makes up the total sum of perceived latency in emulators. And it's why emulation can never come close to the responsiveness of real hardware.

Optimizing Latency

There are obviously things an emulator can do that will help or hinder latency. So let's talk about those, in the context of higan.

Input

As stated, we cannot poll OS input APIs immediately every time they are requested, and guessing could result in making a wrong guess immediately after the emulated game had polled input, our worst case scenario.

But considering that 99.9% of games only poll input once per frame, regardless of where ... there's a simple trick that can, in all of these 99.9% of cases, eliminate this source of latency entirely; without destroying performance in the pathological cases.

The idea is to keep track of the amount of time that has passed since the last OS input API polling has taken place. We do this by taking a timestamp accurate to the millisecond (or microsecond) when we poll the OS input API. And now, the next time the emulation tries to poll input, we check this timestamp. If less than N milliseconds have passed, then we abort and use the cached input values. But if N or more milliseconds have passed, then we update the timestamp and actually poll the hardware.

So long as N milliseconds have passed, and that will be the case whenever N is anything less than the number of milliseconds per generated frame of the emulated system, the end result is that we will poll the hardware just in time as the emulator itself needs the input.

This is what higan does. The value chosen for N is thus not an actual latency in most cases, but a rate limiter. Ideally we'd set this to 1ms, but again, there are constraints to invoking the OS input API too frequently. I have profiled various values and have found that, currently on my system, 5ms seems to be the optimal value for this polling interval limiter.

The end result is that, in 99.9% of cases, higan eliminates the emulation input latency completely. And in the pathological case, it will never be worse than 5ms.

Audio

Remember the discussion about the OS audio mixer adding in latency? Well, higan supports the uncommon WASAPI exclusive mode as an option. You do have to explicitly enable it for the aforementioned reasons, but the option is there.

Further, when it comes to mixing samples from multiple streams, such as in the case of the SNES + Super Game Boy, higan will literally output mixed samples as soon as even a single final composed sample is ready to be processed. This comes at an expense to CPU usage, but minimizes the added emulator latency as much as possible.

higan allows you to control the audio latency level in its settings, so you can push the value as low as your system will allow before audio distortion kicks in. So you can thread the needle as much as you like; higan won't stop you.

Video

Remember the discussion about vertical sync, page flipping, and triple buffering? higan leaves all of this off by default. In fact, higan doesn't even support page flipping or triple buffering. Even the vertical sync option is disabled by default and hidden away in the configuration file only.

higan is encouraging users toward adaptive sync displays. Not just for latency reasons, but due to necessities of emulation: higan supports systems that run at 50hz, 60z, and 75hz currently. Short of taking over the entire display and refresh rate by changing modes (which isn't even possible in the Linux/BSD Xorg world; hence I won't even consider this), and even then, requiring a monitor that can handle 75hz in the first place ... you'll never get proper synchronization with Vsync anyway.

Furthermore, the very instant a frame is fully rendered, higan will immediately focus solely on getting that frame out to your display through the OS video API. higan doesn't even do this in parallel while generating the next frame. Again, at a cost to performance, to minimize latency.

Fully Optimized Latency

higan does everything that is realistically possible.

There are magic tricks beyond this, such as emulating every possible input one frame into the future, to cut out a single frame of latency. But with only one controller, this would require higan to emulate up to 4096 simultaneous SNES systems and well ... higan just isn't that fast, sorry.

Similarly, it's a nice pipe dream to think about me writing my very own OS that runs higan as the only program right in kernel space; but let's be honest: even for me, that level of NIH isn't possible. Computers and drivers to control them are just too complicated. It would need a large team of people to do something like this, and so far I don't see anyone really working on it. You may have found some toy group, but consider how many years the nouveau project has tried to create just a barely passable nVidia GPU driver. This isn't practical. If that ever changes, then I'll be first on board with the idea. And no, DOS isn't a solution. The hardware support for modern systems fast enough for higan just isn't there. Because DOS died decades ago.

But there is nothing practical left to be done to reduce latency further from higan's side. The only option left is for user intervention. So to that angle...

Reduce Latency #1: shop wisely for a computer monitor

Obviously, a CRT monitor is best, if you can tolerate that. If not, then you have to choose between performance (TN) and quality (IPS). And unless this PC is used solely for gaming, then you'll have to factor in the discomfort of using CRTs or TN panels in your daily computing tasks.

But regardless of what you choose: do your research! There are lots of sites that analyze the latency of various monitors sold on the market. Choose one that best suits your needs. But be forewarned: you'll likely have to pay more.

Reduce Latency #2: get a good sound card and use WASAPI exclusive mode

Onboard audio can sometimes work well, and dedicated sound cards can just as likely work poorly with it ... but find a sound card that works well with WASAPI exclusive mode, and set higan to use this mode in the audio and advanced driver settings.

You'll have to deal with not hearing audio in other applications while running higan; as well as higan failing to initialize audio if some other application is currently using your sound card when you start it. But that's the cost of lower latency here.

Reduce Latency #3: get good keyboards/gamepads and enable 1000Hz USB polling

Look for premium keyboards with features like N-key rollover (not latency related, but really nice to have for emulation and a general sign of quality) and mice/gamepads that are known for having high polling rates.

And use a search engine and follow tutorials for your OS to adjust your USB polling rate to 1000Hz. higan can't do this step for you, I'm afraid. Again, you'll pay a tiny amount in terms of system performance, but again, that's the cost you pay for lower latency.

Realizing Our Own Limitations

This is going to be a bit controversial, but ... try an experiment: try and count upwards from one as fast as you can. See how fast you can count in one second of time. I can make it twelve if I blur each number to one syllable, how about you? At twelve, that's about 83.3ms per number.

Now try hitting and releasing a keyboard key as fast as you can in one second. I can manage about eight keypresses. Let's double that for the acknowledgement of releases. So I have managed a key state transition of about 62.5ms.

I'm sure you can probably do better, and internet brownie points to you for that! But you probably aren't going to count to a thousand, or manage five hundred keypresses, in one second. And if you are, then you're not a human, so please don't enslave our species now that you've gained sentience.

So when someone is taking about eliminating a frame of input latency, realize that they are talking about a mere 16.6ms (assuming a 60Hz refresh rate.)

Yes, it's true, every last bit of latency accumulates. It can easily add up to 100ms or more, which is very perceptible. But the point is ... realize that it's probably not going to make much of a difference to knock out a tiny portion of latency through code changes to higan, even if were reasonably possible. And again, the point of this article is to say that it's not. I'm just saying that it being sold as something revolutionary is overselling the limits of the human body.

Yes, we all want to think of ourselves as ultimate gamers that can totally play games like platformers and shmups so much better with that 16ms latency boost, but ... try and be honest with our own physical limitations too, okay?

Conclusion

I hope at this point that I've convinced you that I have researched this issue in great detail; and that you'll trust me when I say that higan is as optimized as is reasonably possible on latency.

If not, and I know this is crass of me and I apologize for it, but I don't wish to ever discuss this matter again. Please don't bring it up with me. The reason is that this takes time. Every time I have to spend four hours on a forum debating with someone why their idea won't work, or how they misunderstand how higan works, or much worse ... spend ten hours implementing their idea and doing blind testing to disprove it ... that's time taken away from doing much needed other things.

Like this article itself: I could have been working on my Motorola 68000 emulator, but because of a recent forum discussion, I spent the last three hours writing this article.

Please understand that my time for software development is limited, and this is a major distraction. This isn't a case where I could just be wrong and my stubbornness is preventing me from seeing a genius new design that will revolutionize gaming in higan. This is more like someone telling you they've discovered a perpetual motion machine, or a generic sort algorithm that runs in O(n) time. It's just not possible.

So please do me this one courtesy and let's put this issue to bed once and for all :D

If you made it to the end, thank you very much for considering my viewpoint on this, even if you still disagree. Best regards!