2. articles
  3. accuracy-takes-time

Accuracy Takes Time

Accuracy Takes Time2020-01-22 14:59:36

Hi, I'm byuu. I've been attempting to perfect Super Nintendo emulation for the past 15 years. We are now at a point where that goal is in sight, but there is one last challenge we face: accurate cycle timing of the SNES video processors. In this article, I will recap the progress we've made, explain the problem we face, and finally I'll propose possible solutions so that we can move forward.


Today, SNES emulation is in a very good place. Barring unusual peripherals that are resistant to emulation such as a light-sensor based golf club, an exercise bike, or a dial-up modem used to place real-money bets on live horse races in Japan, every other officially licensed SNES title is fully playable, and no game is known to have any glaring issues.

SNES emulation has gotten so precise that I've even taken to splitting my emulator into two versions: higan, which focuses on absolute accuracy and hardware documentation, and bsnes, which features on performance, features, and ease-of-use.

Some amazing things have come out of SNES emulation recently:

... and much more!

So we're done, right? Kudos on a job well done, thanks for all the fish, and all that? Well, ... not quite.

Today, we enjoy cycle-level accuracy for nearly every component of the SNES, with the sole exception of the PPUs (picture processing units), which are used to generate the video frames sent to your monitor. We mostly know how the PPUs work, but we have to take guesses that result in less than total perfection.

SNES Design

Let's start by taking a look at the components that make up the SNES:

SNES system diagram
SNES system diagram

The arrows indicate the direction that the various processors in the SNES can communicate with one another, and the dotted lines represent memory chip connections.

The key thing to take away right now is to note that the video and audio output are sent directly from the PPU and DSP specifically. That is to say, they function like black boxes. This will be important later on.

Oscillators and Cycles

The SNES contains two oscillators: a crystal clock that runs at ~21 MHz and controls the CPU and PPUs, and a ceramic resonator that runs at ~24 MHz and controls the SMP and DSP. Cartridge coprocessors will sometimes use the ~21 MHz CPU oscillator, and will sometimes include their own oscillators that run at different frequencies.

A clock is the core timing element of any system, and the SNES is designed to perform various tasks at certain frequencies and times.

If you imagine a 100 Hz clock, it is a device with a digital pin that transitions to logic high (+5 volts, for instance), and then back to logic low (0 volts, or ground) 100 times per second. That is to say, the pin voltage will fluctuate 200 times total: there will be 100 rising clock edges, and 100 falling clock edges.

A clock cycle is generally treated as one full transistion, so a 100 Hz clock would generate 100 clock cycles per second. There are some systems that require distinguishing between rising and falling edges, and for those, we break this further down into half-cycles to denote each phase (high or low) of the clock signal.

The key goal of an emulator is to perform these tasks in exactly the same ways and at exactly the same times. More specifically, it doesn't so much matter how the tasks are performed, only that given the same inputs, the same outputs will be generated as with real hardware.

We'll break this into two goals: correctness and timing.


Imagine you are emulating a CPU's multiply instruction: it takes two registers (variables), multiplies them together, and produces a result and some flags which represent the status of the result (such as overflow.)

So if we imagine an 8-bit CPU (one that can count from 0 to 255), then given the expression A = 3 * 7, we expect that A will become 21, and since 21 fits inside the 8-bit result (that is to say, it is less than 255), the overflow flag should be clear.

This is easy to test for: we can create our own SNES program that multiplies by 3 * 7, and then check the A register for 21, and the overflow flag for 0. If the result is as expected, we pass. If not, we fail.

Of course, basic multiplication is trivial. But let's say we had a CPU with a half-overflow flag for detecting 4-bit overflows in an operation, or a secondary overflow flag, or any other such oddity.

In this case, we could devise a software program that multiplies every possible value from 0 - 255 as both the multiplier and multiplicand. And then we would output both the numeric and flag results of the multiplication. This would produce two 65,536-entry tables.

By analyzing these tables, we could determine exactly how and when the CPU results were set certain ways. And then we could modify our emulators so that when running the same test, we produce exactly the same tables.

Now let's say the CPU had 16-bit x 16-bit multiplications. Testing every possible value would generate four billion results, which is starting to push it. If the CPU had 32-bit x 32-bit multiplication, it would just flat out not be practical.

In cases like this, we would have to get more selective with our tests and try to determine exactly when flags might change, when results might overflow, and so forth. Otherwise we'd have tests that would never complete.

Again, multiplication is a fairly trivial operation, but this is the general process behind reverse-engineering, and it extends to more complex operations such as how the SNES' horizontal blanking DMA (direct memory access) transfers work: we create tests that try and detect what happens on edge cases, and confirm that our emulation behaves identically to a real SNES.

We can never hope to exhaustively test every possible combination of inputs, because that would be more tests than there are atoms in the universe, but we try to deduce as much as we can from as few tests as possible.


Sometimes, operations happen over time. Take SNES CPU multiplication, for instance. Rather than take a longer time to complete, the SNES CPU calculates the multiplication result, one bit at a time, in the background over the next eight CPU opcode cycles. This allows your code to possibly do other things while waiting on the multiplication to complete.

Any commercially released software is likely to wait those eight cycles, because if you try and read the result before it's ready, you will get a partially computed result instead.

Yet when it came to unofficial code created using emulators, such as was the case with many earlier Super Mario World ROM hacks, developers who weren't aware of this limitation who used emulators that did not simulate the delay, went on to create software that only worked in these earlier emulators and not on real SNES hardware. As emulators improved, this old software broke, and we have had to subsequently offer compatibility options in our newer emulators in order to not lose this software to time.

Yes, as surreal as it is to say, these days our emulators emulate other emulators! How meta!

Tangent: I've always wanted to preserve everything that could be preserved, but supporting the flaws of earlier emulators interfered with my goal of having my emulator be a self-documenting representation of a real SNES console. I ultimately solved this problem by forking my emulator into two separate emulators: higan, focused on exactly recreating the original hardware; and bsnes, designed to be as easy-to-use and compatible as possible. Although I must admit, it's quite the added burden to have to maintain multiple emulators for the same system.

The nice thing about the CPU multiplication delay is that it's very predictable: the eight computation cycles start immediately after requesting a multiplication. By writing code to read the results at every cycle we were able to, we were eventually able to determine that the SNES CPU was using the Booth algorithm for multiplication. Obvious in hindsight, but it was important to confirm it all the same.

Other operations are not so simple, and happen asynchronously in the background, such as the DRAM refresh that applies to the SNES WRAM chip: every scanline, at approximately the 538th cycle, the entire SNES CPU freezes for the next 40 clock cycles as the contents of the SNES WRAM chip are refreshed. This is needed because the SNES chose to use dynamic RAM (rather than static RAM) for its main CPU memory as a cost-cutting measure. Dynamic RAM must be periodically refreshed in order to not lose its contents over time.

When I say approximately the 538th cycle, I mean the exact position changes from scanline to scanline, and the pattern is also different between the two CPU revisions that were released during the SNES' lifetime.

Yep, as if we didn't have enough to do already, there's even multiple versions of most of the chips inside the SNES, each with their own subtle-yet-important differences and quirks.

Being the first to implement DRAM refresh in an emulator, the exact timings of when this happened weren't known, they were only estimated. In fact, there wasn't even a known method yet for figuring out the exact position. I had to come up with that first.

Clock Synchronization

The key insight to being able time these operations was to take advantage of a feature of the SNES PPU, its horizontal and vertical counters. These counters advance and are reset after each horizontal and vertical blanking period. However, their precision is only a quarter of the SNES' CPU oscillator frequency. That is to say, the horizontal counter incremented only once every four clock cycles.

By reading the counters multiple times, it became possible to determine which quarter of a clock cycle the counter was aligned to. By combining that insight with a specially crafted function that would step by an exact number of clock cycles passed to it as an argument, it became possible to perfectly align the SNES CPU to any exact clock cycle position I wanted.


By iterating over a range of clock cycles in a loop, I could determine exactly when certain operations (such as DRAM refresh, HDMA transfers, IRQ polling, etc) would occur, and I was able to reproduce this precisely under emulation.

The SNES SMP has its own timers as well, which allowed similar reverse engineering to be successful against that processor as well. I could spend an entire article talking about the SMP TEST register alone, which allowed you to control the clock divider of the SMP and its timers itself, among other horrible things, but suffice it to say it was not an easy or fast process, but we were ultimately victorious.


There were a whole host of SNES coprocessors used inside various game cartridges that needed to be tamed as well. From dedicated general-purpose CPUs like the SuperFX and SA-1, to digital signal processors like the DSP-1 and Cx4, to decompression accelerators like the S-DD1 and SPC7110, to real-time clocks from Sharp and Epson, and more ...

Emulating the instruction and pixel caches of the SuperFX, the memory bus conflict arbitrator of the SA-1 (which allowed the SNES CPU and SA-1 to share the same ROM and RAM chips simultaneously), the embedded firmware of the DSP-1 and Cx4, the prediction-based arithmetic coders of the S-DD1 and SPC7110, or the odd BCD (binary-coded decimal) edge cases of the real-time clocks ... slowly but surely by applying the above techniques to determine correctness and timing, we were able to near-perfectly emulate all of these chips.

It actually took a massive effort and thousands of dollars to decap and extract the programming firmware from the digital signal processors used in various games, in fact.

In one instance, emulation of the NEC uPD772x led to code from higan being used to save the late professor Stephen Hawking's voice!

In another, we had to reverse engineer the entire instruction set of the Hitachi HG51B architecture, because there was no documentation on it anywhere.

And in yet another instance, one game (Hayazashi Nidan Morita Shougi 2) ended up containing a full-blown 32-bit, 21 MHz ARM6 CPU to accelerate its Japanese chess engine!

Preserving all of the SNES coprocessors alone was a multi-year journey full of challenges and surprises.


Not to be confused with the DSP-1 cartridge coprocessor, the Sony S-DSP chip is what generated the distinctive sound from the SNES. This chip combined eight voice channels with 4-bit ADPCM encoding to produce a 16-bit stereo signal.

On the surface and per the system diagram from earlier, the DSP initially looks like a black box: you configure the voice channels and mixer settings, and sit back as it generates sound to be sent to your speakers.

But one key feature allowed a developer by the name of blargg to fully reverse engineer this chip: the echo buffer.

The SNES DSP has a feature that mixes the outputs from previous samples together to produce an echo effect. This happens at the very end of the audio generation process (aside from one last final mute flag that can be applied to silence all audio output.)

By writing carefully cycle-timed code, it became possible to devise the exact order of operations the SNES DSP would take to generate each sample and to produce cycle-accurate, bit-perfect audio.


So this leads us to the final piece of the SNES architectural diagram: the PPU-1 and PPU-2 chips:

S-PPU1 die scan
S-PPU1 die scan
S-PPU2 die scan
S-PPU2 die scan

Thans to John McMaster, we have 20x magnification scans of the S-PPU1 (revision 1) and the S-PPU2 (revision 3) chips.

What the die scans above reflect is that, rather obviously, these are not general-purpose CPUs, nor are they custom architectures executing operation codes from an internal firmware program ROM: they are both dedicated, hard-coded logic circuits that take the inputs of various registers and memory, and produce the video output to your monitor, one scanline at a time.

The reason why the PPUs remain the final frontier of SNES emulation is because unlike every other component discussed up until now, the PPUs truly are a black box: you can configure them to any state you want, but you have no way of observing what they generate from the SNES CPU.

In other words, imagine if in the earlier example, you requested the result of 3 * 7, but you never received an answer. It was sent as a fuzzy, analog image showing the value of '21' to your monitor. The people running your software could see the 21, but you would have no way of confirming that is what they saw. And as such, you wouldn't be able to write software to verify that given the inputs of 3 and 7, would produce an output was 21. The pass or fail condition would be up to the human looking at their monitor instead of your software. That doesn't scale beyond a few thousand tests, and we're going to need several million to really hone down the exact PPU behavior.

Now I know what you're probably thinking: "but byuu, wouldn't it be easy to use a capture card, perform a lot of image processing, roughly match it to the emulator's digital screen image, and pass-fail the test based on that?"

Well, probably! Especially if the test were as simple as two giant numbers spanning the entire size of the screen.

But what if our testing was very nuanced, and we were trying to detect a half-shade difference of a single pixel? What if we wanted to run a million consecutive tests and didn't necessarily know what they were going to generate just yet, but still wanted to match it to the output of our emulation?

Nothing beats the convenience and certainty of digital data, an exact stream of bits that you either match or don't match. The analog domain isn't that.

Why Does This Matter?

With the exception of one game, Air Strike Patrol, all officially-licensed SNES software is (intended to be) scanline-based: that is to say, the games do not attempt to change the PPU rendering state in the middle of an actively rendering scanline (which is known as a raster effect), and so very little timing precision is actually needed to run these games.

As much as I hate to admit it, and I've perhaps buried the lede here, but if you're not interested in the pursuit of 100% faithful perfectionism for the sake of it, I am not going to be able to convince you. As with any goal in life, the closer we get to perfection, the more diminishing the returns will be.

I can tell you why this is important to me: it's my life's work, and I don't want to say I came this close to finishing, but didn't quite get the last piece of it right. I'm getting older, and I won't be around forever. I want this final piece solved so that I can feel confident in my retirement that the SNES has been faithfully preserved. No stone was left unturned, no area left unfinished. I want to say that's it's done.

Sometimes you just want to do things for the technical challenge alone; just to prove to yourself you can; just for the sake of it.

All the same, I'll give you some real-world examples.

/images/articles/accuracy-takes-time/air-strike-a.png /images/articles/accuracy-takes-time/air-strike-b.png /images/articles/accuracy-takes-time/air-strike-c.png

Above, you see that the "Good Luck" text is being rotated from frame to frame. It does this by modifying the background layer 3's vertical scrolling position. However, the HUD display on the left (where it displays you have 39 missiles available) is also on the same background layer.

The game performs this feat by changing BG3's scroll position after the HUD on the left has rendered, but before the "Good Luck" text begins rendering. It can get away with this because BG3 is transparent outside of the HUD and the text, so there is nothing to really draw between those two points, regardless of the vertical scroll register value. This tells us that the scroll registers can be changed at any point during rendering.


Above is the infamous plane shadow near the bottom of the screen. This effect is rendered by changing the screen display brightness register for short bursts over the span of five scanlines.

While playing the game, you'll note that the shadow is rather erratic. In the above image, it looks a bit like a 'c', but in-game, it's constantly changing in length and start point for each scanline. The reason for that is because SNES timing is extremely difficult to get absolutely cycle perfect.

Air Strike Patrol just aimed in the general ballpark of where they wanted the shadow to appear, and then went for it, guns blazing. It mostly worked.


Finally, here is the pause screen. This one toggles BG3 on during the yellow and black border on the left, and off again during the same border on the right, to draw a scanline effect of gray lines on the screen. It also alternates between doing this on odd and even scanlines every other frame to give a shaking effect to the overlay.

If you zoom in on the image, you'll notice that on a couple of scanlines, there are a few missing pixels on the left-hand edge. The cause of that is that my emulation of the PPU is not 100% cycle-perfect, and in this case, is triggering the BG3 toggle effect a bit later than it is supposed to.

I could very easily adjust the timings to render this image correctly, but then that is just as likely to have adverse effects in other titles that unintentionally modify PPU display registers mid-scanline.

While Air Strike Patrol is the only game to do this intentionally, there are at least a dozen games that do so by accident: maybe they had an IRQ fire a bit too early or too late, but in the end, PPU settings are sometimes modified in the middle of scanlines. Sometimes this produces some brief visible corruption that was overlooked in development (as with Full Throttle Racing when transitioning between the shop and game); sometimes the writes happen during an otherwise transparent part of the screen and were thus not noticed (such as the HP status display in Dai Kaijuu Monogatari II.)

Even discounting Air Strike Patrol, it is not possible to design a PPU renderer for the SNES that generates an entire scanline at a time at any specific point due to all of these games accidentally but effectively triggering raster effects.

With bsnes, I have a list of these games and custom render positions so that a much faster scanline-based renderer can render all of these games properly (save for Air Strike Patrol, of course), but that essentially works out to per-game hacks.

I also have a cycle-based PPU renderer that doesn't need any of these hacks, but eventually you run into tiny 1-4 pixel differences, as in the last Air Strike Patrol screenshot pictured above.

Latching Internal Registers

The cause of these slight misses comes down to latching behavior timings.

Let's say the SNES is rendering its iconic mode 7, which is an affine texture transformation with per-scanline parameter adjustments. To determine any given pixel on the screen, a computation such as this can be performed:

px = a * clip(hoffset - hcenter) + b * clip(voffset - vcenter) + b * y + (hcenter << 8)
py = c * clip(hoffset - hcenter) + d * clip(voffset - vcenter) + d * y + (vcenter << 8)

But the thing is, a real SNES would struggle to perform these eight multiplications every single pixel. None of these values change from pixel to pixel (or at least, they aren't supposed to), so we only have to compute px,py once at the start of every scanline. And so the PPU thusly caches things into latches, where latches are essentially copies of PPU registers that may have been transformed, or may be transformed further as time goes by.

The x,y coordinates are then transformed by mode 7 like so:

ox = (px + a * x) >> 8
oy = (py + c * x) >> 8

Although x changes every pixel, we know that it increments by one every time. By keeping internal accumulators, we can simply add constant values a,c to ox,oy once every pixel instead of having to perform two multiplications here for every pixel.

The question then becomes, at what exact cycle position does the PPU read the values a,c from the CPU-accessible SNES PPU external registers?

If we guess a time that is too soon, it may break a certain subset of games, and if we guess a time that is too late, it may break a different subset of games.

The easy approach is to just keep waiting for bug reports, and adjusting these positions to resolve issues in any specific game. But by doing this, we will never find out the exact positions, only approximations.

And any time we change one of these variables, we are not feasibly going to be able to retest the entire ~3,500-game SNES library to spot any regressions that our changes might have caused.


In fact, this exact style of "just get the current game of interest working at any cost" methodology led to a phenomenon I call Whack-a-Mole emulation.

Back in the early days of SNES emulation, whenever a game had issues, any fix that resulted in a given game working would be accepted and committed to the emulator. Without fail, that fix would end up breaking a different game. And when that game was fixed, a third game would fail. Fixing the third game would then break the first game again. This went on for years.

The mistake being made here is trying to consider only one variable at a time. Let's say we have a game where an event needs to happen between cycle 20 and 120 in order to work. We don't know the exact cycle, so we go ahead and pick 70, right in the middle.

Later on a bug report comes in for a different game, and we determine that for this game to work, the value needs to be between 10 and 60. So now we adjust it to 40.

It seems sensible, but then along comes a third game that needs the event to trigger between 80 and 160! Now there's no way to get all three games working at the same time!

This led emulators to implement game-specific hacks: you don't want to be the person to ship the emulator that doesn't run Mario, Zelda, or Metroid. So you use 40, and then when Metroid is loaded, you make it use 100 instead.

How is it possible that two games need different values? That's because there's more variables in play than just that one value. The timing you trigger a different event previously may influence what timing value is needed for the next event.

To put it as a simple algebraic expression, imagine:

2x + y = 120

You can solve this equation with x=10, y=100. Or with x=20, y=80. Or with x=30, y=60. If you're only thinking about what values of x makes your subset of games all work simultaneously, then you may be missing the fact that the actual problem is that your y variable is incorrect.

What happened with SNES emulators back then is, even if it later became known, definitively, that x should be 20, if they had y wrong, but didn't know it, they would leave x intentionally wrong, because it ran more games. And so y never got fixed.

This is an aspect of emulation that is slightly unintuitive to outsiders, but very much true: accuracy is not the same thing as compatibility.

You could have an emulator be 99% accurate, but only run 10% of games. And you could have an emulator be only 80% accurate, and run 98% of games. But one thing is certain: if you don't accept that sometimes doing things right will break popular games in the short-term, you will never get to 100% accuracy and 100% compatibility.

The thing is, the SNES is not just one or two variables in play at a time. Just the SNES PPU alone has 52 external registers comprising ~130 settings. The process of rendering one scanline involves all ~130 settings and however many unknown number of internal and latched registers all in play at the exact same time. It's too much for anyone to fully grasp the entire state of the SNES PPU alone at any given point in time.

Deductive Reasoning

The way we've gotten SNES PPU emulation to the point it is at now has been through deductive reasoning and real-world results.

We know that the PPUs have access to two VRAM chips. We know they can only read so many bytes of data from each chip per scanline. We know the broad details of how each SNES video mode operates. And with that, we can lay out a general pattern of how a design might look:

if(io.bgMode == 0) {
if(io.bgMode == 1) {
if(io.bgMode == 2) {

Above is an abridged example of how the first three SNES video modes might work.


Although the SNES PPU's VRAM (video RAM) is locked out from the SNES CPU during rendering, even for reading ... it turns out that the OAM (sprite memory) and CGRAM (palette memory) are not. The catch is that the SNES PPU controls the address bus during this time. And so by reading OAM and CGRAM while the screen is rendering, I'm able to observe what the SNES PPU is fetching from these two pieces of memory.

It's not the whole piece of the puzzle, but it was enough for me to implement mostly-correct sprite fetching patterns.

PPU Flags

The PPU exposes precious little state: horizontal and vertical blanking flags, horizontal and vertical pixel counters, and range-tile over flags for sprites.

It's not a lot, but again, every little bit of exposed state helps.

Putting It All Together

Combining deductive reasoning, the exposed OAM and CGRAM access patterns, PPU flags, and just general observations (read: guessing) from bug reports in various games has led us to having cycle-based PPU renderers that can almost run all commercially released games perfectly.

However, it's a house of cards: if someone were to start creating homebrew that exploits perfect cycle timing and raster effects, all of our emulations would fall apart. Everywhere. Not just in my emulators. Not just in software emulators. In FPGA implementations as well.

I want to be clear here: everyone is currently guessing the internal order of operations and latching behaviors of the SNES PPUs. Nobody knows how to emulate this perfectly. Not yet, anyway.


So what do we do about this? How do we determine the exact order of operations of the SNES PPU, when it acts as a black box to us from the SNES CPU side of things?

I see three possibilities: logic analyzers, breakout boards, and decapping.

Logic Analyzers

If you look at the PPU die scans from above, you'll notice black pads all around the edges of the chips. These are pads that connect to pins on the actual chips.

These pins hold the state of the PPUs during each clock cycle of execution. Here you will find the current address they are accessing for the video RAM chip, what data value is being transmitted from one PPU to the other, and more.

This is information that is not available to code running on the SNES CPU, but it provides valuable insight into the SNES PPU's internal order of operations.

The critical issue with logic analyzers is that it's not extremely well controlled: trying to sample live data on a running system is going to give you a stream of results that are quite difficult to decipher. And you will still have the same problem as with trying to analyze the analog RGB output: you will have to manually run each and every test to capture this data. It's not a good system for creating reproducible regression tests.

Breakout Boards

Given that the SNES PPUs are static, it would be possible to extract the PPUs from a working SNES console, and to place them onto a protoboard or custom PCB, along with the two video RAM chips. From here, a microcontroller could sit between the PPUs and a USB interface to a PC, which allowed you to program all of the video RAM and PPU external registers. Finally, you could manually drive the PPU clock, and sample the resulting PPU I/O pins, registers, and memory during each and every cycle.

By modifying a software emulator to generate these same internal I/O pin values, it would be possible to directly compare real hardware to emulation, even in real-time.

This will still be very hard work, however. We still don't have visibility into the internal operations of the PPUs.


And so the final, most extreme approach, would be to expand upon our decapping efforts. We have 20x die scans, but the resolution is not enough to make out and reconstruct individual logic circuits from them, such as was done with the Visual 6502 project, but on a larger scale here. If we can get 100x magnification scans of both PPU dies, we could begin the arduous task of mapping out the entire PPUs, and converting them into netlists or VHDL code. This would be directly usable with FPGAs, and with porting work to C++ or another software language, usable with our software emulators as well.

A ballpark estimate I received from someone who has done work like this in the past is that it would take around 600 hours to do this for both PPUs. That is well past the point of "let's fundraise and pay someone to do this" level, and falls squarely into the "let's hope someone extremely talented with a unique, in-demand skillset is interested in volunteering to help us" territory.

That is of course not to say that I wouldn't be happy to financially reward anyone able to assist me, as well as to pay for any parts or labor required.

A Call for Help

So basically, I need help to finish this final task.

If you've read this far, hopefully that's you!

If so, I would love to hear from you. Please get in contact with me in that case.

And if not, I still appreciate you reading this, and hopefully it has shed some light on the difficulty that goes into creating cycle-perfect emulators.

Even with 15 years of working as a volunteer in this field, everyone's support and assistance over the years has kept this project interesting and rewarding for me. Thank you all so much for the encouragement over the years! Here's to hoping the final stretch is a success as well.