Nes Emulator

Nes Emulator

This blog covers my journey of emulating the NES, a popular console from the 1980s made by Nintendo.

Yet Another NES Emulator

Why write an emulator for the NES, a project that can be considered well-trodden at this point? For those pursuing low level programming, an emulation project can serve as an excellent appetizer to the field. It gets you familiarized with CPU architecture, the interactions of different components within a computer. It’s a project big enough where authors have to concern themselves with profiling, and unit testing, concepts that may not be well taught in university. With a sufficiently complex computer to emulate, authors will actually have to think about how to structure their project, else be ridden with spaghetti code that can be a nightmare to debug. And why not the NES, which at this point is well documented and has a lot of support from communities such as r/EmuDev.

The intentions of this blog is to document my process in writing this emulator, as well as write about the flow of execution of the PPU, and the APU. I will not go into the specifics, as the nesdev wiki already provides the necessary technical details of the NES. I hope that my blog can serve as guide to what each component does on a higher level.

Core Emulation Logic

CPU - 6502

CPU Components

I would like to preface that when I say CPU components, I am talking about the elements of the 6502 relevant to emulation. A good starting point would be to list the registers of the 6502 CPU.

ℹ️
A register is simply a piece of hardware that contains bits of data. For the sake of emulation, registers can be represented as unsigned integers.

PC (Program Counter)

The PC is a 16 bit register that tracks where the program is at in memory. It can be thought of as an index of an array, with the array being memory.

A (Accumulator)

This is an 8 bit register that is typically used to hold the result of operations performed by instructions.

X, Y

These are 8 bit registers, and typically used as operands for operations.

S

This 8 bit register serves as a stack pointer, letting the CPU know where the top (or bottom in the case of the NES) of the stack is at.

P

This is the status register. It’s not really a register, but can be thought of as one. Each bit in this ‘register’ represents a state after an instruction is performed. Decimal mode can be ignored, as this is not a feature on the NES.

Bit Name Flag Description
7 Negative N Set if the result of an operation is negative (bit 7 set)
6 Overflow V Set if signed overflow occurred
5 Unused Always set to 1 when pushed to the stack, ignored otherwise
4 Break B Set when a BRK instruction was executed
3 Decimal D Decimal mode (ignored on NES)
2 Interrupt Disable I If set, disables IRQ interrupts
1 Zero Z Set if the result of an operation is zero
0 Carry C Set if the last addition or shift resulted in a carry or borrow

Memory

The CPU also has access to 64KB of memory, with each element of this memory being able to store an 8 bit value. The table below is the CPU’s memory layout. Take a careful look at this map, particularly region $8000 to $FFFF, as this is where the games discussed in this blog will be stored.

Address Range Size Purpose
$0000–$07FF $0800 2 KB internal RAM
$0800–$0FFF $0800 Mirrors of $0000–$07FF
$1000–$17FF $0800 Mirrors of $0000–$07FF
$1800–$1FFF $0800 Mirrors of $0000–$07FF
$2000–$2007 $0008 PPU MMIO registers
$2008–$3FFF $1FF8 Mirrors of $2000–$2007 (repeats every 8 bytes)
$4000–$4017 $0018 NES APU and I/O registers
$4018–$401F $0008 APU and I/O functionality (normally disabled — see CPU Test Mode)
$4020–$FFFF
$6000–$7FFF Usually cartridge RAM, when present
$8000–$FFFF $BFE0 Usually cartridge ROM and mapper registers

The code below is a representation of the information relayed so far as a struct in C. There are some values that weren’t covered yet and they can be ignored for now as they do not pertain to the core cpu functionality.

typedef struct Cpu6502 {

  int cycles;

  uint8_t instr;

  // Accumulator
  uint8_t A;

  // X and Y register
  uint8_t X;
  uint8_t Y;

  // Program Counter
  uint16_t PC;

  // Stack pointer
  // Initialize it to the top
  uint8_t S;

  unsigned char P[8];

  PPU *ppu;

  uint8_t ctrl_latch_state;
  int ctrl_bit_index;

  unsigned char nmi_state;
  unsigned char strobe;
} Cpu6502;

Stack

The stack is a part of the memory used to temporarily hold values assigned by the CPU. Its used to save values when the CPU needs to change to a different context. Certain instructions such as the Jump to Subroutine (JSR) instructs the CPU to change the program counter value to another value. The current PC will be saved by pushing its value to the stack.

CPU Instructions

The CPU executes instructions and these instructions can vary in function. The simplest instructions to implement are the read/write, and the register operations. The LDA, LDX, LDY, STA, STX, STY are the read & write instructions of the 6502 ISA. The last letter (A, X, or Y) indicates the register that will be used in the instruction. LD gets a value from memory, where it gets is determined by the Addressing Mode, and the retrieved value is stored in the specified register. For example, LDA # gets a value from the immediate address and stored it in the accumulator register. Most of the CPU’s instructions are quite simple functionality-wise so I won’t cover them. This site, as well the NesDev Wiki fully documents the 6502 ISA.

Addressing Mode

Certain instructions need operand(s) to execute. For example, a load instruction needs an address to specify where to load the value from memory.

ℹ️
In the code implementation in this section, you will notice that I increment PC within the addressing mode logic. This is not accurate but serves my purposes, as I do not increment PC in the previous instruction.
Immediate Addressing
PC++;
address = PC; 
Immediate Addressing
The immediate address is the value itself, located at PC + 1. This addressing mode fetches the operand directly following the opcode.
  graph TD
    A["Mnemonic: LDA #$42"]
    B["$A9 $42"] 
    C["A ← $42"]
    B --> C
Zero Page Addressing
PC++;
address = memory[PC];
Zero Page Addressing
Zero page refers to memory addresses from $0000 to $00FF. This region is faster for the CPU to access. The instruction uses a single-byte operand (from PC + 1) as the full address.
  graph TD 
    A["Mnemonic: LDA $42"]
    B["$A5 $42"] 
    C["Address [$42] → $10"]
    D["A ← $10"]

    B --> C --> D
Zero Page,X
PC++;
address = (memory[PC] + X) & 0xFF;
Zero Page,X Addressing
This uses the value stored in memory at the address of (PC + 1) incremented by the value in register X. It wraps around within the zero page using & 0xFF.
  graph TD 
    M["Mnemonic: LDA $42,X"]
    I["$B5 $42"] 
    Xreg["X → $04"] 
    AddrCalc["Effective address = ($42 + $04) & 0xFF = $46"]
    MemRead["Address [$46] → $99"]
    Result["A ← $99"]

    I --> AddrCalc --> MemRead --> Result
Zero Page,Y
PC++;
address = (memory[PC] + Y) & 0xFF;
Zero Page,Y Addressing
Works the same way as Zero Page,X, but adds the Y register instead. Rarely used in 6502.
Absolute Addressing
lsb = memory[++PC];
msb = memory[++PC];
address = (msb << 8) | lsb;
Absolute Addressing
Absolute addressing requires two operands (bytes) from memory to form a full 16-bit address. The low byte (LSB) comes first, then the high byte (MSB). These are combined using bit shifting and bitwise OR.
  graph TD 
    M["Mnemonic: LDA $1234"]
    I["$AD $34 $12"] 
    Addr["Effective address = $1234"]
    MemRead["Address [$1234] → $56"]
    Result["A ← $56"]

    I --> Addr --> MemRead --> Result
Absolute,X
lsb = memory[++PC];
msb = memory[++PC];
address = ((msb << 8) | lsb) + X;
Absolute,Y
lsb = memory[++PC];
msb = memory[++PC];
address = ((msb << 8) | lsb) + Y;
(Indirect,X)
PC++;
base = (memory[PC] + X) & 0xFF;
lo = memory[base];
hi = memory[(base + 1) & 0xFF];
address = (hi << 8) | lo;
(Indirect,X) Addressing
First adds X to the operand (from zero page), then uses the result as a pointer to a 16-bit address. The pointer wraps in zero page.
  graph TD
    M["Mnemonic: LDA ($20,X)"]
    I["$A1 $20"]
    Xreg["X → $04"]
    AddrCalc["$20 + $04 = $24"]
    PtrLow["Address [$24] → $78 (low byte)"]
    PtrHigh["Address [$24 + $1] → $56 (high byte)"]
    EffAddrLow["Effective address low byte: $78"]
    EffAddrHigh["Effective address high byte: $56"]
    EffAddr["Effective address = $5678"]
    MemRead["Address [$5678] → $9A"]
    Result["Load A ← $9A"]

    M --> I --> Xreg --> AddrCalc
    AddrCalc --> PtrLow --> PtrHigh
    PtrLow --> EffAddrLow
    PtrHigh --> EffAddrHigh
    EffAddrLow --> EffAddr
    EffAddrHigh --> EffAddr
    EffAddr --> MemRead --> Result
(Indirect),Y
PC++;
ptr = memory[PC];
lo = memory[ptr];
hi = memory[(ptr + 1) & 0xFF];
address = ((hi << 8) | lo) + Y;
(Indirect),Y Addressing
Reads a pointer from zero page, builds a base address from it, then adds the Y register to get the effective address.
  graph TD
    M["Mnemonic: LDA ($20),Y"]
    I["$B1 $20"]
    PtrLow["Read low byte at Address [$20]: $34"]
    PtrHigh["Read high byte at Address [$20 + 1]: $12"]
    Yreg["Y register → $05"]
    AddrCombine["Combine bytes: $1234"]
    EffAddrCalc["Add Y: $1234 + $05 = $1239"]
    MemRead["Address [$1239] → $BC"]
    Result["A ← $BC"]

    M --> I --> PtrLow --> AddrCombine 
    I --> PtrHigh --> AddrCombine 
    AddrCombine --> EffAddrCalc --> MemRead --> Result
    Yreg --> EffAddrCalc

PPU - The Picture Processing Unit

This component of the NES is what generates the graphics. The PPU was definitely the most confusing and time consuming part of this emulation process. The documentation I was able to find was in-depth on the individual components of the PPU, but I was confused as to what the PPU was doing at a high level. If you’re having a hard time understanding what the PPU is doing, I hope this initial high-level overview will help.

The main goal of the PPU is to draw a frame, pixel by pixel. The visible frame is the actual display, 256 by 240 pixels. This can also be thought of as 32 by 30 tiles, with each tile being 8 by 8 pixels. To draw each pixel, the PPU will have to know what colour each pixel is. The PPU does 4 memory fetches every 8 cycles to get the colour of a pixel. The first memory fetch is from the Nametable using the value in the PPU’s v register. The Nametable contains the tile ID. For example, lets say the Nametable returns $13 (19 in base 10), this means we have to use the 19th tile from the CHR, which is stored in the Pattern Table. The next two cyles are spent on a memory fetch from the Attribute Table. The Attribute Table determines which Palette out of the four palettes to use for a 4x4 tile region. The next 4 cycles are spent on the Pattern Table. A palette is composed of four possible colours to use. The Pattern Table will determine which of the four colours in a palette to use. All these memory fetches will let us build an address that reveals which colour from the Palette memory to use.

  flowchart TD
    A["PPU Begins Rendering Pixel"] --> V(["v = $2000"])

    V --> B["Fetch Tile ID from Nametable"]
    B --> C(["Tile ID = $13"])

    V --> AT["Fetch Palette Index from Attribute Table"]
    AT --> G(["Palette Index = 3"])

    C --> D["Fetch Pattern from Pattern Table"]
    D --> E(["Tile #19 Pixel Data"])

    E --> F["Combine Pattern Data + Palette Index"]
    G --> F

    F --> H["Fetch Color from Palette Memory"]
    H --> I(["Final Pixel Color = #12"])

This diagram simplifies the flow a bit too much. There are some crucial nuances that the Wiki will cover, and some that I’ll try to cover below.

PPU Memory Mapped & Internal Registers

Memory mapped registers are essentially the channels of communication between the CPU and PPU. These two components have their own memory space separate from each other. For these two components to communicate with each other, the PPU has eight registers which are mapped to certain addresses of the CPU in its memory space. For example, the PPU has a register called PPUCTRL. The CPU is writing to this register when it writes to the address $2000 in its memory space. I won’t cover in detail the purpose of each memory mapped register as the NesDev Wiki details this extensively.

  flowchart LR
    A["A → $80"] 
CPU["STA $2000"]
    CPU --> Write["Stores $80 in CPU's memory address $2000"]
    Map["$2000 is Memory-Mapped to PPUCTRL Register"] --> PPUCTRL["PPUCTRL Register ← $80"]
    PPUCTRL --> PPU["PPU Updates Control Flags"]

Frames, Scanlines, & Cycles

A frame is an instance of the screen. Each frame is composed of 262 scanlines, which can be thought of as rows. Earlier, I mentioned the visible frame, which can also be referred to as the picture region. But I also said there were 240 rows. This discrepancy is due to something called the Vertical Blanking Lines (VBlank). The PPU has essentially finished drawing the frame by scanline 239, but there are additional scanlines, (241 to 260) where the PPU is idle. During this period of time, the CPU can commit operations that can alter PPU’s behaviour, such as changing which sprites to appear on the next frame, or change the palette colours. Scanline 0 to scanline 239 are referred to as the Visible Scanlines. Scanline 240 is the Post-render scanline, where the PPU is idle. The VBlank flag isn’t yet so this is not VBlank. At scanline 241, if enabled in PPUCTRL the PPU triggers an NMI, which notifies the CPU that the PPU is in VBlank.

Scanline # Description
0–239 Visible Scanlines
240 Post-render (idle)
241–260 VBlank (Vertical Blanking)
261 Pre-render scanline

Each scanline requires 341 PPU clock cycles, with each clock cycle producing one pixel. You may notices that there are an additional 86 cycles to draw the 256 pixels per row. Cycle 0 is idle, it does nothing. Cycle 1 to Cycle 256 is where the 4 memory fetches I mentioned above are executed. Cycle 257-320 are pre-fetching the first two tiles of the next scanline. Cycle 337-340 also does memory fetches, but their purpose are unknown and for the sake of emulation can be ignored.

Cycle Range Purpose / Memory Fetch Pattern
0 Idle (no operation)
1–256 Pixel rendering (4 memory fetches every 8 cycles)
257–320 Prefetch tiles for next scanline
321–336 Pipeline warm-up / dummy fetches
337–340 Unused or unknown-purpose fetches (safe to ignore)
Cycle % 8 Memory Fetch Target Description
1 Nametable byte Tile ID for the current tile
3 Attribute table byte Palette attribute for 4×4 tile region
5 Pattern table low byte LSB of tile graphics from CHR
7 Pattern table high byte MSB of tile graphics from CHR

Memory Reads

Nametable

A nametable dictates how the background is laid out. Each nametable is composed of 1024 bytes, each byte storing the tile ID of each of the 30x32 tiles. So 960 bytes (30 * 32) are used to indicate which tile to use from the pattern table, the remaining 64 bytes are used by each nametable’s Attribute Table. The CPU will write to this memory area ($2000 - $2FFF) as instructed by the PRG data, so the CPU (indirectly) directed by the PRG data determines how the background is laid out.

inline uint8_t read_mem(PPU *ppu, uint16_t addr) {
  
  ... 

  else if (addr < 0x3F00) {
    // Nametable region with mirroring
    addr &= 0x2FFF;
    uint16_t offset = addr & 0xFFF;

    if (nametable_mirror_flag) { // vertical
      return ppu_memory[offset % 0x800];
    } else { // horizontal
      if (offset < 0x800) {
        // NT0 or NT1 (2000 or 2400)
        return ppu_memory[(0x2000 | (offset % 0x400))];
      } else {
        if (offset < 0xC00) {
          return ppu_memory[addr];
        } else {
          return ppu_memory[0x2000 | (offset - 0x400)];
        }
      }
    }
  }

  ...

}

The function above is how I fetch the nametable byte using the value in register v. The code handles PPU memory reads in the nametable region ($2000–$2FFF), applying mirroring rules. For vertical mirroring, it mirrors every 2 KB. For horizontal, it redirects specific 1 KB chunks to simulate NT0 and NT1 sharing. The goal is to map the 4 nametable address ranges into the 2 KB of physical VRAM available in the NES.

void background_ppu_render(PPU *ppu) {
  
  ...

  switch (ppu->current_scanline_cycle % 8) {

  // Fetch Nametable byte
  case 1:

    ppu->bg_pipeline.name_table_byte = fetch_name_table_byte(ppu);
    break;

  ...

  }
}

The code above is how the function (fetch_name_table_byte just calls read_mem) is used in the actual background rendering pipeline. The modulo switch statement lets us know which cycle we are on in the 8 cycle pipeline of rendering a tile.

Attribute Table

As mentioned before, the last 64 bytes of each nametable is used as the Attribute table of the corresponding nametable. Each byte controls the palette of a 4x4 tile region of the nametable. Each 2 bits of the the tile controls a 2x2 tile region.

Bits Bit Position Affected Quadrant Description
1 0 Bits 0–1 Bottom Right (BR) Palette index for bottom-right
3 2 Bits 2–3 Bottom Left (BL) Palette index for bottom-left
5 4 Bits 4–5 Top Right (TR) Palette index for top-right
7 6 Bits 6–7 Top Left (TL) Palette index for top-left
void background_ppu_render(PPU *ppu) {
  
  ...

  case 3:
    ppu->bg_pipeline.attribute_byte = fetch_attr_table_byte(ppu);
    // Determines which of the four areas of the attribute byte to use
    // Each attribute byte covers 4x4 tiles → each quadrant is 2x2 tiles
    uint8_t quadrant_x = ((ppu->v & 0x1F) >> 1) & 1; // 0 = left, 1 = right
    uint8_t quadrant_y =
        (((ppu->v >> 5) & 0x1F) >> 1) & 1; // 0 = top,  1 = bottom

    uint8_t shift = (quadrant_y << 1 | quadrant_x) * 2; // 0, 2, 4, or 6
    ppu->bg_pipeline.palette_index =
        (ppu->bg_pipeline.attribute_byte >> shift) & 0x3;
    break;

  ...

}

The code above is the logic I use to parse out which part of the attribute byte I want to use depending on the register v pointer.

ℹ️

It’s crucial that the logic used here is sound. Previously, I was using if statements to get parse out the attribute byte shown below.

  uint8_t tile_area_horizontal = (ppu->v & 0x00F) % 4;
  uint8_t tile_area_vertical = ((ppu->v & 0x0F0) / 0x20) % 4;

  if (tile_area_vertical == 0 && tile_area_horizontal == 0) {
      // Top left quadrant (bit 1 & 0)
      ppu->palette_index = ppu->attribute_byte & 0x03; :> [!WARNING]
      > 
  } else if (tile_area_vertical == 0 && tile_area_horizontal == 1) {
      // Top right quadrant (bit 3 & 2)
      ppu->palette_index = (ppu->attribute_byte & 0x0C) >> 2;
  } else if (tile_area_vertical == 1 && tile_area_horizontal == 0) {
      // Bottom left quadrant (bit 5 & 4)
      ppu->palette_index = (ppu->attribute_byte & 0x30) >> 4;
  } else if (tile_area_vertical == 1 && tile_area_horizontal == 1) {
      // Bottom right quadrant (bit 7 6 6)
      ppu->palette_index = (ppu->attribute_byte & 0xC0) >> 6;
  }

The code wasn’t computing the tile’s quadrant within the 4-tile attribute area correctly. As a result, the following occurred. There was a bag of other issues with my implementation at the time, but the incorrect colours were mainly due to how I was parsing the attribute byte.

Pattern Table (CHR)

Each tile is composed of 16 bytes, made up of two planes. The first plane controls bit 0 of pixel’s palette index, and the corresponding bit in the second plane controls bit 1. The diagram below will hopefully clear up any confusion. In the example below, PPUCTRL sets the base address of the pattern table to 0x1000, and the nametable byte is 0x3A.

  flowchart TD
    NTB["Nametable Byte: 0x2000:0x3A"]
    PTAddr["Pattern Table Addr: 0x1000 + 0x3A×16 + scanline"]
    PTByte1["Pattern Table Byte Lo: 0x13A0:0x5C"]
    PTByte2["Pattern Table Byte Hi: 0x13A1:0xA3"]

    NTB --> PTAddr --> PTByte1
    PTAddr --> PTByte2
Pixel # PT Hi Bit PT Lo Bit Palette Index (binary) Palette Index (decimal)
7 1 0 10 2
6 0 1 01 1
5 1 0 10 2
4 0 1 01 1
3 0 1 01 1
2 0 1 01 1
1 1 0 10 2
0 1 0 10 2
void background_ppu_render(PPU *ppu) {

  ...

  case 5:
    ppu->bg_pipeline.pattern_table_lsb =
        read_mem(ppu, (ppu->bg_pipeline.name_table_byte * 16) + fine_y);
    break;

  // Fetch high byte
  case 7:
    ppu->bg_pipeline.pattern_table_msb =
        read_mem(ppu, (ppu->bg_pipeline.name_table_byte * 16) + 8 + fine_y);
    break;

  ...

}
inline uint8_t read_mem(PPU *ppu, uint16_t addr) {

  ...

  if (addr < 0x2000) {
    // Pattern tables
    uint16_t pt_base_addr = 0;

    // if (ppu->drawing_bg_flag) {
    pt_base_addr = (ppu->PPUCTRL & 0x10) ? 0x1000 : 0x0000;
    //} else {
    //  // TODO: Handle 8x16 sprites separately in sprite code
    //  pt_base_addr = (ppu->PPUCTRL & 0x08) ? 0x1000 : 0x0000;
    //}
    return ppu_memory[pt_base_addr | (addr & 0x0FFF)];
  }

  ...

}
ℹ️

For the longest time, my emulator was outputting garbage visuals and it was due to how I was fetching my pattern table data. Currently, my implementation is fine (with the limited testing I have done so far, so this statement is subject to change). But initially, I wrote the code as below.

           
  ...

  case 5:

    if (drawing_bg_flag) {
        ppu->pattern_table_lsb = read_mem(ppu, (ppu->name_table_byte * 16) + ppu->scanline);
        ppu->pattern_table_msb = read_mem(ppu, (ppu->name_table_byte * 16) + ppu->scanline + 8);

    } else {
        // byte 1 is tile number
        ppu->pattern_table_lsb = read_mem(ppu, (oam_buffer_latches[sprite_index + 1] * 16) + ppu->scanline);
        ppu->pattern_table_msb = read_mem(ppu, (oam_buffer_latches[sprite_index + 1] * 16) + ppu->scanline + 8);
    }

    ...
    
  break;

The crucial error is in the address I’m reading from.

ppu->pattern_table_lsb = read_mem(ppu, (ppu->name_table_byte * 16) + ppu->scanline);
ppu->pattern_table_msb = read_mem(ppu, (ppu->name_table_byte * 16) + ppu->scanline + 8);

I was using the incorrect address here! My goal was to increment the name_table_byte by the current row being rendered in a tile, which can range in value from 0 to 8. However, I was increasing by the total number of scanlines! This outputted some interesting visuals when I tried to run the Donkey Kong rom.

Palette

The PPU memory (0x3F00 - 0x3F1F) contains four palettes, with each palette containing four colours. Memory address $3F00 to $3F0F contains the colours that background will use, while memory address $3F10 to $3F1F contains the colours used by sprites. The table below illustrates this.

Address Range Purpose Notes
$3F00 Universal background color Shared by all background palettes
$3F01–$3F03 Background palette 0 3 colors (color 0 is $3F00)
$3F04–$3F06 Background palette 1
$3F07–$3F09 Background palette 2
$3F0A–$3F0C Background palette 3
$3F0D–$3F0F Sprite palette 0 $3F0F may mirror $3F10
$3F10–$3F12 Sprite palette 1 $3F10 often mirrors $3F00
$3F13–$3F15 Sprite palette 2
$3F16–$3F18 Sprite palette 3
$3F19–$3F1F Mirrors Mirrors previous palette entries

As I mentioned before, each memory fetch plays a part in building the address that will fetch a colour for a pixel.

Bit Meaning
3 Sprite (1) or Background (0)
2-1 Attribute bits (palette selection: 0–3)
0 Pixel value from pattern table (0 or 1)

Hopefully everything comes together with this. It took me a while of reading through the NesDev Wiki forums and documentation to piece it together myself.

void background_ppu_render(PPU *ppu) {

    ...

    case 0:

    ...

    unsigned char is_pre_fetch =
        ppu->current_scanline_cycle >= 321 && ppu->current_scanline_cycle <= 336
            ? 1
            : 0;

    int row = is_pre_fetch ? ppu->scanline + 1 : ppu->scanline;
    int column_base = is_pre_fetch ? ppu->current_scanline_cycle - 328
                                   : ppu->current_scanline_cycle + 8;

    for (int i = 0; i < TILE_SIZE; i++) {
      uint8_t bit = 7 - i;
      uint8_t bg_lo = (ppu->bg_pipeline.pattern_table_lsb >> bit) & 1;
      uint8_t bg_hi = (ppu->bg_pipeline.pattern_table_msb >> bit) & 1;

      ppu->bg_pipeline.tile_pixel_value[i] = (bg_hi << 1) | bg_lo;

      uint8_t palette_index = ppu->bg_pipeline.palette_index;

      uint16_t pal_addr =
          0x3F00 | (palette_index << 2) | ppu->bg_pipeline.tile_pixel_value[i];

      uint8_t palette_data = read_mem(ppu, pal_addr);
      uint8_t alpha = (ppu->bg_pipeline.tile_pixel_value[i] == 0) ? 0x00 : 0xFF;

      uint8_t *color = &ppu_palette[palette_data * 3];
      uint8_t r = color[0], g = color[1], b = color[2];

      int column = column_base + i;

      if (column >= 256 || row >= 240)
        break;

      ppu->frame_buffer[row][column] = (r << 24) | (g << 16) | (b << 8) | alpha;
    }
    break;

    ...

}
  sequenceDiagram
    participant PPU
    participant PPU_Memory
    participant FrameBuffer
    participant Frontend 

    loop 8-cycle fetch
        PPU->>PPU_Memory: Fetch nametable byte
        PPU->>PPU_Memory: Fetch attribute byte
        PPU->>PPU_Memory: Fetch pattern LSB
        PPU->>PPU_Memory: Fetch pattern MSB
        PPU->>FrameBuffer: Fill up framebuffer with 8 pixels
    end

    FrameBuffer->>Frontend: Present frame buffer to render

The Nes outputs a pixel cycle by cycle. I’m doing it a little differently. At every 8 cycle from cycle 1 - 256 (as well as the prefetch cycle 321 - 336), I use the information I have fetched the past 8 cycles rendering the past 8 pixels. Once I have a frame’s worth of pixels filled up in the buffer, I pass it off to Frontend, which is an abstraction I made that does all the user interactions, including rendering the graphics.

OAM (Sprites)

The OAM is another internal memory of the PPU that contains up to 64 sprites, with each sprite taking up 4 bytes. The table below can summarize what each of the 4 bytes determine.

Byte Description Details
0 Y Position Y coordinate of top of sprite minus 1 (due to 1-scanline delay). Values #$EF–#$FF hide the sprite offscreen.
1 Tile Index Number Tile number in pattern table. For 8x16 sprites, bit 0 selects tile bank ($0000 or $1000).
Bit 0: Bank ($0000 or $1000); Bits 1-7: Tile number top of sprite (0–254).
2 Attributes Bit 0-1: Palette (4–7) of sprite
Bit 2-4: Unimplemented (always read 0)
Bit 5: Priority (0=front of BG, 1=behind BG)
Bit 6: Flip sprite horizontally
Bit 7: Flip sprite vertically
3 X Position X coordinate of left side of sprite

From this table, it can be discerned that the same rendering process done for the background pipeline cannot be used for the sprite. At least that was my conclusion. Instead I made a different function to render and detect the sprites. Honestly, I’m still not fully confident in my implementation with sprite rendering. It works with the games I’ve tried so far but I suggest you take this section with a grain of salt, as its most likely subject to change when I realize I did something wrong. Still, I believe my understanding should be able to help somewhat if you’re absolutely lost.

To start with, each scanline from cycle 65-256 looks skims through the OAM memory to fill up a buffer with any sprite that will appear on the next scanline. At HBlank, after the visible rendering cycles of a scanline, this buffer is transferred to the appropriate shift and rendering registers to be rendered on the next scanline. To represent this, I employ two arrays. One will fill up the sprites of the next scanline, and at HBlank, will transfer its content to the second array so that it can be used to render the sprites on the next scanline.

  flowchart TD
    Start([Start: n=0, m=0])

    ReadY["Read sprite Y-coordinate OAM[n][0]"]
    CheckInRange{"Y in range?"}
    CopySprite["Copy sprite data to secondary OAM if < 8 sprites found"]
    WriteIgnore["Ignore write if secondary OAM full"]
    IncrementN["Increment n"]
    CheckNZero{"All 64 sprites checked (n=0)?"}
    CheckSpriteCount{"< 8 sprites found?"}
    EvaluateSprites["Evaluate sprites for overflow"]
    SetOverflow["Set sprite overflow flag if > 8 sprites"]
    IncrementM["Increment m (hardware bug)"]
    ContinueEval["Continue evaluation"]
    CopyFail["Fail to copy sprite to secondary OAM"]
    IncrementNUntilHBLANK["Increment n until HBLANK"]
    End([End at HBLANK])

    Start --> ReadY
    ReadY --> CheckInRange

    CheckInRange -- Yes --> CopySprite
    CopySprite --> IncrementN
    CheckInRange -- No --> IncrementN

    IncrementN --> CheckNZero
    CheckNZero -- No --> CheckSpriteCount
    CheckNZero -- Yes --> CopyFail

    CheckSpriteCount -- Yes --> ReadY
    CheckSpriteCount -- No --> EvaluateSprites

    EvaluateSprites --> SetOverflow
    SetOverflow --> IncrementM
    IncrementM --> ContinueEval
    ContinueEval --> EvaluateSprites

    CopyFail --> IncrementNUntilHBLANK --> End

If this confuses you, don’t worry cause it confused me too. Just take your time reading through the wiki.

This is the code I was able to come up with my current understanding.

void sprite_detect(PPU *ppu) {
  // Reset secondary OAM memory at cycle 1
  if (ppu->current_scanline_cycle == 1) {
    memset(oam_memory_secondary, 0xFF, OAM_SECONDARY_SIZE);
  }

  if (ppu->current_scanline_cycle == 336) {
    ppu->copy_sprite_flag = 0;
    ppu->index_of_sprite = 0;
    ppu->sprite_evaluation_index = 0;
    ppu->oam_memory_top = 0;
  }

  // Bit 5 of PPUCTRL determines sprite size
  int sprite_height = ppu->PPUCTRL & 0x20 ? 16 : 8;

  if (ppu->current_scanline_cycle >= 65 && ppu->current_scanline_cycle <= 256) {
    // int index = ppu->sprite_evaluation_index * 4 + ppu->index_of_sprite;
    int index = ppu->sprite_evaluation_index * 4;
    if (index > OAM_SIZE - 1)
      return;
    // Odd cycle; read OAM
    if (ppu->current_scanline_cycle % 2 == 1) {
      if (ppu->scanline >= oam_memory[index] &&
          ppu->scanline < oam_memory[index] + sprite_height) {
        ppu->copy_sprite_flag = 1;
      }
    } else {
      if (ppu->copy_sprite_flag && ppu->oam_memory_top <= 31) {
        oam_memory_secondary[ppu->oam_memory_top++] = oam_memory[index];
        oam_memory_secondary[ppu->oam_memory_top++] = oam_memory[index + 1];
        oam_memory_secondary[ppu->oam_memory_top++] = oam_memory[index + 2];
        oam_memory_secondary[ppu->oam_memory_top++] = oam_memory[index + 3];
      }
      if (ppu->sprite_evaluation_index == 64) {
        ppu->sprite_evaluation_index = -1;
        ppu->PPUSTATUS |= 0x20;
      }
      ppu->sprite_evaluation_index++;
      ppu->copy_sprite_flag = 0;
    }
  }
}



void ppu_exec_visible_scanline(PPU *ppu) {

  ...

  oam_buffer_latches[(ppu->current_scanline_cycle - 257) % 32] =
    oam_memory_secondary[(ppu->current_scanline_cycle - 257) % 32];
  
  ...
}

The actual rendering code is basically identical to the background rendering. It just uses data from the oam_buffer_latches instead ofppu_memory.

Loop v

v is the current VRAM address used by the PPU to fetch the nametable, and attribute data. It’s a 15-bit register that changes during the rendering process, and each bit of this register has a significance to which tile to retrieve.

Bit(s) Name Description
14–12 fine Y Row offset within a tile (0–7)
11 nametable Y Selects vertical nametable (0 or 1)
10 nametable X Selects horizontal nametable (0 or 1)
9–5 coarse Y Tile row index on the screen (0–29)
4–0 coarse X Tile column index on the screen (0–31)

To get to this point took me quite a long time, mostly a lot of painful debugging. But after getting some coherent gameplay, it was all worth it.

APU - Audio Processing Unit

Now its time to add audio. I found this part to be not too hard compared to the PPU, but the theory behind it can be quite confusing. How does a digital signal get converted to audio? Some basic understanding of digital signal processing will help but basically, sound is just a vibration in the air. The APU generates samples of this vibration as a digital representation. My job is to simply collect these samples and use an existing library, SDL2 in my case, to generate the actual audio. The APU is comprised of five channels, 2 pulse channels, 1 triangle channel, 1 noise channel, and a delta modulation channel. The output of each of these channels are sent to a mixer. Currently, I’m using a simple averaging mixer, but I do plan on exploring other formulas in the future.

uint8_t apu_output(APU *apu) {
  uint8_t p1 = apu_pulse_output(apu->pulse1);
  uint8_t p2 = apu_pulse_output(apu->pulse2);
  uint8_t tri = apu_triangle_output(apu->triangle);
  uint8_t noise = apu_noise_output(apu->noise);
  // At this point, I still have not implemented the Delta modulation channel.
  
  return (p1 + p2 + tri + noise) / 3;
}

Frame Counter

Frame counter basically acts like a clock generator. Similar to how everything acts on the PPU cycle in the PPU, certain components act on the clock signal generated by the frame counter. The frame counter itself derives its generated clock signals by the APU cycle. For example, the frame counter will generate a clock signal at 3728 APU cycles. This would clock the Envelopes and Triangle's linear counter.

Length Counter, Envelope, Sweep

Before I cover the actual channels, I’ll start of with the components they’re comprised. Namely, the length counter, envelope generator and the sweep unit.

The length counter acts like timer. Once it counts down to zero, the sound will stop. The CPU writes a 5 bit values to one of the APU’s memory mapped register, and this value serves as an index to a duration. For example, lets say the value is $1F. The corresponding duration would be 30, which is clocked by the Frame Counter.

  flowchart LR 
    CPU["CPU writes 5-bit value (e.g. $1F) to APU register"] --> IDX["Value used as index into length table"]
    IDX --> DUR["Length counter initialized (e.g. 30)"]
    DUR --> FC["Frame Counter ticks at 240Hz"]
    FC --> DEC["Length counter decrements each tick"]
    DEC --> CHECK{"Length counter == 0?"}
    CHECK -- Yes --> STOP["Sound channel muted"]
    CHECK -- No --> DEC

The Envelope controls the volume output. Depending on certain flags, the volume can either decay with an optional looping to repeat the decay, or it can be a constant volume.

The Sweep unit can adjust a channel’s period up or down. Shorter period creates a higher pitch, while longer period creates a lower pitch.

Pulse Channels

This is a simple square wave generators with adjustable duty cycle, volume and sweep. A period is the length of the smallest unrepeatable values in a signal. Duty cycle determines during the period when a value is 0, or when the signal is 1. Volume can either be constant, or have a decay. It is composed of a length counter, envelope, and a sweep unit.

ℹ️

I have included a little clip to show how important it is to nail the logic of each component. Have a listen!

It took me a long time to pinpoint this issue, as it was due to just a single line.

  if (envelope->constant_volume_flag) {
    envelope->volume = envelope->divider->P;
  } else {
    envelope->volume = envelope->divider->counter;
  }

If the constant volume flag was not set, I was resetting the volume to the value of the divider. This is why each note sounds choppy, as if it the volume restarts! The corrected code is as shown below.

  if (envelope->constant_volume_flag) {
    envelope->volume = envelope->divider->P;
  } else {
    envelope->volume = envelope->envelope_decay_level_counter;
  }

This results in the following.

Triangle Channel

The Triangle channel is composed of a timer, length counter, and a linear counter, and some flags. A linear counter is basically a length counter, but its value is directly used, unlike a length counter which uses an index to get its duration.

Unlike the pulse, the triangle channel has no envelope. Instead, when the timer is clocked, it traverses through a 32 step sequence of values to output to the mixer.

15, 14, 13, 12, 11, 10,  9,  8,  7,  6,  5,  4,  3,  2,  1,  0
 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15

Noise

The Noise channel is composed of a envelope generator, a timer, a Linear Feedback Shift Register, and a length counter. When clocked by the timer, a 1 bit feedback value is calculated, the shift register gets shifted to the right by 1 bit, and the leftmost bit is set to the feedback. The mixer will get the value generated by the envelope if bit 0 if the shift register is 1. It will get zero otherwise.

  flowchart TD
    R["Shift Register (R): 100000000000001"]

    Timer["Timer clocks shift register"] --> Feedback["XOR bit 0 (1) and bit 6 (0) equals 1"]

    Feedback --> ShiftAfter["After shift right by 1 bit: 010000000000000"]

    ShiftAfter --> SetBit14["Set bit 14 = feedback (1)"]

    SetBit14 --> Result["New register value: 1100000000000001"]

    R

DMC

To be implemented.



The Frontend - SDL2

Frontend is just the name of my abstraction for SDL2. This is the library that handles all the rendering, control inputs, and audio. It’s a very popular library with a lot of examples, so it shouldn’t be too difficult getting something up and running using this library.

Visuals

I employ a flag to let the renderer know if the buffer is ready. When scanline 260 is reached, the PPU will set this flag.

void ppu_execute_cycle(PPU *ppu) {

  ...

    if (ppu->scanline > 260) {
      ppu->scanline = -1;
      ppu->update_graphics = 1;
    }

  ...

}

Once the flag is set, the Frontend library will update SDL2 with the PPU’s frame buffer and reset the flag.

while (1) {

  ...

    if (ppu_graphics_flag_set(&ppu)) {
      ppu_graphics_flag_reset(&ppu);

      Frontend_DrawFrame(&frontend, ppu.frame_buffer);
      Frontend_SetFrameTickStart(&frontend);
    }

  ...

}

Controls

The CPU sets the a flag, strobe, to indicate when its ready to poll the user inputs. When the user presses a button on the NES controller, the corresponding bit on a register is set, which is fed to the CPU. This register essentially holds the state of the controller per frame.

    if (cpu_strobe_set(&cpu)) {
      if (Frontend_HandleInput(&frontend) != 0)
        break;
      // cpu.ctrl_latch_state = frontend.controller;
      Frontend_SetController(cpu_ctrl_latch_state(&cpu));
    }

The following code shows what keys I use for control.

const static int keyboard[8] = {
    SDL_SCANCODE_Z,  SDL_SCANCODE_X,    SDL_SCANCODE_C,    SDL_SCANCODE_V,
    SDL_SCANCODE_UP, SDL_SCANCODE_DOWN, SDL_SCANCODE_LEFT, SDL_SCANCODE_RIGHT};

int Frontend_HandleInput(Frontend *frontend) {
  SDL_PumpEvents();
  const Uint8 *keystate = SDL_GetKeyboardState(NULL);
  frontend->controller = 0;

  if (keystate[SDL_SCANCODE_Q])
    return 1;
  if (keystate[keyboard[0]])
    frontend->controller |= NES_A;
  if (keystate[keyboard[1]])
    frontend->controller |= NES_B;
  if (keystate[keyboard[2]])
    frontend->controller |= NES_SELECT;
  if (keystate[keyboard[3]])
    frontend->controller |= NES_START;
  if (keystate[keyboard[4]])
    frontend->controller |= NES_UP;
  if (keystate[keyboard[5]])
    frontend->controller |= NES_DOWN;
  if (keystate[keyboard[6]])
    frontend->controller |= NES_LEFT;
  if (keystate[keyboard[7]])
    frontend->controller |= NES_RIGHT;
  return 0;
}

Audio

The standard frequency used for sampling is 44.1 kHz. With that in mind, we need to collect enough samples to generate a 44.1 kHz audio. However, the CPU frequency is 1.79 MHz. This would generate too many samples, at a frequency that would be inaudible to human ears. As such, I had to downsample this high-frequency signal to the standard 44.1 kHz audio rates, which will still preserve the shape of the NES sound waves while remaining audible. The calculation below illustrates how many samples I would need to get per frame.

Audio Sample Rate=44,100samples/secVideo Frame Rate (NTSC)=60frames/secSamples per Frame=44,10060=735samples/frame \begin{aligned} \text{Audio Sample Rate} &= 44{,}100 \, \text{samples/sec} \\ \text{Video Frame Rate (NTSC)} &= 60 \, \text{frames/sec} \\ \\ \text{Samples per Frame} &= \frac{44{,}100}{60} = 735 \, \text{samples/frame} \end{aligned}

The calculation below illustrates how many times I’d have to downsample the NES APU signal to get the audio sample rate of 44.1 kHz. What this means is that I have to call my apu mixer function once approximately every 40 CPU cycles.

CPU Clock Frequency=1,789,773 HzAudio Sample Rate=44,100 HzDownsampling Factor=CPU Clock FrequencyAudio Sample Rate=1,789,77344,10040.57 \begin{aligned} \text{CPU Clock Frequency} &= 1{,}789{,}773 \text{ Hz} \\ \text{Audio Sample Rate} &= 44{,}100 \text{ Hz} \\ \\ \text{Downsampling Factor} &= \frac{\text{CPU Clock Frequency}}{\text{Audio Sample Rate}} \\ &= \frac{1{,}789{,}773}{44{,}100} \\ &\approx 40.57 \end{aligned}

My implementation of the calculation presented above is shown in the code below. I still call the apu mixer output (apu_output) every CPU cycle, but I average it out over every ~40 CPU cycles. I could simply just disregard the other samples and it call it every 40 cycles, but the sound quality improves drastically with this simple technique. The audio_buffer_add is a function in the Frontend abstraction that adds the samples to a ring buffer.

    for (int i = 0; i < cpu.cycles; i++) {
      apu_execute(&apu);

      sample_accumulator += apu_output(&apu);
      apu_accumulator += 1.0;

      if (apu_accumulator >= apu_ticks_per_sample) {
        apu_accumulator -= apu_ticks_per_sample;

        int16_t sample =
            ((int)(sample_accumulator / apu_ticks_per_sample) - 8) * 4096;

        audio_buffer_add(sample);

        sample_accumulator = 0.0;
      }
    }

The code below shows how SDL2 initializes the audio. It sets a frequency of 44100 Hz, expects 735 samples per buffer, repeatedly calls audio_callback (the function the parses through the samples in the ring buffer), and uses signed 16 bit.

SDL_AudioSpec spec = {
      .freq = 44100,
      .format = AUDIO_S16SYS,
      .channels = 1,
      .samples = 735,
      .callback = audio_callback,
      .userdata = NULL,
  };

...

void audio_callback(void *userdata, Uint8 *stream, int len) {
  int16_t *out = (int16_t *)stream;
  int samples = len / sizeof(int16_t);

  for (int i = 0; i < samples; ++i) {
    if (read_pos != write_pos) {
      out[i] = audio_buffer[read_pos];
      read_pos = (read_pos + 1) % AUDIO_BUFFER_SIZE;
    } else {
      out[i] = 0;
    }
  }

Wrap up

That’s about it. I still have some stuff to do, like finishing up the DMC, trying out different apu mixer formulas, implement other NROM mappers. But as to the core functionality, I’m basically done. Of course, I still want to test other NROM0 ROMs, and I’m sure I’ll find something wrong about my code down the line. Overall, this was a pretty fun project. If you want to try something that will get you familiarized with how computer works, buiding any emulator will definitely do it.