*******************************
*NTSC 2C02 technical operation*
*******************************
Brad Taylor (big_time_software@hotmail.com)

1st release: Sept 25th, Y2K
2nd release: Jan  27th, 2K3
3rd release: Feb   4th, 2K3


This document describes the low-level operation and technical details of the 
2C02, the NES's PPU. In general, it contains important information in 
regards to PPU timing, which no NES coder/emulator author should be without. 
This document assumes that you already understand the basics of how the PPU 
works, like how the playfield/object images are generated, and the behaviour 
of scroll/address counters during playfield rendering.

Alot of the concepts behind how the PPU works described here have been 
extracted from Nintendo's patent documentation (U.S.#4,824,106). With block 
diagrams of the PPU's architecture (and even some schematics), these papers 
will definetely aid in the comprehension of this complex device.

Since the first release, this document has been given a major overhaul. Most 
sections of the document have been reworked, and new information has been 
added just about everywhere. If you've read the old version of this document 
before, I recommend that you read this new one in it's entirity; there's new 
information even in sections which may look like they haven't changed much.

Topics discussed hereon are as follows.

- Video signal generation
- PPU base timing
- Miscellanious PPU info
- PPU memory access cycles
- Frame rendering details
- Scanline rendering details
- In-range object evaluation
- Details of playfield render pipeline
- Details of object pattern fetch & render
- Extra cycle frames
- The MMC3's scanline counter
- PPU pixel priority quirk
- Graphical enhancements


+-------+
|History|
+-------+
On the weekend of Sept. 25th, Y2K, I setup an experiment with my NTSC NES MB 
& my PC so's I could RE the PPU's timing. What I did was (using a PC 
interface) analyse the changes that occur on the PPU's address and data pins 
on every rising & falling edge of the PPU's clock. I was not planning on 
removing the PPU from the motherboard (yet), so basically I just kept 
everything intact (minus the stuff I added onto the MB so I could monitor 
the PPU's signals), and popped in a game, so that it would initialize the 
PPU for me (I used DK classics, since it was only taking somthing like 4 
frames before it was turning on the background/sprites).

The only change I made was taking out the 21 MHz clock generator circuitry. 
To replace the clock signal, I connected a port controlled latch to the 
NES's main clock line instead. Now, by writing a 0 or a 1 out to an PC ISA 
port of my choice (I was using $104), I was able to control the 21 MHz 
clockline of the NES. After I would create a rise or a fall on the NES's 
clock line, I would then read in the data that appeared on the PPU's address 
and data pins, which included monitoring what PPU registers the game 
read/wrote to (& the data that was read/written).


+-----------------------+
|Video signal generation|
+-----------------------+
A 21.48 MHz clock signal is fed into the PPU. This is the NES's main clock 
line, which is shared by the CPU.

Inside the PPU, the 21.48 MHz signal is used to clock a three-stage Johnson 
counter. The complimentery outputs of both master and slave portions of each 
stage are used to form 12 mutually exclusive output phases- all 3.58 MHz 
each (the NTSC colorburst). These 12 different phases form the basis of all 
color generation for the PPU's composite video output.

Naturally, when the user programs the lower 4-bits of a palette register, 
they are essentially selecting any 1 of 12 phases to be routed to the PPU's 
video out pin (this corresponds to chrominance (tint/hue) video information) 
when the appropriate pixel indexes it. Other chrominance combinations (0 & 
13) are simply hardwired to a 1 or 0 to generate grayscale pixels.

Bits 4 & 5 of a palette entry selects 1 of 4 linear DC voltage offsets to 
apply to the selected chrominance signal (this corresponds to luminance 
(brightness) video information) for a pixel.

Chrominance values 14 & 15 yield a black pixel color, regardless of any 
luminance value setting.

Luminance value 0, mixed with chrominance value 13 yield a "blacker than 
black" pixel color. This super black pixel has an output voltage level close 
to the vertical/horizontal syncronization pulses. Because of this, some 
video monitors will display warped/distorted screens for games which use 
this color for black (Game Genie is the best example of this). Essentially 
what is happening is the video monitor's horizontal timing is compromised by 
what it thinks are extra syncronization pulses in the scanline. This is not 
damaging to the monitors which are effected by it, but use of the super 
black color should be avoided, due to the graphical distortion it causes.

The amplitude of the selected chrominance signal (via the 4 lower bits of a 
palette register) remain constant regardless of bits 4 or 5. Thus it is not 
possible to adjust the saturation level of a particular color.


+---------------+
|PPU base timing|
+---------------+
Other than the 3-stage Johnson counter, the 21.48 MHz signal is not used 
directly by any other PPU hardware. Instead, the signal is divided by 4 to 
get 5.37 MHz, and is used as the smallest unit of timing in the PPU. All 
following references to PPU clock cycle (abbr. "cc") timing in this document 
will be in respect to this timing base, unless otherwise indicated.

- Pixels are rendered at the same rate as the base PPU clock. In other 
words, 1 clock cycle= 1 pixel.

- 341 PPU cc's make up the time of a typical scanline (or 341/3 CPU cc's).

- One frame consists of 262 scanlines. This equals 341*262 PPU cc's per 
frame (divide by 3 for # of CPU cc's).


+------------------------+
|PPU memory access cycles|
+------------------------+
All PPU memory access cycles are 2 clocks long, and can be made back-to-back 
(typically done during rendering). Here's how the access breaks down:

At the beginning of the access cycle, PPU address lines 8..13 are updated 
with the target address. This data remains here until the next time an 
access cycle occurs.

The lower 8-bits of the PPU address lines are multiplexed with the data bus, 
to reduce the PPU's pin count. On the first clock cycle of the access, 
A0..A7 are put on the PPU's data bus, and the ALE (address latch enable) 
line is activated for the first half of the cycle. This loads the lower 
8-bit address into an external 8-bit transparent latch strobed by ALE 
(74LS373 is used).

On the second clock cycle, the /RD (or /WR) line is activated, and stays 
active for the entire cycle. Appropriate data is driven onto the bus during 
this time.


+----------------------+
|Miscellanious PPU info|
+----------------------+
- Sprite DMA is 1536 clock cycles long (512 CPU cc's). 256 individual 
transfers are made from CPU memory to a temp register inside the CPU, then 
from the CPU's temp reg, to $2004.

- The PPU makes NO external access to the PPU bus, unless the playfield or 
objects are enabled during a scanline outside vblank. This means that the 
PPU's address and data busses are dead while in this state.

- palette RAM is accessed internally during playfield rendering (i.e., the 
palette address/data is never put on the PPU bus during this time). 
Additionally, when the programmer accesses palette RAM via $2006/7, the 
palette address accessed actually does show up on the PPU address bus, but 
the PPU's /RD & /WR flags are not activated. This is required; to prevent 
writing over name table data falling under the approprite mirrored area 
(since the name table RAM's address decoder simply consists of an inverter 
connected to the A13 line- effectively decoding all addresses in 
$2000-$3FFF).

- the VINT impulse (NMI) and bit $2002.7 are set simultaniously. Reading 
$2002 will reset bit 7, but it seems that the VINT flag goes down on it's 
own. Because of this, when the PPU generates a VINT, it doesn't require any 
acknowledgement whatsoever; it will continue firing off VINTs, regardless of 
inservice to $2002. The only way to stop VINTs is to clear $2000.7.

- Because the PPU cannot make a read from PPU memory immediately upon 
request (via $2007), there is an internal buffer, which acts as a 1-stage 
data pipeline. As a read is requested, the contents of the read buffer are 
returned to the NES's CPU. After this, at the PPU's earliest convience 
(according to PPU read cycle timings), the PPU will fetch the requested data 
from the PPU memory, and throw it in the read buffer. Writes to PPU mem via 
$2007 are pipelined as well, but it is unknown to me if the PPU uses this 
same buffer (this could be easily tested by writing somthing to $2007, and 
seeing if the same value is returned immediately after reading).


+-----------------------+
|Frame rendering details|
+-----------------------+
  The following describes the PPU's status during all 262 scanlines of a 
frame. Any scanlines where work is done (like image rendering), consists of 
the steps which will be described in the next section.

0..19:	Starting at the instant the VINT flag is pulled down (when a NMI is 
generated), 20 scanlines make up the period of time on the PPU which I like 
to call the VINT period. During this time, the PPU makes no access to it's 
external memory (i.e. name / pattern tables, etc.).

20:	After 20 scanlines worth of time go by (since the VINT flag was set), 
the PPU starts to render scanlines. This first scanline is a dummy one; 
although it will access it's external memory in the same sequence it would 
for drawing a valid scanline, no on-screen pixels are rendered during this 
time, making the fetched background data immaterial. Both horizontal *and* 
vertical scroll counters are updated (presumably) at cc offset 256 in this 
scanline. Other than that, the operation of this scanline is identical to 
any other. The primary reason this scanline exists is to start the object 
render pipeline, since it takes 256 cc's worth of time to determine which 
objects are in range or not for any particular scanline.

21..260: after rendering 1 dummy scanline, the PPU starts to render the 
actual data to be displayed on the screen. This is done for 240 scanlines, 
of course.

261:	after the very last rendered scanline finishes, the PPU does nothing 
for 1 scanline (i.e. the programmer gets screwed out of perfectly good VINT 
time). When this scanline finishes, the VINT flag is set, and the process of 
drawing lines starts all over again.


+--------------------------+
|Scanline rendering details|
+--------------------------+
Naturally, the PPU will fetch data from name, attribute, and pattern tables 
during a scanline to produce an image on the screen. This section details 
the PPU's doings during this time.

As explained before, external PPU memory can be accessed every 2 cc's. With 
341 cc's per scanline, this gives the PPU enough time to make 170 memory 
accesses per scanline (and it uses all of them!). After the 170th fetch, the 
PPU does nothing for 1 clock cycle. Remember that a single pixel is rendered 
every clock cycle.


Memory fetch phase 1 thru 128
-----------------------------
1. Name table byte
2. Attribute table byte
3. Pattern table bitmap #0
4. Pattern table bitmap #1

This process is repeated 32 times (32 tiles in a scanline).


This is when the PPU retrieves the appropriate data from PPU memory for 
rendering the playfield. The first playfield tile fetched here is actually 
the 3rd to be drawn on the screen (the playfield data for the first 2 tiles 
to be rendered on this scanline are fetched at the end of the scanline prior 
to this one).

All valid on-screen pixel data arrives at the PPU's video out pin during 
this time (256 clocks). For determining the precise delay between when a 
tile's bitmap fetch phase starts (the whole 4 memory fetches), and when the 
first pixel of that tile's bitmap data hits the video out pin, the formula 
is (16-n) clock cycles, where n is the fine horizontal scroll offset (0..7 
pixels). This information is relivant for understanding the exact timing 
operation of the "object 0 collision" flag.

Note that the PPU fetches an attribute table byte for every 8 sequential 
horizontal pixels it draws. This essentially limits the PPU's color area 
(the area of pixels which are forced to use the same 3-color palette) to 
only 8 horizontally sequential pixels.

It is also during this time that the PPU evaluates the "Y coordinate" 
entries of all 64 objects in object attribute RAM (OAM), to see if the 
objects are within range (to be drawn on the screen) for the *next* scanline 
(this is why Y-coordinate entries in the OAM must be programmed to a value 1 
less than the scanline the object is to appear on). Each evaluation 
(presumably) takes 4 clock cycles, for a total of 256 (which is why it's 
done during on-screen pixel rendering).


In-range object evaluation
--------------------------
An 8-bit comparator is used to calculate the 9-bit difference between the 
current scanline (minus 21), and each Y-coordinate (plus 1) of every object 
entry in the OAM. Objects are considered in range if the comparator produces 
a difference in the range of 0..7 (if $2000.5 currently = 0), or 0..15 (if 
$2000.5 currently = 1).

(Note that a 9-bit comparison result is generated. This means that setting 
object scanline coordinates for ranges -1..-15 are actually interpreted as 
ranges 241..255. For this reason, objects with these ranges will never be 
considered to be part of any on-screen scanline range, and will not allow 
smooth object scrolling off the top of the screen.)

Tile index (8 bits), X-coordinate (8 bits), & attribute information (4 bits; 
vertical inversion is excluded) from the in-range OAM element, plus the 
associated 4-bit result of the range comparison accumulate in a part of the 
PPU called the "sprite temporary memory". Logical inversion is applied to 
the loaded 4-bit range comparison result, if the object's vertical inversion 
attribute bit is set.

The sprite temporary memory is large enough to hold the results of 8 found 
objects. Since object range evaluations occur sequentially through the OAM 
(starting from entry 0), the sprite temporary memory always fills in order 
from the highest priority in-range object, to lower ones. If more than 8 are 
found, the results are ignored, and the PPU raises a flag (bit 5 of $2002) 
indicating that it is going to be dropping objects for the next scanline.

An additional memory bit associated with the sprite temporary memory is used 
to indicate that the primary object (#0) was found to be in range. This will 
be used later on to detect primary object-to-playfield pixel collisions.


Playfield render pipeline details
---------------------------------
As pattern table & palette select data is fetched, it is loaded into 
internal latches (the palette select data is selected from the fetched byte 
via a 2-bit 1-of-4 selector).

At the start of a new tile fetch phase (every 8 cc's), both latched pattern 
table bitmaps are loaded into the upper 8-bits of 2- 16-bit shift registers 
(which both shift right every clock cycle). The palette select data is also 
transfered into another latch during this time (which feeds the serial 
inputs of 2 8-bit right shift registers shifted every clock). The pixel data 
is fed into these extra shift registers in order to implement fine 
horizontal scrolling, since the periods when the PPU fetch tile data is 
fixed.

A single bit from each shift register is selected, to form the valid 4-bit 
playfield pixel for the current clock cycle. The bit selection offset is 
based on the fine horizontal scroll value (this selects bit positions 0..7 
for all 4 shift registers). The selected 4-bit pixel data will then be fed 
into the multiplexer (described later) to be mixed with object data.


Memory fetch phase 129 thru 160
-------------------------------
1. Garbage name table byte
2. Garbage name table byte
3. Pattern table bitmap #0 for applicable object (for next scanline)
4. Pattern table bitmap #1 for applicable object (for next scanline)

This process is repeated 8 times.


This is the period of time when the PPU retrieves the appropriate pattern 
table data for the objects to be drawn on the *next* scanline. Even if no 
sprites exist on the next scanline, a dummy pattern table fetch takes place.

Although the fetched name table data is thrown away, and the name table 
address is somewhat unpredictable, the address does seem to relate to the 
first name table tile to be fetched for the next scanline. This would seem 
to imply that PPU cc #256 is when the PPU's scroll/address counters have 
their horizontal scroll values automatically updated.

It should also be noted that because this fetch is required for objects on 
the next scanline, it is neccessary for a garbage scanline to exist prior to 
the very first scanline to be actually rendered, so that object attribute 
RAM entries can be evaluated, and the appropriate bitmap data retrieved.

As far as the wasted fetch phases here, well, what can I say. Either 
Nintendo's engineers were VERY lazy, and didn't want to add the small amount 
of extra circuitry to the PPU so that 16 object fetches could take place per 
scanline, or Nintendo couldn't spot the extra memory required to implement 
16 object scanlines. Thing is though- between the object attribute mem, 
sprite temporary & buffer mem, and palette mem, that's already 2406 bits of 
RAM; I don't think it would've killed them to just add the 400 bits it 
would've took for an extra 8 objects, which would've made games with 
horrible OAM cycling (Double Dragon 2 w/ 2 players) look half-decent (hell, 
with 16 object scanlines, games would hardly even need OAM cycling).


Details of object pattern fetch & render
----------------------------------------
Where the PPU fetches pattern table data for an individual object is 
conditioned on the contents of the sprite temporary memory element, and 
$2000.5. If $2000.5 = 0, the tile index data is used as usual, and $2000.3 
selects the pattern table to use. If $2000.5 = 1, the MSB of the range 
result value become the LSB of the indexed tile, and the LSB of the tile 
index value determines pattern table selection. The lower 3 bits of the 
range result value are always used as the fine vertical offset into the 
selected pattern.

Horizontal inversion (bit order swapping) is applied to fetched bitmaps, if 
indicated in the sprite temporary memory element.

The fetched pattern table data (which is 2 bytes), plus the associated 3 
attribute bits (palette select & priority), and the x coordinate byte in 
sprite temporary memory are then loaded into a part of the PPU called the 
"sprite buffer memory". This memory area again, is large enough to hold the 
contents for 8 sprites. (The primary object present bit is also copied.)

The composition of one sprite buffer element here is: 2 8-bit shift 
registers (the fetched pattern table data is loaded in here, where it will 
be serialized at the appropriate time), a 3-bit latch (which holds the color 
& priority data for an object), and an 8-bit down counter (this is where the 
x coordinate is loaded).

The counter is decremented every time the PPU renders a pixel (first 256 
cc's of a scanline; see "Memory fetch phase 1 thru 128" above). When the 
counter reaches 0, the pattern table data in the shift registers will start 
to serialize (1 shift per clock). Before this time, or 8 clocks after, 
consider the outputs of the serializers for each stage to be 0 
(transparency).

The streams of all 8 object serializers are prioritized, and ultimately only 
one stream (with palette select & priority information) is selected for 
output to the multiplexer (where object & playfield pixels are prioritized).

The data for the first sprite buffer entry (including the primary object 
present flag) has the first chance to enter the multiplexer, if it's output 
pixel is non-transparent (non-zero). Otherwise, priority is passed to the 
next serializer in the sprite buffer memory, and the test for 
non-transparency is made again (the primary object present status will 
always be passed to the multiplexer as false in this case). This is done 
until the last (8th) stage is reached, when the object data is passed 
through unconditionally. Keep in mind that this whole process occurs every 
clock cycle (hardware is used to determine priority instantly).

The multiplexer does 2 things: determines primary object collisions, and 
decides which pixel data to pass through to index the palette RAM- either 
the playfield's or the object's.

Primary object collisions occur when a non-transparent playfield pixel 
coincides with a non-transparent object pixel, while the primary object 
present status entering the multiplexer for the current clock cycle is true. 
This causes a flip-flop ($2002.6) to be set, and remains set (presumably) 
some time after the VINT occurence (prehaps up until scanline 21?).

The decision for selecting the data to pass through to the palette index is 
made rather easilly. The condition to use object (opposed to playfield) data 
is:

(OBJpri=foreground OR PFpixel=xparent) AND OBJpixel<>xparent

Since the PPU has 2 palettes; one for objects, and one for playfield, the 
appropriate palette will be selected depending on which pixel data is passed 
through.

After the palette look-up, the operation of events follows the 
aforementioned steps in the "video signal generation" section.


Memory fetch phase 161 thru 168
-------------------------------
1. Name table byte
2. Attribute table byte
3. Pattern table bitmap #0 (for next scanline)
4. Pattern table bitmap #1 (for next scanline)

This process is repeated 2 times.


It is during this time that the PPU fetches the appliciable playfield data 
for the first and second tiles to be rendered on the screen for the *next* 
scanline. These fetches initialize the internal playfield pixel pipelines 
(2- 16-bit shift registers) with valid bitmap data. The rest of tiles 
(3..32) are fetched at the beginning of the following scanline.


Memory fetch phase 169 thru 170
-------------------------------
1. Name table byte
2. Name table byte


I'm unclear of the reason why this particular access to memory is made. The 
name table address that is accessed 2 times in a row here, is also the same 
nametable address that points to the 3rd tile to be rendered on the screen 
(or basically, the first name table address that will be accessed when the 
PPU is fetching playfield data on the next scanline).


After memory access 170
-----------------------
The PPU simply rests for 1 cycle here (or the equivelant of half a memory 
access cycle) before repeating the whole pixel/scanline rendering process.


+------------------+
|Extra cycle frames|
+------------------+
Scanline 20 is the only scanline that has variable length. On every odd 
frame, this scanline is only 340 cycles (the dead cycle at the end is 
removed). This is done to cause a shift in the NTSC colorburst phase.

You see, a 3.58 MHz signal, the NTSC colorburst, is required to be modulated 
into a luminance carrying signal in order for color to be generated on an 
NTSC monitor. Since the PPU's video out consists of basically square waves 
(as opposed to sine waves, which would be preferred), it takes an entire 
colorburst cycle (1/3.58 MHz) for an NTSC monitor to identify the color of a 
PPU pixel accurately.

But now you remember that the PPU renders pixels at 5.37 MHz- 1.5x the rate 
of the colorburst. This means that if a single pixel resides on a scanline 
with a color different to those surrounding it, the pixel will probably be 
misrepresented on the screen, sometimes appearing faintly.

Well, to somewhat fix this problem, they added this extra pixel into every 
odd frame (shifting the colorburst phase over a bit), and changing the way 
the monitor interprets isolated colored pixels each frame. This is why when 
you play games with detailed background graphics, the background seems to 
flicker a bit. Once you start scrolling the screen however, it seems as if 
some pixels become invisible; this is how stationary PPU images would look 
without this cycle removed from odd frames.

Certain scroll rates expose this NTSC PPU color caveat regardless of the 
toggling phase shift. Zelda 2's dungeon backgrounds are a good place to see 
this effect.


+---------------------------+
|The MMC3's scanline counter|
+---------------------------+
As most people know, the MMC3 bases it's scanline counter on PPU address 
line A13 (which is why IRQ's can be fired off manually by toggling A13 a 
bunch of times via $2006). What's not common knowledge is the number of 
times A13 is expected to toggle in a scanline (although if you've been 
paying close attention to the doc here, you should already know ;)

A13 was probably used for the IRQ counter (as opposed to using the PPU's 
/READ line) because this address line already needed to be connected to the 
MMC for bankswitching purposes (so in other words, to reduce the MMC3's pin 
count by 1). They also probably used this method of counting (as opposed to 
a CPU cycle counter) since A13 cycles (0 -> 1) exactly 42 times per 
scanline, whereas the CPU count of cycles per scanline is not an exact 
integer (113.67). Having said that, I guess Nintendo wanted to provide an 
"easy-to-use" method of generating special image effects, without making 
programmers have to figure out how many clock cycles to program an IRQ 
counter with (a pretty lame excuse for not providing an IRQ counter with CPU 
clock cycle precision (which would have been more useful and versatile)).

Regardless of any values PPU registers are programmed with, A13 will operate 
in a predictable fashion during image rendering (and if you understand how 
PPU addressing works, you should understand that A13 is the *only* address 
line with fixed behaviour during image rendering).


+------------------------+
|PPU pixel priority quirk|
+------------------------+
Object data is prioritized between itself, then prioritized between the 
playfield. There are some odd side effects to this scheme of rendering, 
however. For instance, imagine a low priority object pixel with foreground 
priority, a high priority object pixel with background priority, and a 
non-transparent playfield pixel all coinciding.

Ideally, the playfield is considered to be the middle layer between 
background and foreground priority objects. This means that the playfield 
pixel should hide the background priority object pixel (regardless of object 
priority), and the foreground priority object should appear atop the PF 
pixel.

However, because of the way the PPU renders (as just described), OBJ 
priority is evaluated first, and therefore the background object pixel wins, 
which means that you'll only be seeing the PF pixel after this mess.

A good game to demonstrate this behaviour is Megaman 2. Go into airman's 
stage. First, jump into the energy bar, just to confirm that megaman's 
sprite is of a higher priority than the energy bar's. Now, get to the second 
half of the stage, where the clouds cover the energy bar. The energy bar 
will be ontop of the clouds, but megaman will be behind them. Now, look what 
happens when you jump into the energy bar here... you see the clouds where 
megaman _should_ be overlapping the energy bar.


+----------------------+
|Graphical enhancements|
+----------------------+
Since an NES cartridge has access to the PPU bus, any number of on-cart 
hardware schemes can be used to enhance the graphic capabilities of the NES. 
After all, the PPU's playfield pipeline is very simple: it fetches 272 
playfield pixels per scanline (as 34*2 byte fetches, in real-time), and 
outputs 256 of them to the screen (with the 0..7 pixel offset determined by 
the fine X scroll register), along with object data combined with it.

Essentially, you can bypass the PPU's simple scrolling system, implement a 
custom one on your cart (fetching bitmap data in your own fashion), and feed 
the PPU bitmap data in your own order.

The possibilities of this are endless (like sporting multiple playfields, or 
even playfield rotation/scaling), but of course what it comes down to is the 
amount of cartridge hardware required.

Generally, playfield rotation/scaling can be done quite easily- it only 
requires a few sets of 16-bit registers and adders (the 16 bits are broken 
up into 8.8 fixed point values). But this kind of implementation is more 
suited for an integrated circuit, since this would require dozens of 
discrete logic chips.

Multiple playfields are another thing which could be easily done. The caveat 
here is that pixel pipelines (i.e., shift registers) and a multiplexer would 
have to be implemented on the cart (not to mention exclusive name table RAM) 
in order to process the playfield bitmaps from multiple sources. The access 
to the CHR-ROM/RAM would also have to increased- but as it stands, the 
CHR-ROM/RAM bandwidth is 1.34 MHz, a rather low frequency. With a memory 
device capable of a 10.74 MHz bandwith, you could have 8 playfields to work 
with. Generally, this would be very useful for displaying multiple huge 
objects on the screen- without ever having to worry about annoying flicker.

The only restriction to doing any of this is that:

- every 8 sequential horizontal pixels sent to the PPU must share the same 
palette select value. Because of this, hardware would have to be implemented 
to decide which palette select value to feed the PPU between 8 horizontally 
sequential pixels, if they do not all share the same palette select value. 
The on-screen results of this may not be too flattering sometimes, but this 
is a small price to pay to do some neat graphical tricks on the NES.

-only the playfield palette can be used. As usual, this pretty much limits 
your randomly accessable colors to about 12+1.

It's a damn shame that Nintendo never created a MMC which would enhance 
graphics on the NES in useful ways as mentioned above. The MMC5 was the only 
device that came close, and it's only selling features were the single-tile 
color area, and the vertical split screen mode (which I don't think any game 
ever used). Considering the amount of pins (100) the MMC5 had, and number of 
gates they put in it just for the EXRAM (which was 1K bytes), they could've 
put some really useful graphics hardware inside there instead.

Prehaps the infamous Color Dreams "Hellraiser" cart was the closest the NES 
ever came to seeing such sophisticated graphics. The cart was never 
released, but from what I've read, it was going to use some sort of frame 
buffer, and a Z80 CPU to do the graphical rendering. It had been rumored 
that the game had 3D graphics (or at least 2.5D) in it. If so (and the game 
was actually good), prehaps it would have raised a few eyebrows in the 
industry, and inspired Nintendo to develop a new MMC chip with similar 
capabilities, in order to keep the NES in it's profit margin for another few 
years (and allow it to compete somewhat with the more advanced systems of 
the time).

EOF

