A new Atari STE bad DMA investigation


Written by Christian Zietz with support from the ThunderStorm team – Version as of 2023-04-22

The so-called bad DMA phenomenon in the Atari STE has been talked about for many years – even going back to when the Atari STE was released. This phenomenon is known to cause data corruption while writing to ACSI hard disks. However, it seems to have gained more attention during the past years, perhaps as modern hard disk replacements (Gigafile, UltraSatan, …) became more prevalent.

It was always known that early STEs came with the DMA IC C025913-38 (same as in the STf), whereas later STEs were fitted with the updated DMA IC C398739-001. However, people have blamed aging capacitors and bus noise, among others, for the problems that plagued STEs with the older type of DMA IC without conclusive evidence. Making minor changes to the system, e.g., replacing the CPU or the power supply, sometimes fixes the problems. This points towards a borderline situation, where a few nanoseconds or millivolts matter. But – given the conflicting findings – a new investigation is worthwhile.

The symptoms

Finally, in possession of an Atari STE that reliably exhibits the issue, even when using an MC68HC000 CPU that supposedly “fixes” the problem, I started to investigate. First, I needed to understand the exact type of data corruption. To do so and to have a concise test case for further investigation, I developed a small program that writes consecutive bytes (0x00, 0x01, 0x02, …, 0xFF) to a sector of the hard disk – or rather: my Gigafile hard disk replacement. Fortunately, the result is very reproducible: the same pattern always ends up on the Gigafile’s SD card: The first 32 bytes (0x00 - 0x1F) are written correctly. However, the next 12 bytes (0x20 - 0x2B) are missing! The number 12 will become important again later.

This explains observations by others that in affected systems, it is still possible to create one directory on a previously empty disk, but not two directories1. Each directory entry takes up 32 bytes; hence, the first one is not corrupted. But any serious use of the disk would immediately cause data corruption. Imagine 12 bytes missing from the FAT (file allocation table) when it is written back to disk.

Enter the GSTMCU

The GSTMCU – the chip that essentially combines GLUE and MCU of the earlier ST models – is also very much involved in DMA transfers. The DMA IC cannot read from or write to RAM by itself. The GSTMCU handles the task of accessing RAM. For that, some of the DMA registers are actually inside the GSTMCU2 (for example, the address counter 0xFF8609 – 0xFF860D).

The GSTMCU and the DMA IC use the RDY signal to communicate. This signal serves two distinct purposes:

Strange glitches on the way to the root cause

What can cause 12 bytes to go missing? As mentioned in the previous section, the address counter is kept in the GSTMCU, while the DMA IC writes the bytes to the hard disk. What if both get out of sync?

Looking at the schematics of the GSTMCU2, the address counter only advances if the internal IXDMA signal is set. Fortunately, the same signal that drives IXDMA also asserts the BGACK (bus grant acknowledge) signal from the GSTMCU. BGACK can be measured easily.

The following logic analyzer trace is a sector write to an ACSI disk on an unaffected system as a reference. One can see BGACK active early, when the 32-byte FIFO inside the DMA IC is pre-filled, and then repeatedly when the FIFO needs to be refilled during the data transfer to the disk. One can also see the dual function of the RDY signal: acknowledging the register accesses (discernible by FCS going low) and handshaking during data transfers from RAM (when BGACK becomes low).

Compare this to a similar sector write on a system affected by the bad DMA phenomenon:

One immediately notices six additional BGACK pulses or “glitches” that coincide with DMA IC register accesses (FCS going low). They are not supposed to happen!

This measurement explains the observation above: The first 32 bytes from the FIFO pre-fill are written to the disk correctly. After that, the address counter inside the GSTMCU erroneously increments by 6 x 16 bits = 12 bytes. But these bytes are never transferred into the DMA FIFO and, therefore, never end up on the hard disk.

Zooming in

Not every DMA IC register access in the picture above is followed by a “glitch” on BGACK. Zooming in reveals the reason: Only long word (32-bit) accesses trigger the problem 3. As the 68000 data bus is only 16 bits wide, a long word access causes two transfers in very short succession, as can be seen by the two pulses on FCS.

Atari strongly recommended sending the command bytes using long word accesses to the DMA IC4, a recommendation that probably every hard disk driver follows.

The six command bytes that make up an ACSI command trigger the six “glitches” and cause the 12 bytes to get skipped.

Zooming in further and showing the system-wide 8 MHz clock reveals another essential detail: The RDY signal is deasserted almost simultaneously as the clock rises. The following section will investigate the relevance of this observation.

A borderline situation

Why is the relation between RDY and the rising clock edge relevant? This becomes apparent by looking inside the GSTMCU2: The RDY signal (called READY here) passes through some logic gates and is then sampled by a flip-flop, which is clocked by the 8 MHz clock (called ICLK8), which is also purposefully delayed by two inverters. This design hinges on the propagation delays inside the GSTMCU5.

Knowing that, I found that applying freeze spray to either the DMA IC or the GSTMCU cured the issue for a short while. Hence, in both situations, I captured the exact timing, i.e., the delay between RDY and the rising edge of the clock. The delay is slightly larger in the working case (freeze spray to the DMA IC) than in the non-working case6.

This observation solidifies the theory that the bad DMA problem is a matter of borderline timing. It explains why some people find that unrelated changes to the system – e.g., replacing the CPU or the power supply or changing pull-up resistors – alleviate the problem, as all of these changes can potentially shift timings by a few nanoseconds. It also explains why not every STE with a potentially bad C025913 DMA IC is affected: such minute differences can also be caused by aging or process variation during IC manufacturing.

The good DMA IC

Having understood the actual root cause of the bad DMA problem, the question remains: What did Atari change in the C398739 DMA IC to reliably solve the issue? They must have modified something concerning the RDY signal.

A measurement on such a DMA IC confirms this assumption. The RDY signal is deasserted much earlier.

The difference becomes even more conspicuous when comparing a C025913/old DMA IC (top half) and a C398739/new DMA IC (bottom half).

Therefore, Atari did not just make the newer C398739 DMA IC less prone to noise, as it is sometimes claimed. Instead, it was modified to significantly change the behavior of the RDY signal to solve the problem. This indicates that Atari’s engineers knew the root cause.

A possible software fix

As described above, only a long word access to the DMA registers at 0xFF8604 and 0xFF8606 triggers the glitch that is the root cause of the bad DMA phenomenon. What about changing hard disk drivers to use two separate word accesses, thereby forcing a longer pause between the two register writes? This workaround was tested with a modified EmuTOS version.7 On multiple affected STEs, it fixed any issues associated with the bad DMA phenomenon.

This workaround – as also described above – contradicts Atari’s guidance on accessing these registers, though. However, the source code to Atari’s own hard disk driver AHDI contains an important bit of information:

; 23-Jul-1985 jwt       use a move.l instruction for all wdc/wdl write  :
;                        pairs since it changes A1 quickly enough that  :
;                        the (old) DMA chip does not incorrectly        :
;                        generate two chip selects                      :

From this comment, one can hypothesize that only the very first version of the DMA chip C025913-20 (“old” even by July 1985) needs the long word (move.l) access. But this version was never used in the STE, in contrast to the later C025913-38. Under this assumption – and pending further validation – the proposed software fix is safe to use.

Conclusion

This investigation has found the root cause of the bad DMA phenomenon – hopefully ending years of speculation and guesswork.


1 For example: “Then I formatted and swapped back over to the -38 and created foldernames 11111111.111 and 22222222.222. Again the first write worked, but the second filename vanished.” https://web.archive.org/web/20220728230647/https://www.exxosforum.co.uk/atari/last/DMAfix/index.htm

2 I recovered the schematics of the GSTMCU some years ago, so you can see how this is implemented: https://www.chzsoft.de/asic-web/.

3 This also explains why the following test program – in theory, written especially to detect the bad DMA problem – was not successful in triggering the problem. It only uses word (16-bit) accesses, even for sending the ACSI command bytes. https://web.archive.org/web/20200126200632/http://exxosnews.blogspot.com:80/2017/06/dma-hard-drive-test-program.html

4 “DMA CHIP ANOMALY […] while writing to the DMA Data Port it is necessary to write the DMA Mode Control operation for the next operation. This can be done by writing a long word to the Data Port with the data in the upper word and the next operation in the lower word.” Atari ACSI/DMA Integration Guide. https://archive.org/details/ACSI_DMA_Guide_6-28-1991

5 In the STf, the corresponding circuitry is divided between the GLUE and MCU ICs, with different propagation delays and apparently not prone to the bad DMA problem.

6 Contrary to what it might look like on the screenshot, the clock itself is, of course, not affected by the freeze spray. RDY is deasserted earlier or later in relation to the clock.

7 EmuTOS makes this test easy, as it is not only open-source, but also encapsulates this access in a single function: https://github.com/emutos/emutos/blob/770d3675cc3c63412b1d69e84b878eb76e31b545/bios/acsi.c#L429-L436