Original image from Scratch Wiki

After proving that my computer board actually works, it’s time to switch over to software design. This is my core competency, where I’ve spent years professionally, and something I’ve done as a hobby since before I even had a computer. You’d think this part would be easy…

On the Virtual TRS-20, the boot rom starts up, accepts a Y-modem upload, and then jumps to it, happy as Larry. On the real TRS-20, the Y-modem upload fails on the first packet, with nothing more than metadata sent.

The emulator is reasonably faithful to the behaviour of the TRS-20 as I discover its nuances. Memory reads now carry an /M1 flag to allow the ROM masking to more accurately replicate the board, though I didn’t go as far as emulating the timing issue. It’s useful when debugging the boot ROM code as I can stop it, inspect memory and registers, single-step it with ease, and in theory get code working before it hits the real device where debugging is much harder.

In the case of the Y-modem upload, this isn’t the case.

On with the debug

During the serial debug described in the last post I found one possibly significant difference between what I expected to happen and what really happens. The /RTS0 signal is not asserted automatically by the Z180 when its own ASCI0 receive buffer is full. The signal is entirely up to software - so I cannot rely on it to prevent overruns when uploading bytes. It may be that I’m dropping bits by blasting Y-modem packets at the CPU as fast as the UART can handle them.

The heart of my Y-modem code is the recv_packet function. The general idea is that it waits for a packet command byte for up to five seconds, then (re)transmits the ACK code assuming the sender didn’t receive it. After that, if it’s a data packet it receives the packet index then the data, with a timeout of one second for each byte. On completion it checks that the CRC matches.

I can work out a lot about what’s going on using some black-box probing of the behaviour. Not transmitting anything gets ‘C’ repeated every five seconds until it gives up after ten failures (‘C’ is used for the first packet to indicate CRC-mode, rather than an ACK). Sending an invalid command byte does the same, but faster. Sending a valid data packet byte appears to work - but then fails. After that, both the number of retry attempts and the ACK byte are corrupted.

This suggests the CRC function isn’t working.

It also suggests I have a way to keep going without removing the IC and flashing new code. The metadata packet is received before the CRC error kicks in, after which the transfer can be aborted with 2xCAN. The metadata packet, however, is stored right where the first data packet should subsequently get written to - and where the boot monitor will jump when executing uploaded code. Using the 1024-byte packet command this gives me a full 1026 bytes of data (packet plus two CRC bytes) that I can upload and execute.

Using a combination of uploading bits of code to run and replacement editions of recv_packet to patch in place of the boot code version, I tracked down my multiple errors. I present them here in source file order, but this is not the order I discovered them.

Timeouts don’t stop the transfer

Failing ten times should abort the transfer. But it doesn’t - the code just keeps on spitting out retry bytes. Eventually I tracked this down to the following few lines of code:

retry:		djnz	metadata

done:		pop	hl
		pop	de
		ret

A successful run through the code jumps to done with the zero flag reset. An unsuccessful run jumps to done with the zero flag set. I thought that djnz would set the zero flag - but instead it doesn’t affect flags at all. Whatever the zero flag was on entry to retry is what it will be after the djnz has decremented b to zero. Just adding xor a will set the zero flag.

The retry byte gets corrupted

After attempting to send a valid (meta)data packet the retry byte becomes corrupted. After a failed packet transfer instead of getting a ‘C’ back I start getting NULs. I tracked this down to the code invoked after a data packet command is received.

recv_body:	ld	(cmd), a
		ld	de, 1*100
		call	recv_byte
		jr	z, retry
		ld	c, a
		call	recv_byte
		jr	z, retry
		cpl
		cp	a, c
		jr	nz, retry
		ld	(seq), a

The problem, of course, is that I assumed early on that c would be preserved as my retry byte value, and subsequently also decided to use c to store the sequence number to compare to its complement. Oops. I accepted that there’s not many registers on the Z180 and stored the retry byte in memory instead.

Going too far the other way

Now we come to the headline problem - the CRC validation is failing. The other problems would happen in emulation, if I’d exhaustively tested all possible code paths. My djnz emulation didn’t set flags. Yes, that’s right, I wrote assembly code that assumed one thing about an instruction, and code to emulate that instruction assuming something else. People are fallible, tests never exhaustively test all possible code paths, and trusting to discipline to catch problems is naive.

But this problem was different: the bytes were being received correctly. I could send binary code and run it, which wouldn’t be possible if the transfer were corrupted. The CRC code works fine when testing. There’s only one answer left: the CRC code isn’t being called correctly.

		push	hl
		push	bc

		ld	bc, 130
		ld	a, (cmd)
		cp	a, YM_SOH
		jr	z, $+5
		ld	bc, 1026

		ld	de, 1*100
		call	recv_wait

		pop	bc
		pop	hl

This little bit of code is responsible for receiving either 130 or 1026 bytes, the payload and its CRC, depending on whether it’s a SOH or STX command. I carefully preserve bc because b contains my retry counter and c contains (hah!) my retry byte. Then I set up bc to hold the size of the payload and CRC, and receive the data.

Then I restore bc.

At this point in receiving the metadata packet bc contains $0a00. There are still ten retries allowed, and the retry byte has been replaced by NUL. The payload length I calculated is gone.

		jr	z, retry
		call	ym_crc		; CRC should be zero

Then I call the CRC code. This expects the data to checksum in hl (check: I preserved hl for exactly this reason) and the length of the data in bc … but that’s not the length of the received data any more. It’s much, much bigger than the received data.

The Y-modem 16-bit CRC isn’t the world’s best choice of checksum. It’s definitely an improvement over X-modem’s naive 8-bit checksum but it’s a long way short of a reliable hash. In this case one simple flaw plays out: crc16(0, 0) == 0.

My emulator’s SRAM is initialised to zero. Real SRAM is initialised to the background noise of the universe. On my emulator, I checksum 2560 bytes and come up with the same answer I would if I only checksummed the first 130 or 1026 bytes: zero. On the real board the CRC computes over 130 or 1026 bytes of data, getting zero - and then goes on to incorporate the noise.

Well - SRAM’s initial state isn’t actually noise. It’s the consequence of unstable states in (more-or-less) flip-flops resolving one way or the other, which usually comes down to microscopic manufacturing differences. As a result, I always get the same wrong value, because my SRAM on power-on always got the same uninitialised junk in it.

Nifty.

Fixing it all

I wrote a simple program to transmit a patch for recv_packet, then do a full Y-modem transfer of something bigger.

The code for this, I called the in-system programming writer.

At ths point, I can patch the bad code to upload up to 52k of arbitrary code and run it. But it’s not too satisfying to just have a low effort boot monitor with a Y-modem upload for running new code, so rather than focusing on updating the Flash ROM itself the next task will be to upload an operating system.