Notes on ‘44’
“44” is a 4k demo (intro) for the Atari ST. This machine was launched in the 1980s and features 16 colours and a 8Mhz processor. The executable file fits inside 4 kilobytes (4022 bytes, to precise) without any external data sources or cheats.
This demo took a long, long time.
commit 2122c859079bf853d7fbc4d274e85b15d31c934c Author: Steven Tattersall <steve@tattlemuss> Date: Sat Jul 5 14:53:14 2014 +0100 Initial
I won’t bore you with all the details, but it all started with an email from Damo of Reservoir Gods, so it’s his fault. He had a question about how to code a Python tool for his chip-tracker (still not released, hint hint.)
Little was I to realise that behind his cheery Essex exterior Damo is a master manipulator, and before I knew it I was dusting off my Atari and finding that I still remembered all the Motorola 68000 ISA and the ST hardware register addresses.
Hacking away, I had something on screen quite quickly. The original codebase was a simple bitplane engine that was designed to be a small simple intro for Damo’s tracker, but it soon became apparent that this tracker would not be released. So in the meantime I’d just knock out something simple like a 4K. What was I thinking?
You can find the final buildable source on Github.
(The screenshots in here are from randomly-picked Git commits during the creation process, or bugs from running Hatari in the wrong setup.)
The Psychology of Making a 4K
If I have only one tip for doing a 4K, it would be “don’t do it the way I did it.” I started with something small, then built it up to the 4K limit, fitting extra bits in.
Getting a codebase down to around 4K was quite an easy process (in some ways 4K is a lot of memory), but whatever you start with looks like a total mess, and way too short, if you pack it down too early.
Trying to get the 4K to a standard I would be happy with, took a lot more effort. So I got a the stage quite quickly with a tile effect, a 3D engine and a couple of the effects (the “spikeball”, the dodecahedron, the starfish), but it just sort of… well… lay there being boring.
So the loop was:
- pick up the code again with about 200 bytes free but looking crap
- realise 200 bytes isn’t enough to improve it sufficiently
- rewrite some chunk
- add a bit of work, bit better now
- back to 200 bytes free
- still not happy (repeat)
As a result, there were big periods where the codebase was left for several months while I poked about with something else, and a big black cloud hanging over “that darned project”.
So this is a terrible way to do it. Don’t do this kids. Make something really big that you like, then pack it down, stripping things out if it doesn’t fit.
It took a lot of help and advice from the other Avena guys to help get it over the line, with things like the palette work which really livened up the demo. Once again, special thanks to them.
There is literally nothing interesting about this demo’s codebase other than its desperation to save bytes. Everything runs on the edge of being as unoptimised as possible, to run with a barely acceptable framerate while keeping the (packed) code footprint low. No bitmaps are stored; the only data is the tune and the 3d meshes (letters A,V,E,N, 4, a dodecahedron and two triangles). I finally hit on the name “44” so that I could reuse the same mesh on the title shot.
The 3d engine only supports rotation round 2 Euler axes rather than 3, just enough to give interesting motion. All triangles go through a Sutherland-Hodgman clipper routine since I didn’t have the logic space to early-out. That sort of thing.
One other design decision was to have many bits of code pipe into other bits of code. For example, the 3d engine renders the objects into the 16x16 tile bitmaps at startup. There was a generalised “calculate a series of sine values and scale them” function that was used a lot. Some builds had some render-to-texture effects in, although I cut them as they looked fairly “meh.”
Understand Your Packer
Having a good feel for the packer really helps keep the size down. A lot of this has been detailed before. The canonical reference should be this talk from 12 years ago.
The first thing to say is that Shrinkler is completely amazing, and got me around 400 extra bytes just when I needed it most. Full Kudos to Blueberry for such an amazing tool. The final version shipped with Shrinkler, but for a long time I used PackFire/Exomizer (PackFire’s tiny depacker seems to be a direct port of Exomizer’s depack routine, as far as I can tell). I also coded my own (awful) version of Exomizer to see what made it tick, which beat it by a couple of bytes with some tweaking, but wasn’t worth it.
But the bottom line is that all these packers still use LZ-style string matching, which is where all the biggest gains are. So the biggest message is “don’t do stuff that interrupts matchable strings”. Or even better, shuffle your data and code around so that matches are available.
Here’s an example of something you want to do, but packs badly. It clears a one bitplane of one line of the screen:
move.w d0,0(a0) ; on ST, each bitplane move.w d0,8(a0) ; is at an 8-byte offset move.w d0,16(a0) ; from the previous one move.w d0,24(a0) move.w d0,32(a0) ... move.w d0,152(a0) ; last chunk of the line
Since the offset is encoded as part of the instruction, all those changing values mean separate strings to match, and so this causes bad packing. In many cases this works better:
move.w d0,8(a0) ; this won't run until we patch it. move.w d0,8(a0) ; Don't use 0(a0) since over-greedy move.w d0,8(a0) ; assemblers can change it to (a0) move.w d0,8(a0) ; which is a separate opcode move.w d0,8(a0) ... move.w d0,8(a0) ; last chunk
Then at runtime startup, patch those 8’s with increasing values. (That routine can come in handly elsewere too.)
Even so, the 68000 ISA is a pain to pack. Everything is 16-bit values, the registers are packed into 3-bit fields, so any variation in your instruction sequences mean worse packing. If I were actually good at this, I’d write a tool to recode the registers and orders in tail sequences to match other found code bytes… maybe in another 19 years. But simply using similar registers in similar cases helps. For example, I use d7 as my loop counter wherever possible.
Other things that pack badly:
- relative offsets and branches, unless these offsets are small;
- interleaved data (although not always; you need to check how it packs)
- vertex data, unless you try to quantise/match positions to be repeated
- relocation table and relocation code, so make sure you don’t emit one.
When coding with a PC, it’s really easy to set up a pipeline where you pack after every build, so it’s easy to test these things. It’s also possible to write a system that depacks the packed program and matches it to your codebase, so you can see where the hot-spots are.
There are also some other tricks I’ll keep up my sleeve in case there is a next time, although I’m pretty sure these tricks have been done on other platforms. (They are not clever.)
If it helps spurs anyone on, I have a build which is functionally identical to the release version but 3966 bytes in size rather than 4022, and I’m pretty sure it can go smaller.
The music was a big sticking point. Damo did a nice couple of tracker patterns for me, but no matter how much I fiddled with the effects I couldn’t get them to chime. Damo’s tune was a big, bombastic affair which really needed all the effects slamming into the screen. This really did fit the effects I had, which we more floaty and delicate than that. Also the instruments were quite complicated, so while I got the player down to about 300 bytes, the tune data seemed really difficult to pack.
So I vacillated for at least a year before biting the bullet and hacking my own tune together. This meant I could code the player with an eye for size, and exploit my own knowledge of the player to find compositions that fitted how the player worked, rather than the other way round.
The bad part: I totally suck at music (as you can tell) so I had to do a lot of work to convince myself it was worth the switch. I’m still not sure it was.
The player itself is quite interesting. It’s actually a 6-channel player, it’s just that some of the tracks share a YM square channel. I got this idea from seeing Knaecketraecker on the Commodore Plus/4.
The track arrangement is:
- One track for the pure buzzer;
- Channel A: one track for the bass which is synced to the buzzer (and shares the note data) to give the classic squeltchy bass;
- Channel A: one track for the snare
- Channel B: the lead instrument (with SID, limited to timer divider of 50 on the MFP), plus the more aggressive “spangly” high-pitched instrument
- Channel C: the very first spangly instrument, which is just a single VBL square tone at the highest octave
- Channel C: the repeating insistent echoing pulse instrument
The player just loops through each track and overrides that channel’s register data if its instrument is “active”. This way I didn’t need to merge instruments across a beat like a lot of tracker setups need to, and also I can simplify the note data by not having to continually switch instrument.
Some of the design thinking was copied from examining the Tim Follin driver that I blogged about here. So the player also supports a few simple commands.
- jump/gosub and return. with a single “stack” level (jump is just gosub with return not being called)
- transpose - this helps me repeat note data across tracks while still allowing good packer matching
- set a YM register direct (changes the buzzer envelope and noise pitch)
- set a local track data value (used to extend the lead instrument by poking sustain, and setting long note lengths)
Code size is something like 230 bytes, plus 80+ for a very bad SID voice implementation that I added at the last minute. The note data is about 1K in size but packs down to about 400 bytes. I think that could be improved with some analysis.
Oh, and of course the tune data is typed in as data statements, after finding the notes on my son’s Casio keyboard. I went the full old-school there.
Scripting / Design
I wasted so much time on this.
To avoid some of the bloat of using 68000 with its fat ISA, I had grand dreams of writing a tiny P-code-like scripting loop, which would run all the logic and pack really well. And I’d get a sexy live edit process where I could recompile and see the script updated immediately in Hatari.
And it was a disaster. It all worked, and the scripts seemed smaller. But it was so hard to do anything expressive! Every time I’d end up adding a new command to do, say, multiplies, or loops, and the thing would get bloatier and there would be more special cases and… I’d spent a massive amount of effort to do something simple in assembler.
In the end I finally came to my senses, ripped it out and recoded those bits in assembler again. So here is the entirety of my final script update function:
; ----------------------------------------------------------------------------- ; a0 = script position ; returns a0 = next script position run_script: lea .script_yield(pc),a5 jmp (a0) ;jump to function, will jsr (a5) back ;pushing the next instruction after that instruction to the stack... .script_yield: move.l (a7)+,a0 ;fetch "return" address from stack as next to run rts
It uses a register to count the number of VBLs before run_script is called again.
Lesson learned: don’t overcomplicate.
- Code commits: 798
- Final post-processed demo source size: 105k (assembler), 3900 lines
- Unpacked exe size: 6720/6722 bytes (60Hz vs 50Hz)
- Tools: vasm, m4 preprocessor, Hatari, Python, sh, ParCP for hardware testing
- Project babies: 0
I don’t think I’ll ever take as long to do a demo again. It was fun, though. I learned a lot, and it reminded me that coding can just be a journey into “solving interesting problems”. You can choose what you want to do, and when, and how. And it doesn’t matter if you reinvent the wheel because it’s your own time you are spending.
What was I doing this past 19 years?