Introduction:
When the Enigma Z30 program project started, the final program size was a big unknown.
The Atmel Atmega 328 used in the Arduino Pro Mini, which the KIM Uno uses has 1kB of EEPROM space. This is visible in the KIM-1 monitor from address 0400 to 07FF
The KIM Uno had been used in a Hackaday project and already had a couple of programs at 0400, so the Enigma program was started at address 0500, RAM variables at 0050. Version 1, at 703 bytes, occupied the space from 0400 to 06BF.
Initially, the program only supported lever stepping. This stepping mechanism has a double stepping anomaly. When any of the 3 right wheels show a 9, that wheel and the one to its left is incremented. The following sequence shows how a 9 is propagated through the rotors: 8888->8889->8890->8901->9002->9003. Since a lot of numbers are skipped, the actual machine period, the number of unique combinations between 0000 and 9999 is 8100.
After reading the Enigma Z30 page at the cryptomuseum and realizing there were two different models of the machine, the decision was made to add gear stepping to the simulator. Gear stepping behaves like a car odometer and no numbers are skipped in between 0000 and 9999.
Adding gear stepping to the program increased it in size slightly. The problems started with the built in menu system used to change the machine settings.
The enigma program detects when RAM contains the value of 0 and it initializes it with the default machine settings of rotors, 1,2,3 ring settings 1,1,1,1 and initial rotor positions of 4,3,2,1. Those values can be found in RAM starting at 0050 and can be manually edited to change the machine settings. Once the enigma program is started, it can be stopped and the RAM settings edited. The program can be restarted and it will encrypt with the new settings.
A separate menu program was developed to make the process of changing the settings easier. The enigma program does not use the [A],[B] and [GO] keys. A few instructions in the key reading loop detect whether those keys have been pressed and it jumps to special locations in the enigma program. By default those locations contain a jump to the beginning of the key reading loop. When additional code is written, a jump can be inserted in those locations transfering control to the new code. The [GO] key vector has been initialized with a jump to the RAM variable initialization code and serves to zeroise the machine settings to their default values. A quick keypress can erase the current machine settings and prevent others from getting their hands on a machine with actual encryption keys.
The menu system was designed to re-use the enigma program code that changes the rotors. The first thing is must do is backup the rotor values to a temporary space in RAM. Then the desired settings to be changed can be copied to the rotor memory locations, read a keypress and decide what value to change, overflow conditions must be handled, for example, pushing up on rotor 3, selects rotor 1.
Pressing the [B] key copies the settings from the rotor memory location to their actual location and copies the next settings to change to the rotor addresses. Any error checking must happen at that time, for example, the rotor settings must use rotors 1,2,3 in any order exactly once, 132 is a valid combination, 133 is not.
Finally, pushing [B] restores the rotor values from temporary space to their correct memory location and jumps back to the enigma program.
The code sequence was: copy values out, copy values in, read key and handle increment / rollover, jump value checking if [B] is pushed, return to loop if invalid, jump to next setting.
Doing some copy and paste programming, the program quickly filled out the space from 06BF to 07FF. This was a problem, space below 500 had to be found to finish writing the menu system.
This was the initial driving force to optimize the program. Version 2 grouped some of the menu code into routines and with some other optimizations the program, now with dual gear and lever stepping and a menu system with a new step to select the gear mechanism, occupied the space from 0400 to 06CA. This was only 11 bytes more than V1 for a lot more functionality.
V2 was ready in time for arbitrary deadline of March 31 2017, the starting date for the Vintage Computer Festival East
The retrochallenge 2017/04 started that weekend and the idea was born to enter the contest with the goal to further optimize the program so it would be smaller than 703 bytes.
Techniques:
The program was updated as it was being optimized and debugged.
1) Group temporary variable initializations to avoid having to load registers twice.
i.e: if TEMP1 and TEMP2 are going to be used right away, but TEMP6 will be used further down in the code, initialize it when A has the desired value of 0.
LDA 00
STA TEMP1
STA TEMP2
STA TEMP6
2) If an algorithm, ALWAYS sets a register to a certain value and that value is needed later, there is no need to initialize the register.
i.e:
LDA 00
CLC
ADC *TEMP03 ;will have 0 or 1
ADC *TEMP04 ;will have 0 or 1
ADC *TEMP05 ;will have 0 or 1
CMP #$03
BEQ SETUOK
...
SETUOK ; A is NOW 3
3) Ensure the majority of the instructions are 2 bytes. Use RAM variables in page 0 when appropriate. If base+indexing instructions are used, prioritize the use of the X register for indexing
This function was rewritten so the instructions that use $F8 index with the X register (3 instructions (F8,x) @ 2 bytes + 2 instructions (ROTORS,Y) @ 3 bytes = 12 bytes) vs (3 instructions (F8,Y) @ 3 bytes + 2 instructions (ROTORS,X) @ 2 bytes = 13 bytes).
One byte saved is one byte earned.
05C0 LDY #$00 A0 00
05C2 LDX #$03 A2 03
05C4 SHOWND
05C4 LDA ROTORS,Y B9 57 00
05C7 ASL A 0A
05C8 ASL A 0A
05C9 ASL A 0A
05CA ASL A 0A
05CB STA *$F8,X 95 F8
05CD INY C8
05CE LDA ROTORS,Y B9 57 00
05D1 ORA *$F8,X 15 F8
05D3 STA *$F8,X 95 F8
05D5 INY C8
05D6 DEX CA
05D7 BNE SHOWND D0 EB
4) Fallthrough was used a couple of ocassions when two functions (A and B) are called one after the other, the return instruction in the first function is eliminated and control allowed to fall through to function B. Function B returns to the caller. Doing this saves 3 bytes for the call instruction in the calling loop and one byte for the return instruction, for a savings of 4 bytes.
...
call A
call B
...
Entry Point A
...
Return to Caller
Entry Point B
...
Return to Caller
vs:
...
call A
...
Entry Point A
...
Entry Point B
...
Return to Caller
5) The SHOWEN routine that mixes the rotor values into the correct memory locations for the monitor display routine was modified to also call the KIM-1 display and readkey routines after realizing that both the main loop and the manu loop called SHOWEN and inmmediately called SCANS and then GETKEY. This saved 6 bytes in the menu code. Control to GETKEY is transferred with a jump instead of a call, the return instruction in GETKEY will return to SHOWEN caller, saves the one byte return instruction in SHOWEN
;REM_SUB_SHOWENIGMA
SHOWEN
;copy rotor values (ROTOR,X) into correct place for SCANS (F8,F9,FA)
JSR SCANS
JMP GETKEY ; this is a jump, so the return
NOP ; this was a Return to Caller that can now be eliminated
6) Loop Unrolling: sometimes its shorter to write a loop, sometimes it is better to unroll the loop into a series of repetitive instructions. Sometimes, the same code has to be written twice to select the shorter version. The code below used to be in a fancy loop. It was shorter and clearer to rewrite it the following way. This detects any combination of 1,2,3 at memory location $58 by creating a hash table at TMP02+X containing 1, if rotor X, with X being 1,2,3 was used, 0 if not. This is also a huge buffer overrun, writing 1 to arbitrary memory locations in page 0 depending on the value at LROTOR, MROTOR or RROTOR. The program ensures those memory locations will only have the values 1..3. The only way to exceed those values is through direct memory manipulation, and at that point, any location can be changed, so the only way to experience it is to already have full memory access.
0727 LDA #$01 A9 01
0729 LDX *LROTOR A6 58
072B STA *TMP02,X 95 5F
072D LDX *MROTOR A6 59
072F STA *TMP02,X 95 5F
0731 LDX *RROTOR A6 5A
0733 STA *TMP02,X 95 5F
0735 LDA #$00 A9 00
0737 CLC 18
0738 ADC *TMP03 65 60
073A ADC *TMP04 65 61
073C ADC *TMP05 65 62
7) Instead of writing code in the menu system to rollover the rotor values at 1,2,3 instead of the full 0..9, a generic routine was written that will rollover/under at any point. Once the F3 menu was added and restricted to 0..1 to select gear or lever stepping, this generic routine saved 2 bytes from the total program size.
Wrapup:
Implementing all of the previous techiques, the total program size was reduced to 679 bytes, 24 bytes less than v1.
Further optimization could concentrate on reducing the number of 3 byte instructions. This can be accomplished by changing all the jumps (3 bytes) to conditional jumps (2 bytes). To do that, the program needs to be analyzed so a flag that is constant at that point in the program can be used, maybe the overflow flag is a good candidate.
Also, any tables in program space can be moved to page 0 so that a two byte page0+X index instruction can be used instead of a three byte absolute+x index instruction. This modification increases RAM consumption in order to reduce program size.
Below is the breakdown in enigma algorithm vs. menu system program sizes for each version. Version 3 has an enigma engine that is 1 byte bigger, but it enables a significantly smalled menu, for a smaller total size. This byte can be saved by changing the loops to use 2 byte conditional jumps
v1: enigma 460 menu 243 total 703
v2: enigma 451 menu 263 total 714
v3: enigma 452 menu 227 total 679
Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away. Antoine de Saint-Exupery