ia64 Projects

back    chaos        

Baremetal Itanium!

As of April 2017 ChaOS is being actively developed on to Itanium platforms, just as the processor is retired from service. There will be a lot of powerful hardware going cheap over the next decade, for example I recently sourced two HP Integrity machines, one rx2600 with 1 x 1.3GHz CPU and one rx2620 with 2 x 1.6GHz CPU for fifty quid each.

The Itanium is not the sort of processor you take home to meet your parents, it is a supermodel. 128 x 64-bit registers, 128 80-bit floating point registers, Harvard-style fetch-decode-execute architecture and predication like the ARM but with triple instruction bundles, dual bundle pipelining - potentially six-at-once instruction execution. Shame it is coming to end-of-life but no doubt it lives on beneath the Intel x86-64 microcode.

The Itanium's downfall is partly due to this huge register set, making it more difficult to engineer a timer-based task switch - sadly a prerequisite for a mainstream operating system and very much a hangover from 1970s OS design. Working the Itanium a privilege level 0 and without these constraints it is easy to see how much faster it can be compared to my Core i3 laptop at similar clock speeds, typically three times faster in the register core despite being vintage 2005.

My initial interest in Itanium was sparked in 2013 as I began to develop a 64-bit compiler for ChaOS to migrate to the x64 UEFI platform, and realized that the very first EFI BIOS machines were IA64 architecture.

These first few months of this project I have discovered that the Itanium is a deceptively simple machine. It is after all a RISC processor. This project runs as a single thread, using the infinite register stack and just one memory mega-stack. For fun.

Random blog on IA64 compiler and platform development follows below:

8.10.17 {xhc} Taking a break from IA64 to develop a driver for the xHC (USB3) controller on my Dell Inspiron laptop.

7.10.17 {vc}/{cc8} convergence getop() code in {cc8} altered to produce 64-bit immediate constants by default, rather than requiring the (earlier) 0x prefix and q suffix.

4.10.17 EFI memory reclaim Starting now to build system heaps, using MEMDESC info supplied by EFI, initially chains of EFI free memory blocks with a small node record at the start of each. It is worth mentioning that memory blocks used in this way need to be allocated as EFI LoaderData (or any other memory type) - otherwise EFI may perform its own allocations/deallocations on the block which result in node records being overwritten.

To get things off the ground I have created three distinct heaps, dosheap (<1Mb), lowheap(<4Gb) and highheap(>=4Gb). Memory allocations can be directed at a particular heap using named functionsdosmalloc() dosfree() lowmalloc() lowfree() highmalloc() highfree(), with malloc() and free() using highheap if available, otherwise lowheap.

30.9.17 EFI memmap x64 EFI MEMDESC dump ported to IA64 hits {vc} compiler error when attempting, for the first time, to return a pointer from a function with the call64 (or any) function modifier, e.g. CH$ call64 func(VD); Syntax for function modifiers in ANSI C and C++ is unclear to me, so I have as usual made up my own syntax, using (efi * func) to declare EFI function pointers in the BOOTSERV and RUNTSERV structures. It is also worth mentioning the three pointer operators which have now crept into my syntax - '*' is a regular pointer, 32-bit in x64/chaos , 64-bit in ia64/chaos. '$' is an explicit 64-bit pointer, and '#' is an EFI natural size pointer.

Anyway, checking for a function modifier before the pointer operator causes the {vc} compiler to fault on CH$ call64 func(VD); whilst CH call64 $func(VD); works fine - just not as obvious that the function returns CH$. So I have tweaked {vc} to check for a function modifier before and after pointer operators, so CH$ call64 func(VD); and CH call64 $func(VD); are now equivalent.

With this tweak in place, my x64 EFI MEMDESC dump source code ports unchanged to the IA64 project.

23.9.17 {vc}/{cc8} convergence Adding the asm {keyword} syntax to {vc} and {cc8} is not really enough to structure complete compile-time units which will pass through both of these compilers. It is usual to use C preprocessor directives to provide alternate pathways through a source file, so it makes sense to use my arch keyword to qualify preprocessor directives where possible, i.e. instead of

#ifdef IA64
#include {ia64stuff.htm}
#endif
one could use
#include ia64 {ia64stuff.htm}
and
#if ia64
...
#endif
is a no-brainer.

With this tweak added to {vc} and {cc8} (just to #include a slightly different uefi.html for the x64 machine), my EFI BlkIO device sector browser compiles and works identically on x64 and IA64. Conversely the ability to compile and test code within an x64 UEFI environment, then move this easily into the IA64 project is going to be mega-useful.

17.9.17 strtod() Spent a few hours reacquainting myself with the 32-bit versions of strtod which I have, in preparation for writing a 64-bit IA64/x64 library version. Never noticed before the slight downward rounding produced by x32/ChaOS.strtod() - this because I set FPU ROUNDCHOP mode for easier double->integer conversions. The mathematical differences are insignificant, just one or two bits of mantissa, but I am thinking maybe I should use FPU ROUNDNEAREST mode for strtod, especially when called by my compilers.

Most people misunderstand the fact that floating-point numbers are in the main inexact, only a subset of real numbers drop exactly into floating point encodings. Exact fractions are found where the number is a negative power of 2 multiplied by an integer, e.q 7*1/2=3.5, 53*1/16=3.3125 etc.

Even with ROUNDNEAREST, for inexact fractions strtod produces a DB value slightly below the input string half of the time. Thus the mantissa needs to processed beyond a given number of fractional digits to reproduce an input string such as 1.2345, after rounding, when output of strtod is passed to dtoa. The problem arises when a decimal fraction ends with a 5 and strtod (very often) produces 4999999999. For speed I prefer to process only one digit beyond the requested number of fractional digits for rounding purposes - no good when dtoaing 3 fractional digits, whilst strtod has produced 1.2344999999999etc for input of 1.2345 - we get a display value of 1.234 instead of 1.235. Given that all of my inputs to strtod are less than 10 fractional digits, a useful kludge is to add 1 the mantissa, when the output seems to be non-recurring (e.g. NOT something like 0.1 = 0x3fb999999999999a). This produces 1.2345000000000something for input of 1.2345, surely a nicer number.

13.9.17 EFI BlkIo sector browser Added read/display sector option when probing the EFI BlkIO handle list, with keyboard loop, '+','-' and (g)oto keystroke roam around. Very handy.

13.9.17 {vc} arch id for inline asm{} blocks Added check for arch id string after asm keyword, with ignore flag now switched on if asm block does not match the CPU architecture for the current compilation run. In other words. if {vc} is compiling for ia64 it will skip asm x64{} blocks, when compiling for x64 it will skip asm ia64{} blocks. This allows CPU-specific tweaks to be grouped together in C source code as fair dinkum assembly language instructions, instead of inside some faraway macro. {vc} now also accepts my '$' 64-bit pointer operator (used in my cc8 Intel64 compiler) as a synonym for the regular '*' operator. {cc8} uses 'call64' and 'call32' function modifiers to switch output between the IA32e and Intel64 mode code, so by changing the default IA64 FTEMPLATE keyname from "def" to "call64", {vc} /a=ia64 can potentially can compile {cc8} sources without modification.

To allow {cc8} to compile {vc} sources, I have added a similar asm id{} check, and added code to generate and initialize argc, argsz and argv for ellipsis functions. As a quick test, IA64 wdispf(CH$ format,...) and its companion functions witoa, wdtoa, wstrrev and strtol now compile unchanged on either {cc8}->x64 or {vc}->IA64.

12.9.17 HP Integrity rx2600/rx2620 EFI_DEVICE_PATHs Probing EFI Device Paths for handles with BlkIo protocol, hit the dreaded ACPI Hardware Path type, which points to the ACPI tables to resolve HWP0002 and HWP0003. Anyone who has tried to write an ACPI table decode will know this is a job for another lifetime, but fortunately, there is enough information in EFI_DEVICE_PATH to deduce the encoding. EFI ACPI Hardware Path _HIDs are a 32-bit number I do not understand, but are kindly displayed by the EFI Shell as HWP0002 or HWP0003 and clearly indicate PCI Bus and AGP Bus respectively. EFI ACPI Hardware Path _UIDs will point to some Method or other in the ACPI table, but are easily recognized to be (PCI Bus Number<<3).

10.9.17 Itanium zx1 PCI Managed to reprogram the PCI buses (ropes) to direct I/O cycles to the BMC VGA whilst AGP PCI IO is turned off, this proved by writing and reading back VGA registers in the BMC VGA which are different to those in the AGP VGA device. The MMIO window is more problematic, currently faulting the processor when accessed via bus 0xe0.

9.9.17 Itanium zx1 PCI Experimenting with dual graphics configuration, using AGP backplane with Diamond Fire GL4 dual DVI card. EFI boot disables the BMC VGA and sets up the EFI console on the Diamond Fire DVI-0 output. Setting video modes via the Diamond Fire BIOS hangs the machine, I am guessing because the PCI config accesses in the BIOS are directed to I/O ports 0xcf8/0xcfc, which are not present on the zx1 MMIO controller. Support for a BIOS set mode here would required VM86 mode and I/O traps on these ports, redirected to the zx1 PCI config address and config data ports.

Setting up multiple VGA adapters is possible, provided only one VGA device is connected to the zx1 bus at at a time. I have done this before to switch alternate PCI adapter BIOS ROM images into play, and invoke their respective BIOS set mode functions - up the point where graphics modes/linear apertures are established all the adapters can coexist on the PCI bus. This is a useful exercise into understanding the various buses and PCI bridges in a system.

IA64 SAL provides PCI config read and write functions, but on the rx2600 at least the config registers are directly accessible in the MMIO block at 0xfed00000. Following the Rope Configuration Base register at 0xfed003a8, the register blocks for bus 0x00,0x10,0x20... are at 0xfed20000,0xfed21000,0xfed22000... and are identifiable by a 0x103c:0x122e signature if present in the system. To post PCI config cycles on to a bus, simply write the seg:bus:dev:fn:reg to block+0x40, and read/write config data from/to block+0x48.

The BMC console VGA appears on the rx2600 at bus:dev:fn(0xe0,2,0), but disappears from the EFI PCI listing when an AGP graphics card is installed. However the BMC VGA is still present on the PCI bus, with IO and MEM access disabled. So it should be possible to deactivate the AGP card, configure the BMC VGA, display something on the screen, then decouple it from the PCI BUS and switch back to the AGP card. There are two or three other registers on the zx1 to direct VGA cycles down the appropriate bus, some tiral anderror needed no doubt to get this to work.

5.9.17 Itanium SVGA graphics Dug out my old VGA reference manual (Wilton 1987) as used to develop Lotti all those years ago - so much easier to understand second time around. EGA/VGA was a quantum leap in graphics in its day, but confused me greatly at that time. Taking a fresh look I now see the method in the madness, and how it provided a neat way to read or write 32 graphics bits in one cycle of a 16-bit CPU.

Took only an hour or so to knock up writepixel, hline, vline, and gchar functions, along with cgasavescreen, cgarestorescreen, to be able to save and restore the EFI console. So I can now begin to develop base functions for a GUI on the Itanium, so long as I avoid debug breakpoints whilst in graphics mode. In time debug_donkey will substitute EFI console output in favour of graphical output when a graphics mode is active.

3.9.17 Itanium Video BIOS call Running through VESA BIOS calls, modes with resolution higher than 640x480 (apart from mode 0x6a, 4-plane 800x600 SVGA mode) are actually flagged unsupported in hardware, so the bog-standard Itanium VGA output is rather limited. Also the EFI/BMC console locks up when one of these limited graphics modes is selected. Ah well.

27.8.17 Itanium Video BIOS call Fleshed out the .x86r assembler enough to setup parameters and make entries into the Int 0x10 Video BIOS code. This is far easier than poking opcode bytes into Itanium low memory locations. With a whole load more handlers in place (ITLB miss, DTLB miss, alternate ITLB/DTLB etc) and page-not-present DTLBs for video memory at 0xb0000 and 0xa0000, the rogue instructions which take the system down (on Int 0x10/set mode 3) can be narrowed down to OUTSW page B8000 (clear screen memory), MOVSB page c0000 -> page a0000 (program character set) and STOSB page a0000 (character set also).

Emulating these three cases is tricky, especially because my debug donkey uses the Itanium Register Stack Engine to preserve r1=r15. I envisaged that writes to the backing store would be needed to update registers for the IA32 machine, e.g. advancing EDI, decrementing ECX on a STOSW emulation for example. Happily this approach works (I told you the Itanium is a simple beast). The slightest coding error here hangs the machine, but after a good deal of trial and error, I finally have the Video Mode Set running through to the IRET intercept. Curiously this mode set produces a greyscale 80x25 CGA text screen, but I have done plenty enough VGA programming to recognize this is a DAC palette issue. Pushing the usual 64 entries in the VGA DAC flicks the display back into its proper 16-colour CGA glory.

Noticing that my emulations ran without ld.acq/st.rel semantics, I tried replacing the not-present VGA DTLBs with present DTLBs identity mapped into physical uncached memory, i.e virtual 0xb0000 -> 0x80000000000b0000, then disabled the emulations. With this mapping the BIOS calls run OK, so maybe the simplest of solutions was under my nose! However I now also have a template for instruction emulation with client register tweaking which is mega-useful. Similarly, trapping the IA-32 intercept, matching the IA-32 IRET opcode and poking new values into cr.ipsr and cr.iip produces a smooth return to IA64 mode after a BIOS callout.

As I had hoped, graphics modes can be set on the Itanium VGA hardware without breaking the remote console. So glad now that I picked up an AGP backplane and original Itanium twinhead Radeon card last year, looks like I will be able to make it work. Is is really only 16 weeks since I started my IA64 compiler?

25.8.17 ia32 disassembler Ported about half of my ChaOS x64 disassembler to the IA64 project to produce the first mnemonic ia32 disassembly inside the IA64 debug donkey. Will improve this in parallel with development of the new {vc} assemblers.

22.8.17 ia32 modes for {vc} Restructured {vc} compiler as planned to allow multiple modes for each /arch definition. Quite simply each processor mode can invoke a different assembler in the compiler back-end. For the moment, processor mode is switched by assembler .pmode directives, but eventually will be switched by the C compiler as needed, by #preprocessor directives and maybe a special function modifier. Initial pmodes for IA64 are ia64,x86r and ia32, providing support for a 16-bit and 32-bit IA-32 code segment, in addition to native Itanium mode. To aid with checking and testing, I have added a second processor set for amd/intel64, also with three processor modes x64,x86r and ia32. For 16-bit and 32-bit code, /a=ia64 and /a=x64 now use exactly the same assemblers.

20.8.17 ia64 ia-32 BIOS call Experimenting with ia-32 mode to fathom how to perform a video BIOS call into the VGA card. Tried all sorts of variations, real-mode, protected-mode and particularly vm86 mode, thinking this would be the preferred route. Because IRET is an intercepted instruction (i.e. has to be emulated) I chose PUSH (int 0x10 offset), PUSH (int 0x10 seg), RETF as a route into the Video BIOS entry point, but achieved only a wall of vm86 GP and Stack faults. In 16-bit real mode, the faults are fatal Machine Checks. Maybe I should have listened to my own advice and done a br.ia straight into the Video BIOS, however...

...noticing that the same instructions sometimes run fine, other times cause these Machine-Checks, I gradually focussed in on the RET and RETF instructions, and proved that the Machine Check occurs when these opcode bytes are exactly 10 bytes ahead of the current single-step fault i.e. as they enter a 10-byte instruction pipeline. Furthermore executing a CALL instruction before RET or RETF enters the pipeline prevents the Machine Check fault. Whether there is an error in the Itanium ia32 machine here, or an error in my setups is a question for another day. Clearly the ia-32 machine is decoding, speculating, branch-predicting ahead of the current ia32 instruction, another fascinating insight into the Itanium processor.

In ia-32 real mode, provided KR0 is set to IOBASE, the IN instruction produces recognizable i/o port values from the VGA hardware. Using the direct br.ia entry method, selected Video BIOS calls run to completion, i.e to the IRET intercept fault. BIOS functions which touch video memory (such as set VESA mode) inevitably fail, because these addresses are not yet mapped into uncached memory pages. Interestingly Int 0x10/AX=0x8003 (set 25x80 CGA mode, retain screen contents) runs to completion, because this is the mode set already for the EFI console.

So there is still work to do to get Video BIOS calls working fully. Added a Page-Not-Present handler and mapped to VGA aperture to a not-present data TLB. This faults any instructions reading or writing VGA memory addresses. Added a handler to emulate the REP STOSW instruction which is used by int 0x10/9 write char(s) at cursor BIOS CALL. For this function at least the characters appear on screen. The complete solution could involve emulation of many ia-32 memory read/write instructions. But I have proved it can be done. Emulations can be added as each faulting instruction is discovered, until the BIOS mode set call runs through to the IRET intercept. The goal is to be able to use the BIOS to set VESA graphics modes. With this done, custom routines can read/write to the screen directly using virtual or physical uncached address mappings, using ld.acq and st.rel semantics.

13.8.17 ia64 ia-32 mode debug Added ia-32 intercept handler working towards an attempt at executing a Video BIOS call; seeing this fault in action I now realise the swathe of instructions which have to be emulated to construct a working ia-32 machine. This explains why the ia-32 mode was eventually dropped in favour of pure software emulation - the PSR.is bit only runs the easy ia-32 instructions. ia-32 instructions such as INT and IRET simply fault the Itanium and provide a pointer to help in skipping the offending opcode bytes after an emulator completes its work..

No matter, a call into the Video BIOS should be possible by a br.ia straight into the Video BIOS, with the IRET intercept fault signalling the end of the BIOS call? Who knows.

10.8.17 ia64 ia-32 mode debug Managed to perform first Itanium ia-32 mode switch via br.ia instruction, discovering along the way that my virtual memory mode switch was not quite what I thought. Having used ssm to flick the PSR translation switches dt,it and rt, I missed that fact that ssm can only set bits 0:23, so I have been running a partial vm mode, with only data address translation switched on. This has been sufficient to work out and prove setups for region registers, translation cache entries and virtual addressing for VGA screen memory. Added a keystroke to the debug donkey to flick all these bits on, happily my itc setups must be OK because everything keeps running. Attempting the ia-32 mode switch without dt and it is undefined, but actually results in an instant fatal machine check.

To keep the debug donkey from falling over (remember it calls out the the EFI console, which doesn't do virtual mode) I switch psr.dt off for the display and keyboard loop, then rfi restores psr from ipsr on exit. In this way I can effectively single step the Itanium in virtual address mode. You see deep down, the Itanium is deceptively simple. I now have a pretty solid base on which to research and develop the vm handlers. As ever the equation is installhandler + (improve debug donkey).

Switching on virtual addressing is easy from within an interrupt handler, but tricky otherwise (need to set up all the interrupt control registers, then execute rfi). Taking an idea from IA64 Linux, a quick and easy workaround is to sample iim inside the break handler (like a syscall). Added a bit of code to debug donkey to do this, now break 0x12345 can be used to properly start virtual address mode. The EFI shell runs happily with psr.rt and psr.it set (and identity-mapped translations of course), so there is no need to switch these bits off just yet.

With another couple of handlers installed on IA32 exception and IA32 intercept, my machine will single-step into ia-32 mode, to a low memory address where I have placed an x86 NOP and JMPE instruction. Single-step continues back into IA64 mode, eventually falling over due a loss of RSE context - single-step in ia-32 mode seems to destroy the contents of IFS, not quite sure why. I may need a separate entry path for the ia-32 handlers into the debug donkey. Nevertheless this problem is overcome by using a MSTACK function template for the wrapper function, i.e. ar.pfs is saved on the memory stack rather than the register stack during the ia-32 callout. The mov ar.pfs=reg instruction is now the only one which cannot be single-stepped, but my ia-32 callout now works. Roll on x86 video BIOS calls.

Yes I know ia-32 hardware support disappears from the Itanium after the Madisons, but ChaOS uses only a subset of the x64 instruction set. Writing a ia-32 software emulator is cumbersome but not difficult, with 128 64-bit registers compared to the 8 32-bit registers on ia-32. Indeed this is the approach adopted by HP engineers for PA-RISC as well as IA-32. Has anyone written an an AMD64 emulator for IA64 yet?

2.8.17 ia64 debug Whilst far from finished, disassembler now covers all common A,I,M,B,F and X-unit instructions, so is becoming really useful. Added (a)uto single step, to stress-test the debugger by a running many thousands of single-step cycles. Runs fine through ChaOS code, but runs into processor faults while single-stepping EFI display functions. Tried saving a few more registers, i.e. bank 1 r16-r31 which generally hold invalid/misaligned addresses at the faulting instructions, (though the ChaOS compiler does not touch these registers). This stops the faults but auto-single-step then gets into an endless loop. Of course the debug handlers are invoking EFI for console output, so the single-step might be looping on a display list within the firmware.

1.8.17 ia64 debug fault Added debug fault handler, and switched on IPSR.db bit to enable instruction breakpoints. Added (e)xit key stroke to debug donkey which sets a breakpoint on b0 as saved on entry to the debug handlers, and (n)ext key set breakpoint on iip+0x10. So now I can step over a function call, or single-step into it and skip out at any point. Instruction breakpoint faults occur on each of the three instructions in a bundle, if the ibr[ ] registers are not cleared. Single-step through the bundle can be achieved bt setting IPSR.id for each step. Frankly it is easier just to clear the ibr[ ]s after the breakpoint has been hit.

30.7.17 ia64 single-step Another fascinating product of my single-step handler is that processor faults can can occur on dependency violations, such as when a stop is missing from the code stream for the desired result. Not yet having tried to understand IA64 speculative instructions, I sort-of expected that at interrupt time in-flight values would all have time to arrive at their destnation. I had not considered that these are all carefully designed, and finite states in a parallel execution engine. As such the single-step interrupt provides a proper window into this area, to the point of raising the same faults on single-step as would occur when the parallel execution units are free-running.

30.7.17 ia64 debug donkey/Floating point registers Introducing a floating point register display inside debug donkey handler produced some interesting results along the way, all adding to my understanding of the fascinating Itanium processor. Initially using stfd and ldfd to save and restore some FP registers to a memory stack buffer, I became intrigued when FP values seemed to disappear during single-stepping. The obvious reason is the registers in question are being overwritten, maybe during a hardware interrupt, so I tried a static memory save area etc, all with the same result - FP values disappear when the setf.sig is single-stepped.

The answer to this problem comes when stfd.spill and ldfd.fill are used to save the full 80-bit FP register contents. setf.sig simply copies a 64-bit integer into the significand of the 80-bit register, and sets the exponent to 0x3e. This creates a valid 80-bit floating point number, but top bit of the significand is zero. Such a value has no 64-bit FP counterpart (64-bit doubles have an implied '1' as the top bit of the significand). It can be normalized by fma fr=fr,f1,f0, but for the single-step between setf.sig and fma, a 64-bit double cannot save the in-register value. This is an insight into the rawness of the Itanium, from whence comes its speed.

With 80-bit FP save/restore wrapped around the debug handler, and a rough-and-ready dtoa function, I can now single-step mixed integer/floating point calculations such as those generated by {vc} for multiply, divide and modulo operators. How quickly those Newton-Raphsonian iterations converge for small integers! (Don't try baremetal IA64 programming unless you understand Newton-Raphsonian iterations, or are prepared to learn what they are). The FP fault and GE fault handlers have saved hours of reboot time this week by trapping my coding errors along the way.

26.7.17 ia64 debug handlers Added predicate register save/restore to debug handler wrapper, to properly single-step compare instructions. Added ifs decode to discover valid loc registers range, and a function to use this knowledge to locate and display r32->rxx from the RSE backing store. Added General Exception, Unaligned Reference and Floating Point Fault handlers, all using the common debug donkey function. Faulting instructions (like break) can be skipped by pressing j.

These handlers are a great addition to my ia64 debug toolkit, saving the reset/reboot cycle I have had to bear every time I make a mistake. Deliberate processor faults are equally satisfying.

25.7.17 ia64 debug handler and RSE backing store Attempted to use cover and flushrs to make singlestep handler saved registers visible by inspecting memory below BSP/BPSTORE, but caused processor faults. Settled on flushrs only, but this only flushes the outer function state to the backing store - so inserted an extra function which saves register state to loc registers before calling singlestep donkey function. Registers in the backing store cannot be accessed via a struct because the RSE inserts a NatVal when BPSTORE&0x1f8 equals 0x1f8. Added a set of #defines for saved registers r1->r15 representing the order in which they were saved, and a helper function to skip the NatVal while repeatedly subtracting 8 from BSP until the required register is reached.

Registers used in the coding of the function calls (r1,r4,b0,b7) are saved in the interrupt handler before the intermediate function is called - so these are retrieved from further down the RSE backing store. Using these tweaks I now have a register display for the singlestep handler, really useful for verifying that disassembler display matches live register operations.

24.7.17 ia64 disassembler Finished X-unit disassembler, including long branch instructions, though I will struggle to build a program large enough to need them. Started I-unit disassembler, completing the opcode 0 set (break,nop,hint,mov from IP/PR, sxt,zxt,czxl/r).

23.7.17 ia64 disassembler Added the first glimpses of a disassembler to the single-step handler. Starting with th X-unit (with the smallest instruction set!), and decoding just the MOVL instruction, else showing just instruction pointer/slot and raw 41-bit unbundled instructions. Added assembler support for NOP.X and BREAK.X, then using the embryo disassembler to spot some latent bugs in X1 and M44 encodings. Parallel development of an assembler and disassembler for any architecture is always a potent combination for weeding out coding errors. Decoding of the MOVL imm64 is a test for any programmer, even with a mainstream compiler.

Although NOP.X and BREAK.X encodings accomodate an immediate value up to 62 bits long, I observe that only 21 bits make it to cr.iim, i.e. the L slot is ignored by this implementation (rx2600). More interesting is that break instructions encountered whilst single-stepping are no problem, the break handler is re-entered, and single-stepping can be resumed. At this point I should have singlestep->break->singlestep nested on the stack, but have not attempted a stack dump to prove this. In any case, clearing ipsr.ss has the system running as normal so who cares. Using break.x requires a tweak to the breakhandler, i.e. need to skip 2 slots instead of 1 to get over the faulting instruction. I do this by checking for the MLX template (t&0x1e)==4 AND slot==1.

21.7.17 {ia64/chaos} Single Step With only 16 bundles available in the IVT for a single-step interrupt, further reduced interrupt handler template to be a bare function dispatcher, with all the tricky code moved to the donkey callout. This sparse interrupt handler occupies only 8 bundles (0x80 bytes), so it can be used for any slot in the Itanium interrupt system. Thus added a single-step interrupt handler.

Added keypress filter to the break handler, to set IPSR.ss on a certain key value. This causes the single-step handler to be entered, which also waits for a keypress. Displaying cr.iipa whilst leaving cr.iip unchanged seems to produced the desired effect, with execution advancing on each keypress through slot0, slot1, slot2 before advancing to the next instruction bundle. Way to go!

16.7.17 {ia64/chaos} Improved interrupt handler further, towards ia64 heavy specification, i.e. to call out to a sophisticated handler with interrupt collection and external interrupts switched back on - aiming to hold execution on a loop calling EFI readkeystroke(). One has to remember that space in the Itanium Interrupt vector table is limited, as I found to my cost as my developing break fault handler grew beyond 0x400 bytes (the handler immediately after is the external hardware interrupt oops). The solution to the size problem is simple enough, just place a wrapper function in the IVT which calls out to the donkey function. Critical to getting external interrupts working for the callout is the switch back to bank 1 registers before any EFI system interrupts happen, otherwise the system crashes. Note that the bsw instruction which performs the bank switch must be the last in an instruction group otherwise its operation is described as undefined. In fact bsw faults the processor if used without a stop immediately after.

Happily after a good deal of trial and error I now have code which can wait for an EFI keystroke, from inside a fault handler. Obviously the ability to accept keyboard input in this situation is an massive step towards interactive debugging on the Itanium platform.

15.7.17 {ia64/chaos} Working on console display suite, centred around wdispf(CH* format,...) a printf-style function producing formatted output to the EFI console. Whilst VGA direct draw functions are technically easier, using the EFI functions carried the advantage that output is mirrored to the serial console or BMC telnet connection, allowing remote viewing - this is pretty much essential given the noise level of the rx2600 fans. wdispf places output into a static ring buffer, to get around the temporal problems with stack addresses passed to EFI (18.6.17). Support functions now working well include strtol, witoa (itoa for wide characters, with enhancements such as max output length, zerofill ...), and the ctype functions isdigit(), isalnum() etc, and reduce(number,base), a divmod function to halve the math workload of witoa, all working fine.

The result is a flexible EFI console display function which works inside an interrupt handler. Just a few minutes with this up and running produces a break fault handler which can display the bundle address and slot number of a break instruction, then adjust the slot and bundle address (if at slot 2) by rewriting the IIP and IPSR registers to skip the break instruction. I now have the development template for any Itanium fault/exception/interrupt handler!

8.7.17 {vc} debug info Revised variable-length SY structure (symbols) to align records on 4-byte boundaries, and add a locr member to store RSTK register number, leaving rtaddr free for MSTK shadow address. Added MOVLOCMLOCR pcode to move RTSK value to MSTK when & operator is first used on a register variable. This completes a fix for the address of register problem.

7.7.17 {vc} debug info Revised variable-length UT structure (type system) to align records on 4-byte boundaries - this to mitigate problems (IA64 alignment problem 18.6.17) when accessing these structures down the line (for compile and debug purposes). Besides padding UT namestrings, UT->prev is now reduced to one byte, recording the dword count of the previous UT record. This limits a UT record to 1020 bytes, sufficient to describe structures with up to 250 members, should be plenty.

Recompiled armc1 with the new UT regime, happily pi/chaos still works fine.

1.7.17 {vc} efi FTEMPLATE Mystified temporarily by the corruption of new ellipsis mstack variables by callouts to EFI functions - only to realise that arguments passed in registers to EFI require memory stack shadow space. This boils down to a misunderstanding on my part of the implementaion of register variables in a modern compiler. ANSI C states plainly the address of a register variable cannot be taken , whereas ANSI C++ states that the register keyword will be ignored if the address of the variable is taken. The only way this can work is to allocate memory stack shadow space for all register variables, and to switch to the memory stack whenever the & is used. My first Itanium compiler, {itc1} uses only the register stack; approaching the problem from the wrong direction is producing interesting results along the way.

28.6.17 {vc} varargs Revised ellipsis handling to cope with variable stack alignment. Rather than defining argv as an array, constrained to the prevailing stack alignment (23.6.17) I have decided to use explicit addressing, dragging the current mstackalign value into the function body itself.

So now ellipsis generates three automatic symbols, UC* argv is a simple pointer to the first stack argument, integer argc is the argument count, and integer argsz is the byte spacing between mstack arguments. argc and argsz are initialized by generating code at call time to pass values in r0/r1 (r8/r7 on IA64) which are moved on function entry into the relevant local variables.

On function entry, argv can be cast to any type appropriate to the conceptual argv[0], subsequent arguments are accessed by simply adding argsz, e.g VD* nextarg=(VD*)argv+=argsz;

Recognizing that the traditional SL main(UL argc,CH* argv[]) prototype only works whan mstackalign equals sizeof(CH*) is a revelation..

23.6.17 {vc} varargs Added ellipsis handling to call FTEMPLATE and fctbody FTEMPLATE by forcing the variable function arguments on to the memory stack. These variable arguments are stored on the memory stack in reverse order, so that they can be addessed from within the function as a simple array with element size equal to FTEMPLATE->mstackalign. Added code to create a local symbol argv to make this array accessible to the ellipsis function body. Whilst it is possible to arrange the memory stack for functions like printf with different-sized elements, this makes no sense going forward with the alignment-check constraints on 64-bit processors.

FTEMPLATEs allow mismatches between naturalsize integers and stack granularity, and I am currently using 16-byte stack alignment even though my Itanium project currently uses at maximum only half of each stack element - this just to make sure no naturalsize dependencies creep into the compiler code. Added a new 128-bit integer to the {vc} inbuilt type system (US). Where stack alignment equals natural pointer size, it is convenient to access the stack via UC* argv[]. Where these sizes are mismatched, UL argv[] (32-bit stack), UQ argv[] (64-bit stack) and US argv[] (128-bit stack) are used instead. This way varargs can be accessed intuitively using argv[0],argv[1],argv[2],... from left to right, with typecasts if appropriate.

This FTEMPLATE system may seem overly complex, but by dissociating stack granularity and natural integer size, all the ingredients are in place to compile IA32 code to use a 64-bit stack, thus making mixed-mode code much more straightforward.

18.6.17 Itanium Data Alignment Beginning memory stack structure, union and array adressing now, aware that PSR.ac is set by default at EFI Boot Manager time, so it is easy to force an alignment check fault. What I did not expect was that clearing PSR.ac does not stop the alignment check faults, i.e. these check faults are an architectural feature, not soft warnings. Having been used to (almost) complete freedom regarding data alignment (i.e coming from IA32), this comes as a shock. No doubt data member alignment in public structures such as EFI follow rules which match the architectural limitations of IA64,x64 and ARM. Generally, structure member addresses should be mod membersize for arithmetic and pointer types to avoid this problem, so a compiler must insert padding bytes automatically. As a quick fix, I have just added a compiler warning where this rule is violated, I am not keen to add automatic padding until I have studied how this is done in the wider scheme of things. Meanwhile the Itanium processor will sure let me know when I get the alignment wrong.

18.6.17 Itanium EFI memory stack Using EFI memory stack to build a string for output to the EFI console produces an interesting yet useless result, i.e. the memory stack content is overwritten before the console gets the display data. This indicates that Itanium EFI console output is asynchronous, being accumulated perhaps as a series of memory pointer/length records to be processed later (maybe on a timer interrupt which uses the same stack?). Therefore display data must be placed in a persistent memory area for it to make it to the console. Furthermore, the garbage output to the screen via the memory stack is different to the garbage sent to BMC, indicating a further time-lapse between screen and remote console display processing.

Calling EFI console output functions from within a break interruption handler is similarly frustrating, even if persistent memory locations are used. This effect can be sidestepped by using functions which write directly to the VGA memory buffer. Fortunately, EFI screen console output appears to perform an IO read of the VGA text cursor before each output phase - so custom display functions can mesh with the EFI console by simply updating the VGA cursor position.

17.6.17 Auto Memory Stack Local Variables {vc} default FTEMPLATE for ia64 places local variables on the register stack in the first instance, equivalent to placing register keyword in front of all local data declarations, and works well, up to a point. Declarations which will not fit in a register naturally go on to the memory stack, which is easy because this can be decided as the data declaration is processed. But using register locals as the default throws up the awkward case where an attempt is made to take the address of that variable - this is only possible for memory stack variables.

Rather than revert to a full register declaration for all locals, I inserted code in doaddressofoperator() (& operator handler) which switches variables from the register stack to the memory stack when this happens. This violates ANSI C 6.3.3.2, which states that address-of register is just not allowed; however ANSI C also says that the register keyword is no more than a hint to the compiler - i.e. if the compiler runs out of registers then the variable will be on the memory stack anyway, and the address-of operator is perfectly valid.

14.6.17 Itanium break Interruption Working now on FTEMPLATE for ia64 interrupt function modifier, using the software break at IVA+0x2c00. A simple memcpy of a handler function into the IVA, followed by srlz.i;;srlz.d;; seems to work fine. The Itanium Register Stack Engine (RSE) can be used inside this handler - the trick is to, insert cover;; before alloc; there is no need to restore ar.pfs (this is done automatically by rfi following cover). Managed to perform EFI display calls within this interrupt handler hence nesting allocs no problem, though EFI console display is truncated/garbled somehow. My VGA direct-draw functions work just fine. This facility will make the development of other handlers much easier than I had expected (how could I develop a TLB miss handler without it?).

11.6.17 Itanium SAL call is straightforward, worked first time using cast of salproc entrypoint location in SAL System Table to efi function pointer (entry point is followed by GP for SAL call, hence this is effectively a PLABEL:
salproc=(SQ (efi#)(UQ a,UQ b,UQ c,UQ d,UQ e,UQ f,UQ g,UQ h))&ss->salproc;
(*salproc)(0x01000012,0,0,0,0,0,0,0); (//SAL_FREQ_BASE)
returns 200000000 in r9 on Integrity rx2600, the 200MHz platform base frequency.

10.6.17 {vc} With FTEMPLATE now allocating a varying number of sys registers, local symbol addressing from within asm{} blocks is variable too. Added _lr# aliasing to register stack, similar to loc# but subtracting sys register count which is known at compile time. Thus asm{} blocks inside fctbody can reference local symbols as _lr0,_lr1, instead of loc3,loc4 which are dependent on the correct FTEMPLATE being in use.

10.6.17 {vc} and Itanium PAL call Attempting to produce a FTEMPLATE for an Itanium PAL call, using some MSTACK control fields to pass arguments on the memory stack.
Of course this does not work, I misread the meaning of stacked registers as opposed to static registers in the PAL sense. So for anyone else who tries this,
PAL static registers are r28,r29,r30,r31
PAL stacked registers are r32,r33,r34,r35, also set r28 equal to r32
PAL return registers are r8,r9,r10,r11.
Took a little while to work that one out. PAL call also needs an assembly language wrapper to calculate a return address, avoiding br.call for the entry, which is easy in the ChaOS IA64 model since code label values generate a simple GP-relativeoffset, as in this snippet:
movl r8=palreturn;;//offset
add r8=r8,r1;; //add GP
mov b0=r8
mov r28=in3
mov r29=in2
mov r30=in1
mov r31=in0
mov b7=in4;;
br.cond.sptk.many b7;;
palreturn:

10.6.17 {vc} FTEMPLATE developed further, with two FTEMPLATE pointers active in fctbody(), one for the function being compiled, another to control generation and stack positioning of function arguments. FTEMPLATES feed into a FRAME structure, which contains address mapping for five distinct stack areas (ins,syslocals,locals,tmps, outs) on both the register stack and the memory stack. syslocals (things like ar.pfs, function return address etc) are controlled by bit settings in a flags parameter, one for each stack.

Whilst necessarily complicated, FTEMPLATE coding falls almost entirely into three blocks -
(1)in addresslocalsymbols(), to allocate the sys area between ins and locals
(2)in flushpcos(), case ENTERP to generate code on function entry
(3)in flushpcos(), case LEAVEP to generate code on function exit

As a final tweak, temporary copies of FTEMPLATE->rsysflags and msysflags are used by fctbody for code generation. This allows tweaks to the stack frames, by setting or clearing bits as desired - e.g. (ia64) no calls to other functions, so STACKIP (save b0) and STACKGP (save r1) can be skipped. Therefore a function with no stack requirements could generate no code for ENTERP, and just br.ret b0 for LEAVEP.

28.5.17 {vc} a virtual C compiler Deconstructed {itc1} IA64 C compiler into a group of function pointers, placed in ARCH structure to create {vc} to begin to draw {armc1}, {ebc1}, and {itc1} into one multi-architecture compiler. Compilation is directed by a new command-line argument /a=ia64. I am currently working from a IA64 register-only model, through a mixed-stack model towards a memory-only stack model such as the one I have always used on my ChaOS compilers. Before {vc}, function calling conventions were very much hard-coded in my compilers. These become FTEMPLATE structures which control the addressing of arguments (outs and ins in Itanium-speak) in the outer and inner functions. FTEMPLATES include a namestring and are stored in a table within the ARCH structure, invoked by using the namestring as a function modifier. In this way, a namestring such as efi can invoke architecture-specific behaviour as defined by the UEFI Specification.

A FRAME structure, built for each function body expands the FTEMPLATE to map local storage in fine detail, and is inevitably complex. Currently putting the finishing touches to the pal modifier for Itanium firmware PAL function calls.

12.5.17 Virtual address mode Identity-mapped VM now running, using itc.i and itc.d to create TLB cache entries for a handful of 4Gb and 64Mb descriptors, enough to get things off the ground. Will probably put this on the backburner for now, because EFI SetVirtualAddressMap can only be called after ExitBootServices, which destroys most of the EFI pre-boot environment.

11.5.17 End of the line:Intel announces Itanium 9700 series as being the seventh and final generation of IPF.

6.5.17 Progress:Itanium compiler taking shape, producing ChaOS FTRAW image in PE32+ wrapper, acceptable to EFI firmware as a native executable program (PE32 type 0x200 = IA64). Presently using register stack model only (no memory stack yet), but sufficient to call EFI functions for console input (getkey etc) and output to the EFI Shell environment. Now beginning to get under the hood of this fascinating processor.

Accessed Itanium IO ports and MMIO addresses yesterday for the first time, which require specific variants of ld and st instructions. Also managed to switch to virtual address mode, and access IO via page with UC attribute.

April 2017: Ported {ebc1} -> {itc1} to produce native Itanium IA64 code. Tried to benchmark the Itanium against the i3-4010U in my development laptop, using assembly language routines for both which convert a 64-bit value into 16 hexadecimal digits. First attempts showed the IA64 consuming 300 clock cycles versus 200 for the i3, kind of what I expected given that my Itanium is vintage 2004. However by careful bundling of the instruction triples (ref:Itanium Architecture for Programmers), the Itanium streaks ahead, consuming only 57 clock cycles. What a shame this processor will never be produced on thinner silicon.

March 2017: Ported {armc1} -> {ebc1} to produce EFI Byte Code, for the EFI pre-boot VM. (provided it is supported in firmware). This provided a first glimpse of the Itanium environment.