ia64 Projects

back    chaos        

Baremetal Itanium!

As of April 2017 ChaOS is being actively developed on to Itanium platforms, just as the processor is retired from service. There will be a lot of powerful hardware going cheap over the next decade, for example I recently sourced two HP Integrity machines, one rx2600 with 1 x 1.3GHz CPU and one rx2620 with 2 x 1.6GHz CPU for fifty quid each. My rx2660 cost quite a bit more.

The Itanium is not the sort of processor you take home to meet your parents, it is a supermodel. 128 x 64-bit registers, 128 80-bit floating point registers, Harvard-style fetch-decode-execute architecture and predication like the ARM but with triple instruction bundles, dual bundle pipelining - potentially six-at-once instruction execution. Shame it is coming to end-of-life without ever becoming famous.

The Itanium's downfall is well documented, mostly by those who have little idea of what it is. CPU clock speeds have settled in the low Gigahertz range for a couple of decades now. Recognising this fact, the Itanium designers increased the core register count by a factor of 8 over AMD64, and delivered a core capable of executing 5 or 10 parallel operations on each clock cycle, a proper number-cruncher..

My initial interest in Itanium was sparked in 2013 as I began to develop a 64-bit compiler for ChaOS to migrate to the x64 UEFI platform, and realized that the very first EFI BIOS machines were IA64 architecture.

These first few months of this project I have discovered that the Itanium is a deceptively simple machine. It is after all a RISC processor.

Random blog on IA64 compiler and platform development follows below:

9.8.18 EFI Device Driver Created template for EFI App and Device Driver, essentially a machine-specific startup delivering control to a main() function. The key difference between the two templates is the program subtype which causes the device driver to remain in memory after startimage(), and the unload callback (like an Event callback) which provides a place for cleanup code should the driver be asked to stop.

Also in the driver template is an example of InstallProtocol exporting a small function pointer table. My Itanium compiler does not produce shadow PLABELs for function pointer assignments. However the machine-specifics for assigning values to protocol function pointers can be reduced to a small subfunction or macro. Of course the driver unload function calls DeinstallProtocol!

3.8.18 EFI Event callbacks Exploring EFI timer callbacks. x64 was straighforward enough, I had an on-screen hours:mins:seconds display running within an hour. ia64 proved more challenging. ia64 Event callback function pointer references a plabel, not a bare function pointer. And some registers need to be saved and restored around the callback, a fact which caught me out. Overwriting r7 in a timer callback handler seems inocuous, but on the rx2660 will subtly corrupt the EFI event tables, so the second and subsequent events on a periodic timer just don't happen.

Reading up on the subject shows that r4-r7 are designated as preserved, whilst being merrily trashed by my vc /a=ia64 compiler. Added callback function modifier to vc, with a new function template to save and restore r4-r7. Adding this modifier to the calllback function in the C code is a neat fix.

1.8.18 EFI Simple Network Experimenting with the baseline EFI network protocol. Network controllers are already running when EFI Boot Manager runs, but EFI_SIMPLE_NETWORK may be stopped. Generally, if higher level network protocols have been installed (as on later UEFI machines), i.e. MNP, ARP, UDP DHCP etc, Simple Network will be running at the bottom of the network stack. If stopped, it can easily be started, but will not transfer incoming frames via the ->receive function until receivefilters have been set up. There is a trick to perform (thanks to EFI 1.0 as on the rx2600) if multicast is desired - receivefilters() needs to be called twice: once to enable unicast and broadcast whilst disabling multicast and resetting the multicast filters; second to enable multicasts and supply at least one valid multicast filter.

26.7.18 sha256 in Itanium rotating registers Pushed my 'C' sha256 source code through my Itanium compiler, as an exercise, initially taking 19000 cycles on a Montvale CPU . Recoded the central hashing algorithm in assembly language, as an exercise in using the Itanium rotating registers for the first time.

Rotating register addressing is a little counter-intuitive. After allocating 64 rotating registers (r32->r95), then using a br.ctop to to load a table from memory into GR32, you will get the first table entry in r32, the second in r95, the third in r94 etc (with rrb.gr back to zero after 64 iterations). Once you get your head round this, fixed register names can be used for the offsets used to pull values into the central sha256 algorithm. After a day of concentrated work, hashing a 256-byte block takes 1450 cycles for the fips-180 "abc" message. Recoding the block memcpy used to bring in the hash data as a quadword move brings the cycle count to 1000 - around 4 cycles per byte hashed.

One striking feature of the Itanium is its predictability. First invocation of this routine runs about 35% slower, as the caches load up. Second and subsequent invocations show ar.itc time intervals which are the same to +/- 2 clock cycles!. Messing around with branch hints dptk sptk etc makes no difference to the clock count.

To take this thread a little further I have added register index notation to my assembler, r[{expression}] to improve readability of code (like the Intel Itanium assembler). There is definitely more speed to be had because my current sha256 algorithm has to swizzle 32-bit big-endian integers, and uses basic 64-bit operations (ignoring the upper 32 bits of each register) rather than the 2x32bit parallel operations. I have just added syntax for explicit bundle templates to my assembler, will be interesting to see the effects described in the textbooks.

24.7.18 Itanium Montvale vs Madison Update on execution timing for my 64-bit binary to hex ascii benchmark (April 2017). Back then I was chuffed to see this assembly routine completing in 57 clock cycles. On the rx2660, Montvale 9040 processor the same code completes in 19 cycles. Both timings include latency for reading ar.itc, not bothered trying to work that out. Fast or what?,

12.7.18 ia64 spinlock Added atomic spinlock to serialize access to console output, console getkey and debug_donkey. Interesting to watch control alternating between two CPUs as they contend for debug_donkey whilst single-stepping off a hard breakpoint. Added CPU id to debug output throughout the disassembler to show which CPU has the spinlock. (Much better than the Machine Checks which result when no spinlocks are used!).

This all happens because of the way in which Itanium PAL sets up the APs, all running in spin loops with interrupts off, whilst snooping their respective IPI interrupt 0xff status bits; an example of how civilised is the Itanium programming environment. Of course this translates to x64, where ChaOS could set up the APs in similar fashion ready to launch a user function on receiving an IPI;

This is the first time I have used a spinlock throughout ChaOS, never thought I would need one. Previously I have used APs on x64 as occasional processors, and avoided them primarily because of the extra heat generated which causes the fan to roar on my development laptop.

10.7.18 ia64 AP debugger working AP stack pointer on startup is zero. By setting r12 to the top of a mallocd memory block, APs will run EFI boot services, provided the BSP is held in a spinloop. Similarly my debugger works on the APs, so I can now breakpoint and single step one AP while holding the BSP on a completion flag.

9.7.18 ia64 APs now working After a few dozen Machine Check crashes whilst fumbling for the correct IPI addresses of the Itanium Application Processors (APs), managed to bring APs on rx2660 out of SAL spin loop to execute a short procedure within chaos.efi. By passing BSP r1 value to SAL_SET_VECTORS, APs are able to store values into the ChaOS memory image. APs call flushcacheline() and this makes stored values visible to the BSP.

APs seem to run my simple register stack function no problem, though I have yet to check whether SAL has set up an RSE backing store and memory stack for each CPU.

The 7 APs on my rx2660 respond to ExInt 0xff IPIs written to uncached addresses 0xfee01000, 0xfee02000, 0xfee03000, 0xfee04000, 0xfee05000, 0xfee06000 and 0xfee07000. Reading cr.lid for each returns 0x1000000,0x2000000, 0x3000000, 0x4000000, 0x5000000, 0x6000000 and 0x7000000 respectively, all of which makes perfect sense now I know the IPI mappings.

Update: firing IPIs to all 7 APs, each storing their cr.lid in adjacent memory quadwords, and with no explicit cache flush instructions, results in the above (expected) values even though the APs execute concurrently. (I had expected some cache-coherency issues to be demonstrated).

30.6.18 ia64 heap, salproc Pushing new heap allocations to stretch the new memory regime, EFI saved interruption vectors are now on heap. Exploring SAL procedures calls to probe CPU BSP and AP topography. Looking to intercept the EXINT vector to get a handle on external interupts including the interprocessor interrupts used to kick the APs out of BOOT_RENDEZVOUS.

Interrupt intercept is possible because EFI EXINT is effectively just one bundle containing a br (jump) instruction. This bundle needs to be disassembled to locate the jump target, with brl instructions for the hop to the intercept code, and for the hop back to the original handler jump target. Hopefully I filter out some hardware interrupts and divert them to ChaOS code. Timer interrupts are my favourite initial hooks.

23.6.18 ia64 malloc,free Added memory heaps to ia64/chaos, to get away from EFI allocpages for dynamic memory allocations. Heap code is dual-mode, also compiling for x64. Rather fun to have a 12Gb flat memory space to play with, though return to EFI on program exit is slow because Itanium EFI insists on filling freed blocks with 0xfb bytes (I suppose for security purposes). Takes about 1 second for each Gb.

29.5.18 HP Integrity rx2660 Processor upgrade/downgrade: fitted 2 x AB577-2100B processors. All working, showing 8 CPUs. PAL_BRAND_INFO shows these to be the 1.6GHz dual-core Itanium 2 9040 Montecito chip, which was quickly superceded by Montvale. Spectacular depreciation to $25 each this week on eBay.

25.5.18 HP Integrity rx2660 testbed arrives A year has passed so quickly since I iced my Itanium project. I got bogged down with the IA-32 mode on the Itanium 1. Yesterday added a rx2660 server to my Itanium collection, one dual-core 1.42GHz Itanium 2 9120N with 16Gb RAM. There is no IA-32 mode on this processor, but there are two cores, four threads, 1.7 billion transistors to play with...

BMC is almost identical to the rx2620 for the Telnet and Serial port consoles, so it did not take long to try running chaos.efi. All is well until a processor exception is triggered (e.g. debug break instruction, VM access fault etc), where the processor clearly takes the firmware vector rather than the ChaOS code. This is a cache issue, cleared by installing the ChaOS vectors, and running more code before faulting the processor - the Itanium 2 processor caches are MUCH bigger than my Madisons.

So it is gratifying to see my amateur debugger breakpointing and stepping the Itanium 2 after such a short time.

The rx2660 case seems to be quite rare, so I had to pay a high price to get one. But Itanium 2 processors are more plentiful. I found a pair of 9140Ms (dual-core 1.67GHz) for nineteen quid each. Will they run in my box? Watch this space...

Here is a quick HowTo for new owners to get up and running:

Like any enterprise-level server, you get 2 computers for the price of one; one is the server itself, two is the Management Processor, which allows you to monitor and control the server through a serial port or network connection.

Each vendor has its own flavour of Management Processor and user interface, for the rx2660 HP-Speak gives us the BMC (Baseboard Management Controller), with iLO-2 for the user interface (Integrated-Lights-Out-2...). This provides a Web interface into the BMC, with SSL secure access. At the Ground-Zero level, the BMC can be used to power up or power down the Itanium mainboard remotely, rather essential because these machines are noisy!

So on the back of the server (extreme left) there is a serial port and RJ45 network connection into the BMC. Beside the RJ45 there is a hole to access the BMC reset button. The BMC is active a few seconds after power is applied to the unit. Holding the reset button in for 4+ seconds clears previous username/password combinations from the BMC, which might prevent a new owner from gaining control. Once this is done, the unit defaults to 9600,8,1 on the serial port and network access defaults to DHCP. Therefore the BMC can be accessed via the first serial port and a terminal program, or something like telnet (look at your router DHCP diagnostic or ARP cache to see where the BMC is on your network). Login to the BMC is simply Admin/Admin, with plenty warnings about setting up something more secure. The HP Integrity User manuals document the BMC very well from here on.

Modern browsers tend to be locked to TLS 1.3 these days, so the firmware security in these old servers will throw errors when attempting a secure connection via the Web interface. I downshift FireFox to TLS 1.1 to make things work ( open about:config TAB; ignore warranty warning; set security.tls.version.max = 1 ). TLS needs to be shifted back up to 1.3 when you are done with the iLO-2.

8.10.17 {xhc} Taking a break from IA64 to develop a driver for the xHC (USB3) controller on my Dell Inspiron laptop.

7.10.17 {vc}/{cc8} convergence getop() code in {cc8} altered to produce 64-bit immediate constants by default, rather than requiring the (earlier) 0x prefix and q suffix.

4.10.17 EFI memory reclaim Starting now to build system heaps, using MEMDESC info supplied by EFI, initially chains of EFI free memory blocks with a small node record at the start of each. It is worth mentioning that memory blocks used in this way need to be allocated as EFI LoaderData (or any other memory type) - otherwise EFI may perform its own allocations/deallocations on the block which result in node records being overwritten.

To get things off the ground I have created three distinct heaps, dosheap (<1Mb), lowheap(<4Gb) and highheap(>=4Gb). Memory allocations can be directed at a particular heap using named functionsdosmalloc() dosfree() lowmalloc() lowfree() highmalloc() highfree(), with malloc() and free() using highheap if available, otherwise lowheap.

30.9.17 EFI memmap x64 EFI MEMDESC dump ported to IA64 hits {vc} compiler error when attempting, for the first time, to return a pointer from a function with the call64 (or any) function modifier, e.g. CH$ call64 func(VD); Syntax for function modifiers in ANSI C and C++ is unclear to me, so I have as usual made up my own syntax, using (efi * func) to declare EFI function pointers in the BOOTSERV and RUNTSERV structures. It is also worth mentioning the three pointer operators which have now crept into my syntax - '*' is a regular pointer, 32-bit in x64/chaos , 64-bit in ia64/chaos. '$' is an explicit 64-bit pointer, and '#' is an EFI natural size pointer.

Anyway, checking for a function modifier before the pointer operator causes the {vc} compiler to fault on CH$ call64 func(VD); whilst CH call64 $func(VD); works fine - just not as obvious that the function returns CH$. So I have tweaked {vc} to check for a function modifier before and after pointer operators, so CH$ call64 func(VD); and CH call64 $func(VD); are now equivalent.

With this tweak in place, my x64 EFI MEMDESC dump source code ports unchanged to the IA64 project.

23.9.17 {vc}/{cc8} convergence Adding the asm {keyword} syntax to {vc} and {cc8} is not really enough to structure complete compile-time units which will pass through both of these compilers. It is usual to use C preprocessor directives to provide alternate pathways through a source file, so it makes sense to use my arch keyword to qualify preprocessor directives where possible, i.e. instead of

#ifdef IA64
#include {ia64stuff.htm}
one could use
#include ia64 {ia64stuff.htm}
#if ia64
is a no-brainer.

With this tweak added to {vc} and {cc8} (just to #include a slightly different uefi.html for the x64 machine), my EFI BlkIO device sector browser compiles and works identically on x64 and IA64. Conversely the ability to compile and test code within an x64 UEFI environment, then move this easily into the IA64 project is going to be mega-useful.

17.9.17 strtod() Spent a few hours reacquainting myself with the 32-bit versions of strtod which I have, in preparation for writing a 64-bit IA64/x64 library version. Never noticed before the slight downward rounding produced by x32/ChaOS.strtod() - this because I set FPU ROUNDCHOP mode for easier double->integer conversions. The mathematical differences are insignificant, just one or two bits of mantissa, but I am thinking maybe I should use FPU ROUNDNEAREST mode for strtod, especially when called by my compilers.

Most people misunderstand the fact that floating-point numbers are in the main inexact, only a subset of real numbers drop exactly into floating point encodings. Exact fractions are found where the number is a negative power of 2 multiplied by an integer, e.q 7*1/2=3.5, 53*1/16=3.3125 etc.

Even with ROUNDNEAREST, for inexact fractions strtod produces a DB value slightly below the input string half of the time. Thus the mantissa needs to processed beyond a given number of fractional digits to reproduce an input string such as 1.2345, after rounding, when output of strtod is passed to dtoa. The problem arises when a decimal fraction ends with a 5 and strtod (very often) produces 4999999999. For speed I prefer to process only one digit beyond the requested number of fractional digits for rounding purposes - no good when dtoaing 3 fractional digits, whilst strtod has produced 1.2344999999999etc for input of 1.2345 - we get a display value of 1.234 instead of 1.235. Given that all of my inputs to strtod are less than 10 fractional digits, a useful kludge is to add 1 the mantissa, when the output seems to be non-recurring (e.g. NOT something like 0.1 = 0x3fb999999999999a). This produces 1.2345000000000something for input of 1.2345, surely a nicer number.

13.9.17 EFI BlkIo sector browser Added read/display sector option when probing the EFI BlkIO handle list, with keyboard loop, '+','-' and (g)oto keystroke roam around. Very handy.

13.9.17 {vc} arch id for inline asm{} blocks Added check for arch id string after asm keyword, with ignore flag now switched on if asm block does not match the CPU architecture for the current compilation run. In other words. if {vc} is compiling for ia64 it will skip asm x64{} blocks, when compiling for x64 it will skip asm ia64{} blocks. This allows CPU-specific tweaks to be grouped together in C source code as fair dinkum assembly language instructions, instead of inside some faraway macro. {vc} now also accepts my '$' 64-bit pointer operator (used in my cc8 Intel64 compiler) as a synonym for the regular '*' operator. {cc8} uses 'call64' and 'call32' function modifiers to switch output between the IA32e and Intel64 mode code, so by changing the default IA64 FTEMPLATE keyname from "def" to "call64", {vc} /a=ia64 can potentially can compile {cc8} sources without modification.

To allow {cc8} to compile {vc} sources, I have added a similar asm id{} check, and added code to generate and initialize argc, argsz and argv for ellipsis functions. As a quick test, IA64 wdispf(CH$ format,...) and its companion functions witoa, wdtoa, wstrrev and strtol now compile unchanged on either {cc8}->x64 or {vc}->IA64.

12.9.17 HP Integrity rx2600/rx2620 EFI_DEVICE_PATHs Probing EFI Device Paths for handles with BlkIo protocol, hit the dreaded ACPI Hardware Path type, which points to the ACPI tables to resolve HWP0002 and HWP0003. Anyone who has tried to write an ACPI table decode will know this is a job for another lifetime, but fortunately, there is enough information in EFI_DEVICE_PATH to deduce the encoding. EFI ACPI Hardware Path _HIDs are a 32-bit number I do not understand, but are kindly displayed by the EFI Shell as HWP0002 or HWP0003 and clearly indicate PCI Bus and AGP Bus respectively. EFI ACPI Hardware Path _UIDs will point to some Method or other in the ACPI table, but are easily recognized to be (PCI Bus Number<<3).

10.9.17 Itanium zx1 PCI Managed to reprogram the PCI buses (ropes) to direct I/O cycles to the BMC VGA whilst AGP PCI IO is turned off, this proved by writing and reading back VGA registers in the BMC VGA which are different to those in the AGP VGA device. The MMIO window is more problematic, currently faulting the processor when accessed via bus 0xe0.

9.9.17 Itanium zx1 PCI Experimenting with dual graphics configuration, using AGP backplane with Diamond Fire GL4 dual DVI card. EFI boot disables the BMC VGA and sets up the EFI console on the Diamond Fire DVI-0 output. Setting video modes via the Diamond Fire BIOS hangs the machine, I am guessing because the PCI config accesses in the BIOS are directed to I/O ports 0xcf8/0xcfc, which are not present on the zx1 MMIO controller. Support for a BIOS set mode here would required VM86 mode and I/O traps on these ports, redirected to the zx1 PCI config address and config data ports.

Setting up multiple VGA adapters is possible, provided only one VGA device is connected to the zx1 bus at at a time. I have done this before to switch alternate PCI adapter BIOS ROM images into play, and invoke their respective BIOS set mode functions - up the point where graphics modes/linear apertures are established all the adapters can coexist on the PCI bus. This is a useful exercise into understanding the various buses and PCI bridges in a system.

IA64 SAL provides PCI config read and write functions, but on the rx2600 at least the config registers are directly accessible in the MMIO block at 0xfed00000. Following the Rope Configuration Base register at 0xfed003a8, the register blocks for bus 0x00,0x10,0x20... are at 0xfed20000,0xfed21000,0xfed22000... and are identifiable by a 0x103c:0x122e signature if present in the system. To post PCI config cycles on to a bus, simply write the seg:bus:dev:fn:reg to block+0x40, and read/write config data from/to block+0x48.

The BMC console VGA appears on the rx2600 at bus:dev:fn(0xe0,2,0), but disappears from the EFI PCI listing when an AGP graphics card is installed. However the BMC VGA is still present on the PCI bus, with IO and MEM access disabled. So it should be possible to deactivate the AGP card, configure the BMC VGA, display something on the screen, then decouple it from the PCI BUS and switch back to the AGP card. There are two or three other registers on the zx1 to direct VGA cycles down the appropriate bus, some tiral anderror needed no doubt to get this to work.

5.9.17 Itanium SVGA graphics Dug out my old VGA reference manual (Wilton 1987) as used to develop Lotti all those years ago - so much easier to understand second time around. EGA/VGA was a quantum leap in graphics in its day, but confused me greatly at that time. Taking a fresh look I now see the method in the madness, and how it provided a neat way to read or write 32 graphics bits in one cycle of a 16-bit CPU.

Took only an hour or so to knock up writepixel, hline, vline, and gchar functions, along with cgasavescreen, cgarestorescreen, to be able to save and restore the EFI console. So I can now begin to develop base functions for a GUI on the Itanium, so long as I avoid debug breakpoints whilst in graphics mode. In time debug_donkey will substitute EFI console output in favour of graphical output when a graphics mode is active.

3.9.17 Itanium Video BIOS call Running through VESA BIOS calls, modes with resolution higher than 640x480 (apart from mode 0x6a, 4-plane 800x600 SVGA mode) are actually flagged unsupported in hardware, so the bog-standard Itanium VGA output is rather limited. Also the EFI/BMC console locks up when one of these limited graphics modes is selected. Ah well.

27.8.17 Itanium Video BIOS call Fleshed out the .x86r assembler enough to setup parameters and make entries into the Int 0x10 Video BIOS code. This is far easier than poking opcode bytes into Itanium low memory locations. With a whole load more handlers in place (ITLB miss, DTLB miss, alternate ITLB/DTLB etc) and page-not-present DTLBs for video memory at 0xb0000 and 0xa0000, the rogue instructions which take the system down (on Int 0x10/set mode 3) can be narrowed down to OUTSW page B8000 (clear screen memory), MOVSB page c0000 -> page a0000 (program character set) and STOSB page a0000 (character set also).

Emulating these three cases is tricky, especially because my debug donkey uses the Itanium Register Stack Engine to preserve r1=r15. I envisaged that writes to the backing store would be needed to update registers for the IA32 machine, e.g. advancing EDI, decrementing ECX on a STOSW emulation for example. Happily this approach works (I told you the Itanium is a simple beast). The slightest coding error here hangs the machine, but after a good deal of trial and error, I finally have the Video Mode Set running through to the IRET intercept. Curiously this mode set produces a greyscale 80x25 CGA text screen, but I have done plenty enough VGA programming to recognize this is a DAC palette issue. Pushing the usual 64 entries in the VGA DAC flicks the display back into its proper 16-colour CGA glory.

Noticing that my emulations ran without ld.acq/st.rel semantics, I tried replacing the not-present VGA DTLBs with present DTLBs identity mapped into physical uncached memory, i.e virtual 0xb0000 -> 0x80000000000b0000, then disabled the emulations. With this mapping the BIOS calls run OK, so maybe the simplest of solutions was under my nose! However I now also have a template for instruction emulation with client register tweaking which is mega-useful. Similarly, trapping the IA-32 intercept, matching the IA-32 IRET opcode and poking new values into cr.ipsr and cr.iip produces a smooth return to IA64 mode after a BIOS callout.

As I had hoped, graphics modes can be set on the Itanium VGA hardware without breaking the remote console. So glad now that I picked up an AGP backplane and original Itanium twinhead Radeon card last year, looks like I will be able to make it work. Is is really only 16 weeks since I started my IA64 compiler?

25.8.17 ia32 disassembler Ported about half of my ChaOS x64 disassembler to the IA64 project to produce the first mnemonic ia32 disassembly inside the IA64 debug donkey. Will improve this in parallel with development of the new {vc} assemblers.

22.8.17 ia32 modes for {vc} Restructured {vc} compiler as planned to allow multiple modes for each /arch definition. Quite simply each processor mode can invoke a different assembler in the compiler back-end. For the moment, processor mode is switched by assembler .pmode directives, but eventually will be switched by the C compiler as needed, by #preprocessor directives and maybe a special function modifier. Initial pmodes for IA64 are ia64,x86r and ia32, providing support for a 16-bit and 32-bit IA-32 code segment, in addition to native Itanium mode. To aid with checking and testing, I have added a second processor set for amd/intel64, also with three processor modes x64,x86r and ia32. For 16-bit and 32-bit code, /a=ia64 and /a=x64 now use exactly the same assemblers.

20.8.17 ia64 ia-32 BIOS call Experimenting with ia-32 mode to fathom how to perform a video BIOS call into the VGA card. Tried all sorts of variations, real-mode, protected-mode and particularly vm86 mode, thinking this would be the preferred route. Because IRET is an intercepted instruction (i.e. has to be emulated) I chose PUSH (int 0x10 offset), PUSH (int 0x10 seg), RETF as a route into the Video BIOS entry point, but achieved only a wall of vm86 GP and Stack faults. In 16-bit real mode, the faults are fatal Machine Checks. Maybe I should have listened to my own advice and done a br.ia straight into the Video BIOS, however...

...noticing that the same instructions sometimes run fine, other times cause these Machine-Checks, I gradually focussed in on the RET and RETF instructions, and proved that the Machine Check occurs when these opcode bytes are exactly 10 bytes ahead of the current single-step fault i.e. as they enter a 10-byte instruction pipeline. Furthermore executing a CALL instruction before RET or RETF enters the pipeline prevents the Machine Check fault. Whether there is an error in the Itanium ia32 machine here, or an error in my setups is a question for another day. Clearly the ia-32 machine is decoding, speculating, branch-predicting ahead of the current ia32 instruction, another fascinating insight into the Itanium processor.

In ia-32 real mode, provided KR0 is set to IOBASE, the IN instruction produces recognizable i/o port values from the VGA hardware. Using the direct br.ia entry method, selected Video BIOS calls run to completion, i.e to the IRET intercept fault. BIOS functions which touch video memory (such as set VESA mode) inevitably fail, because these addresses are not yet mapped into uncached memory pages. Interestingly Int 0x10/AX=0x8003 (set 25x80 CGA mode, retain screen contents) runs to completion, because this is the mode set already for the EFI console.

So there is still work to do to get Video BIOS calls working fully. Added a Page-Not-Present handler and mapped to VGA aperture to a not-present data TLB. This faults any instructions reading or writing VGA memory addresses. Added a handler to emulate the REP STOSW instruction which is used by int 0x10/9 write char(s) at cursor BIOS CALL. For this function at least the characters appear on screen. The complete solution could involve emulation of many ia-32 memory read/write instructions. But I have proved it can be done. Emulations can be added as each faulting instruction is discovered, until the BIOS mode set call runs through to the IRET intercept. The goal is to be able to use the BIOS to set VESA graphics modes. With this done, custom routines can read/write to the screen directly using virtual or physical uncached address mappings, using ld.acq and st.rel semantics.

13.8.17 ia64 ia-32 mode debug Added ia-32 intercept handler working towards an attempt at executing a Video BIOS call; seeing this fault in action I now realise the swathe of instructions which have to be emulated to construct a working ia-32 machine. This explains why the ia-32 mode was eventually dropped in favour of pure software emulation - the PSR.is bit only runs the easy ia-32 instructions. ia-32 instructions such as INT and IRET simply fault the Itanium and provide a pointer to help in skipping the offending opcode bytes after an emulator completes its work..

No matter, a call into the Video BIOS should be possible by a br.ia straight into the Video BIOS, with the IRET intercept fault signalling the end of the BIOS call? Who knows.

10.8.17 ia64 ia-32 mode debug Managed to perform first Itanium ia-32 mode switch via br.ia instruction, discovering along the way that my virtual memory mode switch was not quite what I thought. Having used ssm to flick the PSR translation switches dt,it and rt, I missed that fact that ssm can only set bits 0:23, so I have been running a partial vm mode, with only data address translation switched on. This has been sufficient to work out and prove setups for region registers, translation cache entries and virtual addressing for VGA screen memory. Added a keystroke to the debug donkey to flick all these bits on, happily my itc setups must be OK because everything keeps running. Attempting the ia-32 mode switch without dt and it is undefined, but actually results in an instant fatal machine check.

To keep the debug donkey from falling over (remember it calls out the the EFI console, which doesn't do virtual mode) I switch psr.dt off for the display and keyboard loop, then rfi restores psr from ipsr on exit. In this way I can effectively single step the Itanium in virtual address mode. You see deep down, the Itanium is deceptively simple. I now have a pretty solid base on which to research and develop the vm handlers. As ever the equation is installhandler + (improve debug donkey).

Switching on virtual addressing is easy from within an interrupt handler, but tricky otherwise (need to set up all the interrupt control registers, then execute rfi). Taking an idea from IA64 Linux, a quick and easy workaround is to sample iim inside the break handler (like a syscall). Added a bit of code to debug donkey to do this, now break 0x12345 can be used to properly start virtual address mode. The EFI shell runs happily with psr.rt and psr.it set (and identity-mapped translations of course), so there is no need to switch these bits off just yet.

With another couple of handlers installed on IA32 exception and IA32 intercept, my machine will single-step into ia-32 mode, to a low memory address where I have placed an x86 NOP and JMPE instruction. Single-step continues back into IA64 mode, eventually falling over due a loss of RSE context - single-step in ia-32 mode seems to destroy the contents of IFS, not quite sure why. I may need a separate entry path for the ia-32 handlers into the debug donkey. Nevertheless this problem is overcome by using a MSTACK function template for the wrapper function, i.e. ar.pfs is saved on the memory stack rather than the register stack during the ia-32 callout. The mov ar.pfs=reg instruction is now the only one which cannot be single-stepped, but my ia-32 callout now works. Roll on x86 video BIOS calls.

Yes I know ia-32 hardware support disappears from the Itanium after the Madisons, but ChaOS uses only a subset of the x64 instruction set. Writing a ia-32 software emulator is cumbersome but not difficult, with 128 64-bit registers compared to the 8 32-bit registers on ia-32. Indeed this is the approach adopted by HP engineers for PA-RISC as well as IA-32. Has anyone written an an AMD64 emulator for IA64 yet?

2.8.17 ia64 debug Whilst far from finished, disassembler now covers all common A,I,M,B,F and X-unit instructions, so is becoming really useful. Added (a)uto single step, to stress-test the debugger by a running many thousands of single-step cycles. Runs fine through ChaOS code, but runs into processor faults while single-stepping EFI display functions. Tried saving a few more registers, i.e. bank 1 r16-r31 which generally hold invalid/misaligned addresses at the faulting instructions, (though the ChaOS compiler does not touch these registers). This stops the faults but auto-single-step then gets into an endless loop. Of course the debug handlers are invoking EFI for console output, so the single-step might be looping on a display list within the firmware.

1.8.17 ia64 debug fault Added debug fault handler, and switched on IPSR.db bit to enable instruction breakpoints. Added (e)xit key stroke to debug donkey which sets a breakpoint on b0 as saved on entry to the debug handlers, and (n)ext key set breakpoint on iip+0x10. So now I can step over a function call, or single-step into it and skip out at any point. Instruction breakpoint faults occur on each of the three instructions in a bundle, if the ibr[ ] registers are not cleared. Single-step through the bundle can be achieved bt setting IPSR.id for each step. Frankly it is easier just to clear the ibr[ ]s after the breakpoint has been hit.

30.7.17 ia64 single-step Another fascinating product of my single-step handler is that processor faults can can occur on dependency violations, such as when a stop is missing from the code stream for the desired result. Not yet having tried to understand IA64 speculative instructions, I sort-of expected that at interrupt time in-flight values would all have time to arrive at their destnation. I had not considered that these are all carefully designed, and finite states in a parallel execution engine. As such the single-step interrupt provides a proper window into this area, to the point of raising the same faults on single-step as would occur when the parallel execution units are free-running.

30.7.17 ia64 debug donkey/Floating point registers Introducing a floating point register display inside debug donkey handler produced some interesting results along the way, all adding to my understanding of the fascinating Itanium processor. Initially using stfd and ldfd to save and restore some FP registers to a memory stack buffer, I became intrigued when FP values seemed to disappear during single-stepping. The obvious reason is the registers in question are being overwritten, maybe during a hardware interrupt, so I tried a static memory save area etc, all with the same result - FP values disappear when the setf.sig is single-stepped.

The answer to this problem comes when stfd.spill and ldfd.fill are used to save the full 80-bit FP register contents. setf.sig simply copies a 64-bit integer into the significand of the 80-bit register, and sets the exponent to 0x3e. This creates a valid 80-bit floating point number, but top bit of the significand is zero. Such a value has no 64-bit FP counterpart (64-bit doubles have an implied '1' as the top bit of the significand). It can be normalized by fma fr=fr,f1,f0, but for the single-step between setf.sig and fma, a 64-bit double cannot save the in-register value. This is an insight into the rawness of the Itanium, from whence comes its speed.

With 80-bit FP save/restore wrapped around the debug handler, and a rough-and-ready dtoa function, I can now single-step mixed integer/floating point calculations such as those generated by {vc} for multiply, divide and modulo operators. How quickly those Newton-Raphsonian iterations converge for small integers! (Don't try baremetal IA64 programming unless you understand Newton-Raphsonian iterations, or are prepared to learn what they are). The FP fault and GE fault handlers have saved hours of reboot time this week by trapping my coding errors along the way.

26.7.17 ia64 debug handlers Added predicate register save/restore to debug handler wrapper, to properly single-step compare instructions. Added ifs decode to discover valid loc registers range, and a function to use this knowledge to locate and display r32->rxx from the RSE backing store. Added General Exception, Unaligned Reference and Floating Point Fault handlers, all using the common debug donkey function. Faulting instructions (like break) can be skipped by pressing j.

These handlers are a great addition to my ia64 debug toolkit, saving the reset/reboot cycle I have had to bear every time I make a mistake. Deliberate processor faults are equally satisfying.

25.7.17 ia64 debug handler and RSE backing store Attempted to use cover and flushrs to make singlestep handler saved registers visible by inspecting memory below BSP/BPSTORE, but caused processor faults. Settled on flushrs only, but this only flushes the outer function state to the backing store - so inserted an extra function which saves register state to loc registers before calling singlestep donkey function. Registers in the backing store cannot be accessed via a struct because the RSE inserts a NatVal when BPSTORE&0x1f8 equals 0x1f8. Added a set of #defines for saved registers r1->r15 representing the order in which they were saved, and a helper function to skip the NatVal while repeatedly subtracting 8 from BSP until the required register is reached.

Registers used in the coding of the function calls (r1,r4,b0,b7) are saved in the interrupt handler before the intermediate function is called - so these are retrieved from further down the RSE backing store. Using these tweaks I now have a register display for the singlestep handler, really useful for verifying that disassembler display matches live register operations.

24.7.17 ia64 disassembler Finished X-unit disassembler, including long branch instructions, though I will struggle to build a program large enough to need them. Started I-unit disassembler, completing the opcode 0 set (break,nop,hint,mov from IP/PR, sxt,zxt,czxl/r).

23.7.17 ia64 disassembler Added the first glimpses of a disassembler to the single-step handler. Starting with th X-unit (with the smallest instruction set!), and decoding just the MOVL instruction, else showing just instruction pointer/slot and raw 41-bit unbundled instructions. Added assembler support for NOP.X and BREAK.X, then using the embryo disassembler to spot some latent bugs in X1 and M44 encodings. Parallel development of an assembler and disassembler for any architecture is always a potent combination for weeding out coding errors. Decoding of the MOVL imm64 is a test for any programmer, even with a mainstream compiler.

Although NOP.X and BREAK.X encodings accomodate an immediate value up to 62 bits long, I observe that only 21 bits make it to cr.iim, i.e. the L slot is ignored by this implementation (rx2600). More interesting is that break instructions encountered whilst single-stepping are no problem, the break handler is re-entered, and single-stepping can be resumed. At this point I should have singlestep->break->singlestep nested on the stack, but have not attempted a stack dump to prove this. In any case, clearing ipsr.ss has the system running as normal so who cares. Using break.x requires a tweak to the breakhandler, i.e. need to skip 2 slots instead of 1 to get over the faulting instruction. I do this by checking for the MLX template (t&0x1e)==4 AND slot==1.

21.7.17 {ia64/chaos} Single Step With only 16 bundles available in the IVT for a single-step interrupt, further reduced interrupt handler template to be a bare function dispatcher, with all the tricky code moved to the donkey callout. This sparse interrupt handler occupies only 8 bundles (0x80 bytes), so it can be used for any slot in the Itanium interrupt system. Thus added a single-step interrupt handler.

Added keypress filter to the break handler, to set IPSR.ss on a certain key value. This causes the single-step handler to be entered, which also waits for a keypress. Displaying cr.iipa whilst leaving cr.iip unchanged seems to produced the desired effect, with execution advancing on each keypress through slot0, slot1, slot2 before advancing to the next instruction bundle. Way to go!

16.7.17 {ia64/chaos} Improved interrupt handler further, towards ia64 heavy specification, i.e. to call out to a sophisticated handler with interrupt collection and external interrupts switched back on - aiming to hold execution on a loop calling EFI readkeystroke(). One has to remember that space in the Itanium Interrupt vector table is limited, as I found to my cost as my developing break fault handler grew beyond 0x400 bytes (the handler immediately after is the external hardware interrupt oops). The solution to the size problem is simple enough, just place a wrapper function in the IVT which calls out to the donkey function. Critical to getting external interrupts working for the callout is the switch back to bank 1 registers before any EFI system interrupts happen, otherwise the system crashes. Note that the bsw instruction which performs the bank switch must be the last in an instruction group otherwise its operation is described as undefined. In fact bsw faults the processor if used without a stop immediately after.

Happily after a good deal of trial and error I now have code which can wait for an EFI keystroke, from inside a fault handler. Obviously the ability to accept keyboard input in this situation is an massive step towards interactive debugging on the Itanium platform.

15.7.17 {ia64/chaos} Working on console display suite, centred around wdispf(CH* format,...) a printf-style function producing formatted output to the EFI console. Whilst VGA direct draw functions are technically easier, using the EFI functions carried the advantage that output is mirrored to the serial console or BMC telnet connection, allowing remote viewing - this is pretty much essential given the noise level of the rx2600 fans. wdispf places output into a static ring buffer, to get around the temporal problems with stack addresses passed to EFI (18.6.17). Support functions now working well include strtol, witoa (itoa for wide characters, with enhancements such as max output length, zerofill ...), and the ctype functions isdigit(), isalnum() etc, and reduce(number,base), a divmod function to halve the math workload of witoa, all working fine.

The result is a flexible EFI console display function which works inside an interrupt handler. Just a few minutes with this up and running produces a break fault handler which can display the bundle address and slot number of a break instruction, then adjust the slot and bundle address (if at slot 2) by rewriting the IIP and IPSR registers to skip the break instruction. I now have the development template for any Itanium fault/exception/interrupt handler!

8.7.17 {vc} debug info Revised variable-length SY structure (symbols) to align records on 4-byte boundaries, and add a locr member to store RSTK register number, leaving rtaddr free for MSTK shadow address. Added MOVLOCMLOCR pcode to move RTSK value to MSTK when & operator is first used on a register variable. This completes a fix for the address of register problem.

7.7.17 {vc} debug info Revised variable-length UT structure (type system) to align records on 4-byte boundaries - this to mitigate problems (IA64 alignment problem 18.6.17) when accessing these structures down the line (for compile and debug purposes). Besides padding UT namestrings, UT->prev is now reduced to one byte, recording the dword count of the previous UT record. This limits a UT record to 1020 bytes, sufficient to describe structures with up to 250 members, should be plenty.

Recompiled armc1 with the new UT regime, happily pi/chaos still works fine.

1.7.17 {vc} efi FTEMPLATE Mystified temporarily by the corruption of new ellipsis mstack variables by callouts to EFI functions - only to realise that arguments passed in registers to EFI require memory stack shadow space. This boils down to a misunderstanding on my part of the implementaion of register variables in a modern compiler. ANSI C states plainly the address of a register variable cannot be taken , whereas ANSI C++ states that the register keyword will be ignored if the address of the variable is taken. The only way this can work is to allocate memory stack shadow space for all register variables, and to switch to the memory stack whenever the & is used. My first Itanium compiler, {itc1} uses only the register stack; approaching the problem from the wrong direction is producing interesting results along the way.

28.6.17 {vc} varargs Revised ellipsis handling to cope with variable stack alignment. Rather than defining argv as an array, constrained to the prevailing stack alignment (23.6.17) I have decided to use explicit addressing, dragging the current mstackalign value into the function body itself.

So now ellipsis generates three automatic symbols, UC* argv is a simple pointer to the first stack argument, integer argc is the argument count, and integer argsz is the byte spacing between mstack arguments. argc and argsz are initialized by generating code at call time to pass values in r0/r1 (r8/r7 on IA64) which are moved on function entry into the relevant local variables.

On function entry, argv can be cast to any type appropriate to the conceptual argv[0], subsequent arguments are accessed by simply adding argsz, e.g VD* nextarg=(VD*)argv+=argsz;

Recognizing that the traditional SL main(UL argc,CH* argv[]) prototype only works whan mstackalign equals sizeof(CH*) is a revelation..

23.6.17 {vc} varargs Added ellipsis handling to call FTEMPLATE and fctbody FTEMPLATE by forcing the variable function arguments on to the memory stack. These variable arguments are stored on the memory stack in reverse order, so that they can be addessed from within the function as a simple array with element size equal to FTEMPLATE->mstackalign. Added code to create a local symbol argv to make this array accessible to the ellipsis function body. Whilst it is possible to arrange the memory stack for functions like printf with different-sized elements, this makes no sense going forward with the alignment-check constraints on 64-bit processors.

FTEMPLATEs allow mismatches between naturalsize integers and stack granularity, and I am currently using 16-byte stack alignment even though my Itanium project currently uses at maximum only half of each stack element - this just to make sure no naturalsize dependencies creep into the compiler code. Added a new 128-bit integer to the {vc} inbuilt type system (US). Where stack alignment equals natural pointer size, it is convenient to access the stack via UC* argv[]. Where these sizes are mismatched, UL argv[] (32-bit stack), UQ argv[] (64-bit stack) and US argv[] (128-bit stack) are used instead. This way varargs can be accessed intuitively using argv[0],argv[1],argv[2],... from left to right, with typecasts if appropriate.

This FTEMPLATE system may seem overly complex, but by dissociating stack granularity and natural integer size, all the ingredients are in place to compile IA32 code to use a 64-bit stack, thus making mixed-mode code much more straightforward.

18.6.17 Itanium Data Alignment Beginning memory stack structure, union and array adressing now, aware that PSR.ac is set by default at EFI Boot Manager time, so it is easy to force an alignment check fault. What I did not expect was that clearing PSR.ac does not stop the alignment check faults, i.e. these check faults are an architectural feature, not soft warnings. Having been used to (almost) complete freedom regarding data alignment (i.e coming from IA32), this comes as a shock. No doubt data member alignment in public structures such as EFI follow rules which match the architectural limitations of IA64,x64 and ARM. Generally, structure member addresses should be mod membersize for arithmetic and pointer types to avoid this problem, so a compiler must insert padding bytes automatically. As a quick fix, I have just added a compiler warning where this rule is violated, I am not keen to add automatic padding until I have studied how this is done in the wider scheme of things. Meanwhile the Itanium processor will sure let me know when I get the alignment wrong.

18.6.17 Itanium EFI memory stack Using EFI memory stack to build a string for output to the EFI console produces an interesting yet useless result, i.e. the memory stack content is overwritten before the console gets the display data. This indicates that Itanium EFI console output is asynchronous, being accumulated perhaps as a series of memory pointer/length records to be processed later (maybe on a timer interrupt which uses the same stack?). Therefore display data must be placed in a persistent memory area for it to make it to the console. Furthermore, the garbage output to the screen via the memory stack is different to the garbage sent to BMC, indicating a further time-lapse between screen and remote console display processing.

Calling EFI console output functions from within a break interruption handler is similarly frustrating, even if persistent memory locations are used. This effect can be sidestepped by using functions which write directly to the VGA memory buffer. Fortunately, EFI screen console output appears to perform an IO read of the VGA text cursor before each output phase - so custom display functions can mesh with the EFI console by simply updating the VGA cursor position.

17.6.17 Auto Memory Stack Local Variables {vc} default FTEMPLATE for ia64 places local variables on the register stack in the first instance, equivalent to placing register keyword in front of all local data declarations, and works well, up to a point. Declarations which will not fit in a register naturally go on to the memory stack, which is easy because this can be decided as the data declaration is processed. But using register locals as the default throws up the awkward case where an attempt is made to take the address of that variable - this is only possible for memory stack variables.

Rather than revert to a full register declaration for all locals, I inserted code in doaddressofoperator() (& operator handler) which switches variables from the register stack to the memory stack when this happens. This violates ANSI C, which states that address-of register is just not allowed; however ANSI C also says that the register keyword is no more than a hint to the compiler - i.e. if the compiler runs out of registers then the variable will be on the memory stack anyway, and the address-of operator is perfectly valid.

14.6.17 Itanium break Interruption Working now on FTEMPLATE for ia64 interrupt function modifier, using the software break at IVA+0x2c00. A simple memcpy of a handler function into the IVA, followed by srlz.i;;srlz.d;; seems to work fine. The Itanium Register Stack Engine (RSE) can be used inside this handler - the trick is to, insert cover;; before alloc; there is no need to restore ar.pfs (this is done automatically by rfi following cover). Managed to perform EFI display calls within this interrupt handler hence nesting allocs no problem, though EFI console display is truncated/garbled somehow. My VGA direct-draw functions work just fine. This facility will make the development of other handlers much easier than I had expected (how could I develop a TLB miss handler without it?).

11.6.17 Itanium SAL call is straightforward, worked first time using cast of salproc entrypoint location in SAL System Table to efi function pointer (entry point is followed by GP for SAL call, hence this is effectively a PLABEL:
salproc=(SQ (efi#)(UQ a,UQ b,UQ c,UQ d,UQ e,UQ f,UQ g,UQ h))&ss->salproc;
(*salproc)(0x01000012,0,0,0,0,0,0,0); (//SAL_FREQ_BASE)
returns 200000000 in r9 on Integrity rx2600, the 200MHz platform base frequency.

10.6.17 {vc} With FTEMPLATE now allocating a varying number of sys registers, local symbol addressing from within asm{} blocks is variable too. Added _lr# aliasing to register stack, similar to loc# but subtracting sys register count which is known at compile time. Thus asm{} blocks inside fctbody can reference local symbols as _lr0,_lr1, instead of loc3,loc4 which are dependent on the correct FTEMPLATE being in use.

10.6.17 {vc} and Itanium PAL call Attempting to produce a FTEMPLATE for an Itanium PAL call, using some MSTACK control fields to pass arguments on the memory stack.
Of course this does not work, I misread the meaning of stacked registers as opposed to static registers in the PAL sense. So for anyone else who tries this,
PAL static registers are r28,r29,r30,r31
PAL stacked registers are r32,r33,r34,r35, also set r28 equal to r32
PAL return registers are r8,r9,r10,r11.
Took a little while to work that one out. PAL call also needs an assembly language wrapper to calculate a return address, avoiding br.call for the entry, which is easy in the ChaOS IA64 model since code label values generate a simple GP-relativeoffset, as in this snippet:
movl r8=palreturn;;//offset
add r8=r8,r1;; //add GP
mov b0=r8
mov r28=in3
mov r29=in2
mov r30=in1
mov r31=in0
mov b7=in4;;
br.cond.sptk.many b7;;

10.6.17 {vc} FTEMPLATE developed further, with two FTEMPLATE pointers active in fctbody(), one for the function being compiled, another to control generation and stack positioning of function arguments. FTEMPLATES feed into a FRAME structure, which contains address mapping for five distinct stack areas (ins,syslocals,locals,tmps, outs) on both the register stack and the memory stack. syslocals (things like ar.pfs, function return address etc) are controlled by bit settings in a flags parameter, one for each stack.

Whilst necessarily complicated, FTEMPLATE coding falls almost entirely into three blocks -
(1)in addresslocalsymbols(), to allocate the sys area between ins and locals
(2)in flushpcos(), case ENTERP to generate code on function entry
(3)in flushpcos(), case LEAVEP to generate code on function exit

As a final tweak, temporary copies of FTEMPLATE->rsysflags and msysflags are used by fctbody for code generation. This allows tweaks to the stack frames, by setting or clearing bits as desired - e.g. (ia64) no calls to other functions, so STACKIP (save b0) and STACKGP (save r1) can be skipped. Therefore a function with no stack requirements could generate no code for ENTERP, and just br.ret b0 for LEAVEP.

28.5.17 {vc} a virtual C compiler Deconstructed {itc1} IA64 C compiler into a group of function pointers, placed in ARCH structure to create {vc} to begin to draw {armc1}, {ebc1}, and {itc1} into one multi-architecture compiler. Compilation is directed by a new command-line argument /a=ia64. I am currently working from a IA64 register-only model, through a mixed-stack model towards a memory-only stack model such as the one I have always used on my ChaOS compilers. Before {vc}, function calling conventions were very much hard-coded in my compilers. These become FTEMPLATE structures which control the addressing of arguments (outs and ins in Itanium-speak) in the outer and inner functions. FTEMPLATES include a namestring and are stored in a table within the ARCH structure, invoked by using the namestring as a function modifier. In this way, a namestring such as efi can invoke architecture-specific behaviour as defined by the UEFI Specification.

A FRAME structure, built for each function body expands the FTEMPLATE to map local storage in fine detail, and is inevitably complex. Currently putting the finishing touches to the pal modifier for Itanium firmware PAL function calls.

12.5.17 Virtual address mode Identity-mapped VM now running, using itc.i and itc.d to create TLB cache entries for a handful of 4Gb and 64Mb descriptors, enough to get things off the ground. Will probably put this on the backburner for now, because EFI SetVirtualAddressMap can only be called after ExitBootServices, which destroys most of the EFI pre-boot environment.

11.5.17 End of the line:Intel announces Itanium 9700 series as being the seventh and final generation of IPF.

6.5.17 Progress:Itanium compiler taking shape, producing ChaOS FTRAW image in PE32+ wrapper, acceptable to EFI firmware as a native executable program (PE32 type 0x200 = IA64). Presently using register stack model only (no memory stack yet), but sufficient to call EFI functions for console input (getkey etc) and output to the EFI Shell environment. Now beginning to get under the hood of this fascinating processor.

Accessed Itanium IO ports and MMIO addresses yesterday for the first time, which require specific variants of ld and st instructions. Also managed to switch to virtual address mode, and access IO via page with UC attribute.

April 2017: Ported {ebc1} -> {itc1} to produce native Itanium IA64 code. Tried to benchmark the Itanium against the i3-4010U in my development laptop, using assembly language routines for both which convert a 64-bit value into 16 hexadecimal digits. First attempts showed the IA64 consuming 300 clock cycles versus 200 for the i3, kind of what I expected