ia64 Projects

back    chaos        

Baremetal Itanium!

Around April 2017 I began developing a ChaOS variant for the Itanium platform, just as the processor is retired from service. There will be a lot of powerful hardware going cheap over the next decade, for example I recently sourced two HP Integrity machines, one rx2600 with 1 x 1.3GHz CPU and one rx2620 with 2 x 1.6GHz CPU for fifty quid each. As Itanium depreciation has accelerated, my i2 and i4 blade servers source in my home city together cost less than the rx2660 I sourced in Australia..

As of August 2018 I am having great fun with my home-grown Itanium compiler, a small development OS and remote booting for kicks.

The Itanium is not the sort of processor you take home to meet your parents, it is a radiation-hardened supermodel. 128 x 64-bit registers, 128 82-bit floating point registers, Harvard-style fetch-decode-execute architecture and predication like the ARM but with triple instruction bundles, quad bundle pipelining - potentially twelve-at-once instruction execution. Over 100 amps of electrons flowing through 3.1 billion transistors, 4000-plus registers in the i4 9560 Poulson.. Shame it is coming to end-of-life without ever becoming famous.

The Itanium's downfall is well documented, mostly by those who have little idea of what it is. CPU clock speeds have settled in the low Gigahertz range for years now, a clock speed ceiling, imposed by the physics of heat generation and signal propagation. The wavelength of a 6GHz signal is two inches or five centimetres. Intel QuickPath Interconnect is 6.4GT/s, SATA is 6Gb/s and USB3.0 is 5Gb/s, a 5GHz overclocked CPU needs nitro cooling, so this ceiling is very real today. This problem was long foreseen by researchers in the 1980s and 1990s, resulting in thorough research into parallel processing. One such parallel-computing model developed at Hewlett Packard became the template for Itanium, a radical design to supercede Intel's ageing 32-bit 486/Pentium. Itanium would have a 64-bit core with a 16-times uplift in register count (compared with IA32), and with scheduling for parallel instructions to be despatched to multiple execution units on each clock cycle. This processor would need a clever compiler to analyze programs and identifying opportunies for parallelism, (opportunites which would be increased by having such a large register file). Instructions would be bundled together in groups of three, packaged where possible for simultaneous execution. Programmers would help by providing hints to the processor about the likely path of execution, to reduce time wasted by mispredicted branches.

In 1999, as the Itanium project neared its public debut, AMD proposed a 64-bit opcode extension set for the 486, called AMD64. This would sit on top of the fast and stable 32-bit Pentium. AMD64 would boot in the conventional manner, before switching to a 64-bit execution mode. Itanium (IA64) tackled the issue of backward-compatibility the other way up - with 16-bit and 32-bit hardware emulation modes sitting on top the brand new IA64 machine. IA64 would start up in 64-bit mode with a newly-designed pre-boot environment called EFI, with no route to boot legacy 32-bit code. It would be quite a leap forward. After many delays Itanium was released in 2001, but was raw, untested, and hamstrung by slow motherboard circuitry. When AMD64 came to market in 2003, its ability to provide 64-bit features on top of the legacy boot quickly gained wide acceptance, and forced Intel to release its own version of this architecture, called Intel64. Although the clock speed ceiling was soon reached, Intel64 and AMD64 have since then pursued routes other than parallel execution to add performance, notably widening and adding more SIMD registers (which perform simultaneous operations on small arrays of numbers), thinner silicon, lots of opcode extensions, more processor cores and out-of-order execution. Itanium too has acquired more cores and thinner silicon, but its processor core (with limited 64-bit SIMD) has survived 20 years almost unchanged, having made that brave leap into the unknown at day one.

Some features of IA64 have found their way into Intel64 during its progress. Intel64 now uses complex circuitry to route some work to parallel execution units, or to handle branch prediction. Itanium always expected these issues to be considered by programmers and compilers. (Who better than to predict whether a branch is likely taken than the programmer himself?). As time has marched on Intel64 has acquired hint instructions to allow, a programmer to assist the circuitry. Itanium has had hint bits encoded into every load, store and branch instruction since day one. Intel64 more recently added a 3-register fused-multiply-add (FMA3) XMM instruction (AMD64 now has FMA4) with a latency of 5 clock cycles. The Itanium has had the 4-register FMA4 instruction for 20 years now, with a latency of you guessed it - 5 clock cycles.

My initial interest in Itanium was sparked in 2013 as I began to develop a 64-bit compiler for ChaOS, producing code to run in the x64 UEFI pre-boot environment. I discovered that the early Itanium servers and workstations contained EFI BIOS, and searched the second-hand market for four years before one came along.

Today almost every laptop, desktop or server using the Intel64/AMD64 platform will start up with code running in the UEFI BIOS - direct descendant of the early Itanium pre-boot code. It is ironic that the many who have mocked IA64 as The Itanic, have blogged so from machines running that part of the Itanium project which is truly unsinkable.

In the first few months of this project I discovered that the Itanium is a deceptively simple machine. It is after all a RISC processor. Its small instruction set is suited to amateur compiler writers. Its strictly in-order execution units and dependable timestamp counter provide a window into the dark world of cache misses, pipeline stalls and resource trade-offs for threads in the same core.

Random blog on IA64 compiler and platform development follows below:

29.1.21 rx2800 i4 and WinS 2008 R2 Experimenting now with external SAS enclosures, so managed to install and activate last-of-the-line Windows Server for fun. Whatever I do I can't get CPU usage to register above 0%!.

13.12.20 rx2800 i4 and P410 RAID Using drvcfg -s multiple RAID volumes can be created in the 8 SAS drive bays at the front of this server. RAID0, RAID1+0 (+spare) and RAID5 (+spare) are all possible.

EFI shell on the rx2800 has an inbuilt FTP command. Once the EFI system partition is created on a RAID volume (just two steps in ia64/ChaOS blk utilities), development boot images can be downloaded via VPN to the system partition using a quick login followed by Ftp>GET chaos.efi.

For configuration from scratch after a RAID reformat, I use a ChaOS image on the local network, and a PXE boot launched AFTER an EFI Shell session so that the EFI blk devices for the RAID volumes are visible. After creating the EFI system partition the system needs a reboot for the FAT32 filesystem to become visible. Then FTP becomes a valuable addition to the configuration toolkit.

Having now created two RAID volumes on the rx2800, it becomes possible to trash one during further development of GPT/EFI partitioning code, whilst running .EFI images from the other.

Interestingly I had to fix a slight problem with x86\ftpsrv where the rx2800 response to SYN came back before ftpsrv had set up a callback function to catch the SYNACK response. This was a latent bug where I filled in the address of a handler function AFTER sending the SYN packet. Never been an issue so far but as machine speeds increase such bugs become exposed. In considering timing problems with network comms (as I mentioned two days ago regarding the Cisco 3560g switch) it should be remembered that the rx2800 is a very fast machine.

11.12.20 ChaOS on rx2800 i4 I am now comfortable toggling between HBA and RAID mode on the P410i controller, using drvcfg -s. A reset is needed after switching to RAID mode, before using drvcfg -s a second time to delete/create RAID volumes.

The rx2800 i4 iLO is flaky when connected directly to a Cisco 3560g switch, as I did from the get-go. Telnet into the switch itself only works on a single LAN segment if ip routing is on. Despite trying lots of different Cisco IOS configurations I cannot get my rx2800 to hook up via telnet. There are examples on the internet of similar problems with HP iLO, and not just with Itanium servers.

Because PXE boot is initiated by a broadcast packet, boot image files need to be uploaded into a server in the LAN for network booting. But with a LAN-to-LAN VPN I can now quickly transfer a newly-compiled boot image across the internet and fire a reboot using the UEFI Boot Manager.

Curiously, using the telnet command within the Cisco IOS (I only discovered this command by happy accident) works fine, which proves that configuration and wiring is not the problem. I suspected it may be a timing problem, that the Cisco switch is too fast for the iLO firmware, but that is difficult to prove. (Update:13/12/20: It is also possible that the rx2800 is too fast!) However by placing another non-Cisco switch between the Cisco 3560g and the rx2800 i4, everything works fine.

Installed in my rx2800 i4 are one PCIe P411i Smart Array controller and two PCIe Fibre-Channel controllers, all of which have UEFI drvcfg -s capability in their option ROMs. Configuring an external RAID array is the obvious next step.

10.12.20 ChaOS on rx2800 i4 As Itanium depreciation gathers pace, I recently acquired an rx2800 i4, still in box, absolutely unused for 350 quid. Not really interested in the factory-installed OpenVMS operating system, due to the crazy per-core license regime, so those disks will be labelled and tossed in a box. Clearly the HP Integrity Sales Department has been focussed on extracting the last bit of juice from the product, so we need an entitlement (i.e. some kind of service contract) to download the latest firmware which includes EFI command saupdate to alter the operating mode of the Smart Array disk controllers. Specifically I need to change the P410i controller from factory-set RAID mode over to HBA.

Fortunately I tried the EFI command drvcfg -s for this controller. This brings up a menu to toggle mode between RAID and HBA

I can see the firmware is little changed from the rx2660, so within a couple of hours I have ChaOS booting from local disk and via PXE.

29.6.20 rx2620 mainboard replacement Sourced some old rx2620 mainboards with the object of fixing my rx2620 which was destroyed by during Storm Emma a couple of years ago. Ran up against security in the HP firmware which matches serial number, model number and UUID on the mainboard with values stored in the front LED panel PCB. There are lots of people out there with this problem, resulting from HP forcing their own factory replacement/refurbished parts into the market in favour of salvaged spares. The HP firmware cheekily disables EFI filesystems which makes it impossible to boot an operating system.

Took me a while to understand where the serial numbers are stored, involving ripping the eeproms off my dead rx2620 mainboard, soldering them on to circuit boards with patch leads to a Raspberry Pi. Pushing my chaos/pi project to the limit I developed software to read and write these eeproms from files on the Raspberry Pi boot SD card. I also learned what happens when you drive the 24c02 address lines with outputs from the Raspberry Pi (i.e. no current limiting resistors!). You WILL fry the chip.

The key to rectifiying mismatched components is in the rx2620 EFI shell, . specifically the commands sysmode service and sysset. Some say there is fourth sysmode level besides user, admin and service. to get at the eeproms and update them with correct values. However when motherboard and LED panel are mismatched, in service mode, sysset displays the values stored in the motherboard eeprom. These numbers need to be duplicated in the LED panel eeprom.

I tried desoldering and removing the LED panel eeprom for reprogramming, then soldering the chip back on to the LED panel PCB. But for a single-byte error I had the job done, two out of three values were matched by the firmware, not bad as Meatloaf would say. At the second attempt I lost a leg from the 24c02WP chip making it impossible to reconnect. Other 24c02 eeproms sourced from eBay just don’t seem to have the same spec as HP ones, an mess up I2C timing for all the eeproms on the mainboard bus. Also by this time the LED panel PCB tracks were badly damaged by repeated soldering.

Luckily I sourced another rx2620 LED panel within the UK, evidently brand new as the eeprom was blank. I figured the eeprom should be accessible through the connecting cable, rather than messing with the soldering on this PCB. I inserted fine copper wires into the LED panel connector cable at the motherboard plug, and hooked them up to the Raspberry Pi. This is a much kinder way of accessing the i2C bus, and reprogramming the eeprom is simple and stable, because the PCB has pullup/pulldown resistors in all the right places. Experts out there will know that just four wires are needed, Vcc, Vss, SDA and SCL, at 3.3v. The LED panel eeprom is at address 4. Data format in the eeprom is not hard to work out. UUID/GUID bytes need to be in the correct order.

13.4.20 vc2 compiler Still off on the new compiler tangent, having introduced a range of new inbuilt numeric types, most notably floating point in 16,32,64,80 and 128-bit flavours. This is a major departure from my ageing 32-bit compilers, and introduces a matrix of conversions which need to be invoked implicitly during compilation. First of all, numeric string scanning has been modified to produce 128-bit integers and 128-bit floating point (quad-precision) by default. Secondly, a set of conversion functions now produces the range of number formats required to initialize these new inbuilt types.

27.5.2019 vc2 compiler Still off on the new compiler tangent, having introduced a range of new inbuilt numeric types, most notably floating point in 16,32,64,80 and 128-bit flavours. This is a major departure from my ageing 32-bit compilers, and introduces a matrix of conversions which need to be invoked implicitly during compilation. First of all, numeric string scanning has been modified to produce 128-bit integers and 128-bit floating point (quad-precision) by default. Secondly, a set of conversion functions now produces the range of number formats required to initialize these new inbuilt types.

Thirdly, the matrix of conversions is now beginning to take shape. I have opted for an in-register conversion rather than library function calls for the more complex conversions, to avoid having to produce different libraries for different function template scenarios.

After spending time studying all these number formats, imaginative ways of converting between, for example, double and half-precision floating point begin to emerge. In this case the difference in exponent bias is 0x3f0 (1008), so the FPU can be used to multiply a double-precision floating point value by 2^1008 (0x7fe0000000000000 in that format). The half-precision exponent and mantissa are then contiguous and can be simply shifted into their final bit-positions. Two conditional bit tests on the input double provide propagation of sign information, and optional rounding-up according to the last half-bit.

2.3.2019 vc2 compiler T-codes With t-code experiment looking good as the basis of a better-structured C compiler, mirrored t-code support to /a=x64 modules. See vc2

21.2.2019 IA64 C compiler T-codes A small, but significant question when allocating registers to t-codes for a complex C language expression, is how to ensure the value of that expression is directed to one or more fixed "accumulator" registers. A recursive-descent parser which is the norm for the C language ensures that assignments are coded in right-to-left order where there are mutiple assignments in the expression. So the answer is in the id of the value temp of the rvalue at assignment time. By writing this id out to a global or static variable, when the recursion unwinds we have the value of the final (left-hand) assignment, i.e. b out of a=b=c=d=e.

20.2.2019 IA64 C compiler T-codes Moved T-code dependency analysis and automatic stop insertion to the compiler back-end, easier to manage after locations for register variables have been allocated. Mapping currently covers r1-r63, extendable to cover the whole 128-register Itanium range. Any T-code can be generated with a hard stop to override auto-stop behaviour. Tried a "move-up" function, to reorder the T-code buffer before flushing it to the back-end, again surprisingly easy and productive. I find every successful "move-up" pretty much saves a stop and a NOP.

19.2.2019 IA64 C compiler T-codes Experimenting with alternate ways for generating IA64 code, added a #pragma DEV to the compiler to generate a short t-code (three-code) buffer and accompanying temporary variable allocation, wrapped around expression in expression-list, and init-declarator. Working with a subset of expression analysis comprising indirect load and store, constant load, register stack variables and just the one additive binary operator, I have a working prototype in just two days.

I tried a similar exercise in the early ChaOS IA32 days, which I dropped when I found that a PUSH/POP method allows effectively infinite expression depth without demanding more than two accumulator registers.

These days my compiler output is buffered by C function, so this t-code experiment uses a similar technique, sub-buffering at expression level. T-codes are easily visualised, simply by displaying or printing each t-code and its associated temporaries as pseudo-code. Opportunites for code improvement, elimination of temporaries etc, are easy to see and map before assigning processor registers ready for the backend code generator. And these t-codes can exploit the full potential of three-register instructions on RISC processors such as ARM and Itanium.

By padding my existing p-code structure to the same size as the larger t-codes, I can submit t-codes to the compiler backend using a p-code alias. And by flushing the t-codes to the backend, the code stream contains alternating old and new code generator output for each test expression. This is really useful when animated by single-step debug, to ensure the new output is technically correct.

But most interesting is the ease with which dependencies can be spotted in the t-code buffer, especially by a simple algorithm. My IA64 back-end cannot do this, and relies on an oversupply of stops from the p-codes to avoid dependency errors.

The abstract terms write-after-read(WAR), write-after-write(WAW) and read-after-write(RAW), are daunting until you realise that Itanium handles handles all these hazards in hardware for memory dependencies, as well as WAR for register dependencies. This leaves just the RAW and WAW register hazards, where the programmer or compiler is expected to schedule a momentary halt (called a stop) after a register write, and before a subsequent read-from or write-to the same register. This is not a difficult job for an algorithm acting on a table of t-codes and associate operands.

I now have a small bitmap in each t-code to represent WAW and RAW dependencies, run these rules on a register usage map for each t-code in the buffer to determine where stops need to be placed in the code stream, exactly the way is Itanium intended to be programmed. A little automation goes a long way. Straightaway I see instructions scheduled to execute in parallel, a 25%-40% reduction in stops, NOPs, and code size. Dependency mapping also shows which instructions might be "moved up", an area of further performance gain which I will no doubt explore.

15.2.2019 IA64 C compiler Floating point expressions Method now in place for free mixing of 64-bit floating point, 80-bit floating point, and 64-bit integer types in expressions, with arithmetic accumulation switching between GRs and FPU as necessary. Integer and floating point constants also properly type-checked and converted to the appropriate format for inclusion into initializers and expressions. Working implementation in place for the four primary arithmetic operations, add, subtract, multiply and divide. Next: extend this method for modulus and comparative operators.

11.2.2019 Conversion from 80-bit double extended FP to native IA64 82-bit FP Cracking on with compiler enhancements to mix 64-bit doubles and 80-bit extended doubles freely in expressions. This tweak requires the compiler to place conversion code inline where necessary to load values from GR pairs into FPU registers. The exponent and significand can be loaded from two 64-bit GRs into two FPU registers, then combined using fmerge.se. However the sign:exponent for this operation needs to be in native IA64 82-bit format.

One way is to tweak the in-register sign:exponent before loading it into the FPU. The exponent bias needs to be changed from 0x3fff to 0xffff, and the sign (if set, i.e. negative) from bit 15(79) to bit 17(81). Adding 0xc000 adjusts the bias, but corrupts the sign when set; however, by testing for this case, the sign can be moved to bit 17(81) by further adding 0x18000, which is just 0xc000 times 2. shladd means the same scratch GR (r2) can produce these two offsets. tbit can be used on the sign, to predicate the extra addition. If the exponent is in r37, for example:

     addl   r2=0xc000,r0
     add    r37=r2,r37 ;;
     tbit   p16,p0=r37,15
(p16)shladd r37=r2,1,r37 ;;

This takes just a couple of clock cycles. But three 5-cycle FP operations are required to construct the FPU value. setf.sig and setf.exp will issue in parallel, followed by fmerge.se. I measure 11 cycles in total.

Another way is to store the two GRs to a memory address,then use ldfe instruction to load the FPU. This conversion is potentially slower with three memory latencies plus one 5-cycle FPU latency. But the common memory address will bring the caches into play. For repetitive operations, execution time is less than the first method! (I measure 8 cycles excluding stack address arithmetic which maybe needed for a stack temporary).

6.2.2019 IA64 compiler local C variables now visible to assembler This is a feature which has been in my IA32 compiler for at least 15 years, where local C data declarations and stack arguments are visible to the inline assembler by their C data names. This strategy keeps C code and inline asm blocks in synch, because assembler addresses are recalculated if local C data declarations are changed. Errors can be thrown for not found symbolic names; much better than mentally translating between a C function template and the registers assigned by a compiler.

Also added a new preprocessor directive #inline really just an #include file to hold small chunks of code which I use repeatedly. Because the Itanium lacks a div function, for instance, a small Newton-Raphsonian code nugget is one such favourite chunk. These chunks are fed to the C statement() function, which will recognise asm{...} blocks of course, so a combination of C and assembly language can be used if necessary. With local symbols being visible to the assembler now, these chunks provide the functionality of an assembly macro, but are less cryptic to the reader.

3.2.2019 Itanium L2 cache Much of the perceived speed of a modern processor is down to the way caches are arranged around the CPU core. Whether it is fetching instructions from a program in memory, or fetching and storing data, half the transistors in the CPU are buffering access to the memory DIMMS. Getting down and dirty with my range of Itanium testbeds reveals the Itanium 9500 series to be slower than expected despite a 50% CPU clock speed uplift compared to the 9300 series. On closer inspection (i.e. via PAL system calls) it can be seen the Level 2 Data cache read latency on my Poulson 9560s is 8 compared to 5 on my Tukwila 9350s. This manifests as a higher clock count for the later processors - in other words the core has increased in speed but the actual time taken to access the Level 2 cache has not really changed. This could be to reduce power consumption, or maybe there is no headroom to make the circuits go any faster.

What a dog of a processor is the Itanium. Level 1 cache takes 1 clock. Level 2 cache takes 5-8 clocks. Haswell, Broadwell, Skylake Level 1 cache access is 4 clock cycles, Level 2 is 12 plus...

2.2.2019 i4 bl870c Itanium 9560 does what it says on the tin Like most blade servers on the second market, my i4 was stripped of memory DIMMS before sale and arrived with the usual 0Gb. This twin interlinked blade has a total of 48 DIMM slots and maximum memory capacity of 768Gb. Memory is recommended to be installed in matching DIMM quads. Having no such DIMMs to hand, I initially installed a matched pair of DIMMs in the CPU0 bank of each blade, but different of different capacity. Because I was happy to see the unit boot successfully, I left the memory in this configuration.

In recent weeks I had noticed my i4 bl870c 9560 producing higher itc clock counts for software tests than my i2 bl860c 9350, one of the reasons I completed a full review and produced the pclock() function to apply the correct multiplier to base itc clock counts. When this scaling was applied, the i4 clock counts were approximately twice those measured on the i2. Today I changed the memory configuration to two matching DIMM quads, one on bank CPU0, the other on bank CPU1 of the bl870 Monarch blade. Clock counts for my i2 and i4 are now in the same range, so I can see the slower counts demonstrate why a non-interleaved memory configuration needs to be avoided, hence the advice to install memory in quads of DIMMS.

The i4 still lags the i2 by a ratio of about 5:6. I think this is a reflection of a higher load latency for the 9560 L2 Data cache. Of course the 9560 runs a test routine quicker in absolute time because the clock speed is nearly 50% faster, at 2.53GHz versus 1.72GHz. Multiply these two ratios together and you see a figure about 1.2 times faster, which correlates with the 2.4 times performance uplift claimed by Intel for Poulson over Tukwila (factoring in the doubling of cores from 4 to 8).

31.1.2019 IA64 ar.itc Spent some time with the SAL and PAL calls to develop a reliable way of scaling the Itanium interval counter ar.itc to properly represent processor clocks. Anyone who has tried to obtain stable readings from the Intel64 rdtsc instruction will know how inexact this sort of science can be. I settled on a small list of stock ratios which represent the Itaniums in my current collection, such as 17/8 for the i2 9350 1.73GHz, 25/8 for the i4 9560 2.73GHz and 32/8 or 28/8 for the 9000/9100 series. The translation from itc to processor clocks can be done by a set of predicated shladd instructions, (having placed the numerator into the lower bits of PR), followed by a shift right by 3 to represent the common denominator of 8. I call this function pclock().

The latency of pclock() is simply -(pclock()-pclock()); or b-a where a=pclock();b=pclock(); This can be subtracted from other pclock() diffs for a realistic elapsed processor clock count. Interrupts must be switched off for this calibration, of course.

As with any attempt at micro-timing, a routine must be repeated several times in succession, also with interrupts switched off, before dividing by the repeat count to obtain the count for one iteration. This integration is needed because the granularity of ar.itc is greater than the precision we trying to achieve. Invocations will start and finish at random points in between ar.itc clock ticks so for a pclock:itc ratio of 4, differences in timing of 4 or even 8 are perfectly normal. Beware the first invocation which will show a higher count needed to pull code and data into the processor caches. This count is only available one time per program load. Subsequent invocations produce a lower steady figure which reflects the routine running from within Level 1 and Level 2 caches. Clock counts produced by this method are remarkably stable on the Itanium, resolving reliably to one clock cycle with an oversampling count of 128.

26.1.2019 IA64 math functions Math function development now under way, starting with f2xm1 i.e. (2^x)-1 fyl2x i.e. y.log2(x) at double precision. These use Taylor expansions and do benefit from pipelined loops when loading memory tables of constants, typically 60% faster than a non-pipelined routine. Initial implementation was relatively straightforward, quickly producing sensible numbers. Double precision is reached at 9 iterations for f2xm1, and 11 iterations for fyl2x. Slightly modifying these two functions produces f2x (which is f2xm1 plus 1) and fl2x (which is fyl2x without the final multiply by y).

For any positive number n, f2x(fl2x(n)) should equal n i.e 2^log2(n) = n, and this is a good test for any math library. At first this test showed declining accuracy the further n is from 1. This can be restored by adding iterations to the Taylor expansions, but this is wasteful. Power of 2 scaling is the way to go for accuracy across a wide range for n. With such scaling implemented, f2x(fl2x(6.0)) returns 5.99999999999999911, which is the limit of double precision i.e. 0x4018000000000000 in produces 0x4017ffffffffffff out. The difference is just one bit of rounding.

For f2x this scaling involves splitting a logarithm (power) into its integer and fractional parts, performing the inner loop on just the fractional part which has an exponent of -1, then adding the integral part back to the original exponent (which is after all just a power of 2). For fl2x the reverse is the case, first the exponent is saved and substituted with an exponent of 0. After calculation of the logarithm the exponent is added in to the floating point result. The Itanium extr and dep instructions are purpose made for manipulating bitfields within a general register, which can be moved as a floating point bitmap directly into an FP register, or into its exponent or significand. I will use similar techniques for 80-bit floats for sure.

23.1.2019 IA64 modulo-scheduled loop, with software pipelining Taking the ubiquitous 100 square modulo-scheduled loop in the Itanium textbooks (i.e. load floating point value from memory, square the number, store back to memory, loop 100 times) my initial encoding produces a clock count of 1600, which is comparable to the 1800 cycles taken by my x64 Haswell laptop. The Itanium texts suggest this loop should take 117 cycles, so there is a way to go here.

Some explicit bundling to remove nops inserted by my crummy assembler tightens the code to two bundles with no stops, and brings the clock count down to 240. By chance this exercise placed the loop across a 32-byte memory boundary. Because this processor has 6-wide issue, instructions enter the pipeline in pairs of bundles, i.e. 32 bytes. Therefore aligning our two bundles on a 32-byte boundary will double the issue rate compared to the non-aligned case.

So a quick hack to my assembler (i.e. an align 32 directive) inserts the requisite number of NOPs to nail the loop to a 32-byte boundary. The clock count drops to 124, which is bang on the textbook prediction of 117 (give or take the interval time counter read latency).

22.1.2019 Itanium ar.itc frequency on rx2660 I have been amazed by the interval timer counts coming from the Montvale Itanium for some months, but have become increasingly perplexed in recent days by the implausibility of timings coming back for pipelined loops, especially when compared to Itanium i2 and i4.

A simple test, using EFI->bootservices->stall(1000000), a one second time delay produces an ar.itc diff of 399000000, indicating than the interval time counter on this platform runs at one quarter of the 1.6GHz processor frequency. PAL_FREQ_RATIOS returns a series of clock ratios for the platform, from which the 1:4 ratio can be gleaned. Added a quick calculation to my PAL_FREQ_RATIOS display to show this ratio, which should work for my othe Itanium platforms.

21.1.2019 IA64 modulo-scheduled loops Took the time to improve debug_donkey support for rotating floating point registers, and rotating predicate registers, including rrb.pr evaluation and colour coding of predicated instructions on the single step (i.e. red = predicate false). Animating a simple modulo-scheduled loop with these tools really brings the Itanium to life, how a 100 x 18-cycle loop taking 1800 cycles on a conventional processor might be reduced to less than 120.

How might a C compiler generate such code automatically, for a while(...){} loop for example? That is a difficult question. I will try some hand-coded loops of my own first to see the performance gain to be had.

20.1.2019 IA64 128-bit basic types Carrying on after the relative ease of adding double floating point types to my IA64 compiler, added extended double floats and 128-bit integers to the mix. As with the 64-bit doubles, these types are allocated as GRs, but in pairs. Small changes to the call mechanism, and the allocation of extra in GRs mean these types can straightaway be passed as parameters to function calls. The main reason behind this push is to use extended double floating point precision as the default in my math library.

13.1.2019 IA64 Large Integer Arithmetic Interestingly there are no processor flags on the Itanium to detect arithmetic overflow, like the carry flag on x86. In a parallel processing situation several overflows could happen at the same time, so a single carry flag would be of limited use. Instead the Itanium has 63 variable 1-bit predicate registers, any pair of which can be set to the true and false results of a conditional test. To perform unsigned 64-bit addition, the method is to compare the result of an addition to the two operands; i.e after a=b+c, if((a < b)&&(a < c)) then we have the equivalent of carry flag set. Itanium has the circuitry to perform these two comparisons is one clock cycle, so is competitive when programmed correctly.

This technique corrects the output from my 80-bit floating point display routine, which uses 128-bit shift and add to multiply the 64-bit mantissa by 10 when generating decimal digits. 80-bit pi being in binary 0x4000.c90fdaa22168c235 now correctly produces 3.14159265358979323851280895940618620443274267017841339111328125. Note that this value is correct for pi in only the first 19 decimal digits. Two bits of 64-bit mantissa here encode the 3 to the left of the decimal point, so the trailing decimal digits are an exact representation of a fraction which is 2^-62 granular.

6.1.2019 IA64 C compiler Floating point support - is working Initial implementation complete, supporting +,-,*,/,=,+=,-=,*=,/=,==,!=,>,>=,<,<= operators with automatic conversion between 64-bit integer and 64-bit double FP according to convention.

C version of F2XM1 (0.5) compiled by vc1 returns 0.414213562373094923 which is an acceptable result for calculations in double precision floating point. More to the point this shows this simple floating point model for IA64 is viable.

5.1.2019 IA64 compiler Floating point support Tried a small hack to my vc1 C compiler to allow floating point type declarations, and made a giant leap. Since double floating point numbers are 8 bytes in size, the compiler simply treats these as 64-bit integers for load, store and arithmetic expressions. Each local FP double declaration allocates a GR from the register stack, so these variables work fine as arguments to functions with no changes to the compiler model. A tweak to getop(), to look for a decimal point after an integer constant causes a rescan of the string, and conversion to a 64-bit IEEE bitmap. The expression analyzer is exposed to these values for initialization of local variables. Just need to remember to use 1.0 or just 1. when initializing a double and 1 when initializing an integer, for example. Then I realized the converse is true, 0x3ff0000000000000 can be used to load 1.0, 0x400921fb54442d18, can be used to load pi (yes I know I need a more accuracy for a math library). Neat!

One small change to the expression analyzer, for the additive operator, now detects floating point operands and outputs a new FPADD pcode. This simply moves the relevant GRs into the FPU (setf.d fr=gr), performs a fused-multiply-add, and then stores the result back into the accumulator GR. My first working IA64 floating point C expression.

Only f2,f3 and f4 are used, these FP registers are already saved and restored by debug_donkey so the new code is debugger single-step safe. In the course of one afternoon, I have the design and implementation of floating point for my ia64 C compiler almost done and dusted. This is a great boost towards developing and debugging a math library.

3.1.2019 F2XM1 Managed to code the (e^x)-1 function, on IA32 ChaOS, using a table of reciprocal factorials and a mutiply-add loop. Now I see clearly how to write my IA64 math library.

2.1.2019 IA64 Floating point Currently researching polynomial expansions in preparation for my IA64 math library. With no high-level instructions on the Itanium FPU this is a stretch for an amateur programmer. Functions such as sin, cos, atan, log, ln and pow have to be built up from the correct Taylor series using the ubiquitous FMA and FNMA instructions.

Interestingly the 128 Itanium internal FPU registers are 82 bits wide, slightly different to the conventional 80-bit double extended precision registers on x87. The exponent is 17 bits wide compared to the 15 bits on the x87, giving more range when working close to zero or +/- infinity. 82-bit values can be be entered or extracted to/from the FPU using the ldf.fill and stf.spill instructions, should one ever need to introduce/save these exotic numbers. More easily, 80-bit double extended precision, 64-bit double precision and 32-bit single precision bitmaps can be loaded up from memory or from any of the 128 general 64-bit registers. Raw bitmaps can also be transferred from the GRs to FR significands or signs/exponents. There are plenty of ways to load high precision constants into FPU, as will be needed for any half-decent maths library,

Read-only registers f0 and f1 are hardwired to zero and +1 allowing these values to be incorporated into calculations without the overhead of an explicit register load.

28.12.18 ChaOS on Poulson! Just like the Itanium 9350 (Tukwila), clock counts recorded for my SHA-256 test routine running on the 9560 (Poulson) are 50-60% more than Montecito, (i.e. SLOWER EXECUTION) which is initially a disappointment. Temporal (coarse-grained) multi-threading in Montecito switches between the two threads typically on a cache miss, therefore so it is useful to understand that whilst one thread here executes in fewer clock cycles than Tukwila or Poulson, the second thread on that core may be completely halted. The higher clock count on the later Itanium processors simply reflects the change to interleaved (fine-grained) multi-threading, switching rapidly between pipeline advancement on both threads in each core. In this scenario Tukwila/Poulson ought to produce results from a pair of threads in FEWER clock cycles (e.g. 160% divided by two = 80%). I have yet to write a test routine to prove this interesting point.

There is a marked difference in heat output between i2 and i4 Itanium blades, no doubt reflecting the change from 65nm to 32nm silicon, and despite the doubling of core count from four to eight.

Clock count on Poulson is slightly better than Tukwila. I am guessing this reflects the change from six-at-once to twelve-at-once instruction execution. A cleverer compiler might exploit this potential (explicit quad-bundles aligned on 64-byte boundaries, with NOP insertion?).

So it seems Itanium development pretty much stopped ten years ago. The only difference between my Itanium i4 Poulsons and the i6 Kittson (released May 2017) is part number change and a small CPU speed uplift. I am stoked to have captured some top-level Itaniums 9560s barely a year after fumbling around with IA32 mode on first-generation Itanium processors.

29.11.18 ChaOS on Poulson! Itanium depreciation has exceeded my wildest expectations, having just acquired an i4 bl870c blade complete with 3x 9560 processors (each 8-core, 16 thread) - a mere 6 months after sourcing my first Montecito testbed. I feel like the Old Woman In The Shoe, so many Itaniums I don't know what to do.

The bl870c is basically two bl860c blades supplied with a 'scalable link' which, along with some clever firmware conjoins the two adjacent blades into one computing unit. The left-hand blade assumes the role of Monarch, taking control of all resources; in this case my three 9650s appear as 48 CPU threads in a common memory space totalled from DIMMs present in the two blades. Although the maximum memory configuration for the i4 bl870c is 768Gb, the blade will boot to EFI shell with just a few Gb, in my case just 12Gb of mismatched PC3-10600R.

Second-hand memory prices for PC3-10600R are falling through 2GBP per Gb, whilst slower PC3-8500R can be half of that price. The slower memory is below specification for i2 and i4 Itanium blades, but DOES run in the i2. Meanwhile iLo3 on the i4 blades happily detects slower memory DIMMs, but firmware fails them during the power-on sequence.

19.10.18 ChaOS on Tukwila! Finally succeeded in configuring the Cisco 3020 to switch those Vlan 1000 frames on the internal NIC interfaces to an external switch port. ChaOS remote boot now possible down any of the 4 NICs on each blade. email chaos@ctpp.co.uk to find out how this is done.

9.10.18 ChaOS on Tukwila So a couple of weeks later I have 4 x i2 bl860c blades, each with those twin 9350 4-core multithreading CPUs, a total of 64 Itanium threads. First job has been to learn how to configure the network interfaces, made easier of course by the UEFI BIOS on each blade. I found some Cisco 3020 blade switches (4 for 15GBP!) not realizing that you need a university degree to set them up. So I tried a GbE pass-through module, which gives 1:1 connectivity to the blade NICs. (NB these modules do NOT auto-switch down to 10/100!!) Three of the four bl860c blades power up with the NICs in Vlan 1000, and I have no idea why. With a tweak to the network stack on my ChaOS TFTP server, untagging incoming frames are retagging the replies I soon had PXE remote boot to each blade on eth0.

7.10.18 HP c7000 BladeServer I have some explaining to do. How is ChaOS now running on Tukwila? Well, after seeing the HP c7000 Bladeserver chassis dropping below 500 GBP, I researched this subject thoroughly before picking up one for just 350GBP - the idea being that further Itanium testbeds in bladeserver form could be purchased when the prices crash further. These would likely be Montecito bl860c/bl870c or bl890c blades, typically 2-300GBP.

I had barely powered up the c7000 before discovering bl860c i2 blades, complete with twin Itanium 9350 Tukwila processors in my local area for less than 100GBP. An opportunity not to be missed...

Like the Itanium itself, the c7000 is a supermodel, and took a couple of weeks to fully understand, given that I had never encountered a Vlan or managed network switch before. The chassis Onboard Administrator modules also needed a firmware upgrade just to recognize Itanium blades. (Note that this upgrade will also throw errors for 2250W power supplies made before a certain date in 2008).

28.9.18 ChaOS on Tukwila I have been on a very steep learning curve the last week. Small bug fix to see ia64/chaos running on this monster processor, including all debug handlers, single step etc. i2 UEFI firmware starts dozens of MNP, ARP, IPv4 and TCP4 handles, the proper network stack. For ChaOS this will be a quantum leap. More details when I catch my breath.

28.9.18 rx2660 from Proliant DL380 G5 One last tweak to this build of an rx2660 into the Proliant G5 chassis: the twelve fans were roaring as full speed today as I tried various tweaks of ia64/chaos on a SAS drive swapped between this machine and my Itanium 9350 testbed. After half an hour this rx2660 performed a hard shutdown, because the mainboard thought the cover was off. There is a 2-pin jumper on the mainboard for a cover microswitch. Shorting these pins makes the fans spool normally. Alleluia!

10.9.18 rx2660 EFI device drivers I have been gently probing the EFI firmware with development network driver, trying to switch off the PxeBc and gain control of the NIC through SIMPLE_NETWORK_PROTOCOL. Sadly it looks like someone at HP took a coffee break when writing the firmware..

Apparently the PxeBc and Net protocols are bound within one driver handle, so calling DisconnectController(ControllerHandle,PxeBaseCodeDriverHandle,..) removes the PxeBc, PxeDhcp and Net protocols from the NIC. No worries, PxeBc->Stop() should leave EFI_SIMPLE_NETWORK_PROTOCOL unfettered for our development driver. So if we install a DriverBinding protocol it should be invoked by ConnectController, allowing additional protocols to be installed on the NIC, such as Arp, Dchp4 etc. ConnectController(ControllerHandle,DriverHandle,...) works OK to bind this extra network driver, and EFI dutifully calls DriverBinding.Start after receiving EFI_SUCCESS from DriverBinding.Supported. But calling the matching DisconnectController(ControllerHandle,DriverHandle,...) returns EFI_UNSUPPORTED, and DriverBinding.Stop is never called, so unbinding a development driver according to the UEFI model is not possible.

Unbinding can be done programmatically in development driver Unload(), so this is not insurmountable, just a whole load of wasted time. About that coffee?

10.9.18 rx2660 from Proliant DL380 G5 Just a handful of people have wondered whether the similarity between these two HP servers allows one to be built from the other. Having just managed to create a working rx2660 in a DL380 chassis, the short answer is yes, if you have a metalworking toolkit. The system board mounting points are in completely different places, apart from the three power cage mounting holes. The rear i/o panels are rivetted/welded in place and very different. The PCI cages are different as are their chassis mounting points. The 8-way fanboard mounting points are different. The 4-way I/O fan boards are different. The front bezels and Insight panels are obviously different. The memory DIMMS are different.

Having said all that, the power cage and PSUs are interchangeable, as are all the fans and the 8-way fan cage. The SAS backplane, DVDRW and hard drives remain in place.

All the bits needed can be eBayed here and there but mainly from the USA. Shipping a full server across the Atlantic to the UK costs more than shipping from Australia (why? the special relationship??). So shipping the lighter rarer parts from afar makes some sense. You will need the rx2660 cable kit - the DL380 has a larger systemboard which eliminates several cables. The 34-way half pitch I/O fan board cable is a particular oddball.

All told I managed the build for less than 450 GBP, way under half the cost of a machine from TrumpClintonLand.

27.8.18 BlkIo Tools Added a diskcopy function for drive cloning, with option to increment MBR, GPT and ESP GPART GUIDs on destination drive (so disk copies do not carry duplicate identifiers). Added a wipedisk function to clean drives of development formatting, with early-out and backup GPT wipe in all cases. Added a byte-level editor to BlkIo sector browser, for fine-grained disk editing. Tried corrupting GPT checksums to see how Itanium EFI light react. Although I have been careful to construct correct GPT structures with the mandated checksums, Itanium EFI apparently ignores the crc32 fields when creating BlkIo and ESP protocols.

Passed the new BlkIo code through cc9 into the x64/chaos project, and the disk sector editor works straight away - a testament to the benefits of building projects over EFI.

26.8.18 PXE->mtftp synch Perfected TFTP file transfers using the EFI_PXE_BASE_CODE protocol for synching of other development files to fs0: on the Itaniums. With a simple control file on fs0:, synch pulls any number of files (test apps and drivers) across the eth0: interface and writes them to chosen directories in fs0:. A complementary launch function now allows me to run these apps or drivers from the local filesystem.

PXE->mtftp takes 10 function arguments - P64C means args 9 and 10 are passed on the memory stack at SP+0x10 and SP+0x18. This cornered me for a while as EFI expects 8-byte alignment for args 9 and 10 here. ia64/chaos has been running with a stack granularity of 16 bytes for quite some time. However anything is possible with a home-grown compiler!

20.8.18 EFI disk format Wrote a quick routine to clean-format an EFI BlkIO device with a GPT partition table containing one 512Mb FAT32 EFI System Partition. A couple of hours careful programming has the Itanium writing the FAT32 boot sector, File Allocation Tables and a blank root directory - enough for EFI to recognise this as an EFI_SIMPLE_FILESYSTEM fs0: The Itanium EFI Shell has a tftp command which can be used to pull files through the remote boot network card and write them into this FAT32 drive. Full steam ahead!

18.8.18 VPN for remote boot Set up a VPN between two locations to try remote boot between two segments of a LAN-LAN connection. Of course this cannot work, because the BOOTP broadcast packets are not relayed through the VPN for good reason. The Telnet connection to the Itanium MP however works fine this way. With a small ChaOS server on the remote LAN (running BOOTP and TFTP services, plus FTP to receive boot images through the VPN), I can now run the Itanium servers at a remote location, much as they would have been used in a datacentre.

14.8.18 Itanium remote boot Probed BOOTP network packet stream from PXE network card boot, eventually realizing that a DHCPOFFER can be sent to the rx2600 or rx2660 provided the client IP address offered is OUTSIDE the range of the DHCP server in my LAN router. Replying in kind with 'PXEServer' option instead of 'PXEClient', plus a boot filename results in an immediate UDP TFTP Filesize request, which if correctly answered begins the TFTP upload to the Itanium Client. Of course I have had to add a barebones TFTP service to my ChaOS network stack to do this.

At last I can run my nascent Itanium OS at a more reasonable distance from the roaring fans, and without swapping a USB stick between my development machines and the Integrity servers.

9.8.18 EFI Device Driver Created template for EFI App and Device Driver, essentially a machine-specific startup delivering control to a main() function. The key difference between the two templates is the program subtype which causes the device driver to remain in memory after startimage(), and the unload callback (like an Event callback) which provides a place for cleanup code should the driver be asked to stop.

Also in the driver template is an example of InstallProtocol exporting a small function pointer table. My Itanium compiler does not produce shadow PLABELs for function pointer assignments. However the machine-specifics for assigning values to protocol function pointers can be reduced to a small subfunction or macro. Of course the driver unload function calls DeinstallProtocol!

3.8.18 EFI Event callbacks Exploring EFI timer callbacks. x64 was straighforward enough, I had an on-screen hours:mins:seconds display running within an hour. ia64 proved more challenging. ia64 Event callback function pointer references a plabel, not a bare function pointer. And some registers need to be saved and restored around the callback, a fact which caught me out. Overwriting r7 in a timer callback handler seems inocuous, but on the rx2660 will subtly corrupt the EFI event tables, so the second and subsequent events on a periodic timer just don't happen.

Reading up on the subject shows that r4-r7 are designated as preserved, whilst being merrily trashed by my vc /a=ia64 compiler. Added callback function modifier to vc, with a new function template to save and restore r4-r7. Adding this modifier to the calllback function in the C code is a neat fix.

1.8.18 EFI Simple Network Experimenting with the baseline EFI network protocol. Network controllers are already running when EFI Boot Manager runs, but EFI_SIMPLE_NETWORK may be stopped. Generally, if higher level network protocols have been installed (as on later UEFI machines), i.e. MNP, ARP, UDP DHCP etc, Simple Network will be running at the bottom of the network stack. If stopped, it can easily be started, but will not transfer incoming frames via the ->receive function until receivefilters have been set up. There is a trick to perform (thanks to EFI 1.0 as on the rx2600) if multicast is desired - receivefilters() needs to be called twice: once to enable unicast and broadcast whilst disabling multicast and resetting the multicast filters; second to enable multicasts and supply at least one valid multicast filter.

26.7.18 sha256 in Itanium rotating registers Pushed my 'C' sha256 source code through my Itanium compiler, as an exercise, initially taking 19000 cycles on a Montvale CPU . Recoded the central hashing algorithm in assembly language, as an exercise in using the Itanium rotating registers for the first time.

Rotating register addressing is a little counter-intuitive. After allocating 64 rotating registers (r32->r95), then using a br.ctop to to load a table from memory into GR32, you will get the first table entry in r32, the second in r95, the third in r94 etc (with rrb.gr back to zero after 64 iterations). Once you get your head round this, fixed register names can be used for the offsets used to pull values into the central sha256 algorithm. After a day of concentrated work, hashing a 256-byte block takes 1450 cycles for the fips-180 "abc" message. Recoding the block memcpy used to bring in the hash data as a quadword move brings the cycle count to 1000 - around 4 cycles per byte hashed.

One striking feature of the Itanium is its predictability. First invocation of this routine runs about 35% slower, as the caches load up. Second and subsequent invocations show ar.itc time intervals which are the same to +/- 2 clock cycles!. Messing around with branch hints dptk sptk etc makes no difference to the clock count.

To take this thread a little further I have added register index notation to my assembler, r[{expression}] to improve readability of code (like the Intel Itanium assembler). There is definitely more speed to be had because my current sha256 algorithm has to swizzle 32-bit big-endian integers, and uses basic 64-bit operations (ignoring the upper 32 bits of each register) rather than the 2x32bit parallel operations. I have just added syntax for explicit bundle templates to my assembler, will be interesting to see the effects described in the textbooks.

24.7.18 Itanium Montvale vs Madison Update on execution timing for my 64-bit binary to hex ascii benchmark on the rx2620 (April 2017). Back then I was chuffed to see this assembly routine completing in 57 clock cycles. On the rx2660, Montvale 9040 processor the same code completes in 19 cycles. Both timings waste a few cycles reading ar.itc. Fast?

12.7.18 ia64 spinlock Added atomic spinlock to serialize access to console output, console getkey and debug_donkey. Interesting to watch control alternating between two CPUs as they contend for debug_donkey whilst single-stepping off a hard breakpoint. Added CPU id to debug output throughout the disassembler to show which CPU has the spinlock. (Much better than the Machine Checks which result when no spinlocks are used!).

This all happens because of the way in which Itanium PAL sets up the APs, all running in spin loops with interrupts off, whilst snooping their respective IPI interrupt 0xff status bits; an example of how civilised is the Itanium programming environment. Of course this translates to x64, where ChaOS could set up the APs in similar fashion ready to launch a user function on receiving an IPI;

This is the first time I have used a spinlock throughout ChaOS, never thought I would need one. Previously I have used APs on x64 as occasional processors, and avoided them primarily because of the extra heat generated which causes the fan to roar on my development laptop.

10.7.18 ia64 AP debugger working AP stack pointer on startup is zero. By setting r12 to the top of a mallocd memory block, APs will run EFI boot services, provided the BSP is held in a spinloop. Similarly my debugger works on the APs, so I can now breakpoint and single step one AP while holding the BSP on a completion flag.

9.7.18 ia64 APs now working After a few dozen Machine Check crashes whilst fumbling for the correct IPI addresses of the Itanium Application Processors (APs), managed to bring APs on rx2660 out of SAL spin loop to execute a short procedure within chaos.efi. By passing BSP r1 value to SAL_SET_VECTORS, APs are able to store values into the ChaOS memory image. APs call flushcacheline() and this makes stored values visible to the BSP.

APs seem to run my simple register stack function no problem, though I have yet to check whether SAL has set up an RSE backing store and memory stack for each CPU.

The 7 APs on my rx2660 respond to ExInt 0xff IPIs written to uncached addresses 0xfee01000, 0xfee02000, 0xfee03000, 0xfee04000, 0xfee05000, 0xfee06000 and 0xfee07000. Reading cr.lid for each returns 0x1000000,0x2000000, 0x3000000, 0x4000000, 0x5000000, 0x6000000 and 0x7000000 respectively, all of which makes perfect sense now I know the IPI mappings.

Update: firing IPIs to all 7 APs, each storing their cr.lid in adjacent memory quadwords, and with no explicit cache flush instructions, results in the above (expected) values even though the APs execute concurrently. (I had expected some cache-coherency issues to be demonstrated).

30.6.18 ia64 heap, salproc Pushing new heap allocations to stretch the new memory regime, EFI saved interruption vectors are now on heap. Exploring SAL procedures calls to probe CPU BSP and AP topography. Looking to intercept the EXINT vector to get a handle on external interupts including the interprocessor interrupts used to kick the APs out of BOOT_RENDEZVOUS.

Interrupt intercept is possible because EFI EXINT is effectively just one bundle containing a br (jump) instruction. This bundle needs to be disassembled to locate the jump target, with brl instructions for the hop to the intercept code, and for the hop back to the original handler jump target. Hopefully I filter out some hardware interrupts and divert them to ChaOS code. Timer interrupts are my favourite initial hooks.

23.6.18 ia64 malloc,free Added memory heaps to ia64/chaos, to get away from EFI allocpages for dynamic memory allocations. Heap code is dual-mode, also compiling for x64. Rather fun to have a 12Gb flat memory space to play with, though return to EFI on program exit is slow because Itanium EFI insists on filling freed blocks with 0xfb bytes (I suppose for security purposes). Takes about 1 second for each Gb.

29.5.18 HP Integrity rx2660 Processor upgrade/downgrade: fitted 2 x AB577-2100B processors. All working, showing 8 CPUs. PAL_BRAND_INFO shows these to be the 1.6GHz dual-core Itanium 2 9040 Montecito chip, which was quickly superceded by Montvale. Spectacular depreciation to $25 each this week on eBay.

25.5.18 HP Integrity rx2660 testbed arrives A year has passed so quickly since I iced my Itanium project. I got bogged down with the IA-32 mode on the Itanium 1. Yesterday added a rx2660 server to my Itanium collection, one dual-core 1.42GHz Itanium 2 9120N with 16Gb RAM. There is no IA-32 mode on this processor, but there are two cores, four threads, 1.7 billion transistors to play with...

BMC is almost identical to the rx2620 for the Telnet and Serial port consoles, so it did not take long to try running chaos.efi. All is well until a processor exception is triggered (e.g. debug break instruction, VM access fault etc), where the processor clearly takes the firmware vector rather than the ChaOS code. This is a cache issue, cleared by installing the ChaOS vectors, and running more code before faulting the processor - the Itanium 2 processor caches are MUCH bigger than my Madisons.

So it is gratifying to see my amateur debugger breakpointing and stepping the Itanium 2 after such a short time.

The rx2660 case seems to be quite rare, so I had to pay a high price to get one. But Itanium 2 processors are more plentiful. I found a pair of 9140Ms (dual-core 1.67GHz) for nineteen quid each. Will they run in my box? Watch this space...

Here is a quick HowTo for new owners to get up and running:

Like any enterprise-level server, you get 2 computers for the price of one; one is the server itself, two is the Management Processor, which allows you to monitor and control the server through a serial port or network connection.

Each vendor has its own flavour of Management Processor and user interface, for the rx2660 HP-Speak gives us the BMC (Baseboard Management Controller), with iLO-2 for the user interface (Integrated-Lights-Out-2...). This provides a Web interface into the BMC, with SSL secure access. At the Ground-Zero level, the BMC can be used to power up or power down the Itanium mainboard remotely, rather essential because these machines are noisy!

So on the back of the server (extreme left) there is a serial port and RJ45 network connection into the BMC. Beside the RJ45 there is a hole to access the BMC reset button. The BMC is active a few seconds after power is applied to the unit. Holding the reset button in for 4+ seconds clears previous username/password combinations from the BMC, which might prevent a new owner from gaining control. Once this is done, the unit defaults to 9600,8,1 on the serial port and network access defaults to DHCP. Therefore the BMC can be accessed via the first serial port and a terminal program, or telnet for example (look at your router DHCP diagnostic or ARP cache to see where the BMC is on your network). Login to the BMC is simply Admin/Admin, with plenty warnings about setting up something more secure. The HP Integrity User manuals document the BMC very well from here on.

Modern browsers tend to be locked to TLS 1.3 these days, so the firmware security in these old servers will throw errors when attempting a secure connection via the Web interface. I downshift FireFox to TLS 1.1 to make things work ( open about:config TAB; ignore warranty warning; set security.tls.version.max = 1 ). TLS needs to be shifted back up to 1.3 when you are done with the iLO-2.

8.10.17 {xhc} Taking a break from IA64 to develop a driver for the xHC (USB3) controller on my Dell Inspiron laptop.

7.10.17 {vc}/{cc8} convergence getop() code in {cc8} altered to produce 64-bit immediate constants by default, rather than requiring the (earlier) 0x prefix and q suffix.

4.10.17 EFI memory reclaim Starting now to build system heaps, using MEMDESC info supplied by EFI, initially chains of EFI free memory blocks with a small node record at the start of each. It is worth mentioning that memory blocks used in this way need to be allocated as EFI LoaderData (or any other memory type) - otherwise EFI may perform its own allocations/deallocations on the block which result in node records being overwritten.

To get things off the ground I have created three distinct heaps, dosheap (<1Mb), lowheap(<4Gb) and highheap(>=4Gb). Memory allocations can be directed at a particular heap using named functionsdosmalloc() dosfree() lowmalloc() lowfree() highmalloc() highfree(), with malloc() and free() using highheap if available, otherwise lowheap.

30.9.17 EFI memmap x64 EFI MEMDESC dump ported to IA64 hits {vc} compiler error when attempting, for the first time, to return a pointer from a function with the call64 (or any) function modifier, e.g. CH$ call64 func(VD); Syntax for function modifiers in ANSI C and C++ is unclear to me, so I have as usual made up my own syntax, using (efi * func) to declare EFI function pointers in the BOOTSERV and RUNTSERV structures. It is also worth mentioning the three pointer operators which have now crept into my syntax - '*' is a regular pointer, 32-bit in x64/chaos , 64-bit in ia64/chaos. '$' is an explicit 64-bit pointer, and '#' is an EFI natural size pointer.

Anyway, checking for a function modifier before the pointer operator causes the {vc} compiler to fault on CH$ call64 func(VD); whilst CH call64 $func(VD); works fine - just not as obvious that the function returns CH$. So I have tweaked {vc} to check for a function modifier before and after pointer operators, so CH$ call64 func(VD); and CH call64 $func(VD); are now equivalent.

With this tweak in place, my x64 EFI MEMDESC dump source code ports unchanged to the IA64 project.

23.9.17 {vc}/{cc8} convergence Adding the asm {keyword} syntax to {vc} and {cc8} is not really enough to structure complete compile-time units which will pass through both of these compilers. It is usual to use C preprocessor directives to provide alternate pathways through a source file, so it makes sense to use my arch keyword to qualify preprocessor directives where possible, i.e. instead of

#ifdef IA64
#include {ia64stuff.htm}
one could use
#include ia64 {ia64stuff.htm}
#if ia64
is a no-brainer.

With this tweak added to {vc} and {cc8} (just to #include a slightly different uefi.html for the x64 machine), my EFI BlkIO device sector browser compiles and works identically on x64 and IA64. Conversely the ability to compile and test code within an x64 UEFI environment, then move this easily into the IA64 project is going to be mega-useful.

17.9.17 strtod() Spent a few hours reacquainting myself with the 32-bit versions of strtod which I have, in preparation for writing a 64-bit IA64/x64 library version. Never noticed before the slight downward rounding produced by x32/ChaOS.strtod() - this because I set FPU ROUNDCHOP mode for easier double->integer conversions. The mathematical differences are insignificant, just one or two bits of mantissa, but I am thinking maybe I should use FPU ROUNDNEAREST mode for strtod, especially when called by my compilers.

Most people misunderstand the fact that floating-point numbers are in the main inexact, only a subset of real numbers drop exactly into floating point encodings. Exact fractions are found where the number is a negative power of 2 multiplied by an integer, e.q 7*1/2=3.5, 53*1/16=3.3125 etc.

Even with ROUNDNEAREST, for inexact fractions strtod produces a DB value slightly below the input string half of the time. Thus the mantissa needs to processed beyond a given number of fractional digits to reproduce an input string such as 1.2345, after rounding, when output of strtod is passed to dtoa. The problem arises when a decimal fraction ends with a 5 and strtod (very often) produces 4999999999. For speed I prefer to process only one digit beyond the requested number of fractional digits for rounding purposes - no good when dtoaing 3 fractional digits, whilst strtod has produced 1.2344999999999etc for input of 1.2345 - we get a display value of 1.234 instead of 1.235. Given that all of my inputs to strtod are less than 10 fractional digits, a useful kludge is to add 1 the mantissa, when the output seems to be non-recurring (e.g. NOT something like 0.1 = 0x3fb999999999999a). This produces 1.2345000000000something for input of 1.2345, surely a nicer number.

13.9.17 EFI BlkIo sector browser Added read/display sector option when probing the EFI BlkIO handle list, with keyboard loop, '+','-' and (g)oto keystroke roam around. Very handy.

13.9.17 {vc} arch id for inline asm{} blocks Added check for arch id string after asm keyword, with ignore flag now switched on if asm block does not match the CPU architecture for the current compilation run. In other words. if {vc} is compiling for ia64 it will skip asm x64{} blocks, when compiling for x64 it will skip asm ia64{} blocks. This allows CPU-specific tweaks to be grouped together in C source code as fair dinkum assembly language instructions, instead of inside some faraway macro. {vc} now also accepts my '$' 64-bit pointer operator (used in my cc8 Intel64 compiler) as a synonym for the regular '*' operator. {cc8} uses 'call64' and 'call32' function modifiers to switch output between the IA32e and Intel64 mode code, so by changing the default IA64 FTEMPLATE keyname from "def" to "call64", {vc} /a=ia64 can potentially can compile {cc8} sources without modification.

To allow {cc8} to compile {vc} sources, I have added a similar asm id{} check, and added code to generate and initialize argc, argsz and argv for ellipsis functions. As a quick test, IA64 wdispf(CH$ format,...) and its companion functions witoa, wdtoa, wstrrev and strtol now compile unchanged on either {cc8}->x64 or {vc}->IA64.

12.9.17 HP Integrity rx2600/rx2620 EFI_DEVICE_PATHs Probing EFI Device Paths for handles with BlkIo protocol, hit the dreaded ACPI Hardware Path type, which points to the ACPI tables to resolve HWP0002 and HWP0003. Anyone who has tried to write an ACPI table decode will know this is a job for another lifetime, but fortunately, there is enough information in EFI_DEVICE_PATH to deduce the encoding. EFI ACPI Hardware Path _HIDs are a 32-bit number I do not understand, but are kindly displayed by the EFI Shell as HWP0002 or HWP0003 and clearly indicate PCI Bus and AGP Bus respectively. EFI ACPI Hardware Path _UIDs will point to some Method or other in the ACPI table, but are easily recognized to be (PCI Bus Number<<3).

10.9.17 Itanium zx1 PCI Managed to reprogram the PCI buses (ropes) to direct I/O cycles to the BMC VGA whilst AGP PCI IO is turned off, this proved by writing and reading back VGA registers in the BMC VGA which are different to those in the AGP VGA device. The MMIO window is more problematic, currently faulting the processor when accessed via bus 0xe0.

9.9.17 Itanium zx1 PCI Experimenting with dual graphics configuration, using AGP backplane with Diamond Fire GL4 dual DVI card. EFI boot disables the BMC VGA and sets up the EFI console on the Diamond Fire DVI-0 output. Setting video modes via the Diamond Fire BIOS hangs the machine, I am guessing because the PCI config accesses in the BIOS are directed to I/O ports 0xcf8/0xcfc, which are not present on the zx1 MMIO controller. Support for a BIOS set mode here would required VM86 mode and I/O traps on these ports, redirected to the zx1 PCI config address and config data ports.

Setting up multiple VGA adapters is possible, provided only one VGA device is connected to the zx1 bus at at a time. I have done this before to switch alternate PCI adapter BIOS ROM images into play, and invoke their respective BIOS set mode functions - up the point where graphics modes/linear apertures are established all the adapters can coexist on the PCI bus. This is a useful exercise into understanding the various buses and PCI bridges in a system.

IA64 SAL provides PCI config read and write functions, but on the rx2600 at least the config registers are directly accessible in the MMIO block at 0xfed00000. Following the Rope Configuration Base register at 0xfed003a8, the register blocks for bus 0x00,0x10,0x20... are at 0xfed20000,0xfed21000,0xfed22000... and are identifiable by a 0x103c:0x122e signature if present in the system. To post PCI config cycles on to a bus, simply write the seg:bus:dev:fn:reg to block+0x40, and read/write config data from/to block+0x48.

The BMC console VGA appears on the rx2600 at bus:dev:fn(0xe0,2,0), but disappears from the EFI PCI listing when an AGP graphics card is installed. However the BMC VGA is still present on the PCI bus, with IO and MEM access disabled. So it should be possible to deactivate the AGP card, configure the BMC VGA, display something on the screen, then decouple it from the PCI BUS and switch back to the AGP card. There are two or three other registers on the zx1 to direct VGA cycles down the appropriate bus, some tiral anderror needed no doubt to get this to work.

5.9.17 Itanium SVGA graphics Dug out my old VGA reference manual (Wilton 1987) as used to develop Lotti all those years ago - so much easier to understand second time around. EGA/VGA was a quantum leap in graphics in its day, but confused me greatly at that time. Taking a fresh look I now see the method in the madness, and how it provided a neat way to read or write 32 graphics bits in one cycle of a 16-bit CPU.

Took only an hour or so to knock up writepixel, hline, vline, and gchar functions, along with cgasavescreen, cgarestorescreen, to be able to save and restore the EFI console. So I can now begin to develop base functions for a GUI on the Itanium, so long as I avoid debug breakpoints whilst in graphics mode. In time debug_donkey will substitute EFI console output in favour of graphical output when a graphics mode is active.

3.9.17 Itanium Video BIOS call Running through VESA BIOS calls, modes with resolution higher than 640x480 (apart from mode 0x6a, 4-plane 800x600 SVGA mode) are actually flagged unsupported in hardware, so the bog-standard Itanium VGA output is rather limited. Also the EFI/BMC console locks up when one of these limited graphics modes is selected. Ah well.

27.8.17 Itanium Video BIOS call Fleshed out the .x86r assembler enough to setup parameters and make entries into the Int 0x10 Video BIOS code. This is far easier than poking opcode bytes into Itanium low memory locations. With a whole load more handlers in place (ITLB miss, DTLB miss, alternate ITLB/DTLB etc) and page-not-present DTLBs for video memory at 0xb0000 and 0xa0000, the rogue instructions which take the system down (on Int 0x10/set mode 3) can be narrowed down to OUTSW page B8000 (clear screen memory), MOVSB page c0000 -> page a0000 (program character set) and STOSB page a0000 (character set also).

Emulating these three cases is tricky, especially because my debug donkey uses the Itanium Register Stack Engine to preserve r1=r15. I envisaged that writes to the backing store would be needed to update registers for the IA32 machine, e.g. advancing EDI, decrementing ECX on a STOSW emulation for example. Happily this approach works (I told you the Itanium is a simple beast). The slightest coding error here hangs the machine, but after a good deal of trial and error, I finally have the Video Mode Set running through to the IRET intercept. Curiously this mode set produces a greyscale 80x25 CGA text screen, but I have done plenty enough VGA programming to recognize this is a DAC palette issue. Pushing the usual 64 entries in the VGA DAC flicks the display back into its proper 16-colour CGA glory.

Noticing that my emulations ran without ld.acq/st.rel semantics, I tried replacing the not-present VGA DTLBs with present DTLBs identity mapped into physical uncached memory, i.e virtual 0xb0000 -> 0x80000000000b0000, then disabled the emulations. With this mapping the BIOS calls run OK, so maybe the simplest of solutions was under my nose! However I now also have a template for instruction emulation with client register tweaking which is mega-useful. Similarly, trapping the IA-32 intercept, matching the IA-32 IRET opcode and poking new values into cr.ipsr and cr.iip produces a smooth return to IA64 mode after a BIOS callout.

As I had hoped, graphics modes can be set on the Itanium VGA hardware without breaking the remote console. So glad now that I picked up an AGP backplane and original Itanium twinhead Radeon card last year, looks like I will be able to make it work. Is is really only 16 weeks since I started my IA64 compiler?

25.8.17 ia32 disassembler Ported about half of my ChaOS x64 disassembler to the IA64 project to produce the first mnemonic ia32 disassembly inside the IA64 debug donkey. Will improve this in parallel with development of the new {vc} assemblers.

22.8.17 ia32 modes for {vc} Restructured {vc} compiler as planned to allow multiple modes for each /arch definition. Quite simply each processor mode can invoke a different assembler in the compiler back-end. For the moment, processor mode is switched by assembler .pmode directives, but eventually will be switched by the C compiler as needed, by #preprocessor directives and maybe a special function modifier. Initial pmodes for IA64 are ia64,x86r and ia32, providing support for a 16-bit and 32-bit IA-32 code segment, in addition to native Itanium mode. To aid with checking and testing, I have added a second processor set for amd/intel64, also with three processor modes x64,x86r and ia32. For 16-bit and 32-bit code, /a=ia64 and /a=x64 now use exactly the same assemblers.

20.8.17 ia64 ia-32 BIOS call Experimenting with ia-32 mode to fathom how to perform a video BIOS call into the VGA card. Tried all sorts of variations, real-mode, protected-mode and particularly vm86 mode, thinking this would be the preferred route. Because IRET is an intercepted instruction (i.e. has to be emulated) I chose PUSH (int 0x10 offset), PUSH (int 0x10 seg), RETF as a route into the Video BIOS entry point, but achieved only a wall of vm86 GP and Stack faults. In 16-bit real mode, the faults are fatal Machine Checks. Maybe I should have listened to my own advice and done a br.ia straight into the Video BIOS, however...

...noticing that the same instructions sometimes run fine, other times cause these Machine-Checks, I gradually focussed in on the RET and RETF instructions, and proved that the Machine Check occurs when these opcode bytes are exactly 10 bytes ahead of the current single-step fault i.e. as they enter a 10-byte instruction pipeline. Furthermore executing a CALL instruction before RET or RETF enters the pipeline prevents the Machine Check fault. Whether there is an error in the Itanium ia32 machine here, or an error in my setups is a question for another day. Clearly the ia-32 machine is decoding, speculating, branch-predicting ahead of the current ia32 instruction, another fascinating insight into the Itanium processor.

In ia-32 real mode, provided KR0 is set to IOBASE, the IN instruction produces recognizable i/o port values from the VGA hardware. Using the direct br.ia entry method, selected Video BIOS calls run to completion, i.e to the IRET intercept fault. BIOS functions which touch video memory (such as set VESA mode) inevitably fail, because these addresses are not yet mapped into uncached memory pages. Interestingly Int 0x10/AX=0x8003 (set 25x80 CGA mode, retain screen contents) runs to completion, because this is the mode set already for the EFI console.

So there is still work to do to get Video BIOS calls working fully. Added a Page-Not-Present handler and mapped to VGA aperture to a not-present data TLB. This faults any instructions reading or writing VGA memory addresses. Added a handler to emulate the REP STOSW instruction which is used by int 0x10/9 write char(s) at cursor BIOS CALL. For this function at least the characters appear on screen. The complete solution could involve emulation of many ia-32 memory read/write instructions. But I have proved it can be done. Emulations can be added as each faulting instruction is discovered, until the BIOS mode set call runs through to the IRET intercept. The goal is to be able to use the BIOS to set VESA graphics modes. With this done, custom routines can read/write to the screen directly using virtual or physical uncached address mappings, using ld.acq and st.rel semantics.

13.8.17 ia64 ia-32 mode debug Added ia-32 intercept handler working towards an attempt at executing a Video BIOS call; seeing this fault in action I now realise the swathe of instructions which have to be emulated to construct a working ia-32 machine. This explains why the ia-32 mode was eventually dropped in favour of pure software emulation - the PSR.is bit only runs the easy ia-32 instructions. ia-32 instructions such as INT and IRET simply fault the Itanium and provide a pointer to help in skipping the offending opcode bytes after an emulator completes its work..

No matter, a call into the Video BIOS should be possible by a br.ia straight into the Video BIOS, with the IRET intercept fault signalling the end of the BIOS call? Who knows.

10.8.17 ia64 ia-32 mode debug Managed to perform first Itanium ia-32 mode switch via br.ia instruction, discovering along the way that my virtual memory mode switch was not quite what I thought. Having used ssm to flick the PSR translation switches dt,it and rt, I missed that fact that ssm can only set bits 0:23, so I have been running a partial vm mode, with only data address translation switched on. This has been sufficient to work out and prove setups for region registers, translation cache entries and virtual addressing for VGA screen memory. Added a keystroke to the debug donkey to flick all these bits on, happily my itc setups must be OK because everything keeps running. Attempting the ia-32 mode switch without dt and it is undefined, but actually results in an instant fatal machine check.

To keep the debug donkey from falling over (remember it calls out the the EFI console, which doesn't do virtual mode) I switch psr.dt off for the display and keyboard loop, then rfi restores psr from ipsr on exit. In this way I can effectively single step the Itanium in virtual address mode. You see deep down, the Itanium is deceptively simple. I now have a pretty solid base on which to research and develop the vm handlers. As ever the equation is installhandler + (improve debug donkey).

Switching on virtual addressing is easy from within an interrupt handler, but tricky otherwise (need to set up all the interrupt control registers, then execute rfi). Taking an idea from IA64 Linux, a quick and easy workaround is to sample iim inside the break handler (like a syscall). Added a bit of code to debug donkey to do this, now break 0x12345 can be used to properly start virtual address mode. The EFI shell runs happily with psr.rt and psr.it set (and identity-mapped translations of course), so there is no need to switch these bits off just yet.

With another couple of handlers installed on IA32 exception and IA32 intercept, my machine will single-step into ia-32 mode, to a low memory address where I have placed an x86 NOP and JMPE instruction. Single-step continues back into IA64 mode, eventually falling over due a loss of RSE context - single-step in ia-32 mode seems to destroy the contents of IFS, not quite sure why. I may need a separate entry path for the ia-32 handlers into the debug donkey. Nevertheless this problem is overcome by using a MSTACK function template for the wrapper function, i.e. ar.pfs is saved on the memory stack rather than the register stack during the ia-32 callout. The mov ar.pfs=reg instruction is now the only one which cannot be single-stepped, but my ia-32 callout now works. Roll on x86 video BIOS calls.

Yes I know ia-32 hardware support disappears from the Itanium after the Madisons, but ChaOS uses only a subset of the x64 instruction set. Writing a ia-32 software emulator is cumbersome but not difficult, with 128 64-bit registers compared to the 8 32-bit registers on ia-32. Indeed this is the approach adopted by HP engineers for PA-RISC as well as IA-32. Has anyone written an an AMD64 emulator for IA64 yet?

2.8.17 ia64 debug Whilst far from finished, disassembler now covers all common A,I,M,B,F and X-unit instructions, so is becoming really useful. Added (a)uto single step, to stress-test the debugger by a running many thousands of single-step cycles. Runs fine through ChaOS code, but runs into processor faults while single-stepping EFI display functions. Tried saving a few more registers, i.e. bank 1 r16-r31 which generally hold invalid/misaligned addresses at the faulting instructions, (though the ChaOS compiler does not touch these registers). This stops the faults but auto-single-step then gets into an endless loop. Of course the debug handlers are invoking EFI for console output, so the single-step might be looping on a display list within the firmware.

1.8.17 ia64 debug fault Added debug fault handler, and switched on IPSR.db bit to enable instruction breakpoints. Added (e)xit key stroke to debug donkey which sets a breakpoint on b0 as saved on entry to the debug handlers, and (n)ext key set breakpoint on iip+0x10. So now I can step over a function call, or single-step into it and skip out at any point. Instruction breakpoint faults occur on each of the three instructions in a bundle, if the ibr[ ] registers are not cleared. Single-step through the bundle can be achieved bt setting IPSR.id for each step. Frankly it is easier just to clear the ibr[ ]s after the breakpoint has been hit.

30.7.17 ia64 single-step Another fascinating product of my single-step handler is that processor faults can can occur on dependency violations, such as when a stop is missing from the code stream for the desired result. Not yet having tried to understand IA64 speculative instructions, I sort-of expected that at interrupt time in-flight values would all have time to arrive at their destnation. I had not considered that these are all carefully designed, and finite states in a parallel execution engine. As such the single-step interrupt provides a proper window into this area, to the point of raising the same faults on single-step as would occur when the parallel execution units are free-running.

30.7.17 ia64 debug donkey/Floating point registers Introducing a floating point register display inside debug donkey handler produced some interesting results along the way, all adding to my understanding of the fascinating Itanium processor. Initially using stfd and ldfd to save and restore some FP registers to a memory stack buffer, I became intrigued when FP values seemed to disappear during single-stepping. The obvious reason is the registers in question are being overwritten, maybe during a hardware interrupt, so I tried a static memory save area etc, all with the same result - FP values disappear when the setf.sig is single-stepped.

The answer to this problem comes when stfd.spill and ldfd.fill are used to save the full 80-bit FP register contents. setf.sig simply copies a 64-bit integer into the significand of the 80-bit register, and sets the exponent to 0x3e. This creates a valid 80-bit floating point number, but top bit of the significand is zero. Such a value has no 64-bit FP counterpart (64-bit doubles have an implied '1' as the top bit of the significand). It can be normalized by fma fr=fr,f1,f0, but for the single-step between setf.sig and fma, a 64-bit double cannot save the in-register value. This is an insight into the rawness of the Itanium, from whence comes its speed.

With 80-bit FP save/restore wrapped around the debug handler, and a rough-and-ready dtoa function, I can now single-step mixed integer/floating point calculations such as those generated by {vc} for multiply, divide and modulo operators. How quickly those Newton-Raphsonian iterations converge for small integers! (Don't try baremetal IA64 programming unless you understand Newton-Raphsonian iterations, or are prepared to learn what they are). The FP fault and GE fault handlers have saved hours of reboot time this week by trapping my coding errors along the way.

26.7.17 ia64 debug handlers Added predicate register save/restore to debug handler wrapper, to properly single-step compare instructions. Added ifs decode to discover valid loc registers range, and a function to use this knowledge to locate and display r32->rxx from the RSE backing store. Added General Exception, Unaligned Reference and Floating Point Fault handlers, all using the common debug donkey function. Faulting instructions (like break) can be skipped by pressing j.

These handlers are a great addition to my ia64 debug toolkit, saving the reset/reboot cycle I have had to bear every time I make a mistake. Deliberate processor faults are equally satisfying.

25.7.17 ia64 debug handler and RSE backing store Attempted to use cover and flushrs to make singlestep handler saved registers visible by inspecting memory below BSP/BPSTORE, but caused processor faults. Settled on flushrs only, but this only flushes the outer function state to the backing store - so inserted an extra function which saves register state to loc registers before calling singlestep donkey function. Registers in the backing store cannot be accessed via a struct because the RSE inserts a NatVal when BPSTORE&0x1f8 equals 0x1f8. Added a set of #defines for saved registers r1->r15 representing the order in which they were saved, and a helper function to skip the NatVal while repeatedly subtracting 8 from BSP until the required register is reached.

Registers used in the coding of the function calls (r1,r4,b0,b7) are saved in the interrupt handler before the intermediate function is called - so these are retrieved from further down the RSE backing store. Using these tweaks I now have a register display for the singlestep handler, really useful for verifying that disassembler display matches live register operations.

24.7.17 ia64 disassembler Finished X-unit disassembler, including long branch instructions, though I will struggle to build a program large enough to need them. Started I-unit disassembler, completing the opcode 0 set (break,nop,hint,mov from IP/PR, sxt,zxt,czxl/r).

23.7.17 ia64 disassembler Added the first glimpses of a disassembler to the single-step handler. Starting with th X-unit (with the smallest instruction set!), and decoding just the MOVL instruction, else showing just instruction pointer/slot and raw 41-bit unbundled instructions. Added assembler support for NOP.X and BREAK.X, then using the embryo disassembler to spot some latent bugs in X1 and M44 encodings. Parallel development of an assembler and disassembler for any architecture is always a potent combination for weeding out coding errors. Decoding of the MOVL imm64 is a test for any programmer, even with a mainstream compiler.

Although NOP.X and BREAK.X encodings accomodate an immediate value up to 62 bits long, I observe that only 21 bits make it to cr.iim, i.e. the L slot is ignored by this implementation (rx2600). More interesting is that break instructions encountered whilst single-stepping are no problem, the break handler is re-entered, and single-stepping can be resumed. At this point I should have singlestep->break->singlestep nested on the stack, but have not attempted a stack dump to prove this. In any case, clearing ipsr.ss has the system running as normal so who cares. Using break.x requires a tweak to the breakhandler, i.e. need to skip 2 slots instead of 1 to get over the faulting instruction. I do this by checking for the MLX template (t&0x1e)==4 AND slot==1.

21.7.17 {ia64/chaos} Single Step With only 16 bundles available in the IVT for a single-step interrupt, further reduced interrupt handler template to be a bare function dispatcher, with all the tricky code moved to the donkey callout. This sparse interrupt handler occupies only 8 bundles (0x80 bytes), so it can be used for any slot in the Itanium interrupt system. Thus added a single-step interrupt handler.

Added keypress filter to the break handler, to set IPSR.ss on a certain key value. This causes the single-step handler to be entered, which also waits for a keypress. Displaying cr.iipa whilst leaving cr.iip unchanged seems to produced the desired effect, with execution advancing on each keypress through slot0, slot1, slot2 before advancing to the next instruction bundle. Way to go!

16.7.17 {ia64/chaos} Improved interrupt handler further, towards ia64 heavy specification, i.e. to call out to a sophisticated handler with interrupt collection and external interrupts switched back on - aiming to hold execution on a loop calling EFI readkeystroke(). One has to remember that space in the Itanium Interrupt vector table is limited, as I found to my cost as my developing break fault handler grew beyond 0x400 bytes (the handler immediately after is the external hardware interrupt oops). The solution to the size problem is simple enough, just place a wrapper function in the IVT which calls out to the donkey function. Critical to getting external interrupts working for the callout is the switch back to bank 1 registers before any EFI system interrupts happen, otherwise the system crashes. Note that the bsw instruction which performs the bank switch must be the last in an instruction group otherwise its operation is described as undefined. In fact bsw faults the processor if used without a stop immediately after.

Happily after a good deal of trial and error I now have code which can wait for an EFI keystroke, from inside a fault handler. Obviously the ability to accept keyboard input in this situation is an massive step towards interactive debugging on the Itanium platform.

15.7.17 {ia64/chaos} Working on console display suite, centred around wdispf(CH* format,...) a printf-style function producing formatted output to the EFI console. Whilst VGA direct draw functions are technically easier, using the EFI functions carried the advantage that output is mirrored to the serial console or BMC telnet connection, allowing remote viewing - this is pretty much essential given the noise level of the rx2600 fans. wdispf places output into a static ring buffer, to get around the temporal problems with stack addresses passed to EFI (18.6.17). Support functions now working well include strtol, witoa (itoa for wide characters, with enhancements such as max output length, zerofill ...), and the ctype functions isdigit(), isalnum() etc, and reduce(number,base), a divmod function to halve the math workload of witoa, all working fine.

The result is a flexible EFI console display function which works inside an interrupt handler. Just a few minutes with this up and running produces a break fault handler which can display the bundle address and slot number of a break instruction, then adjust the slot and bundle address (if at slot 2) by rewriting the IIP and IPSR registers to skip the break instruction. I now have the development template for any Itanium fault/exception/interrupt handler!

8.7.17 {vc} debug info Revised variable-length SY structure (symbols) to align records on 4-byte boundaries, and add a locr member to store RSTK register number, leaving rtaddr free for MSTK shadow address. Added MOVLOCMLOCR pcode to move RTSK value to MSTK when & operator is first used on a register variable. This completes a fix for the address of register problem.

7.7.17 {vc} debug info Revised variable-length UT structure (type system) to align records on 4-byte boundaries - this to mitigate problems (IA64 alignment problem 18.6.17) when accessing these structures down the line (for compile and debug purposes). Besides padding UT namestrings, UT->prev is now reduced to one byte, recording the dword count of the previous UT record. This limits a UT record to 1020 bytes, sufficient to describe structures with up to 250 members, should be plenty.

Recompiled armc1 with the new UT regime, happily pi/chaos still works fine.

1.7.17 {vc} efi FTEMPLATE Mystified temporarily by the corruption of new ellipsis mstack variables by callouts to EFI functions - only to realise that arguments passed in registers to EFI require memory stack shadow space. This boils down to a misunderstanding on my part of the implementaion of register variables in a modern compiler. ANSI C states plainly the address of a register variable cannot be taken , whereas ANSI C++ states that the register keyword will be ignored if the address of the variable is taken. The only way this can work is to allocate memory stack shadow space for all register variables, and to switch to the memory stack whenever the & is used. My first Itanium compiler, {itc1} uses only the register stack; approaching the problem from the wrong direction is producing interesting results along the way.

28.6.17 {vc} varargs Revised ellipsis handling to cope with variable stack alignment. Rather than defining argv as an array, constrained to the prevailing stack alignment (23.6.17) I have decided to use explicit addressing, dragging the current mstackalign value into the function body itself.

So now ellipsis generates three automatic symbols, UC* argv is a simple pointer to the first stack argument, integer argc is the argument count, and integer argsz represents the byte spacing between mstack arguments. argc and argsz are initialized by generating code at call time to pass values in r0/r1 (r8/r7 on IA64) which are moved on function entry into the relevant local variables.

On function entry, argv can be cast to any type appropriate to the conceptual argv[0], subsequent arguments are accessed by simply adding argsz, e.g VD* nextarg=(VD*)argv+=argsz;

Recognizing that the traditional SL main(UL argc,CH* argv[]) prototype only works whan mstackalign equals sizeof(CH*) is a revelation..

23.6.17 {vc} varargs Added ellipsis handling to call FTEMPLATE and fctbody FTEMPLATE by forcing the variable function arguments on to the memory stack. These variable arguments are stored on the memory stack in reverse order, so that they can be addessed from within the function as a simple array with element size equal to FTEMPLATE->mstackalign. Added code to create a local symbol argv to make this array accessible to the ellipsis function body. Whilst it is possible to arrange the memory stack for functions like printf with different-sized elements, this makes no sense going forward with the alignment-check constraints on 64-bit processors.

FTEMPLATEs allow mismatches between naturalsize integers and stack granularity, and I am currently using 16-byte stack alignment even though my Itanium project currently uses at maximum only half of each stack element - this just to make sure no naturalsize dependencies creep into the compiler code. Added a new 128-bit integer to the {vc} inbuilt type system (US). Where stack alignment equals natural pointer size, it is convenient to access the stack via UC* argv[]. Where these sizes are mismatched, UL argv[] (32-bit stack), UQ argv[] (64-bit stack) and US argv[] (128-bit stack) are used instead. This way varargs can be accessed intuitively using argv[0],argv[1],argv[2],... from left to right, with typecasts if appropriate.

This FTEMPLATE system may seem overly complex, but by dissociating stack granularity and natural integer size, all the ingredients are in place to compile IA32 code to use a 64-bit stack, thus making mixed-mode code much more straightforward.

18.6.17 Itanium Data Alignment Beginning memory stack structure, union and array adressing now, aware that PSR.ac is set by default at EFI Boot Manager time, so it is easy to force an alignment check fault. What I did not expect was that clearing PSR.ac does not stop the alignment check faults, i.e. these check faults are an architectural feature, not soft warnings. Having been used to (almost) complete freedom regarding data alignment (i.e coming from IA32), this comes as a shock. No doubt data member alignment in public structures such as EFI follow rules which match the architectural limitations of IA64,x64 and ARM. Generally, structure member addresses should be mod membersize for arithmetic and pointer types to avoid this problem, so a compiler must insert padding bytes automatically. As a quick fix, I have just added a compiler warning where this rule is violated, I am not keen to add automatic padding until I have studied how this is done in the wider scheme of things. Meanwhile the Itanium processor will sure let me know when I get the alignment wrong.

18.6.17 Itanium EFI memory stack Using EFI memory stack to build a string for output to the EFI console produces an interesting yet useless result, i.e. the memory stack content is overwritten before the console gets the display data. This indicates that Itanium EFI console output is asynchronous, being accumulated perhaps as a series of memory pointer/length records to be processed later (maybe on a timer interrupt which uses the same stack?). Therefore display data must be placed in a persistent memory area for it to make it to the console. Furthermore, the garbage output to the screen via the memory stack is different to the garbage sent to BMC, indicating a further time-lapse between screen and remote console display processing.

Calling EFI console output functions from within a break interruption handler is similarly frustrating, even if persistent memory locations are used. This effect can be sidestepped by using functions which write directly to the VGA memory buffer. Fortunately, EFI screen console output appears to perform an IO read of the VGA text cursor before each output phase - so custom display functions can mesh with the EFI console by simply updating the VGA cursor position.

17.6.17 Auto Memory Stack Local Variables {vc} default FTEMPLATE for ia64 places local variables on the register stack in the first instance, equivalent to placing register keyword in front of all local data declarations, and works well, up to a point. Declarations which will not fit in a register naturally go on to the memory stack, which is easy because this can be decided as the data declaration is processed. But using register locals as the default throws up the awkward case where an attempt is made to take the address of that variable - this is only possible for memory stack variables.

Rather than revert to a full register declaration for all locals, I inserted code in doaddressofoperator() (& operator handler) which switches variables from the register stack to the memory stack when this happens. This violates ANSI C, which states that address-of register is just not allowed; however ANSI C also says that the register keyword is no more than a hint to the compiler - i.e. if the compiler runs out of registers then the variable will be on the memory stack anyway, and the address-of operator is perfectly valid.

14.6.17 Itanium break Interruption Working now on FTEMPLATE for ia64 interrupt function modifier, using the software break at IVA+0x2c00. A simple memcpy of a handler function into the IVA, followed by srlz.i;;srlz.d;; seems to work fine. The Itanium Register Stack Engine (RSE) can be used inside this handler - the trick is to, insert cover;; before alloc; there is no need to restore ar.pfs (this is done automatically by rfi following cover). Managed to perform EFI display calls within this interrupt handler hence nesting allocs no problem, though EFI console display is truncated/garbled somehow. My VGA direct-draw functions work just fine. This facility will make the development of other handlers much easier than I had expected (how could I develop a TLB miss handler without it?).

11.6.17 Itanium SAL call is straightforward, worked first time using cast of salproc entrypoint location in SAL System Table to efi function pointer (entry point is followed by GP for SAL call, hence this is effectively a PLABEL:
salproc=(SQ (efi#)(UQ a,UQ b,UQ c,UQ d,UQ e,UQ f,UQ g,UQ h))&ss->salproc;
(*salproc)(0x01000012,0,0,0,0,0,0,0); (//SAL_FREQ_BASE)
returns 200000000 in r9 on Integrity rx2600, the 200MHz platform base frequency.

10.6.17 {vc} With FTEMPLATE now allocating a varying number of sys registers, local symbol addressing from within asm{} blocks is variable too. Added _lr# aliasing to register stack, similar to loc# but subtracting sys register count which is known at compile time. Thus asm{} blocks inside fctbody can reference local symbols as _lr0,_lr1, instead of loc3,loc4 which are dependent on the correct FTEMPLATE being in use.

10.6.17 {vc} and Itanium PAL call Attempting to produce a FTEMPLATE for an Itanium PAL call, using some MSTACK control fields to pass arguments on the memory stack.
Of course this does not work, I misread the meaning of stacked registers as opposed to static registers in the PAL sense. So for anyone else who tries this,
PAL static registers are r28,r29,r30,r31
PAL stacked registers are r32,r33,r34,r35, also set r28 equal to r32
PAL return registers are r8,r9,r10,r11.
Took a little while to work that one out. PAL call also needs an assembly language wrapper to calculate a return address, avoiding br.call for the entry, which is easy in the ChaOS IA64 model since code label values generate a simple GP-relativeoffset, as in this snippet:
movl r8=palreturn;;//offset
add r8=r8,r1;; //add GP
mov b0=r8
mov r28=in3
mov r29=in2
mov r30=in1
mov r31=in0
mov b7=in4;;
br.cond.sptk.many b7;;

10.6.17 {vc} FTEMPLATE developed further, with two FTEMPLATE pointers active in fctbody(), one for the function being compiled, another to control generation and stack positioning of function arguments. FTEMPLATES feed into a FRAME structure, which contains address mapping for five distinct stack areas (ins,syslocals,locals,tmps, outs) on both the register stack and the memory stack. syslocals (things like ar.pfs, function return address etc) are controlled by bit settings in a flags parameter, one for each stack.

Whilst necessarily complicated, FTEMPLATE coding falls almost entirely into three blocks -
(1)in addresslocalsymbols(), to allocate the sys area between ins and locals
(2)in flushpcos(), case ENTERP to generate code on function entry
(3)in flushpcos(), case LEAVEP to generate code on function exit

As a final tweak, temporary copies of FTEMPLATE->rsysflags and msysflags are used by fctbody for code generation. This allows tweaks to the stack frames, by setting or clearing bits as desired - e.g. (ia64) no calls to other functions, so STACKIP (save b0) and STACKGP (save r1) can be skipped. Therefore a function with no stack requirements could generate no code for ENTERP, and just br.ret b0 for LEAVEP.

28.5.17 {vc} a virtual C compiler Deconstructed {itc1} IA64 C compiler into a group of function pointers, placed in ARCH structure to create {vc} to begin to draw {armc1}, {ebc1}, and {itc1} into one multi-architecture compiler. Compilation is directed by a new command-line argument /a=ia64. I am currently working from a IA64 register-only model, through a mixed-stack model towards a memory-only stack model such as the one I have always used on my ChaOS compilers. Before {vc}, function calling conventions were very much hard-coded in my compilers. These become FTEMPLATE structures which control the addressing of arguments (outs and ins in Itanium-speak) in the outer and inner functions. FTEMPLATES include a namestring and are stored in a table within the ARCH structure, invoked by using the namestring as a function modifier. In this way, a namestring such as efi can invoke architecture-specific behaviour as defined by the UEFI Specification.

A FRAME structure, built for each function body expands the FTEMPLATE to map local storage in fine detail, and is inevitably complex. Currently putting the finishing touches to the pal modifier for Itanium firmware PAL function calls.

12.5.17 Virtual address mode Identity-mapped VM now running, using itc.i and itc.d to create TLB cache entries for a handful of 4Gb and 64Mb descriptors, enough to get things off the ground. Will probably put this on the backburner for now, because EFI SetVirtualAddressMap can only be called after ExitBootServices, which destroys most of the EFI pre-boot environment.

11.5.17 End of the line:Intel announces Itanium 9700 series as being the seventh and final generation of IPF.

6.5.17 Progress:Itanium compiler taking shape, producing ChaOS FTRAW image in PE32+ wrapper, acceptable to EFI firmware as a native executable program (PE32 type 0x200 = IA64). Presently using register stack model only (no memory stack yet), but sufficient to call EFI functions for console input (getkey etc) and output to the EFI Shell environment. Now beginning to get under the hood of this fascinating processor.

Accessed Itanium IO ports and MMIO addresses yesterday for the first time, which require specific variants of ld and st instructions. Also managed to switch to virtual address mode, and access IO via page with UC attribute.

April 2017: Ported {ebc1} -> {itc1} to produce native Itanium IA64 code. Tried to benchmark the Itanium against the i3-4010U in my development laptop, using assembly language routines for both which convert a 64-bit value into 16 hexadecimal digits. First attempts showed the IA64 consuming 300 clock cycles versus 200 for the i3, kind of what I expected given that my Itanium is vintage 2004. However by careful bundling of the instruction triples (ref:Itanium Architecture for Programmers), the Itanium streaks ahead, consuming only 57 clock cycles. What a shame this processor will never be produced on thinner silicon.

March 2017: Ported {armc1} -> {ebc1} to produce EFI Byte Code, for the EFI pre-boot VM. (provided it is supported in firmware). This provided a first glimpse of the Itanium environment.