Post by Stephen HoffmanPost by John DallmanPost by Scott DorseyThe whole idea of the VLIW system is that the compiler will be able
to optimize the code to gain paralellism of units inside the single
processor. This is a very very ingenious idea but nobody has yet been
able to make a compiler that could do it well enough for it to be a
real win.
Sadly, the job is *impossible*.
The fundamental problem in optimisation for modern computers is the
slowness of main RAM, which isn't currently solvable at a reasonable
cost. We use caches to mitigate it.
Out-of-order execution addresses this problem by tracking the data
dependencies on memory and registers in real time and executing
instructions when their data is available....
The Itanium compiler optimizer just doesn't (and can't) know enough
about the system memory state, yes. Among other (no pun intended) issues.
The attempt to address that included providing run-time feedback into
the executables; providing post-link, post-execution tuning. (Caliper /
Atom / OM / etc.)
https://www.cs.tufts.edu/comp/150PAT/tools/caliper/wiess-rev-4.pdf
This Alpha versus IA-64 Itanium paper from 1999 describes the issues
https://web.archive.org/web/20010611202933/http://www.compaq.com/hpc/
ref/ref_alpha_ia64.doc
Clearly that old Alpha/IA64 comparison was written with an agenda.
There is no clear attribution in the document but all the "we did" and
"we designed" clearly indicates authorship in the Alpha hardware group.
Some of their assumptions like it will be impossible to do out-of-order
on IA64 are wrong since the last Itaniums actually implemented OOO and
existing images saw an immediate benefit.
They were comparing the Itanium of the day to what they thought Alpha
could someday do. The Itanium of the day was pretty bad compared to the
Alpha of the day (or of the next 2 years). And it is more than just the
architecture. It is the chip, the process, the interface chips, etc.
And yes, it was a challenge for compilers. The GEM implementation is a
good V1 but is lacking. GEM wasn't designed around such a hardware
model. I'm sure with additional time/money/people that subsequent
versions would be better. Of all the backends, I've seen, the HPUX one
is the best. During the Itanium port, I had some of the COBOL RTL
routines for datatype conversion. We had C code and the performance was
horrible out of GEM. We were considering our own assembly versions, but
I was directed to some of the HPUX compiler folks. I gave them the C
code and in a few weeks, I had Itanium assembly code that I could not
recognize. It used all sorts of Itanium features. It was several times
faster (I'm thinking 10x but I don't remember). That code is in the
COBOL RTL today. That was on those early Itaniums without OOO. How
good would the GEM code be on "modern" Itanium? Don't know. Never
tried. Doesn't matter.
As you say, cache is king. Intel doesn't price their chips based on
clock speed. They price them based on cache size.
I'll agree that Alpha was the better floating point system. The weird
bundling rules in the Itanium architecture make it difficult for a
floating application.
Not to litigate the argument (but it is what c.o.v does best) again, but
it was clear to many that upper Digital management didn't want to hear
technical arguments about the decision. Turning around to ask your
choir doesn't give you any information about a transformational change
in the underlying technology.