After this launch of Bulldozer, where many were dissapointed, at least for me it seems unclear why it was designed this way.
If some eyeballs here are on the AMD's payroll, please, can you answer a few questions and give your opinion about few suggestions here ?
Here it goes:
I'll skip the obvious questions like, why didn't you just made shrinked Thuban etc:
Given that AMD chose to go for entirely new approach and warned the public in advance that you are not going after absolute IPC ratio but for cheapest core ( in dollar and W terms) and biggest bang for the buck performance ( maybe a bit slower per thread, but many more threads etc)
1. Did you manage to squeeze just four modules on th die that has more than 90% of the area of the fattest Thuban ?
Logic would suggest that for the best model, 32nm would give you cca 2x transistor budget on the same area than what you had on 45nm.
If, as you say, BD module costs you in area just 18% more than Thuban's core and if you managed to squeeze 6 such cores on 346 mm2 at 45nm, then you should be able to cram at least 10 such modules ( with 2 "cores" each) on the same area. And you would left with percentually still the same area for the L3 cache etc...
2. If you were going for extensivelly parallel concept, why did you have to load this thing with mountains of cache ? Why not use as many as possible cache-lean cores ? Public already knew that unithread performance with BD will be so-so. It seems to me that you could easily optimize BD for heavy threaded apps while leaving existing Thuban lines to those who need time to adapt. Cache-lean BD with say 12 modules ( = 24 "cores" !) would be a chip with a punch.
And a good reason for folks to optimize and multithread their apps.
Even if it would have ot work on much lower frequencies.
3. You had three years for BD ( at least that much has passed since first public announcement) and this is what you managed to came up with ?
I don't mind failures as integral part of risky, innovative products, but I can hardly see all that much innovation here. Interesting starting idea, but frankly, far cry from the impact x86_64 had, when it came out.
4.You constantly talk about fusion, but without thorogugh redesign it is hard to see concrete benefit, except somewhat lower power consumption. What I miss the most from pixel shaders is intelligence. They are effective only in lockstep and can not execute code autonomously. This means that they are effective tool only for narrow spectrum of problems and in big swarms. But this means that I can not hand a just any problem to GPU, just I can with FPU or SSE unit.
It incurs some overhead and latency. Which means that it really doesn't matter that much that GPU is actually on the same silicon. It is probably more of the tradeoff, since you obviously had some problems optimizing production parameters so that they satisfy both GPU and CPU needs.
B --- wishlist
1. AMD is cash strapped, at least compared to Intel. It's stupid to expect them to beat Intel in all disciplines, so wishes of type "Get to 12nm before Intel" are unreasonable and stupid ( last but not least, since making actual chips is GlobalFoundries job).
2. Whole SOI idea was not that great overall.
3. Since GlobalFoundries is now quite behind Intel ( and will presumably be in the foreseeable future), full frontal assault through geometry war seems madness.
4. AMDs share on the market is small and declining. At this point of time it could be benefitial to specialize for some niche market. Linux users, for example, care much less about whole x86 concept. If there were, for example, effective ARM-based PC, I'd be using it as my next machine.
5. Despite that, AMD has its patent portfolio and x86 licences and it would be stupid to throw them away now.
1. Keep it simple and modular. Whole "memory controller on the chip" business didn't go over that well. It incurred you more than a year ( if not two!) delay and your competition managed to wipe much of that latency difference with discrete solution anyway. Not to mention that without it, you could optimize production process to main component alone, whatever it might be ( CPU/MEM/GPU/etc).
2. HT links seem like a viable idea, keep it. But keep it simple. Each component small and simple, as one HT node. Where you absolutely want top performance, you might have multiple dies on the same chip, conneected through internal fat-and-fast HT links.
I know that copper interconnect demands some power and has its limitations, but neither needs to be relevant for on-the-chip interconect where you need to travel just a few mm through very high quality micro-pcb with practically no noise. With such a design you might have chip that has let's say four memory banks, where on its top you have memory controller with some cache on one die, GPU unit on second and e.g. two CPU nodes on other two dies. There might be even some GDDR5 on the board for GPU or some cache so that periodic picture generation doesn't burden main memory etc. All units interconnected through internal wide (64-bit ?) HT links.
3. For smaller, configurable systems, one might want to combine all units externally. One ( or more) chip as memory controller, one ( or more) as a CPU, another ( or more) as GPU. Each preferably on its PCB card, with cards connected through simple backplane or even simple ribbons-HT links. Somewhat as in the days of "good,ol' SCSI"
4. WRT to core itself- what you have shown so far seems more evolutionary than revolutionary step with BD.
Please show us:
- more cores, less cache
- some innovation inside cores.
- stop selling one module as two cores
Since we are at the end of what x86 and x86_64 has to offer and since your market share has fallen to more or less equal of Linux market, this might be a good point to risk something radically different.
Here are a few ideas:
1. Redesign instruction decoder so that you can flip it from the x86 mode into some clean RISC mode, possibly with "Thumb-like" submode.
1. redesign the instruction decoder so that one can flip it into WLIW mode
2. It'd be cool if one could slip into the vector mode within general registers. Now I can do ops on byte to long long, but with one operand, not packed operands.
3. It'd be cool if one could to threading within register set. Something like we can already see within GPU pixel shaders, but less constrained.
Like that within 64x64bit register set one would be able to execute loop for each consecutive 4 or 8 registers. Not necessarily in lockstep but freely within say one cacheline of 64 bytes. This means that all banks would share all but lowest 6 bits of program counter.
Asa I understand, modern chips already have much more registers internally than user can see ( register renaming etc). IIRC, number of actual GPRs is more like 100 or more instead of 8 (ia386 ) or 16 "official" ones. It would be great if they could be made all visible and organised in let's say 8 register banks ( R0-R7,R8-R15 etc) with the ability to join registers into pairs/quads/octuplets/hexaplets (R0,R8,R16...) on which operations could be performed in the same cycle.
Otherwise, as long as one stayed inside his own register set and executing within same cacheline ( 64-byte chunk) one should be able to execute up to 8 "microthreads". On the second thought, having 64-bit native registers might be overkill for most apps. It might be nicer to have 32 or even 16 bit register as a native building block but of course in correspondingly fatter and numerous banks.
4. With K-10 we have seen advantage of L3 cache being used as intercore communication buffer. You could now give it another good use - for matrix operations or tiling, for example. You already have it with GPUs, where rows in RAM are wrapped into tiles to better suit data locality of objects-poligons.
CPU by itself is quite inept rotations ets and it would be nice to have at least cheap partial HW support for such things
5. Why bother with IEEE754/FPU at all ? FP is used all over, but it's always something where it comes short. Either some uirk of the standard causes slowdowns or one comes short with either exponent or the mantissa part. Wouldn't it be better if we could treate exponent and mantissa part as separate integers ( or even vector of integers) and use accordingly ? There could be some instructions to ease FP calculations ( to say adjust exponent after "FP op" etc), but even those could be made so general that they could apply for much wider spectrum of cases ( complex numbers, rational numbers etc etc).
All in all, I would __LOVE__ to see BD to be just as much x86 CPU to be as far as newest, biggest badd-ass bulldozer is a street-legal vehicle.
Strictly speaking, one could drive it on some roads, but this usually isn't the way to drive to job or school ( unless you wan't to make a statement, of course ;o). It's seriously big tool for solving seriously big problems.