AMD Processors
Decrease font size
Increase font size
Topic Title: Bulldozer - some questions and suggestions for AMD
Topic Summary:
Created On: 10/17/2011 04:02 PM
Status: Read Only
Linear : Threading : Single : Branch
Search Topic Search Topic
Topic Tools Topic Tools
View similar topics View similar topics
View topic in raw text format. Print this topic.
 10/17/2011 04:02 PM
User is offline View Users Profile Print this message

Author Icon
brain_pain
Lurker

Posts: 9
Joined: 10/14/2011

Bulldozer questions:

After this launch of Bulldozer, where many were dissapointed, at least for me it seems unclear why it was designed this way.

If some eyeballs here are on the AMD's payroll, please, can you answer a few questions and give your opinion about few suggestions here ?

Here it goes:


A--- questions

I'll skip the obvious questions like, why didn't you just made shrinked Thuban etc:

Given that AMD chose to go for entirely new approach and warned the public in advance that you are not going after absolute IPC ratio but for cheapest core ( in dollar and W terms) and biggest bang for the buck performance ( maybe a bit slower per thread, but many more threads etc)

why:

1. Did you manage to squeeze just four modules on th die that has more than 90% of the area of the fattest Thuban ?
Logic would suggest that for the best model, 32nm would give you cca 2x transistor budget on the same area than what you had on 45nm.

If, as you say, BD module costs you in area just 18% more than Thuban's core and if you managed to squeeze 6 such cores on 346 mm2 at 45nm, then you should be able to cram at least 10 such modules ( with 2 "cores" each) on the same area. And you would left with percentually still the same area for the L3 cache etc...

2. If you were going for extensivelly parallel concept, why did you have to load this thing with mountains of cache ? Why not use as many as possible cache-lean cores ? Public already knew that unithread performance with BD will be so-so. It seems to me that you could easily optimize BD for heavy threaded apps while leaving existing Thuban lines to those who need time to adapt. Cache-lean BD with say 12 modules ( = 24 "cores" !) would be a chip with a punch.
And a good reason for folks to optimize and multithread their apps.
Even if it would have ot work on much lower frequencies.


3. You had three years for BD ( at least that much has passed since first public announcement) and this is what you managed to came up with ?
I don't mind failures as integral part of risky, innovative products, but I can hardly see all that much innovation here. Interesting starting idea, but frankly, far cry from the impact x86_64 had, when it came out.

4.You constantly talk about fusion, but without thorogugh redesign it is hard to see concrete benefit, except somewhat lower power consumption. What I miss the most from pixel shaders is intelligence. They are effective only in lockstep and can not execute code autonomously. This means that they are effective tool only for narrow spectrum of problems and in big swarms. But this means that I can not hand a just any problem to GPU, just I can with FPU or SSE unit.
It incurs some overhead and latency. Which means that it really doesn't matter that much that GPU is actually on the same silicon. It is probably more of the tradeoff, since you obviously had some problems optimizing production parameters so that they satisfy both GPU and CPU needs.

B --- wishlist

Assumptions:

1. AMD is cash strapped, at least compared to Intel. It's stupid to expect them to beat Intel in all disciplines, so wishes of type "Get to 12nm before Intel" are unreasonable and stupid ( last but not least, since making actual chips is GlobalFoundries job).

2. Whole SOI idea was not that great overall.

3. Since GlobalFoundries is now quite behind Intel ( and will presumably be in the foreseeable future), full frontal assault through geometry war seems madness.

4. AMDs share on the market is small and declining. At this point of time it could be benefitial to specialize for some niche market. Linux users, for example, care much less about whole x86 concept. If there were, for example, effective ARM-based PC, I'd be using it as my next machine.

5. Despite that, AMD has its patent portfolio and x86 licences and it would be stupid to throw them away now.


So:

1. Keep it simple and modular. Whole "memory controller on the chip" business didn't go over that well. It incurred you more than a year ( if not two!) delay and your competition managed to wipe much of that latency difference with discrete solution anyway. Not to mention that without it, you could optimize production process to main component alone, whatever it might be ( CPU/MEM/GPU/etc).

2. HT links seem like a viable idea, keep it. But keep it simple. Each component small and simple, as one HT node. Where you absolutely want top performance, you might have multiple dies on the same chip, conneected through internal fat-and-fast HT links.
I know that copper interconnect demands some power and has its limitations, but neither needs to be relevant for on-the-chip interconect where you need to travel just a few mm through very high quality micro-pcb with practically no noise. With such a design you might have chip that has let's say four memory banks, where on its top you have memory controller with some cache on one die, GPU unit on second and e.g. two CPU nodes on other two dies. There might be even some GDDR5 on the board for GPU or some cache so that periodic picture generation doesn't burden main memory etc. All units interconnected through internal wide (64-bit ?) HT links.

3. For smaller, configurable systems, one might want to combine all units externally. One ( or more) chip as memory controller, one ( or more) as a CPU, another ( or more) as GPU. Each preferably on its PCB card, with cards connected through simple backplane or even simple ribbons-HT links. Somewhat as in the days of "good,ol' SCSI"

4. WRT to core itself- what you have shown so far seems more evolutionary than revolutionary step with BD.

Please show us:

- more cores, less cache
- some innovation inside cores.
- stop selling one module as two cores

Since we are at the end of what x86 and x86_64 has to offer and since your market share has fallen to more or less equal of Linux market, this might be a good point to risk something radically different.

Here are a few ideas:

1. Redesign instruction decoder so that you can flip it from the x86 mode into some clean RISC mode, possibly with "Thumb-like" submode.

1. redesign the instruction decoder so that one can flip it into WLIW mode

2. It'd be cool if one could slip into the vector mode within general registers. Now I can do ops on byte to long long, but with one operand, not packed operands.

3. It'd be cool if one could to threading within register set. Something like we can already see within GPU pixel shaders, but less constrained.
Like that within 64x64bit register set one would be able to execute loop for each consecutive 4 or 8 registers. Not necessarily in lockstep but freely within say one cacheline of 64 bytes. This means that all banks would share all but lowest 6 bits of program counter.

Asa I understand, modern chips already have much more registers internally than user can see ( register renaming etc). IIRC, number of actual GPRs is more like 100 or more instead of 8 (ia386 ) or 16 "official" ones. It would be great if they could be made all visible and organised in let's say 8 register banks ( R0-R7,R8-R15 etc) with the ability to join registers into pairs/quads/octuplets/hexaplets (R0,R8,R16...) on which operations could be performed in the same cycle.

Otherwise, as long as one stayed inside his own register set and executing within same cacheline ( 64-byte chunk) one should be able to execute up to 8 "microthreads". On the second thought, having 64-bit native registers might be overkill for most apps. It might be nicer to have 32 or even 16 bit register as a native building block but of course in correspondingly fatter and numerous banks.

4. With K-10 we have seen advantage of L3 cache being used as intercore communication buffer. You could now give it another good use - for matrix operations or tiling, for example. You already have it with GPUs, where rows in RAM are wrapped into tiles to better suit data locality of objects-poligons.
CPU by itself is quite inept rotations ets and it would be nice to have at least cheap partial HW support for such things

5. Why bother with IEEE754/FPU at all ? FP is used all over, but it's always something where it comes short. Either some uirk of the standard causes slowdowns or one comes short with either exponent or the mantissa part. Wouldn't it be better if we could treate exponent and mantissa part as separate integers ( or even vector of integers) and use accordingly ? There could be some instructions to ease FP calculations ( to say adjust exponent after "FP op" etc), but even those could be made so general that they could apply for much wider spectrum of cases ( complex numbers, rational numbers etc etc).


All in all, I would __LOVE__ to see BD to be just as much x86 CPU to be as far as newest, biggest badd-ass bulldozer is a street-legal vehicle.
Strictly speaking, one could drive it on some roads, but this usually isn't the way to drive to job or school ( unless you wan't to make a statement, of course ;o). It's seriously big tool for solving seriously big problems.
 10/18/2011 09:09 AM
User is offline View Users Profile Print this message

Author Icon
Super XP
Deprecated

Posts: 336
Joined: 12/29/2003

Despite some Negative reviews along with some Positive reviews, here is my take on the matter.
- AMD needs to further tweak this design.
- Software needs to polish up to take advantage of this new design.
- Driver Updates.
- Windows 7 Update for Bulldozer to ensure further performance and consistency.

AMD's Innovation = Intel's Success. Without the two, the CPU Industry would fall far behind in Innovation and Technology.
Good Job AMD, just fix the Quirks. Thanks...
------------
Other Thoughts: What NEXT?
With the fact that Piledriver (Bulldozer II / AMD FX2 within Q1 2012 based on Socket AM3+) is being released so soon I believe that AMD knew about Bulldozer's minor design issues (Branch Prediction, Pipeline Flushing, Cache Trashing, Decode unit not wide enough etc. which all need to be TWEAKED.) but instead counted on higher frequencies to make up until Piledriver (Bulldozer II - AMD FX2)should be released via Socket AM3+.

Anandtech's review also shows that cache latency is worse than Phenom II.

Both problems can be blamed on Global Foundry's poor 32nm process yields.

Cache latency can be increased and clock speeds lowered to get higher yields.

-------------------------
AMD FX 8120 @ 4.40GHz (8-Cores)
Asus Crosshair V
G-Skill RipJaw DDR3-1866 16GB (4 x 4GB)
Corsair H100
Corsair 180GB SSD Force Series 3
Windows 7 Ult. x64
 01/02/2012 09:02 PM
User is offline View Users Profile Print this message

Author Icon
RamJam
Lurker

Posts: 7
Joined: 12/29/2011

I'd like to suggest AMD start being responsible and advertise their products to a degree of honesty, finally found that the official line on the Dull Dozers is that yes, they can run up to 1866 DRAM, but you are limited to 1 DIMM per channel and about the only place to find this is in some arcane document hidden away that no one knows about or could ever find (it took their tech support almost two weeks to provide an answer to this to another builder. I know of a large number of people who waited and waited and waited and bought mobos and DRAM before the CPUs were even released, and then they find they can't run all their DRAM....and most of them have now dumped AMD and moved on to Intel.

-------------------------
RamJam
 01/02/2012 10:11 PM
User is offline View Users Profile Print this message

Author Icon
Canis-X
The Frozen One

Posts: 4142
Joined: 01/19/2009

.....and so what about Denab and Thuban. Did you run them at 1600Mhz, because they were rated for nothing more than 1333. I'm sure that you could have gotten them to run at their rated speed if you or they took the time to learn the component's behavior. This is the nature of the beast.

-------------------------
The opinions expressed above do not represent those of Advanced Micro Devices or any of their affiliates.
 01/02/2012 10:51 PM
User is offline View Users Profile Print this message

Author Icon
RamJam
Lurker

Posts: 7
Joined: 12/29/2011

Which ones? The 1090, 1100 isn't bad I've had those to 2000 and better, but head back just a tad, maybe the early 965 that wouldn't even run 1600 or the C3 965 that would run 1600 (1 stick per channel - comfortably) or with 4 sticks where it really gets hot...and even then the you could take a setup with 1 C3 and run 16GB, pull it and try 3 other C3's and maybe 1 or 2 of the other 3 might run the 1600s at the same exact setting (quality control maybe). My point is, if they are going to tout something at 1866 it should be able to run one of their mobos with the slots filled. I'll dump this and go to Intel, they advertise the 2500K as a realistic 1333 capable, but every knows it can easily run 2133 DRAM (16GB) and more and OC higher and outperform the higher price 8150

-------------------------
RamJam
 01/03/2012 12:44 AM
User is offline View Users Profile Print this message

Author Icon
QB the Slayer
Case Modder

Posts: 1304
Joined: 01/23/2010

My C2 965 can run my RAM at 1872 and that is with 4 DIMMs populated... I am guessing that this is another case of a user's lack of knowledge then blaming the HW for it.

Also this is a forum Feedback, NOT an AMD feedback. This entire thread doesn't even belong here.

QB

-------------------------

The MONSTER HTPC

CPU: AMD Phenom II X4 965 C2 (140W).||.Cooler: Corsair H80i
MB: Gigabyte 990FXA-UD7.||.RAM: 8 GB Mushkin Blackline DDR3 2000MHz (7-10-8-27-1T)
Case: CoolerMaster HAF 932.||.PSU: Corsair HX750
GPU:HIS IceQ 5 Radeon HD 5770 Turbo 1GB.||.Audio: Creative X-Fi Titanium Fatal1ty Pro w/ Logitech Z-5300e (5.1, 280W-RMS)
Drive: 2xKingston SSD 40GB RAID0.||.Storage: 6TB (4x500GB Caviar Black RAID0, 2TB Hitachi & 2TB Caviar Green)
 01/03/2012 12:56 AM
User is offline View Users Profile Print this message

Author Icon
Canis-X
The Frozen One

Posts: 4142
Joined: 01/19/2009

Hell, my old CPUs. a 955's (C2 & C3) and a 965, they all ran all 4 DIMMS populated at 1600Mhz no issues. My current 1090T runs all 4 DIMMS populated just fine as well at 2000. Are you sure you Motherboard's VRM's and PSU was providing clean juice? TBH though, I've not owned a bulldozer yet and don't intend on getting the first iteration run either. I'll wait until the next one and hopefully Microsoft will get the scheduler patches out and then I'll pick one up.

-------------------------
The opinions expressed above do not represent those of Advanced Micro Devices or any of their affiliates.
Statistics
112018 users are registered to the AMD Processors forum.
There are currently 0 users logged in.

FuseTalk Hosting Executive Plan v3.2 - © 1999-2014 FuseTalk Inc. All rights reserved.



Contact AMD Terms and Conditions ©2007 Advanced Micro Devices, Inc. Privacy Trademark information