Help requested for Terragen render thread scalability test

Started by Tangled-Universe, August 29, 2019, 06:38:35 am

Previous topic - Next topic

Tangled-Universe

September 03, 2019, 07:12:58 am #15 Last Edit: September 03, 2019, 07:22:00 am by Tangled-Universe
Thank you all for your time and effort!
Against general scientific conduct I think the 4 measurements performed so far are pretty clear already and do not raise doubts to perform more tests.

I plotted all non-hyperthreaded/non-SMT data in a graph.
Please note that n=1 for 2 and >16 threads.
Error bars are standard deviation.

What you see is that for all 4 participants I set the rendertime for 4 threads to performance index 1 (render time 4 threads divided by render time n threads).
With more threads you render in less seconds, so the higher the value on Y the higher the performance.
In an ideal, but far from realistic world each thread doubling should reduce render time by half.
By normalizing the render time in seconds to 4 threads I cancel out all performance-affecting variables present in the machine.
I hope people understand this concept of normalization.

Conclusions:
1) Pretty good scalability up to 12 threads.
2) Plateau start at 16 threads and plateau is reached at 24 threads (which fits with some previous forum discussions).
3) If you like rendering with 16 threads on 16 physical cores and still want to use your workstation for other tasks then it can pay off to have a workstation with >16 physical cores, but for TG the return of investment would be minimal if you intend to use >16 threads.

Discussion:
Moore's Law is in an undead state as performance increase by the industry can mostly be attributed to multi-core CPU's and less attributed to clockspeed/IPC improvements/die-shrinkage, the classic mechanisms driving Moore's Law. IPC improvements occur every generation, but do not follow Moore's Law, neither does clockspeed for a while and die-shrinking is way behind roadmaps from the past (euv).
Therefore, as long as you wish to argue Moore's Law is still alive, it is mostly due to increased core count and to make good use of that you need well parallelized software.
For the future TG's architecture should scale up better, perhaps up to 32 cores.
I wonder whether if TG makes good use of SIMD and other instruction set extensions like AVX and such? Luc asked that about 10 years ago, I remember.

Yes, previous discussion here hinted at possible performance differences between the 3 different render methods currently present and were not the scope of this test.
I will then add 2 lines to the graph below, but expect much longer testing though (for PT).
We can perform these tests again, if anyone is interested, but I don't expect vast differences in scalability, that would be weird/unexpected. Unless Matt incorporated different coding concepts throughout the years?
Perhaps that touches on to my question how well TG uses SIMD and the like and whether if there's a way to improve thread scalability to make good use of where the CPU industry is heading to.

D.A. Bentley

I'm wondering if there is any difference in limiting cores from the Bios vs. directly in Terragen?  It would be a bit more work going into the Bios to limit # of cores used.

-Derek

Tangled-Universe

The max thread setting is not a soft limit. TG really does not create more than the set number of threads for rendering and right now I don't see how a BIOS disabled core would differ from not allowing TG to use it.
Given this data I would not think so either... Can you tell from the graph which system had the cores disabled in BIOS?
There's some variation at 16 threads, but else it's all pretty much on top of each other.

Matt

Hi Martin,

Thanks for putting this together. I will reply in more detail later, but there are a couple of things I wanted to say first, which may help to get a more accurate conclusion.

Quote from: Tangled-Universe on September 03, 2019, 08:40:18 amThe max thread setting is not a soft limit. TG really does not create more than the set number of threads for rendering and right now I don't see how a BIOS disabled core would differ from not allowing TG to use it.
Given this data I would not think so either... Can you tell from the graph which system had the cores disabled in BIOS?
There's some variation at 16 threads, but else it's all pretty much on top of each other.

Terragen uses Embree, and Embree spawns a large number of threads which are not limited by TG's thread settings. I don't know how this will affect performance compared to limiting threads in the BIOS, but I think this should be tested before drawing a conclusion. We also don't know what else is affecting performance that is independent of the TG thread settings but dependent on the number of physical cores.

Shouldn't your graph go through (0,0)? This is an implicit data point and I don't know what standard scientific practise recommends here, but it seems to me that we know for sure that the line must go through the origin, therefore the curve you fit to the data is not plausible if it does not. It also seems excessively curved to miss the first 3 data points, and those are the ones we expect to be the most reliable. If it went through (0,0) I would not mention this.

We definitely have a scalability problem and I don't mean to deflect away from that, I just want to be more confident about this curve.

Just because milk is white doesn't mean that clouds are made of milk.

Tangled-Universe

September 03, 2019, 01:03:03 pm #19 Last Edit: September 03, 2019, 02:52:06 pm by Tangled-Universe
Hi Matt,

Thanks for replying to this thread and glad you appreciate the effort!

I'm not sure either about Embree's thread handling vs TG.
Given the results I'd tend to say that Embree respects the "host's" thread limit, if that makes sense.
Otherwise I still feel confident that limiting threads in the BIOS does not make much of a difference, otherwise the normalized data for that instance would be different for each point on the curve.

About the curve fit... I plotted the R-squared to inform about how well that curve is defined by the input data.
As you may know a value of 1 describes the data perfectly, so this one comes pretty close to it, about 95%.
In this case there's 5% variance for the performance when reading from the curve.
The reason I wanted this curve is to make it easier to visualize the plateau, since that's the intention of this topic: where does TG's multi-threading performance flatten out?

I don't know by heart which curve-fit I used here, but I will check tomorrow.
I do know that I can force the curve to go through the point for 2 threads, but that has just 1 measurement instead of the point for 4 threads, which has 4.
If you are really concerned about this fit, which I'm not really, then we would need 3 more data points for rendering with 2 threads.
Ideally you would do it with 1 thread as well then, but people probably have better things to do :P
I don't see a clear rationale to have it must go through the origin, such a point of the curve is literally meaningless and not necessary for all fitting algorithms.
Changing the fitting to all of this will not affect the plateau, would you think otherwise?

The simplest answer to your remark: yes it should go through the origin if you fit it linearly.
I can fit it with linear regression, but the fit would be bad and not help visualizing the plateau.

These fitting discussions are especially important when you wish to interpolate the data. For empirical data 0.95 is still pretty good then, although generally we wish for >0.98.
All that this R-squared tells you then is that you can be confident that your interpolation is accurate, but again this is not the scope of this endeavor.

Oshyan

Uh... unless I am mistaken, there were no *actual* 32 core machines used on this test. 32 threads, yes, but that then uses hyperthreading on all the CPUs I saw reported from. So... of course performance plateaus at the number of physical cores, since Hyperthreading can only ever add a max of 20% *total* performance in typical usage scenarios. Or am I missing something? :D

Edit: in fact we have a possible conundrum. On the current Terragen 4 benchmark there is only one entry with what would be 32 physical cores. The problem with that CPU, the 2990WX, is that it is known to have some scalability issues under Windows (still, I believe). What we ideally need is someone with either dual Xeon's or dual Epyc's of at least 16 threads each. Probably best if it's a dual Xeon machine just because they're a more known quantity, so to speak.

- Oshyan

Tangled-Universe

KlausK's machine I wrongfully labeled native 32 cores, but indeed @Oshyan it's 16 cores!

I will remove it in the next graph, but it will not make a difference if not even worse.
You can already tell by the look of the graph that it will make matters worse for the result since the 16 thread data point is slightly below the current fit.
The new graph will therefore have a more pronounced plateau compared to the current graph.

@Oshyan : yes hyperthreading adds a max of ~20% performance gain, but it can differ for CPU models/brands and what not. That's why I requested people to not use it, because then the normalization does not work anymore.
That may be reflected in this graph, but for that we indeed would need a true 32 core machine. As you all pointed out, yes.

Oshyan

Yes, performance at 16 cores and above would seem to be leveling off, which is not surprising. But with the current graphed result I don't know that I'd conclude that 32 threads is just not worthwhile for Terragen.

I know other renderers are more efficient with higher thread counts, but it would be good to know just *how* efficient, i.e. what is the realistic target. Maybe we can plot some Cinebench results as well? 

- Oshyan

Tangled-Universe

September 04, 2019, 06:07:57 am #23 Last Edit: September 04, 2019, 10:35:14 am by Tangled-Universe
Yes that would be a premature conclusion at this stage to conclude 32 threads are not worthwhile, but I made that statement in my original post, not after you pointed me out to the flaw of including the 24 and 32 thread results ;)

However, I'm still confident that it holds true once we can get our hands on true 24 and 32 core results. More on that later.

Please look at this improved graph where I excluded the 24 and 32 thread data.
The flattening of the curve improves, logically, but there's way more to this than you may think. More on that below.
Then, using linear regression forcing the line through the origin to satisfy @Matt ;) we get a black line trying to describe our data linearly.
The slope of the ideal curve is y/x = 1/4 = 0.25 and this is also what my software tells me (Graphpad Prism)
The measured slope of our data = 0.1949
Say we assume the linear fit is robust and fine, then we could extrapolate safely to predict the performance at any given number of threads and calculate back the predicted render-time for that number of threads.
So let's fill in 32 threads:

Ideal: 32 x 0.25 = 8
Ours: 32 x 0.1949 = 6.2368

Then calculate expected rendertime = render time in seconds for 4 threads for each participant divided by the perfomance index, then averaged + standard deviation for the 4 participants each:
Ideal: 170 seconds +/- 32 seconds
Ours: 218 seconds +/- 41 seconds

Hurray! Yes? No!?
No! Why no Martin, you critical sucker!?

Linear regression is not fitting our data well (RMSE = 0.25, which is high if the dependent variable has units of 1 (the normalized value).
You can clearly see that the linear regression is under-estimating performance at low thread counts and over-estimating the performance at higher thread counts, starting at around 12 cores.
That's not subjective, that's objective.
The linear fit intersects the non-linear exponential fit just right after 12 cores.
I'm too lazy to calculate the intercept of the two, but it looks to at about 13 threads, but we cannot buy such CPU's and 13 is closer to 12 than 16. So declination starting at 12'ish makes discussion easier. 

Or another way: I can perform t-test on the 12 thread data point and show it is significantly worse from ideal.
Probably the 8 thread point could be statistically different too.
8 threads is already enormously statistically different from theory with p = 0.0001
Is that meaningful? No. The 8 thread data point is almost right on top of the ideal situation, you can clearly see that in the graph, but statistics would make you believe otherwise! Wrong!
The observed values are just very tightly together and that makes it easy for a statistical model to isolate and tell you that 8 threads is significantly worse than the ideal situation.
So this test, for this purpose is not the right one. Statistic difference does not always mean practical/noticeable difference.

Back to the curve fits.
Linear regression is over-estimating performance from 12'ish threads on.
The non-linear regression, which fits the data much better, describes a flattening curve which predicts declining performance as threadnumber increases.
I'll see if I can perform an extrapolation of this curve going up to 32 threads, but it doesn't take a lot of imaginative power to realize that 32 threads will not be reaching a score of 4, equivalent of 16 ideal cores.
This brings me back to my prior statement about 32 threads not being worthwhile.

However, as you said we need a native 24 and 32 core machine to verify and construct an experimentally derived plateau to prove I'm wrong, but so far I'm very confident that 32 threads is dead plateau.
Probably like all software, by the way, this is no way meant to criticize TG's architecture!
I just want to make a very informed decision about whether to go for a sub 3k dollar render machine or a just sub 4k one with all bells and whistles.

Thank you all again!

Tangled-Universe

Quote from: Tangled-Universe on September 04, 2019, 06:07:57 amblablabla...

Probably like all software, by the way, this is no way meant to criticize TG's architecture!

I wonder how well other renderer's perform with these type of tests.
I would not be surprised if they aren't any better, which brings me to this idea...

Many online reviews test CPU's for multi-threading performance using benchmarks like CineBench.
To translate those findings to what you can expect in TG it might be nice to perform a similar thread test with the CineBench 20 benchmark?

Oshyan

Quote from: Tangled-Universe on September 04, 2019, 06:39:07 amMany online reviews test CPU's for multi-threading performance using benchmarks like CineBench.
To translate those findings to what you can expect in TG it might be nice to perform a similar thread test with the CineBench 20 benchmark?
Er, yes, isn't that what I said above?



Quote from: Oshyan on September 03, 2019, 03:34:36 pmI know other renderers are more efficient with higher thread counts, but it would be good to know just *how* efficient, i.e. what is the realistic target. Maybe we can plot some Cinebench results as well? 
:D

- Oshyan

Tangled-Universe

Oh oops, I completely overlooked those remarks!
Good we're at least on the same page about that!
How do you feel about the updated graph and analysis?

cyphyr

Well as a slight aside I am currently rendering out a preview for a client with two instances of Terragen running in parallel on the same computer (Ryzen - see sig for spec).
Each instance is limited to 12 threads in the Terragen render dialogue.

I'm getting two frames output every 30 sec (super low res/quality) as opposed to one frame every 20 sec.

So definitely useful as long as one has the spare memory.
www.richardfraser.co.uk
https://www.facebook.com/RichardFraserVFX/
/|\

Ryzen 9 3900X @3.79Ghz, 64Gb (TG4 benchmark 6:20)
i7 5930K @3.5Ghz, 32Gb (TG4 benchmark 13.44)

Tangled-Universe

What you are doing makes perfect sense to me. Not a slight aside @cyphyr , but actually quite on topic.
I think your experience kind of fits with the results we found here.
Rendering 1 frame with 24 threads on 12 physical cores = 20 sec.
Rendering 1 frame with 12 threads on 6 physical cores = 30 sec for each frame.
Of course it's slower per frame, but if you look at the graph threading inefficiency starts to hurt render-times from 12 threads on, but 6 threads seems to be close to ideal, still.
The index score at 6 cores is at about 1.5 and for 12 threads it's at about 2.4., that's also around 1.5-fold difference.
(The hyperthreading/SMT is irrelevant here, since its performance gain is the same for each TG instance, cancelling each other out effectively)

archonforest

September 05, 2019, 09:22:12 am #29 Last Edit: September 05, 2019, 10:08:56 am by archonforest
Very interesting topic. I am shocked about the graph. Never thought about this problem. I guess TG will have to be optimized for these new insane amount of cores. I am downloading the Cinebench test.
Dell T5500 with Dual Hexa Xeon CPU 3Ghz, 32Gb ram
Amiga 1200 8Mb ram, 8Gb ssd