In the process of preparing a new system build for TG/3D I'm kind of on the fence about whether to wait for Ryzen 3950X or the later to be released Threadripper with 32 cores.
I'm well aware that a lot of software is not capable of fully utilizing that many cores, but that differs for each piece of software.
I'd like to ask a favor to all owners of systems with 16 physical cores or more (preferably!) to perform a scalability test for TG using the Terragen 4 benchmark scene.
I have to think about how valuable and insightful it is to also perform SMT/hyperthreading for these core-counts, but for now I think those are more relevant to CPU testing than software testing and I'd like to keep things as simple as possible to begin with.
I'd like to plot render times for rendering with 4, 8, 12, 16, 24, 28 and 32 physical cores. If such systems are available and people feel up to it of course!
You can do this by changing the max threads number to the same numbers as above, but please do not render with more threads than cores available in your system (see above).
As usual with benchmark testing, don't change anything else.
It would be great if around 8 people would manage to do this, so I can plot average and standard deviations.
Please report CPU build and clock speeds, although they are not critical since I know how to normalize this kind of data, thanks to my dayjob in a lab.
However, you never know with Intel/AMD's differences, so it might be useful in the end.
Constructing such a curve would be very informative for people considering to build a hefty render machine, but who still want to hit the sweet spot for price/performance.
At some point more cores are not helpful anymore and a potential waste of money.
I'm shooting for a system with 4 or 8 more cores than TG can handle so I can still do other relatively demanding tasks while rendering with optimal thread number.
Thanks in advance for people who are willing to invest time - I know I ask for quite a bit of your time!!- in helping me with this and I'm happy to hear any discussion on this endeavour!
Cheers,
Martin
I'd be happy to help - I guess you'll provide a test .tgd for the exercise.
I have a 16 core Threadripper and like you, have heard of scalability issues on the newer AMDs with more cores.
As a side note, I recently tried a Realflow fluid sim on a 16 core / 32 thread 1950X and it ran surprisingly slow. It was only whens I restricted the threads to 16 that it ran at the speed I would have expected
Thanks!
Probably due to TL;DR effects you missed that I intend to use the Terragen 4 benchmark scene for it:
https://planetside.co.uk/terragen-4-benchmark/
The reason I'd like to use the official benchmark is that whatever results come out of it, it will also somehow relate/translate to what's already in the database.
The issue with the database itself for investigating scalability is that each user renders the benchmark full throttle, of course!
But all these results are different because of different CPU builds/speeds RAM build/speeds and what not. These matter, but do not tell anything about TG, but rather about the machine.
These factors all cancel out when on 1 machine you render at 4 cores base, then increase core-numbers.
In an ideal situation with each core-doubling render time should be half. Thus a straight line in the graph.
The 4 core base render time incorporates all these differences in CPU/RAM model/speed etc., so normalization of these variables happens all at once.
The only potential wrong assumption I make is that TG is scaling very well from 1 thread to 4 threads, but I don't want people to go through the long process of rendering the benchmark at 1 and 2 cores.
No probs - should take just over a couple of hours to run on my machine I think - I'll set up the files to run with a render manager (to get some accurate stats).
Here you go:
4 threads 17m 38s
8 threads 9m 43s
12 threads 7m 29s
16 threads 6m 46s
24 threads 6m 11s
64 threads 6m 06s (just let run un-changed)
Ryzen 9 3900X oc'd @ 4.15 Ghz
64 Gb Ram
There is an up-to 40s plus variance on the bunchmark (going on previous tests running at default 64 threads).
Fastest I have had it render is 4m55s. Slowest at about 6m 40s with the same system settings.
Thanks guys!
@cyphyr Richard, thanks for kicking off this test! :)
Your CPU is a 12 core capable of SMT/hyperthreading. I will use your results up to 12 threads for main purpose, since SMT/hyperthreading is complicating matters quickly, but definitely appreciated you tried it!
I think SMT/hyperthreading is probably a large cause of the variance you describe.
Let's wait for more results, but yours are already interesting!
Here is some additional data. If I understood correctly you wanted SMT/Hyper-Threading turned off to see just Physical Core performance differences so I turned off the SMT in mt Threadripper 1950X and got these numbers rendering the TG4 Benchmark Scene:
16 Cores / 16 Threads: 7:01
12 Cores / 12 Threads: 9:28
8 Cores / 8 Threads: 12:23
4 Cores / 4 Threads: 22:56
Here you go, very interesting results, not what I'd expected, but looking at Cypher's results seem to follow a similar pattern: (and just seeing D A Bentley's results similar with Hyperthreading turned on)
Machine details:
AMD Threadripper 1950X 16 cores / 32 threads - 3.7ghz over clock
96 GB Corsair DDR4 Vengeance LPX 3000MHz
Windows 10 Professional
Bios and Chipset drivers up to date
SMT (Hyperthreading) Enabled
Terragen 4.4.18 Frontier
Rendered via command line render queued in Deadline
Threads Time Peak Ram CPU load
32 06m56s 7.884 GB 100% (control render)
16 08m51s 5.687 GB 51%
12 08m36s 5.249 GB 39%
8 11m51s 4.344 GB 26%
4 21m57s 3.238 GB 14%
Quote from: Tangled-Universe on August 29, 2019, 09:09:32 AMn an ideal situation with each core-doubling render time should be half. Thus a straight line in the graph.
Obviously not from these results, it follows the curve from 4 to 8 threads, but surprised at the results from 12 to 16!
I seem to remember render times being a bit more predictable on the last project I rendered, but I wasn't using Defer all or Path tracing.
The test scene uses Defer All - I wonder if that's a factor. I might try the same test with the Terragen 3 benchmark to see if the results are similar.
My results were with hyper-threading turned OFF (AMD calls it SMT).
Also used Terragen 4.4.18 Frontier and have 128GB RAM, but only 9GB was needed.
-Derek
Quote from: digitalguru on August 29, 2019, 02:22:42 PM(and just seeing D A Bentley's results similar with Hyperthreading turned on)
I could have phrased that better :)
Here are my results:
- 32 cores - 00h08m29s HyperThreading ON / all other sessions with HT OFF
- 16 cores - 00h09m08s
- 12 cores - 00h11m22s
- 08 cores - 00h16m32s
- 06 cores - 00h20m30s
- 04 cores - 00h27m59s
- 02 cores - 00h51m54s
Intel (R) Xeon(R) CPU E5-2640v3 @2.60GHz / 2 Processors
Installed Memory RAM - 64GB
Windows 7 Ultimate SP1 64bit
I disabled the cores in the BIOS and so restarted every time before I rendered.
Hope that is useful for you.
CHeers, Klaus
ps: this test reminded me why I gave up on TG for a while when I only had a Quad-Core Pentium CPU (Labtop) to work with.
TG3.x was sloooow but when I installed the Beta version of TG4.x it felt like it did not move at all.
I "saw the light again" after building this Dual Xeon machine a few years back...still happy with it for my needs.
Thanks guys!!
Keep them coming please :) :) :)
Also, please, keep hyperthreading/SMT out of the tests, unless curiosity is irresistible :P It will save you time as well.
@KlausK , why did you disable cores in the BIOS for these tests? If you have 16 physical cores and set max threads in the renderer to 12 threads then you will utilize 12 of your 16 threads. No BIOS tweaking needed I'd say, but thanks a lot for your effort and willingness to do this so exact!
@D.A. Bentley , thanks for running those tests again without SMT! Intel and AMD ought to achieve similar things with hyperthreading and SMT, but these could lead to performance differences among these platforms and therefore potentially skew the results of these tests, because it would make seem TG be more capable to run more threads on Intel or AMD (don't know which of the two has the better implementation). This would then be more meaningful to test CPU's rather than TG's capability of managing threads.
@digitalguru , your remarks about defer all/PT are interesting. I hope these aspects are not at play too much. As long as people render the same benchmark scene this is all fine. If it turns out that certain aspects of the renderer are multithreaded more/less efficiently then all we can say is that the render thread scalability data is valid for this benchmark.
It foremostly would mean that the renderer needs work and the results so far suggest that's definitely needed. Let's wait and see.
Thanks all again!
"...but thanks a lot for your effort and willingness to do this so exact!" Exactly ;)
I restarted in between renders anyway so...
CHeers, Klaus
It would indeed be interesting to see if different render methods scale better than others. We can probably do this with the existing TG4 benchmark scene. It doesn't matter so much what the absolute time differences are (i.e. PT takes a lot longer than the Standard renderer), what matters is the relative difference between # of cores used. So we should be able to test Standard, Defer All, and PT all with the same scene.
I would say this is relevant to but separate from Martin's request here. If people are similarly willing to run some tests (I would limit the core tests even further than Martin probably, since 3 rendering methods need to be tested), then I will start a new thread to plan and discuss results. "Like" this post if you are interested in participating!
- Oshyan
Thank you all for your time and effort!
Against general scientific conduct I think the 4 measurements performed so far are pretty clear already and do not raise doubts to perform more tests.
I plotted all non-hyperthreaded/non-SMT data in a graph.
Please note that n=1 for 2 and >16 threads.
Error bars are standard deviation.
What you see is that for all 4 participants I set the rendertime for 4 threads to performance index 1 (render time 4 threads divided by render time n threads).
With more threads you render in less seconds, so the higher the value on Y the higher the performance.
In an ideal, but far from realistic world each thread doubling should reduce render time by half.
By normalizing the render time in seconds to 4 threads I cancel out all performance-affecting variables present in the machine.
I hope people understand this concept of normalization.
Conclusions:
1) Pretty good scalability up to 12 threads.
2) Plateau start at 16 threads and plateau is reached at 24 threads (which fits with some previous forum discussions).
3) If you like rendering with 16 threads on 16 physical cores and still want to use your workstation for other tasks then it can pay off to have a workstation with >16 physical cores, but for TG the return of investment would be minimal if you intend to use >16 threads.
Discussion:
Moore's Law is in an undead state as performance increase by the industry can mostly be attributed to multi-core CPU's and less attributed to clockspeed/IPC improvements/die-shrinkage, the classic mechanisms driving Moore's Law. IPC improvements occur every generation, but do not follow Moore's Law, neither does clockspeed for a while and die-shrinking is way behind roadmaps from the past (euv).
Therefore, as long as you wish to argue Moore's Law is still alive, it is mostly due to increased core count and to make good use of that you need well parallelized software.
For the future TG's architecture should scale up better, perhaps up to 32 cores.
I wonder whether if TG makes good use of SIMD and other instruction set extensions like AVX and such? Luc asked that about 10 years ago, I remember.
Yes, previous discussion here hinted at possible performance differences between the 3 different render methods currently present and were not the scope of this test.
I will then add 2 lines to the graph below, but expect much longer testing though (for PT).
We can perform these tests again, if anyone is interested, but I don't expect vast differences in scalability, that would be weird/unexpected. Unless Matt incorporated different coding concepts throughout the years?
Perhaps that touches on to my question how well TG uses SIMD and the like and whether if there's a way to improve thread scalability to make good use of where the CPU industry is heading to.
I'm wondering if there is any difference in limiting cores from the Bios vs. directly in Terragen? It would be a bit more work going into the Bios to limit # of cores used.
-Derek
The max thread setting is not a soft limit. TG really does not create more than the set number of threads for rendering and right now I don't see how a BIOS disabled core would differ from not allowing TG to use it.
Given this data I would not think so either... Can you tell from the graph which system had the cores disabled in BIOS?
There's some variation at 16 threads, but else it's all pretty much on top of each other.
Hi Martin,
Thanks for putting this together. I will reply in more detail later, but there are a couple of things I wanted to say first, which may help to get a more accurate conclusion.
Quote from: Tangled-Universe on September 03, 2019, 08:40:18 AMThe max thread setting is not a soft limit. TG really does not create more than the set number of threads for rendering and right now I don't see how a BIOS disabled core would differ from not allowing TG to use it.
Given this data I would not think so either... Can you tell from the graph which system had the cores disabled in BIOS?
There's some variation at 16 threads, but else it's all pretty much on top of each other.
Terragen uses Embree, and Embree spawns a large number of threads which are not limited by TG's thread settings. I don't know how this will affect performance compared to limiting threads in the BIOS, but I think this should be tested before drawing a conclusion. We also don't know what else is affecting performance that is independent of the TG thread settings but dependent on the number of physical cores.
Shouldn't your graph go through (0,0)? This is an implicit data point and I don't know what standard scientific practise recommends here, but it seems to me that we know for sure that the line must go through the origin, therefore the curve you fit to the data is not plausible if it does not. It also seems excessively curved to miss the first 3 data points, and those are the ones we expect to be the most reliable. If it went through (0,0) I would not mention this.
We definitely have a scalability problem and I don't mean to deflect away from that, I just want to be more confident about this curve.
Hi Matt,
Thanks for replying to this thread and glad you appreciate the effort!
I'm not sure either about Embree's thread handling vs TG.
Given the results I'd tend to say that Embree respects the "host's" thread limit, if that makes sense.
Otherwise I still feel confident that limiting threads in the BIOS does not make much of a difference, otherwise the normalized data for that instance would be different for each point on the curve.
About the curve fit... I plotted the R-squared to inform about how well that curve is defined by the input data.
As you may know a value of 1 describes the data perfectly, so this one comes pretty close to it, about 95%.
In this case there's 5% variance for the performance when reading from the curve.
The reason I wanted this curve is to make it easier to visualize the plateau, since that's the intention of this topic: where does TG's multi-threading performance flatten out?
I don't know by heart which curve-fit I used here, but I will check tomorrow.
I do know that I can force the curve to go through the point for 2 threads, but that has just 1 measurement instead of the point for 4 threads, which has 4.
If you are really concerned about this fit, which I'm not really, then we would need 3 more data points for rendering with 2 threads.
Ideally you would do it with 1 thread as well then, but people probably have better things to do :P
I don't see a clear rationale to have it must go through the origin, such a point of the curve is literally meaningless and not necessary for all fitting algorithms.
Changing the fitting to all of this will not affect the plateau, would you think otherwise?
The simplest answer to your remark: yes it should go through the origin if you fit it linearly.
I can fit it with linear regression, but the fit would be bad and not help visualizing the plateau.
These fitting discussions are especially important when you wish to interpolate the data. For empirical data 0.95 is still pretty good then, although generally we wish for >0.98.
All that this R-squared tells you then is that you can be confident that your interpolation is accurate, but again this is not the scope of this endeavor.
Uh... unless I am mistaken, there were no *actual* 32 core machines used on this test. 32 threads, yes, but that then uses hyperthreading on all the CPUs I saw reported from. So... of course performance plateaus at the number of physical cores, since Hyperthreading can only ever add a max of 20% *total* performance in typical usage scenarios. Or am I missing something? :D
Edit: in fact we have a possible conundrum. On the current Terragen 4 benchmark there is only one entry with what would be 32 physical cores. The problem with that CPU, the 2990WX, is that it is known to have some scalability issues under Windows (still, I believe). What we ideally need is someone with either dual Xeon's or dual Epyc's of at least 16 threads each. Probably best if it's a dual Xeon machine just because they're a more known quantity, so to speak.
- Oshyan
KlausK's machine I wrongfully labeled native 32 cores, but indeed
@Oshyan it's 16 cores!
I will remove it in the next graph, but it will not make a difference if not even worse.
You can already tell by the look of the graph that it will make matters worse for the result since the 16 thread data point is slightly below the current fit.
The new graph will therefore have a more pronounced plateau compared to the current graph.
@Oshyan : yes hyperthreading adds a max of ~20% performance gain, but it can differ for CPU models/brands and what not. That's why I requested people to not use it, because then the normalization does not work anymore.
That may be reflected in this graph, but for that we indeed would need a true 32 core machine. As you all pointed out, yes.
Yes, performance at 16 cores and above would seem to be leveling off, which is not surprising. But with the current graphed result I don't know that I'd conclude that 32 threads is just not worthwhile for Terragen.
I know other renderers are more efficient with higher thread counts, but it would be good to know just *how* efficient, i.e. what is the realistic target. Maybe we can plot some Cinebench results as well?
- Oshyan
Yes that would be a premature conclusion at this stage to conclude 32 threads are not worthwhile, but I made that statement in my original post, not after you pointed me out to the flaw of including the 24 and 32 thread results ;)
However, I'm still confident that it holds true once we can get our hands on true 24 and 32 core results. More on that later.
Please look at this improved graph where I excluded the 24 and 32 thread data.
The flattening of the curve improves, logically, but there's way more to this than you may think. More on that below.
Then, using linear regression forcing the line through the origin to satisfy
@Matt ;) we get a black line trying to describe our data linearly.
The slope of the ideal curve is y/x = 1/4 = 0.25 and this is also what my software tells me (Graphpad Prism)
The measured slope of our data = 0.1949
Say we
assume the linear fit is robust and fine, then we could extrapolate safely to predict the performance at any given number of threads and calculate back the predicted render-time for that number of threads.
So let's fill in 32 threads:
Ideal: 32 x 0.25 = 8
Ours: 32 x 0.1949 = 6.2368
Then calculate expected rendertime = render time in seconds for 4 threads for each participant divided by the perfomance index, then averaged + standard deviation for the 4 participants each:
Ideal: 170 seconds +/- 32 seconds
Ours: 218 seconds +/- 41 seconds
Hurray! Yes? No!?
No! Why no Martin, you critical sucker!?
Linear regression is not fitting our data well (RMSE = 0.25, which is high if the dependent variable has units of 1 (the normalized value).
You can clearly see that the linear regression is under-estimating performance at low thread counts and over-estimating the performance at higher thread counts, starting at around 12 cores.
That's not subjective, that's objective.
The linear fit intersects the non-linear exponential fit just right after 12 cores.
I'm too lazy to calculate the intercept of the two, but it looks to at about 13 threads, but we cannot buy such CPU's and 13 is closer to 12 than 16. So declination starting at 12'ish makes discussion easier.
Or another way: I can perform t-test on the 12 thread data point and show it is significantly worse from ideal.
Probably the 8 thread point could be statistically different too.
8 threads is already enormously statistically different from theory with p = 0.0001
Is that meaningful? No. The 8 thread data point is almost right on top of the ideal situation, you can clearly see that in the graph, but statistics would make you believe otherwise! Wrong!
The observed values are just very tightly together and that makes it easy for a statistical model to isolate and tell you that 8 threads is significantly worse than the ideal situation.
So this test, for this purpose is not the right one. Statistic difference does not always mean practical/noticeable difference.
Back to the curve fits.
Linear regression is over-estimating performance from 12'ish threads on.
The non-linear regression, which fits the data much better, describes a flattening curve which predicts declining performance as threadnumber increases.
I'll see if I can perform an extrapolation of this curve going up to 32 threads, but it doesn't take a lot of imaginative power to realize that 32 threads will not be reaching a score of 4, equivalent of 16
ideal cores.
This brings me back to my prior statement about 32 threads not being worthwhile.
However, as you said we need a native 24 and 32 core machine to verify and construct an experimentally derived plateau to prove I'm wrong, but so far I'm very confident that 32 threads is dead plateau.
Probably like all software, by the way, this is no way meant to criticize TG's architecture!
I just want to make a very informed decision about whether to go for a sub 3k dollar render machine or a just sub 4k one with all bells and whistles.
Thank you all again!
Quote from: Tangled-Universe on September 04, 2019, 06:07:57 AMblablabla...
Probably like all software, by the way, this is no way meant to criticize TG's architecture!
I wonder how well other renderer's perform with these type of tests.
I would not be surprised if they aren't any better, which brings me to this idea...
Many online reviews test CPU's for multi-threading performance using benchmarks like CineBench.
To translate those findings to what you can expect in TG it might be nice to perform a similar thread test with the CineBench 20 benchmark?
Quote from: Tangled-Universe on September 04, 2019, 06:39:07 AMMany online reviews test CPU's for multi-threading performance using benchmarks like CineBench.
To translate those findings to what you can expect in TG it might be nice to perform a similar thread test with the CineBench 20 benchmark?
Er, yes, isn't that what I said above?
Quote from: Oshyan on September 03, 2019, 03:34:36 PMI know other renderers are more efficient with higher thread counts, but it would be good to know just *how* efficient, i.e. what is the realistic target. Maybe we can plot some Cinebench results as well?
:D
- Oshyan
Oh oops, I completely overlooked those remarks!
Good we're at least on the same page about that!
How do you feel about the updated graph and analysis?
Well as a slight aside I am currently rendering out a preview for a client with two instances of Terragen running in parallel on the same computer (Ryzen - see sig for spec).
Each instance is limited to 12 threads in the Terragen render dialogue.
I'm getting two frames output every 30 sec (super low res/quality) as opposed to one frame every 20 sec.
So definitely useful as long as one has the spare memory.
What you are doing makes perfect sense to me. Not a slight aside
@cyphyr , but actually quite on topic.
I think your experience kind of fits with the results we found here.
Rendering 1 frame with 24 threads on 12 physical cores = 20 sec.
Rendering 1 frame with 12 threads on 6 physical cores = 30 sec for each frame.
Of course it's slower per frame, but if you look at the graph threading inefficiency starts to hurt render-times from 12 threads on, but 6 threads seems to be close to ideal, still.
The index score at 6 cores is at about 1.5 and for 12 threads it's at about 2.4., that's also around 1.5-fold difference.
(The hyperthreading/SMT is irrelevant here, since its performance gain is the same for each TG instance, cancelling each other out effectively)
Very interesting topic. I am shocked about the graph. Never thought about this problem. I guess TG will have to be optimized for these new insane amount of cores. I am downloading the Cinebench test.
Jumped to it just now. Looks like more is better not always...! :o
Quote from: archonforest on September 05, 2019, 09:22:12 AMVery interesting topic. I am shocked about the graph. Never thought about this problem. I guess TG will have to be optimized for these new insane amount of cores. I am downloading the Cinebench test.
Shocking not so much I think. I don't have the impression other 3D software packages will that much better. Pretty much all CPU renderers use some kind of bucket-approach and are thus intrinsically limited to how well it all can be parallelized.
It can be improved, like Matt said, but I'm not shocked by it.
Quote from: N-drju on September 05, 2019, 12:30:55 PMJumped to it just now. Looks like more is better not always...! :o
I think 16 cores is still a good choice for rendering and perhaps 24 cores is still justifyable at this moment. Why not design your TG scene while another is rendering ;)
I hope AMD will release a 24 core threadripper and otherwise I will have to see if I should either go for the 16 core 3950X or the 32-core threadripper and see how well I can utilize the excess cores. It depends a bit on price, but this little test people helped me with definitely helped me understand the way the renderer is multi-threaded and what I should look for.
Interesting. When I render something and look what the cpu cores are doing usually getting an ALL 100 percent busy report. But are they really busy? Hehehe... or slacking just of? :)
Quote from: Tangled-Universe on September 05, 2019, 12:50:49 PMI think 16 cores is still a good choice for rendering and perhaps 24 cores is still justifyable at this moment. Why not design your TG scene while another is rendering ;)
Oh, yes. That would be sweet. Or at least launch your favorite game while you wait!
I like the updated curve - it is a much better fit for the data. The comparison with the green theoretical line really says a lot about the performance falloff.
Quote from: archonforest on September 05, 2019, 02:00:22 PMInteresting. When I render something and look what the cpu cores are doing usually getting an ALL 100 percent busy report. But are they really busy? Hehehe... or slacking just of? :)
Oh yes they are absolutely busy and definitely not slacking off!
TG always fully utilizes your CPU, no worries about that.
Thread scalability is different. It's a "measure" of how well the software is capable of performing its tasks on an increasing number of cores.
Perfect software, pretty much non-existent, scales like the green line in my graph. For each core doubling the duration of the task is cut in half.
That's not reality though and pretty much all multi-threaded software has certain intrinsic limits to how well its tasks can be split over threads.
For instance, TG renders in buckets and each bucket needs information from other buckets. This exchange of information and/or access to information -like the GI cache- can suffer from certain bottlenecks. Multiple threads accessing the same cache or same parts of memory, stuff like that. In the end your CPU is still super busy but in the software itself things are not running efficiently accross multiple threads. Perhaps not really correct the way I explain this.
Quote from: Matt on September 05, 2019, 04:09:18 PMI like the updated curve - it is a much better fit for the data. The comparison with the green theoretical line really says a lot about the performance falloff.
Thanks :) The exponential fit is actually only ~2%-point better than the previous one, but at least we're 100% sure now that it is constructed using correct data!
I do not mean to demotivate you by showing or discussing the falloff! I discuss it from the point of view of the graph.
Neither do I mean to demotivate by discussing in length what the graph tells us.
It's just that I took a lot of time and text to convince Oshyan that my initial interpretation and statement was not pulled out of thin air.
Perhaps my initial statement that 32 cores are not justifyable seemed bold and premature, but this data supports it and there's nothing I can do to change that. I'm sorry.
About extrapolation, if I still may:
I exported the non-linear regression curve to .csv and imported it into this online tool: http://www.xuru.org/rt/NLR.asp
Then I limited the parameters to 2.
The best fit it returned from the curve was this formula: y=7.326788565 x / (x + 24.53756167)
If we back-fit our own data we get this:
4 threads = 1,027 => measured = 1
8 threads = 1,801 => measured = 1.803
12 threads = 2,405 => measured = 2,448
16 threads = 2,892 => measured = 2,854
The back fit is OK, but not utterly fantastic, as it under-estimates 12 threads and over-estimates 16 threads.
The 16 threads point has the highest variation, so any extrapolation from this wobbly point will have a relatively large uncertainty range.
Knowing this and proceeding into the unknown:
24 threads = 3.623
32 threads = 4.147
Given the original curve (pity that my software does not give me the fitting formula and that we need this online tool!) this seems a little bit over-estimated.
If I extend the curve with the mind's eye I cannot imagine it exceeding a score of 4 at 32 threads.
Anyway... let's be gentle this time and say that 32 threads scores 4.147 out of 8.
It's up to each person to decide whether such an efficiency is worth it and perhaps I should have refrained from stating it's not justifyable to buy a 32-core, but given the results I think it was not completely unfair in doing so. You buy 32, but get the performance of 16, that's what the data tells us thus far. In the end that's why I wanted to do this test.
Like I said myself I still may end up with a 24 core machine, if such a CPU will be released soon. Or perhaps even 32-cores, but then I know at least what to expect and I can think what else to do with the excess cores. Running 2 instances of TG, like Richard does is not a bad idea at all!
In some way I regret this test as I feel a bit like I'm bringing bad or unwanted news here, while I have different intentions.
I just want to clarify and emphasize that my intention with this topic is gathering data to make an informed buying decision.
Not to criticize the software or open up a can of worms about its performance.
And definitely not to have its creator feel bad about!
I genuinely think it's not that much better with other renderers. Just check the Vray benchmark database and perhaps more.
Multi-threading is challenging, just look at genomic assembly for instance, that's by itself not hugely complex (churning and chopping A,C,T and G's in little bits and reference them to a reference static genome) and yet multi-threading that is far from perfect. Let alone a renderer which deals with so many dependencies in scene space and then chops that up in buckets which needs low-level scene space derived data, but also high-level neighbouring bucket data for its own calculations.
It's vastly complicated and definitely a testament to Matt's achievement so far I think.
Cheers to that!
I understood the extrapolation from the beginning. I just think, in a world where 32 core machines (if not CPUs) are readily available, why extrapolate when we can just test? OK, yes, we do not have someone right here ready to test for us, but the only way to really know is to find such a person. So I'd rather focus on that than debate about the extrapolation because *only* that can give us the real data/answer. All the rest of this seems like neat statistics and math to me but... is still just extrapolation and needs testing anyway. The only time I'd really settle for relying on extrapolation is if we are trying to project what a non-existing CPU might be like, for example 64 core or even 128.
Also, while you may "only" get a 16-core equivalent CPU with a 32 core machine, the 16 core machine does not give you 16-core idealized performance! So while the relative efficiency may be less and less, with the 32 you are getting what you wanted from the 16. So OK, could be better, but 32 still > than 16. Of course it is a personal choice whether the price premium of 32 cores if worth it.
Personally I'd go for the 3950x given total system cost, but I knew that before this test. :D
- Oshyan
There's no need to feel bad about this or worry about how I'll feel about it, Martin. But thank you for mentioning that other software has problems like these and that it's difficult work. Thanks :) I welcome this kind of study, especially if it leads to discovering where we can make improvements. I now have a better idea of how inefficient it is.
I look forward to seeing more data if we can get it.
I think another point to bare in mind is RAM quantity and speed (actually I "think" speed is only relevant from a bench-mark perspective and would not make much of a noticeable difference from a "real-world" rendering perspective).
The ONLY reason I am able to run tow instances of Terragen on my system was that I had enough ram (64Gb) to load all the assets twice. (there are a LOT of assets in the scene)
I should do some tests with the benchmark scene probably when I get the chance.
The point being that the extra cores do not necessarily give you the extra rendering cycles if you don't have the RAM overhead to support it.
Also there is the fact that the Ram is now costing more than the CPU as a proportion of the system build. At a certain point the investment may as well go into a second (or third) PC.