Help requested for Terragen render thread scalability test

Started by Tangled-Universe, August 29, 2019, 06:38:35 AM

Previous topic - Next topic

N-drju

Jumped to it just now. Looks like more is better not always...! :o
"This year - a factory of semiconductors. Next year - a factory of whole conductors!"

Tangled-Universe

Quote from: archonforest on September 05, 2019, 09:22:12 AMVery interesting topic. I am shocked about the graph. Never thought about this problem. I guess TG will have to be optimized for these new insane amount of cores. I am downloading the Cinebench test.

Shocking not so much I think. I don't have the impression other 3D software packages will that much better. Pretty much all CPU renderers use some kind of bucket-approach and are thus intrinsically limited to how well it all can be parallelized.
It can be improved, like Matt said, but I'm not shocked by it.

Tangled-Universe

Quote from: N-drju on September 05, 2019, 12:30:55 PMJumped to it just now. Looks like more is better not always...! :o

I think 16 cores is still a good choice for rendering and perhaps 24 cores is still justifyable at this moment. Why not design your TG scene while another is rendering ;)

I hope AMD will release a 24 core threadripper and otherwise I will have to see if I should either go for the 16 core 3950X or the 32-core threadripper and see how well I can utilize the excess cores. It depends a bit on price, but this little test people helped me with definitely helped me understand the way the renderer is multi-threaded and what I should look for.

archonforest

Interesting. When I render something and look what the cpu cores are doing usually getting an ALL 100 percent busy report. But are they really busy? Hehehe... or slacking just of? :)
Dell T5500 with Dual Hexa Xeon CPU 3Ghz, 32Gb ram, GTX 1080
Amiga 1200 8Mb ram, 8Gb ssd

N-drju

Quote from: Tangled-Universe on September 05, 2019, 12:50:49 PMI think 16 cores is still a good choice for rendering and perhaps 24 cores is still justifyable at this moment. Why not design your TG scene while another is rendering ;)
Oh, yes. That would be sweet. Or at least launch your favorite game while you wait!
"This year - a factory of semiconductors. Next year - a factory of whole conductors!"

Matt

I like the updated curve - it is a much better fit for the data. The comparison with the green theoretical line really says a lot about the performance falloff.
Just because milk is white doesn't mean that clouds are made of milk.

Tangled-Universe

Quote from: archonforest on September 05, 2019, 02:00:22 PMInteresting. When I render something and look what the cpu cores are doing usually getting an ALL 100 percent busy report. But are they really busy? Hehehe... or slacking just of? :)

Oh yes they are absolutely busy and definitely not slacking off!
TG always fully utilizes your CPU, no worries about that.

Thread scalability is different. It's a "measure" of how well the software is capable of performing its tasks on an increasing number of cores.
Perfect software, pretty much non-existent, scales like the green line in my graph. For each core doubling the duration of the task is cut in half.
That's not reality though and pretty much all multi-threaded software has certain intrinsic limits to how well its tasks can be split over threads.
For instance, TG renders in buckets and each bucket needs information from other buckets. This exchange of information and/or access to information -like the GI cache- can suffer from certain bottlenecks. Multiple threads accessing the same cache or same parts of memory, stuff like that. In the end your CPU is still super busy but in the software itself things are not running efficiently accross multiple threads. Perhaps not really correct the way I explain this.

Tangled-Universe

#37
Quote from: Matt on September 05, 2019, 04:09:18 PMI like the updated curve - it is a much better fit for the data. The comparison with the green theoretical line really says a lot about the performance falloff.

Thanks :) The exponential fit is actually only ~2%-point better than the previous one, but at least we're 100% sure now that it is constructed using correct data!

I do not mean to demotivate you by showing or discussing the falloff! I discuss it from the point of view of the graph.
Neither do I mean to demotivate by discussing in length what the graph tells us.
It's just that I took a lot of time and text to convince Oshyan that my initial interpretation and statement was not pulled out of thin air.
Perhaps my initial statement that 32 cores are not justifyable seemed bold and premature, but this data supports it and there's nothing I can do to change that. I'm sorry.

About extrapolation, if I still may:
I exported the non-linear regression curve to .csv and imported it into this online tool: http://www.xuru.org/rt/NLR.asp
Then I limited the parameters to 2.
The best fit it returned from the curve was this formula: y=7.326788565 x / (x + 24.53756167)
If we back-fit our own data we get this:
4 threads = 1,027 => measured = 1
8 threads = 1,801 => measured = 1.803
12 threads = 2,405 => measured = 2,448
16 threads = 2,892 => measured = 2,854
The back fit is OK, but not utterly fantastic, as it under-estimates 12 threads and over-estimates 16 threads.
The 16 threads point has the highest variation, so any extrapolation from this wobbly point will have a relatively large uncertainty range.
Knowing this and proceeding into the unknown:
24 threads = 3.623
32 threads = 4.147

Given the original curve (pity that my software does not give me the fitting formula and that we need this online tool!) this seems a little bit over-estimated.
If I extend the curve with the mind's eye I cannot imagine it exceeding a score of 4 at 32 threads.
Anyway... let's be gentle this time and say that 32 threads scores 4.147 out of 8.
It's up to each person to decide whether such an efficiency is worth it and perhaps I should have refrained from stating it's not justifyable to buy a 32-core, but given the results I think it was not completely unfair in doing so. You buy 32, but get the performance of 16, that's what the data tells us thus far. In the end that's why I wanted to do this test.

Like I said myself I still may end up with a 24 core machine, if such a CPU will be released soon. Or perhaps even 32-cores, but then I know at least what to expect and I can think what else to do with the excess cores. Running 2 instances of TG, like Richard does is not a bad idea at all!

In some way I regret this test as I feel a bit like I'm bringing bad or unwanted news here, while I have different intentions.
I just want to clarify and emphasize that my intention with this topic is gathering data to make an informed buying decision.
Not to criticize the software or open up a can of worms about its performance.
And definitely not to have its creator feel bad about!
I genuinely think it's not that much better with other renderers. Just check the Vray benchmark database and perhaps more.
Multi-threading is challenging, just look at genomic assembly for instance, that's by itself not hugely complex (churning and chopping A,C,T and G's in little bits and reference them to a reference static genome) and yet multi-threading that is far from perfect. Let alone a renderer which deals with so many dependencies in scene space and then chops that up in buckets which needs low-level scene space derived data, but also high-level neighbouring bucket data for its own calculations.
It's vastly complicated and definitely a testament to Matt's achievement so far I think.

Cheers to that!

Oshyan

I understood the extrapolation from the beginning. I just think, in a world where 32 core machines (if not CPUs) are readily available, why extrapolate when we can just test? OK, yes, we do not have someone right here ready to test for us, but the only way to really know is to find such a person. So I'd rather focus on that than debate about the extrapolation because *only* that can give us the real data/answer. All the rest of this seems like neat statistics and math to me but... is still just extrapolation and needs testing anyway. The only time I'd really settle for relying on extrapolation is if we are trying to project what a non-existing CPU might be like, for example 64 core or even 128.

Also, while you may "only" get a 16-core equivalent CPU with a 32 core machine, the 16 core machine does not give you 16-core idealized performance! So while the relative efficiency may be less and less, with the 32 you are getting what you wanted from the 16. So OK, could be better, but 32 still > than 16. Of course it is a personal choice whether the price premium of 32 cores if worth it.

Personally I'd go for the 3950x given total system cost, but I knew that before this test. :D

- Oshyan

Matt

There's no need to feel bad about this or worry about how I'll feel about it, Martin. But thank you for mentioning that other software has problems like these and that it's difficult work. Thanks :) I welcome this kind of study, especially if it leads to discovering where we can make improvements. I now have a better idea of how inefficient it is.

I look forward to seeing more data if we can get it.
Just because milk is white doesn't mean that clouds are made of milk.

cyphyr

I think another point to bare in mind is RAM quantity and speed (actually I "think" speed is only relevant from a bench-mark perspective and would not make much of a noticeable difference from a "real-world" rendering perspective).

The ONLY reason I am able to run tow instances of Terragen on my system was that I had enough ram (64Gb) to load all the assets twice. (there are a LOT of assets in the scene)
I should do some tests with the benchmark scene probably when I get the chance.
The point being that the extra cores do not necessarily give you the extra rendering cycles if you don't have the RAM overhead to support it.

Also there is the fact that the Ram is now costing more than the CPU as a proportion of the system build. At a certain point the investment may as well go into a second  (or third) PC.
www.richardfraservfx.com
https://www.facebook.com/RichardFraserVFX/
/|\

Ryzen 9 5950X OC@4Ghz, 64Gb (TG4 benchmark 4:13)