Page 2 of 2

Re: Parallel queue implementation

Posted: Wed Jan 22, 2020 4:25 am
by nhyone
andrewk89 wrote: Mon Jan 20, 2020 5:33 pm The simulation software we use at work is reported to hit CPU starvation somewhere between 2-3 cores per memory channel. Beyond that, there isn't enough RAM bandwidth to keep the CPUs busy. A colleague recently bought a 48-core 384 GB workstation and it kind of runs like a turd (chipset gives 12 memory channels the way his RAM is installed). Unfortunately, he found this research only after making the purchase.
I would think the vendor should know the proper memory configuration? e.g. Intel Xeon Scalable Family Balanced Memory Configurations (From Lenovo)

In the example, the system can take 12-DIMMs. The good configurations:
2 DIMM, single 2-channel interleave set: 35%
4 DIMM, single 4-channel interleave set: 67%
6 DIMM, single 6-channel interleave set: 97%
12 DIMM: 100%


Maybe CPU starvation is due to something else, maybe cache thrashing or I/O?

Re: Parallel queue implementation

Posted: Wed Jan 22, 2020 5:31 pm
by andrewk89
nhyone wrote: Wed Jan 22, 2020 4:25 am Maybe CPU starvation is due to something else, maybe cache thrashing or I/O?
Here is the reference he found on the topic.

384 GB = 12 x 32 GB - all 12 channels are populated, or 1 memory channel per 4 cores. Put a temp drive on a RAM disk to eliminate I/O - it didn't really help. If you believe the conclusion given in the paper, the RAM drive is counter-productive since you didn't have enough RAM bandwidth in the first place.

Re: Parallel queue implementation

Posted: Fri Mar 06, 2020 1:36 pm
by Umadevi
Dear all,
As we have suggested earlier we were observing low CPU utilizations for large machines with HandBrake. We have managed to bring up a stable version of our suggested parallel implementation of HandBrake. We are finding the initial results of the parallel implementation to be promising. We are currently testing our implementation in AMD Threadripper 2990wx (with 64 threads) and our parallel versions is showing higher CPU and Memory utilization compared to the sequential version.
PFA our average execution time (https://drive.google.com/file/d/14GKewj ... sp=sharing), CPU utilization (https://drive.google.com/file/d/1A_H4sU ... sp=sharing) and memory utilization (https://drive.google.com/file/d/1e1YV5v ... sp=sharing) for serial, dynamic parallel and static parallel modes. In static parallel mode the user can select the number of jobs to be run in parallel. In the charts we are providing the results for 3,4 and 5 jobs in parallel (marked as parallel 3, parallel 4 and parallel 5 respectively in the charts). In dynamic parallel (marked as just parallel in the charts), our online dynamic cost function framework decides on the number of jobs to be run in parallel.

The queues have been randomly selected to cover as much use cases as possible.

We are still actively trying to improve our parallel logic, however if you could try our parallel implementation out and provide your valuable suggestions it would be great.
Please find attached our CLI version of parallel HandBrake here. https://drive.google.com/file/d/1nOr-vF ... sp=sharing
To run the sequential mode : HandBrakeCLI.exe --queue-import-file file.json [The execution time for this mode is similar to traditional HandBrake]
Dynamic parallel mode : HandBrakeCLI.exe --queue-import-file file.json --parallel
Static parallel mode : HandBrakeCLI.exe --queue-import-file file.json --parallel=n (where n represents number of jobs to be run in parallel )

Re: Parallel queue implementation

Posted: Sun Jun 07, 2020 8:58 pm
by spaceharfang
I can see the limitations of using fps as a metric over a larger set of encodes of different resolutions. 1fps on a 4k encode certainly doesn't have the same value as 1fps on a sd encode.
One way to bring every encode on the same level, we could expand the concept of fps to "pixel per second". The global value of that metric would be a lot more representative of the work done while requiring little changes to implement. A single 4k frame push 4 times more pixel that a 1080p frame and so on. It would also adapt to custom resolutions.
There is still the issue of 8bit color vs 10bit hdr color, but that also can be expanded to "pixel bits per second".
Those can be scaled to be more readable in the UI (megapixel, gigapixel, etc...).

Personally, I have a Ryzen 3 cpu with 8 hyperthreaded cores that could use the feature. A crazy high core count cpu is not needed to benefit from this.
A rough estimate would be that it can do as follow:
1 4k encode at a time,
2 1080p encode at a time,
3-4 sd encode at a time.

Such a feature would also need a manual configuration to limit the amount of cores used by the encodes using thread affinity. This would allow freeing up some cores for other uses. On my old Sandy bridge cpu, I regularly had to remove 1 or 2 cores from the affinity list of Handbrake to allow the system to remain responsive, especially after the Spectre/Meltdown patches.

Re: Parallel queue implementation

Posted: Mon Jun 08, 2020 11:18 pm
by spaceharfang
It would be definitely interesting to me since I tend to queue a wide range of resolutions. For it to be able change the number of encode on the fly would be really useful.