Parallel queue implementation

HandBrake for Windows support
Forum rules
An Activity Log is required for support requests. Please read How-to get an activity log? for details on how and why this should be provided.
nhyone
Bright Spark User
Posts: 252
Joined: Fri Jul 24, 2015 4:13 am

Re: Parallel queue implementation

Post by nhyone »

andrewk89 wrote: Mon Jan 20, 2020 5:33 pm The simulation software we use at work is reported to hit CPU starvation somewhere between 2-3 cores per memory channel. Beyond that, there isn't enough RAM bandwidth to keep the CPUs busy. A colleague recently bought a 48-core 384 GB workstation and it kind of runs like a turd (chipset gives 12 memory channels the way his RAM is installed). Unfortunately, he found this research only after making the purchase.
I would think the vendor should know the proper memory configuration? e.g. Intel Xeon Scalable Family Balanced Memory Configurations (From Lenovo)

In the example, the system can take 12-DIMMs. The good configurations:
2 DIMM, single 2-channel interleave set: 35%
4 DIMM, single 4-channel interleave set: 67%
6 DIMM, single 6-channel interleave set: 97%
12 DIMM: 100%


Maybe CPU starvation is due to something else, maybe cache thrashing or I/O?
andrewk89
Novice
Posts: 65
Joined: Thu Jun 13, 2013 4:29 pm

Re: Parallel queue implementation

Post by andrewk89 »

nhyone wrote: Wed Jan 22, 2020 4:25 am Maybe CPU starvation is due to something else, maybe cache thrashing or I/O?
Here is the reference he found on the topic.

384 GB = 12 x 32 GB - all 12 channels are populated, or 1 memory channel per 4 cores. Put a temp drive on a RAM disk to eliminate I/O - it didn't really help. If you believe the conclusion given in the paper, the RAM drive is counter-productive since you didn't have enough RAM bandwidth in the first place.
Umadevi
Posts: 7
Joined: Mon Jan 13, 2020 1:06 pm

Re: Parallel queue implementation

Post by Umadevi »

Dear all,
As we have suggested earlier we were observing low CPU utilizations for large machines with HandBrake. We have managed to bring up a stable version of our suggested parallel implementation of HandBrake. We are finding the initial results of the parallel implementation to be promising. We are currently testing our implementation in AMD Threadripper 2990wx (with 64 threads) and our parallel versions is showing higher CPU and Memory utilization compared to the sequential version.
PFA our average execution time (https://drive.google.com/file/d/14GKewj ... sp=sharing), CPU utilization (https://drive.google.com/file/d/1A_H4sU ... sp=sharing) and memory utilization (https://drive.google.com/file/d/1e1YV5v ... sp=sharing) for serial, dynamic parallel and static parallel modes. In static parallel mode the user can select the number of jobs to be run in parallel. In the charts we are providing the results for 3,4 and 5 jobs in parallel (marked as parallel 3, parallel 4 and parallel 5 respectively in the charts). In dynamic parallel (marked as just parallel in the charts), our online dynamic cost function framework decides on the number of jobs to be run in parallel.

The queues have been randomly selected to cover as much use cases as possible.

We are still actively trying to improve our parallel logic, however if you could try our parallel implementation out and provide your valuable suggestions it would be great.
Please find attached our CLI version of parallel HandBrake here. https://drive.google.com/file/d/1nOr-vF ... sp=sharing
To run the sequential mode : HandBrakeCLI.exe --queue-import-file file.json [The execution time for this mode is similar to traditional HandBrake]
Dynamic parallel mode : HandBrakeCLI.exe --queue-import-file file.json --parallel
Static parallel mode : HandBrakeCLI.exe --queue-import-file file.json --parallel=n (where n represents number of jobs to be run in parallel )
spaceharfang
Posts: 2
Joined: Sun Jun 07, 2020 8:35 pm

Re: Parallel queue implementation

Post by spaceharfang »

I can see the limitations of using fps as a metric over a larger set of encodes of different resolutions. 1fps on a 4k encode certainly doesn't have the same value as 1fps on a sd encode.
One way to bring every encode on the same level, we could expand the concept of fps to "pixel per second". The global value of that metric would be a lot more representative of the work done while requiring little changes to implement. A single 4k frame push 4 times more pixel that a 1080p frame and so on. It would also adapt to custom resolutions.
There is still the issue of 8bit color vs 10bit hdr color, but that also can be expanded to "pixel bits per second".
Those can be scaled to be more readable in the UI (megapixel, gigapixel, etc...).

Personally, I have a Ryzen 3 cpu with 8 hyperthreaded cores that could use the feature. A crazy high core count cpu is not needed to benefit from this.
A rough estimate would be that it can do as follow:
1 4k encode at a time,
2 1080p encode at a time,
3-4 sd encode at a time.

Such a feature would also need a manual configuration to limit the amount of cores used by the encodes using thread affinity. This would allow freeing up some cores for other uses. On my old Sandy bridge cpu, I regularly had to remove 1 or 2 cores from the affinity list of Handbrake to allow the system to remain responsive, especially after the Spectre/Meltdown patches.
spaceharfang
Posts: 2
Joined: Sun Jun 07, 2020 8:35 pm

Re: Parallel queue implementation

Post by spaceharfang »

It would be definitely interesting to me since I tend to queue a wide range of resolutions. For it to be able change the number of encode on the fly would be really useful.
Post Reply