Parallel queue implementation

HandBrake for Windows support
Forum rules
An Activity Log is required for support requests. Please read How-to get an activity log? for details on how and why this should be provided.
Umadevi
Posts: 7
Joined: Mon Jan 13, 2020 1:06 pm

Parallel queue implementation

Post by Umadevi »

Hi,

We are running handbrake application on larger core machine, where we found that all the cores are not fully utilized. So we thought of adding feature which will allow us to run multiple vidoes parallely using single instance of handbrake(parallel queue implementation).
With this feature we can make use of all available cores (for the videos in queue) particularly in machines with large number of cores.

We would like to know if something of this sort is already under developement, and if we can collaborate with you to make this a feature in handbrake.

rollin_eng
Veteran User
Posts: 3469
Joined: Wed May 04, 2011 11:06 pm

Re: Parallel queue implementation

Post by rollin_eng »

You can already run multiple instances of HB to accomplish something similar.

User avatar
BradleyS
Moderator
Posts: 1859
Joined: Thu Aug 09, 2007 12:16 pm

Re: Parallel queue implementation

Post by BradleyS »

We have an open issue for this: https://github.com/HandBrake/HandBrake/issues/1445

I believe this is being worked on for Windows at the moment.

User avatar
s55
HandBrake Team
Posts: 9785
Joined: Sun Dec 24, 2006 1:05 pm

Re: Parallel queue implementation

Post by s55 »

Per @rollling_eng's comment -> Yes you can just run multiple instances. They each have a separate queue. I'm not sure how many instance you can run before memory bandwidth will become the limiting factor.

I am working on a number of re-architecture projects that make this a possibility at the moment in a single instance with a single queue. The UI was never designed with this in mind so I'm re-working a number of area's currently to make it less of a hack job.

DrXenos
Bright Spark User
Posts: 233
Joined: Sat Mar 16, 2013 1:19 pm

Re: Parallel queue implementation

Post by DrXenos »

What I've done to handle simultaneous transcodes is just create my own tool to process a queue (generated by another tool) for the command line handbrake, plus other tools I need to use like mkvmerge or ffmpeg, or even a queue generated by the Windows GUI Handbrake. It will run whatever simultaneous number of transcodings I tell it to. Each will open its own output window, so I can watch what each one is doing. It's not very polished looking, but it serves my needs.
You do not have the required permissions to view the files attached to this post.

Umadevi
Posts: 7
Joined: Mon Jan 13, 2020 1:06 pm

Re: Parallel queue implementation

Post by Umadevi »

Thanks for your replies.

As @s55 pointed out, we see a reasonable limit beyond which we do not see any advantage in launching new instances of handbrake.
Hence, we came up with launching a single instance of handbrake with parallel(transcoder) jobs running to get better performance.
With the single instance we can dynamically add or remove (pause) jobs which are running in parallel for better CPU utilization.

@s55 we have a static parallel queue working in a single instance from the command line via a "--parallel" flag right now, but GUI work is certainly still needed. By "static" we mean we can choose how many parallel encodes to launch.
The main functionality we are hoping to add is dynamic adjustment of parallelism. The dynamic adjustment is important for us since we might have a queue with 20-30 different formats and resolutions so the resource utilization changes drastically as the queue is processed. The current idea is adjusting how many parallel encodes run simultaneously by monitoring the global FPS throughput of the queue. Every X seconds we can try to launch an additional encode from the queue, and if global FPS goes up, continue increasing the number of parallel encodes, if global FPS goes down, reduce(pause) them. This avoids having to determine how many parallel encodes your system can handle by implicitly utilizing as much of the system as possible (e.g. even if the limit is disk, memory, or cpu) while the queue is processed. It's a simple algorithm that still processes the queue in order and just tries to maximize immediate FPS throughput -- more optimizations could be done in the future that try to re-order the queue.
This also uses the same single json file supported by Handbrake GUI via queue option. We are trying to implement this as a OS-agnostic parallelization framework which could run in Windows/Linux.

rollin_eng
Veteran User
Posts: 3469
Joined: Wed May 04, 2011 11:06 pm

Re: Parallel queue implementation

Post by rollin_eng »

I’m not sure global FPS would be a good metric if you have different encode sources and settings.

Ideally you would want to monitor CPU/Memory/Disk access then once one of those max out you will be at your limit.

DrXenos
Bright Spark User
Posts: 233
Joined: Sat Mar 16, 2013 1:19 pm

Re: Parallel queue implementation

Post by DrXenos »

rollin_eng wrote:
Sat Jan 18, 2020 9:03 am
I’m not sure global FPS would be a good metric if you have different encode sources and settings.

Ideally you would want to monitor CPU/Memory/Disk access then once one of those max out you will be at your limit.
I absolutely LOVE your idea of monitoring the bounding resources. That's very intelligent.

nhyone
Bright Spark User
Posts: 249
Joined: Fri Jul 24, 2015 4:13 am

Re: Parallel queue implementation

Post by nhyone »

How many cores are we talking about? Suppose it is a 40-core machine, and we want to allocate an average of 8 core per encoding, that's just 5 concurrent jobs. Just run 5 instances?

User avatar
s55
HandBrake Team
Posts: 9785
Joined: Sun Dec 24, 2006 1:05 pm

Re: Parallel queue implementation

Post by s55 »

@nhyone -> Don't forget you need to consider memory bandwidth (not just quantity). Under certain conditions, this could be a severe bottleneck.

DrXenos
Bright Spark User
Posts: 233
Joined: Sat Mar 16, 2013 1:19 pm

Re: Parallel queue implementation

Post by DrXenos »

You'll also want to implement a hysteresis, so you're not constanting pausing/unpausing a job when you hit some limit.

nhyone
Bright Spark User
Posts: 249
Joined: Fri Jul 24, 2015 4:13 am

Re: Parallel queue implementation

Post by nhyone »

s55 wrote:
Sun Jan 19, 2020 11:32 am
@nhyone -> Don't forget you need to consider memory bandwidth (not just quantity). Under certain conditions, this could be a severe bottleneck.
It's worth testing if this is a bottleneck. I've always assumed we'll hit CPU limit first. :D

In any case, I ran mbw 2048 on two systems:

2-CPU w/ 32 GB (Intel Xeon E5-2670 v2 @ 2.50GHz):

Code: Select all

AVG     Method: MEMCPY  Elapsed: 0.58332        MiB: 2048.00000 Copy: 3510.951 MiB/s
AVG     Method: DUMB    Elapsed: 0.44924        MiB: 2048.00000 Copy: 4558.829 MiB/s
AVG     Method: MCBLOCK Elapsed: 0.33086        MiB: 2048.00000 Copy: 6189.843 MiB/s
2-CPU w/ 64 GB (Intel Xeon E5-2690 v4 @ 2.60GHz):

Code: Select all

AVG     Method: MEMCPY  Elapsed: 0.59372        MiB: 2048.00000 Copy: 3449.435 MiB/s
AVG     Method: DUMB    Elapsed: 0.35582        MiB: 2048.00000 Copy: 5755.789 MiB/s
AVG     Method: MCBLOCK Elapsed: 0.41009        MiB: 2048.00000 Copy: 4994.082 MiB/s
If the numbers are right, there should be at least 3 GB/s of bandwidth.

Umadevi
Posts: 7
Joined: Mon Jan 13, 2020 1:06 pm

Re: Parallel queue implementation

Post by Umadevi »

The cost function we are mentioning is independent of the parallelization implementation which can be evolved in future. We are considering FPS as an initial cost function. We expect the overall value of FPS (sum of FPS of all live jobs) will get affected by CPU utilization limit and Memory limit. So to start with we considered FPS as the cost function.
DrXenos wrote:
Sun Jan 19, 2020 2:15 pm
You'll also want to implement a hysteresis, so you're not constanting pausing/unpausing a job when you hit some limit.
As mentioned, we are checking cost function for every X seconds and we record the impact of our changes in our cost function. Based on the cost function value we decide whether to pause/resume/create a new handle. Right now, we are ensuring that we pause a job atmost one time only (We pause/resume the last launched job and once resumed we allow it to finish without any more interruptions).

rollin_eng
Veteran User
Posts: 3469
Joined: Wed May 04, 2011 11:06 pm

Re: Parallel queue implementation

Post by rollin_eng »

The problem with using FPS as your metric is that different sources and settings will produce vastly different FPS.

If you have a SD video encoding at 100 FPS and then start a 4K encode at 5 FPS your SD might drop to 50 FPS giving you a total of 55 FPS. This is a reduction from your original 100 FPS.

But if you reverse this and have a 4K encode running at 10 FPS and start a SD encode at 50 FPS your 4K encode might drop to 5 FPS giving you a total of 55 FPS. This is an increase from your original 10 FPS.

Umadevi
Posts: 7
Joined: Mon Jan 13, 2020 1:06 pm

Re: Parallel queue implementation

Post by Umadevi »

rollin_eng wrote:
Mon Jan 20, 2020 10:34 am
The problem with using FPS as your metric is that different sources and settings will produce vastly different FPS.

If you have a SD video encoding at 100 FPS and then start a 4K encode at 5 FPS your SD might drop to 50 FPS giving you a total of 55 FPS. This is a reduction from your original 100 FPS.

But if you reverse this and have a 4K encode running at 10 FPS and start a SD encode at 50 FPS your 4K encode might drop to 5 FPS giving you a total of 55 FPS. This is an increase from your original 10 FPS.
It is a great point and right now we handle these differently since we are only considering FPS which should not be the case eventually. However, the primary objective for parallelization is to increase the throughput of CPUs with large number of cores (for example CPUs with >32 cores). Right now all the parallelism techniques provided above (i.e multiple handbrake instance/ wrapper tools), delegates a lot of the load balancing and thread creation work on to the user which we want to avoid.

We introduce the `--parallel`(in CLI ) flag to enable parallelism. If the flag is not set, Handbrake will follow the traditional sequential workflow.

rollin_eng
Veteran User
Posts: 3469
Joined: Wed May 04, 2011 11:06 pm

Re: Parallel queue implementation

Post by rollin_eng »

Nothing wrong with trying to max out your computer usage just that using FPS as your counter is probably not the best way to do this.

andrewk89
Regular User
Posts: 59
Joined: Thu Jun 13, 2013 4:29 pm

Re: Parallel queue implementation

Post by andrewk89 »

nhyone wrote:
Mon Jan 20, 2020 12:39 am
It's worth testing if this is a bottleneck. I've always assumed we'll hit CPU limit first. :D
Tests would be interesting.

The simulation software we use at work is reported to hit CPU starvation somewhere between 2-3 cores per memory channel. Beyond that, there isn't enough RAM bandwidth to keep the CPUs busy. A colleague recently bought a 48-core 384 GB workstation and it kind of runs like a turd (chipset gives 12 memory channels the way his RAM is installed). Unfortunately, he found this research only after making the purchase.

That got me thinking - if simulation tools run up against RAM bandwidth at relatively modest core counts, what kind of workloads actually benefit from high core counts??

Umadevi
Posts: 7
Joined: Mon Jan 13, 2020 1:06 pm

Re: Parallel queue implementation

Post by Umadevi »

rollin_eng wrote:
Sat Jan 18, 2020 9:03 am
I’m not sure global FPS would be a good metric if you have different encode sources and settings.

Ideally you would want to monitor CPU/Memory/Disk access then once one of those max out you will be at your limit.
We also had the idea to monitor various system metrics at first, but (1) trying to determine which one was actually limiting is much more complicated than just monitoring usage through the OS, and (2) trying to design this for every OS might be complicated. But I agree, that knowing where your limit is and working with it would be better. However, the issue with different encode settings is still present, e.g. if one encode is much more disk bound and another is CPU bound, perhaps running the disk-bound one is still okay even if your CPU is almost maxed out.
rollin_eng wrote:
Mon Jan 20, 2020 10:34 am
The problem with using FPS as your metric is that different sources and settings will produce vastly different FPS.

If you have a SD video encoding at 100 FPS and then start a 4K encode at 5 FPS your SD might drop to 50 FPS giving you a total of 55 FPS. This is a reduction from your original 100 FPS.

But if you reverse this and have a 4K encode running at 10 FPS and start a SD encode at 50 FPS your 4K encode might drop to 5 FPS giving you a total of 55 FPS. This is an increase from your original 10 FPS.
Using global FPS is certainly not the ideal metric, but it is a metric that tries to capture the system limits implicitly as much as possible, since they are very difficult to measure accurately or predict based on a specific encode. And from a global perspective, the goal of the queue is to process all of the frames, so the rate of frame processing is at least a first-order approximation towards the final goal.

I understand the example given here where the global FPS increases or decreases based on the order of encodes launched, but ends in 55 FPS -- however you will note that in this scenario the point is parallelization isn't helping at all so it doesn't matter what order it happens. I.e. if the speed of both of your encodes get cut exactly in half by running them in parallel, then you could either (1) do both serially, (2) start the HD one first and SD second, or (3) start the SD one first and HD second, and they will all result in the queue ending at about the same time. The point is in that scenario it didn't matter what you did, so the FPS metric doesn't hurt.

Since the goal of parallelization is to take advantage of untapped resources, it is not a sum-zero game. Whichever way it happened, if your overall FPS didn't increase -- you may not be taking advantage of more resources, but you could be causing resource contention in some part of the system (bandwidth, disk, CPU, cache, etc) so it is better not to start processing that last encode (be it the SD or the HD in the previous example) until the previous encode has finished.

The FPS metric is definitely not optimum but the goal of this (rather simple) heuristic is to, (1) help most of the time, and (2) not slow things down from the default serial case -- which could happen in a memory/disk thrashing scenario with too many parallel encodes.

Consider a similar scenario where your SD encode is going 100FPS, and adding the 4K encode causes the SD to run at 75 fps and the HD to run at 5 fps for a total of 80 fps. Should we add the extra encode? The answer is actually we don't know, so don't do anything. Doing nothing is okay since we're not making it any worse than the default serial processing would have done.

And the reverse scenario where you're running your HD encode at 8 fps, and then adding an SD encode which now runs at 75fps and slows down your HD to 5 fps, again for a total of 80 fps. Should we add an extra encode? Yes -- because we know more overall frames are being processed so the queue is likely to finish faster.
DrXenos wrote:
Sun Jan 19, 2020 2:15 pm
You'll also want to implement a hysteresis, so you're not constanting pausing/unpausing a job when you hit some limit.
Definitely! it should have some mechanism to prevent a pausing/unpausing loop.

Umadevi
Posts: 7
Joined: Mon Jan 13, 2020 1:06 pm

Re: Parallel queue implementation

Post by Umadevi »

On more consideration, FPS might work, but as we mentioned, there are scenarios where you are not sure if it is helping (e.g. an SD encode @ 100 fps, then adding HD encode with total FPS decreasing) so you might end up leaving a lot of performance on the table.

I think the issue is that the FPS measure does not fully capture the throughput of the queue. An alternative could be to measure pixels/sec or mb/sec. These metrics are encoder and preset invariant, so it will not matter if an SD encode is @ 100 fps and HD encode @ 5 fps, if starting another encode allows you to push more pixels/sec or mb/sec, it should always better to start another one, and the opposite should also hold true.

nhyone
Bright Spark User
Posts: 249
Joined: Fri Jul 24, 2015 4:13 am

Re: Parallel queue implementation

Post by nhyone »

I'm still skeptical how scalable this is.

I did an experiment on a 40-core (of which 20 are HT) machine. Out-of-box, the slow preset on a 1080p video consumed 30% CPU. This means three concurrent encodings will saturate the CPU.

A small tweak of the x265 options increased its efficiency and consumed 60% CPU usage. (*) This allows just two videos to be encoded concurrently.

(*) CPU usage is not linear for HyperThreaded CPUs. HT doubles #cores, but not actual CPU performance -- in the region of 15% for Haswell. So 50% really means 85%.

To scale x265, the video needs to be broken up into small segments (preferably at I-frames) and then encoded distributedly.

DrXenos
Bright Spark User
Posts: 233
Joined: Sat Mar 16, 2013 1:19 pm

Re: Parallel queue implementation

Post by DrXenos »

Forgive me naivete, as encoding/transcoding software is not my area of expertise. But is all this complexity really necessary. Wouldn't it be easier to let the user set the number of simultaneous transcodes? Wouldn't a given set of hardware typically be able to handle N transcodes at a time. I realize it would fluctuate given various factors such as resolution or bitrate, but would it really be enough to dramatically change the value of N on average?

Just curious. No offense meant.

demonsavatar
Posts: 2
Joined: Mon Jan 20, 2020 11:23 pm

Re: Parallel queue implementation

Post by demonsavatar »

DrXenos wrote:
Tue Jan 21, 2020 6:10 pm
Forgive me naivete, as encoding/transcoding software is not my area of expertise. But is all this complexity really necessary. Wouldn't it be easier to let the user set the number of simultaneous transcodes? Wouldn't a given set of hardware typically be able to handle N transcodes at a time. I realize it would fluctuate given various factors such as resolution or bitrate, but would it really be enough to dramatically change the value of N on average?

Just curious. No offense meant.
You can certainly find N with experimentation if you typically do similar kinds of encodes. The goal here seems to be to have an automatic "parallel" setting so the user doesn't need to think about how many parallel encodes their system can handle. Especially if there are different presets and formats queued up -- e.g. CPU usage is very different for x265, x264, and VP9.

DrXenos
Bright Spark User
Posts: 233
Joined: Sat Mar 16, 2013 1:19 pm

Re: Parallel queue implementation

Post by DrXenos »

I realize that. I'm just wondering how widely the parallelism would swing.

demonsavatar
Posts: 2
Joined: Mon Jan 20, 2020 11:23 pm

Re: Parallel queue implementation

Post by demonsavatar »

On a given machine, probably not that much if you're often targeting the same formats/resolutions. But if I could just check an "enable parallel processing" box and not have to worry about finding the right number of encodes to launch for my 4 core laptop, 8 core desktop, or 32 core threadripper workstation, I would welcome that :)

DrXenos
Bright Spark User
Posts: 233
Joined: Sat Mar 16, 2013 1:19 pm

Re: Parallel queue implementation

Post by DrXenos »

Yeah, I get it. It's a very interesting problem.

Post Reply