NVENC versus CUDA and Hardware Encoding Performance

Discuss encoding for devices and presets.
Forum rules
An Activity Log is required for support requests. Please read How-to get an activity log? for details on how and why this should be provided.
Post Reply
metaldave
Posts: 36
Joined: Mon Apr 10, 2017 6:40 pm

NVENC versus CUDA and Hardware Encoding Performance

Post by metaldave »

Hey, Friends.

I just came across another thread from a few months back where someone inquired on creating a database for performance benchmarks for different CUDA iterations of NVIDIA hardware. While the OP's premise was fallible, this was an interesting conversation for a couple of reasons. First, I have shared in this confusion with the distinction between the CUDA Cores and NVENC Hardware. Second, I would, indeed, like to know if I've got the best combination of equipment for performance of offline encoding with Handbrake, FFmpeg, etc.

With my research, I came to the realization the GPU included the Encoder and Decoders as separate modules from the CUDA cores. In that discussion, it was mentioned the Encoder and Decoder were separate, on-chip integrated ASIC. I had no idea they were separate ASIC, but that makes sense. (If there's a source that describes this, I'd love to see it.)

NVIDIA has some great articles on their developer site, and their NVIDIA FFmpeg Transcoding Guide provides an overview of the Encoding and Decoding hardware integrated into the GPU. It also reviews some of the relevant options for FFmpeg to leverage the NVENC and NVDEC functionality. It's a great primer for understanding the capabilities of the NVIDIA GPU.

I have put together a dedicated encoding workstation with a GTX 1650 Super (Turing generation GPU). With my new understanding of the integrated encoder, it seems that all Turing generation cards use the same NVENC ASIC. This means my entry level GTX 1650 Super has the same capability for encoding as any of the higher end GPU with more CUDA cores. There is some comfort that I didn't short myself by spending as little as possible.

I would like to ensure that I am optimizing the encoding process in the best way. From a HandBrake perspective, it would be good to know how to best take advantage of the Turing NVENC encoder. Should I be using some of the available command-line options? What do the presets do by contrast? The HandBrake documentation is always a work in progress (which I would gladly contribute to developing myself), but there must be some technical documentation that would help guide the optimization.

As an aside, I know there is a strident camp that prefers software encoding with x.264 and x.265, but I believe the Turing generation offers quality and speed that were not previously available. With the right combination, I'm sure comparable results can be made with more efficiency (with at least a huge time savings). That's why I put together my encoding workstation, and I believe the time savings will be as worthy as the power savings I'll gain with a purpose-driven build.

This is more of a discussion for a sharing of resources to help guide others on this path of hardware encoding.

Thanks,

- Dave
musicvid
Veteran User
Posts: 3753
Joined: Sat Jun 27, 2009 1:19 am

Re: NVENC versus CUDA and Hardware Encoding Performance

Post by musicvid »

Your choice of a 1650 is a good one, according to at least two engineers I know.
mduell
Veteran User
Posts: 7257
Joined: Sat Apr 21, 2007 8:54 pm

Re: NVENC versus CUDA and Hardware Encoding Performance

Post by mduell »

CUDA encoding has been dead and gone for many generations, it never worked well.
NVDEC/ENC is on the main GPU ASIC, not a separate ASIC.

The NVENC documentation is a bit garbage, but the available options don't result in much variation anyway. It's great for quick throw away encodes and realtime captures; it's not worthwhile for archival material.
User avatar
Rodeo
HandBrake Team
Posts: 12617
Joined: Tue Mar 03, 2009 8:55 pm

Re: NVENC versus CUDA and Hardware Encoding Performance

Post by Rodeo »

It's still a bit confusing for end users because NVIDIA's API for NVENC does use CUDA (even though the encoding is still done on the dedicated hardware rather than the GPU itself).
metaldave
Posts: 36
Joined: Mon Apr 10, 2017 6:40 pm

Re: NVENC versus CUDA and Hardware Encoding Performance

Post by metaldave »

I really appreciate the veteran user response; Thanks for taking the time.

I don't think there's much to debate on whether software encoding will provide a better encode (a la "archival quality"). In my workflow, I'm working with commercial discs. If I wanted to truly "archive," I'd stick with the original material as ripped from the disc. However, I'm looking at playback quality which is infinitely objective, and the target audience is usually the one to grade the results.

I believe my use case is a very common one. We're just trying to keep the 6-year old from touching the physical media, make it super convenient for those less inclined to learn to use the home theater (eg "the Mrs."), and still be "good enough" for the audio/videophile (me). As long as I don't see pixelation, weird light and dark spots, unintended blur in the action... I'm good.

To support these efforts, video encoding provides the advantage of using less storage. There are a lot of ways to encode with combinations of hardware, software, and codecs. As mentioned, software encoding generally provides the baseline or "gold standard" we're trying to match with Hardware encoding. There are advantages in quality and storage size, but the time and power requirements are a consideration. One could, certainly, invest in a Threadripper to address the speed. However, the hardware cost is in the thousands for just the CPU, and that's a lot of cash for something dedicated to just making sure Back to the Future looks as good as possible.

Leveraging the relatively inexpensive NVIDIA card, I can take advantage of older (i.e. cheaper) hardware in a purpose-built system. I don't feel obligated to keep the system on to do something else with all those CPU cores (as in the Threadripper alternative), and it allows me to create a purpose-built system. The Turing generation of GPU provides the latest and greatest hardware encoder and decoder available, and the fact these same ASIC are included standard in any version of the GPU puts us on an even playing field for troubleshooting and optimization.

This brings us to the glaring lack of resources and information available to leverage Turing GPU encoding. The majority use-case for the encoders are for video game streaming applications (Twitch and YouTube streaming). This is followed by the smaller user base of PLEX users looking for transcoding, and then we have us folks looking for an offline encoding solution for our libraries. There's video mastering and editing professionals in that space as well, but, again, they're part of us here in the minority, and software encoders are back on the table in that use case as well. Thus, there's not a lot of data or information to consume in terms of optimization for the offline encoding workflow.

By participating in groups like this and documenting my own findings, I'm hoping to change that. Ideally, I'd love to write a primer on the minimum configuration to get superior results. At the worst, I'll have somewhere to refer back to my own notes when I put this down and come back to it months later. At best, I could help others looking for the same workflow.
metaldave
Posts: 36
Joined: Mon Apr 10, 2017 6:40 pm

Re: NVENC versus CUDA and Hardware Encoding Performance

Post by metaldave »

mduell wrote: Sat Aug 29, 2020 10:09 pmNVDEC/ENC is on the main GPU ASIC, not a separate ASIC.
The ASIC is integrated into the GPU, but it is discrete from the CUDA Cores. These seems supported by the available NVIDIA documentation.
Image
mduell wrote: Sat Aug 29, 2020 10:09 pmThe NVENC documentation is a bit garbage
You have to start somewhere. :wink:
mduell wrote: Sat Aug 29, 2020 10:09 pmthe available options don't result in much variation anyway.
As with the general Handbrake guidance, you probably don't want to worry about tweaking settings over the defaults. However, I'm looking to confirm certain GPU features are enabled, and the command line options help to ensure this. The NVIDIA papers above use FFmpeg examples to provide comparisons between software and hardware encoding, and this allows them to make (ideally) equivalent comparisons.
mduell wrote: Sat Aug 29, 2020 10:09 pmIt's great for quick throw away encodes and realtime captures; it's not worthwhile for archival material.
One person's "throw away encode" is another's Sunday afternoon viewing.

With hardware encoding, we're looking for that combination of quality versus performance (speed, energy consumption). NVIDIA even takes a mention of this advantage in their propaganda... I mean, documentation:
  • Turing GPUs come equipped with powerful NVENC video encoding units which delivers higher video compression efficiency compared to sophisticated software encoders like libx264, due to the combination of higher performance and lower energy consumption. The ideal solution for transcoding needs to be cost effective (dollars/stream) and power efficient (watts/stream).
One of the beautiful things about Handbrake is the ability to queue jobs with presets galore. This allows you to leave a machine to run x264/x265 encodes until they're done (however long that takes). In the interim, you're dealing with cooling, CPU throttling, etc. Yes, I've got a netbook with an 8th Gen Intel i7 that can sit there and churn until it dies for all I care, but it's definitely not the most efficient way to do it (and, certainly, not the fastest). Going back to a desktop allows for various cooling options to prevent throttling and ensure maximum processing power. However, you're getting back to the cost issue again and always looking for the better CPU to do the job (and we know Intel performance per watt is an issue with every generation).

My purpose-built machine is running a third-gen Intel Xeon E3-1245 V2 @ 3.40 GHz (quad core with hyper-threading) and an NVIDIA GeForce GTX 1650 SUPER (Turing TU116). If I can ensure the video Decode and Encode is taking place within the GPU, I have the advantage of using GPU memory (4GB of DD6), so the system specs are, theoretically, irrelevant. Granted, using software filters will change the workflow a bit (using a mix of CPU and GPU processing), but I won't hurt that CPU in the least (especially because it's well cooled and Turbo capable up to 3.80 GHz).

Hey, we all gotta have hobbies. Getting maximum price to performance is always a challenge I like to take.
metaldave
Posts: 36
Joined: Mon Apr 10, 2017 6:40 pm

Re: NVENC versus CUDA and Hardware Encoding Performance

Post by metaldave »

musicvid wrote: Sat Aug 29, 2020 3:44 amYour choice of a 1650 is a good one, according to at least two engineers I know.
My friends at the PLEX forums recommended this when I broached the subject of a low-cost, purpose-built encoding "appliance."

I'm tickled that it doesn't matter which Turing GPU you buy as they all have the same NVENC and NVDEC ASIC. The GTX 1650 Super has the TU116 GPU (the same as the GTX 1660), so, in addition to the equivalent efficiency and performance, it also has HEVC B Frame support (per the NVIDIA Video Encode and Decode GPU Support Matrix). I'm not sure how to to ensure I'm taking advantage of that capability (yet), but it's nice to know I'm not missing a feature over the other GPU.

As an aside: I'm not really a gamer, but I'm sure the 1,280 CUDA cores would do a fine job to keep up if I ever cared to fire something up.

I bought the MSI GeForce GTX 1650 Super Gaming X version of the card. Of the three GTX 1650 Super models they offer, this was the longest card, but it offered the best cooling solution. It has a huge heat sink and two large fans. The cooling is so efficient, the fans never seem to run. I think the fans in the case do a level job of keeping the air flowing, so this almost operates as a backup.
metaldave
Posts: 36
Joined: Mon Apr 10, 2017 6:40 pm

Re: NVENC versus CUDA and Hardware Encoding Performance

Post by metaldave »

Rodeo wrote: Sat Aug 29, 2020 10:22 pm It's still a bit confusing for end users because NVIDIA's API for NVENC does use CUDA (even though the encoding is still done on the dedicated hardware rather than the GPU itself).
To add to this confusion, the NVIDIA X Server Settings panel (Linux) has a GPU Utilization (percentage) and a Video Engine Utilization (percentage) diagnostic. Video Engine Utilization spikes at 100% when running an encode using NVENC, but GPU hovers in the mid twenties. I can get the GPU Utilization percentage to toggle just by moving a window, so I'm sure it's always involved in some manner.

Honestly, I expected everything to be considered under the umbrella of "GPU Utilization," so it's interesting they break out the Video Engine as a separate reading. At least I now understand which number I'm looking at to ensure we're using NVENC or NVDEC.
User avatar
s55
HandBrake Team
Posts: 9850
Joined: Sun Dec 24, 2006 1:05 pm

Re: NVENC versus CUDA and Hardware Encoding Performance

Post by s55 »

HandBrake doesn't use NVDec and we don't currently have plans to. It's fine with spec complaint files but struggles when you throw error prone OTA recordings, NVR file and anything that isn't complaint at it. As such, the reliability just isn't high enough to justify at this time.

As such, you do need a reasonable CPU and decently fast RAM to feed NVEnc. Especially when you go above 1080p

To add to this confusion, the NVIDIA X Server Settings panel (Linux) has a GPU Utilization (percentage) and a Video Engine Utilization (percentage) diagnostic. Video Engine Utilization spikes at 100% when running an encode using NVENC, but GPU hovers in the mid twenties. I can get the GPU Utilization percentage to toggle just by moving a window, so I'm sure it's always involved in some manner.

Honestly, I expected everything to be considered under the umbrella of "GPU Utilization," so it's interesting they break out the Video Engine as a separate reading. At least I now understand which number I'm looking at to ensure we're using NVENC or NVDEC.
At least on windows, the encode utilisation is under a separate graph as well and it makes sense given it's discrete hardware.

I'd wager any "GPU" utilisation is orchestration, memory bandwidth, PCI-E link bandwidth that are all shared.
metaldave
Posts: 36
Joined: Mon Apr 10, 2017 6:40 pm

Re: NVENC versus CUDA and Hardware Encoding Performance

Post by metaldave »

s55 wrote: Sun Aug 30, 2020 9:28 amAt least on windows, the encode utilisation is under a separate graph as well and it makes sense given it's discrete hardware.
Yeah, I miss my little Task Manager GPU charts (but not enough to abandon ship on Linux).

s55 wrote: Sun Aug 30, 2020 9:28 amI'd wager any "GPU" utilisation is orchestration, memory bandwidth, PCI-E link bandwidth that are all shared.
Agreed; I'm sure it's more noticeable on lower end hardware from 2013 as well. :wink:

s55 wrote: Sun Aug 30, 2020 9:28 am HandBrake doesn't use NVDec and we don't currently have plans to. It's fine with spec complaint files but struggles when you throw error prone OTA recordings, NVR file and anything that isn't complaint at it. As such, the reliability just isn't high enough to justify at this time.

As such, you do need a reasonable CPU and decently fast RAM to feed NVEnc. Especially when you go above 1080p
This sheds a lot of light on the subject! I just assumed Handbrake was engaging FFmpeg to use hardware acceleration on both ends.

I can appreciate the desire to keep the workflow as foolproof as possible and accounting for a wide variety of input files. Regardless of what's doing the decode, the goal would be to ensure the decoded data goes into the GPU and stays there (without having to shuffle back through the PCIe bus, into system memory, etc.) throughout the transcode. Filters aside, is this happening when we're using hardware encoding?

Ideally, we'd start this transcode process in the GPU with the decode on hardware. Assuming we've got a compliant file (and understand the risks), is there a way to override and use NVDEC?

In terms of justification to add the NVDEC capability: it's this exact scenario I'm describing wherein we're using light system hardware and a qualified graphics card. Realizing the majority of enthusiasts are using x264/x265 software encoding, I expect there would, similarly, little demand for hardware decoding. However, the GPU capability is advancing with each generation, so this is becoming more relevant.
User avatar
s55
HandBrake Team
Posts: 9850
Joined: Sun Dec 24, 2006 1:05 pm

Re: NVENC versus CUDA and Hardware Encoding Performance

Post by s55 »

This sheds a lot of light on the subject! I just assumed Handbrake was engaging FFmpeg to use hardware acceleration on both ends.
I can appreciate the desire to keep the workflow as foolproof as possible and accounting for a wide variety of input files. Regardless of what's doing the decode, the goal would be to ensure the decoded data goes into the GPU and stays there (without having to shuffle back through the PCIe bus, into system memory, etc.) throughout the transcode. Filters aside, is this happening when we're using hardware encoding?
HandBrake is not an ffmpeg front-end. We have our own engine and pipeline between various libraries (including libavformat and libavcodec from ffmpeg) (Listed here if interested: https://github.com/HandBrake/HandBrake/ ... er/contrib)

Very Crudely, it looks something like

Decode -> Filters -> Vid Encoding -> Mux
-> Audio Encoding ->
-> A / V Sync Management ->

All happening at the same time

Honestly, if all you want is a "Pure" GPU encode experience, HandBrake is not the tool for the job. There isn't much point using NVEnc in HandBrake if your just going to cripple much of the key functionality on offer since most of it won't run on the GPU. There are much simpler "pure" path UI's out there that will give a better overall experience.

Ideally, we'd start this transcode process in the GPU with the decode on hardware. Assuming we've got a compliant file (and understand the risks), is there a way to override and use NVDEC?
Nope. Requires code to be altered to support it.
In terms of justification to add the NVDEC capability: it's this exact scenario I'm describing wherein we're using light system hardware and a qualified graphics card. Realizing the majority of enthusiasts are using x264/x265 software encoding, I expect there would, similarly, little demand for hardware decoding. However, the GPU capability is advancing with each generation, so this is becoming more relevant.
NVEnc had a vocal community demanding it, then when it was added, virtually no-one ended up using it due to poor results. So we are now in the situation where we have a feature that very few people are actually using, has no developer interest and no active upstream support from the vendor. We have no planned improvements for it so it's really just ticking over in maintenance mode.

At this point in time, Intel QuickSync is the only platform getting investment and that's mostly down to some volunteers from Intel that are helping out so the pipeline there is much more advanced (and includes decode support and in some case can utilise zero-copy as well as it's all operating in system memory. )

While QuickSync is not quite as fast currently, the output quality and file sizes are generally superior to NVenc and hopefully with TigerLake and Xe graphics we'll see a decent boost there too.
tlindgren
Bright Spark User
Posts: 244
Joined: Sun May 03, 2009 2:14 pm

Re: NVENC versus CUDA and Hardware Encoding Performance

Post by tlindgren »

metaldave wrote: Sat Aug 29, 2020 1:43 amI have put together a dedicated encoding workstation with a GTX 1650 Super (Turing generation GPU). With my new understanding of the integrated encoder, it seems that all Turing generation cards use the same NVENC ASIC. This means my entry level GTX 1650 Super has the same capability for encoding as any of the higher end GPU with more CUDA cores. There is some comfort that I didn't short myself by spending as little as possible.
To avoid this misleading someone later, it's important to note that this is NOT true, the standard 1650 cards have the much inferior Volta NVEnc encoder (which is nearly identical with the Pascal/10xx NVEnc), this can be confirmed by checking the official Support Matrix that was posted in this thread, there's a star on the family field for the 1650 which if you follow it reveals this.

With regard to the 1650 Super cards Nvidia doesn't actually say what NVEnc block it has, the specification doesn't reveal that and the Support Matrix doesn't have an entry which means Nvidia doesn't officially commit to anything above the standard 1650 (Volta NVEnc)

However all 1650 Supers I've heard of so far use the TU116 dies (also used in 1660) which means they have the Turing NVEnc block but Nvidia makes no promises on that. So slight caution on the 1650 Super especially on mobile/laptop sector where there's sometimes SERIOUS shenanigans going on with regards to model branding (I think the historical record is 3 very different Nvidia dies used for the "same" GPU!).

For physical cards I hope that we'll never see 1650 Super's that doesn't have the Volta NVEnc, the TU117 die just isn't large enough for 1650 Super and it seems unlikely they'll make a new chip variant specifically to be able to build a cheaper 1650 Super at this point... Still, it is a high-volume product where cost saving redesigns could happen but it still seems unlike especially given that their chip designers are likely busy with Ampere chip variants now!
Post Reply