ATI gpu support

Random chit-chat and anything that doesn't belong elsewhere
Locked
jeylm
Posts: 1
Joined: Mon Mar 30, 2009 11:58 am

ATI gpu support

Post by jeylm »

Hallo, AMD has released the instructions set architecture for R700 today http://developer.amd.com/gpu_assets/R70 ... ecture.pdf
If this is implemented handbrake speed can bewonderful :) is it possible ?
nightstrm
Veteran User
Posts: 1887
Joined: Fri Mar 23, 2007 5:43 am

Re: ATI gpu support

Post by nightstrm »

jeylm wrote:Hallo, AMD has released the instructions set architecture for R700 today http://developer.amd.com/gpu_assets/R70 ... ecture.pdf
If this is implemented handbrake speed can bewonderful :) is it possible ?
You'd have to talk with the x264 guys about implementing this, as Handbrake uses their library for encoding.
User avatar
s55
HandBrake Team
Posts: 9900
Joined: Sun Dec 24, 2006 1:05 pm

Re: ATI gpu support

Post by s55 »

Wouldn't get your hopes up. CUDA was a bust, doubt this will be any different.
mrreload
Posts: 10
Joined: Mon Jan 05, 2009 10:38 pm

Re: ATI gpu support

Post by mrreload »

CUDA is a bust? I think not. GPU Acceleration is just a newer technology that is taking longer to catch on.
I think GPU Acceleration under When Hell Freezes Over section is naive. We now live in a day when the common video card has much more power available for these types of tasks than the common CPU.
FYI, I love Handbrake and have encoded over 300 DVDs with it. (And Counting)
TedJ
Veteran User
Posts: 5388
Joined: Wed Feb 20, 2008 11:25 pm

Re: ATI gpu support

Post by TedJ »

The only API that has any chance of being implemented in future is the much touted OpenCL... no one working on libx264 is willing to spend overly much time on an API that is limited to one OS and/or GPU architecture. Unfortunately, there is very little to OpenCL at the moment besides some press releases and demo code - if (and I stress if) OpenCL goes mainstream it may be implemented, but that won't happen for a minimum of 12 months at the very least.

Based on some early tests with CUDA in lix264, the benefit isn't as much as the PR flacks would have you believe - GPUs excel in parallelism, which is better suited to spatial (intra-frame) compression rather than temporal (inter-frame) compression.
User avatar
s55
HandBrake Team
Posts: 9900
Joined: Sun Dec 24, 2006 1:05 pm

Re: ATI gpu support

Post by s55 »

I think GPU Acceleration under When Hell Freezes Over section is naive.
No-one said that.
CUDA is a bust? I think not.
I think your implying that I'm saying CUDA is useless for everything. I'm not.
In terms of x264, it was a bust hence why we don't have an x264 lib with CUDA support. Hopes are with OpenCL being easier to work with but given the lack of info on OpenCL, that remains to be seen.
jbrjake
Veteran User
Posts: 4805
Joined: Wed Dec 13, 2006 1:38 am

Re: ATI gpu support

Post by jbrjake »

s55 wrote:
I think GPU Acceleration under When Hell Freezes Over section is naive.
No-one said that.
I did, in the requests forum readme.

And mrreload, if you don't understand why the chances of the HandBrake team implementing GPU acceleration are around as likely as hell freezing over, you clearly have not read this thread and have no understanding of how HB works. There is practically no code capable of being accelerated that way in our project, and what little there is, only runs in optional filters that have absolutely zero to do with encoding. Requesting it from us is beyond useless and only demonstrates fundamental ignorance of what HandBrake is and how it works.

The only person being naive here is you.
User avatar
s55
HandBrake Team
Posts: 9900
Joined: Sun Dec 24, 2006 1:05 pm

Re: ATI gpu support

Post by s55 »

I did, in the requests forum readme.
I stand corrected.
tha_specializt
Posts: 2
Joined: Mon Aug 09, 2010 11:17 pm

Re: ATI gpu support

Post by tha_specializt »

jbrjake wrote:
s55 wrote:
There is practically no code capable of being accelerated that way in our project,
Most enormous [Censored] i read in the past 2 years. A GPU is capable of doing a [Censored] of atomic (!) operations per second, MUCH more than a standard CPU will ever dream of - hence EVERY CODE YOU CAN THINK OF fits in a GPU - especially (!) with frameworks like OpenCL, Brook+, Stream, CUDA and whatnot.
One would have to break down the code into atomic operations - which is a [Censored] of annoying work but a not-so-small team of developers (therefore : $1 != megalomaniac scriptkiddies) could transpose the essential code in a few weeks - months, maybe. Oh and before you start crying : companys have already catched on, CUDA is very well supported - how does "1080p / ~1000s" sound to you? Yes, thats what i thought. For the future : stop trolling, stop assuming and start KNOWING things, thank you.
TedJ
Veteran User
Posts: 5388
Joined: Wed Feb 20, 2008 11:25 pm

Re: ATI gpu support

Post by TedJ »

You've obviously read this thread, but like mrreload above you fail to comprehend... WE DO NOT DEVELOP OR MAINTAIN LIBX264! In order to support GPU acceleration, the upstream libx264 developers will have to implement it, not us.
mduell
Veteran User
Posts: 7329
Joined: Sat Apr 21, 2007 8:54 pm

Re: ATI gpu support

Post by mduell »

You're spewing in the wrong forum. The HB project does not maintain most of the processing intensive code. Post on doom9 to talk to the x264 developers.
tha_specializt wrote:companys have already catched on, CUDA is very well supported
And yet all they've produced are a couple mediocre encoders. You can get the same speed and quality on your CPU using settings similar to x264cli's superfast preset.
tha_specializt wrote:how does "1080p / ~1000s" sound to you? Yes, thats what i thought.
I don't even know what that means. 1000 seconds per 1080p frame?
mduell
Veteran User
Posts: 7329
Joined: Sat Apr 21, 2007 8:54 pm

Re: ATI gpu support

Post by mduell »

A CPU is a woman. A GPU is 9 women.

How long does it take a woman to gestate a baby?

How long does it take 9 women to gestate a baby?

Even with 9 times the resources you can't get it done faster because the good algorithms don't parallelize well.

There are algorithms that parallelize well. They're not very good, so even running on 9x the hardware they're still slower.
mduell
Veteran User
Posts: 7329
Joined: Sat Apr 21, 2007 8:54 pm

Re: ATI gpu support

Post by mduell »

A CPU is like fresh water, a GPU is like sea water.

You have a gallon of seawater and a cup of fresh water. Drinking which will keep you alive longer?

But OMG, you have so much more seawater! And it's 99.9% the same as the fresh water! You can do some things, like weighing down a bucket, with either. Yet other things, like staying alive, with just one.
saintdev
Regular User
Posts: 146
Joined: Wed Dec 20, 2006 4:17 am

Re: ATI gpu support

Post by saintdev »

TedJ said it best with respect to x264 and HandBrake, now on to the rest of your post.


First, please don't drink the Kool-Aid provided by GPU vendors marketing department. It will lead to delusions. I've worked some on OpenCL lookahead motion estimation for x264. No, it is not finished, and I've kind of put it in a corner for now.
tha_specializt wrote:A GPU is capable of doing a [Censored] of atomic (!) operations per second, MUCH more than a standard CPU will ever dream of - hence EVERY CODE YOU CAN THINK OF fits in a GPU - especially (!) with frameworks like OpenCL, Brook+, Stream, CUDA and whatnot.
I'm kind of confused why you would bring up atomic operations here. Maybe you misunderstand what an atomic operation is? They're really rather slow on a GPU, because they usually require access to global memory. This gives you a penalty of several hundred clock cycles. It really would be more beneficial to avoid atomic operations if you can.
tha_specializt wrote:One would have to break down the code into atomic operations - which is a [Censored] of annoying work but a not-so-small team of developers (therefore : $1 != megalomaniac scriptkiddies) could transpose the essential code in a few weeks - months, maybe. Oh and before you start crying : companys have already catched on, CUDA is very well supported - how does "1080p / ~1000s" sound to you?
I'm not quite sure what you mean by the "1080p / ~1000s" here.

Anyway, the problem with GPUs is they are massively parallel. This is also their greatest benefit. For algorithms that can be parallelized easily they are amazingly fast (Folding@Home is a great example of this type of computation). Most of the video encoding process is very linear. You have to have the results of one step before moving on to the next. In a GPU, this is very limiting, basically they work "all or nothing". You're either using all of the parallel processors or one (although the others still do the work, but the results are wasted). So in this case you end up with a ~700MHz processor that has a VERY high memory latency. This ends up being many times slower than just running the code on the CPU.

Your next problem is transferring information to the GPU. While the PCI-E bus probably has enough bandwidth to do the transfer, it is very high latency when compared to system memory. This is very limiting on where GPU acceleration can be used. For the most part you can just do the computation on the CPU and have the results before they are transfered back from the GPU.

Now let's take the most ideal candidate for GPU acceleration in x264, the look-ahead motion search. Generally we will have the time to wait here, so the latency isn't too big of an issue. For the most part motion estimation is very well suited to parallel processing. Now, on to the issues here. Every thread within a (in OpenCL terms) work group must do the same calculation. If you have any divergence, your code ends up getting serialized, which drastically hurts performance. So this eliminates using any sort of efficient algorithm (even the simplest, diamond search, falls prey to this), because they must be able to make decisions independently of other threads. You are limited to 'dumb' algorithms, such as an exhaustive search. This is rather wasteful of resources, as you have to do a lot more work than is necessary to get the same result.
Next you have predictors to deal with. Even if you do an exhaustive search where each thread represents one MV, all the predictors in a macroblock need to be the same, so this part needs to be serialized for each macroblock.
Next you run into GPU memory access. GPUs have a lot of memory bandwidth, but it is still fairly high latency. If you have specific memory access patterns, loads and stores can be coalesced to take advantage of the bandwidth available. The memory access patterns used in motion searches are not any one of these patterns. You can use the texture cache to reduce memory latency, however according to a nVidia engineer, this reduces available memory bandwidth by a factor of 5 (this may or may not only apply to nVidia hardware).

I'm sure I've missed a few more points, but I don't feel like typing any longer.

Note, Fermi does change some of this slightly, but not enough.
jbrjake
Veteran User
Posts: 4805
Joined: Wed Dec 13, 2006 1:38 am

Re: ATI gpu support

Post by jbrjake »

tha_specializt wrote:
jbrjake wrote:
s55 wrote:
There is practically no code capable of being accelerated that way in our project,
Most enormous [Censored] i read in the past 2 years. A GPU is capable of doing a [Censored] of atomic (!) operations per second, MUCH more than a standard CPU will ever dream of - hence EVERY CODE YOU CAN THINK OF fits in a GPU - especially (!) with frameworks like OpenCL, Brook+, Stream, CUDA and whatnot.
One would have to break down the code into atomic operations - which is a [Censored] of annoying work but a not-so-small team of developers (therefore : $1 != megalomaniac scriptkiddies) could transpose the essential code in a few weeks - months, maybe. Oh and before you start crying : companys have already catched on, CUDA is very well supported - how does "1080p / ~1000s" sound to you? Yes, thats what i thought. For the future : stop trolling, stop assuming and start KNOWING things, thank you.
Wow. As everyone else has noted, you are a moron.

I spent weeks porting our filter code to OpenCL-- the part that makes the most sense on the GPU because it's full of nested loops that iterate the same instructions over large multidimensional arrays. It was a total waste of time. Why? First off, coalesced loads are a pain in the ass when you need to do things like compare a pixel in one line against a pixel in another line/frame that's got a horizontal offset from the first. Second, the memory latency everyone keeps harping on. If you're going to work with data from, say, three frames at once, you have to do a ton of transfers from host memory. It makes it so these "atomic" operations you're so sure work well on the GPU don't, at all. The benefits of the GPU totally disappear when you have to keep moving data back and forth across the bus. It's not like the demo apps they use to show off GPU processing, where you have two huge arrays, compare the same value in the same space in each array, and then output a single result for the entire comparison which can be quickly shot back at the CPU's memory. Most steps in video filtering will take in one or more whole frames and output one or more whole frames, and this is exacerbated by trying to do things "atomically" with the gpu as you suggest instead of keeping intermediary buffers entirely on host or gpu. Third, there are lots of branches involved, which significantly slow down gpu threading.
tha_specializt
Posts: 2
Joined: Mon Aug 09, 2010 11:17 pm

Re: ATI gpu support

Post by tha_specializt »

in this thread : 99,999 of the people dont know how CPUs and/or GPUs nor do most people know what gates are and how they work .... well i refuse to trolls and wannabes who start their "arguments" (if appliable) with insults, you kids need to learn about discussions.

There WAS _one_ real argument, though - the increased amount of cycles needed to access the storage ... but even this problem can be reduced a quite significant amount - with asynchronous data-distribution, inline-structures which are able to wait on a event (even if that means waiting 99% of the time), probability-distribution in terms of accessing parts of the memory (where COULD my data be? Where IS it? Where CAN it be in the next cycle?) and many more techniques which are rather advanced and ... well ... are way too complex to explain them to a few trolls :-) (yes, i dont understand some of them myself - yes, i am no superhuman; thank you cpt. obvious)

First off, coalesced loads are a pain in the ass when you need to do things like compare a pixel in one line against a pixel in another line/frame that's got a horizontal offset from the first
lol ... comparing pixels in hardware ... what a brilliant evidence of the fact that you dont even know what you're doing.
You are actually using large structures, references and the like ON HARDWARE?? Lololololol no wonder why you guys never managed it to harness the power of GPUs

by the way : "1000s / 1080p" = a whole movie, transcoded in ~1000s; 1080p. I thought that would be obvious ....
User avatar
s55
HandBrake Team
Posts: 9900
Joined: Sun Dec 24, 2006 1:05 pm

Re: ATI gpu support

Post by s55 »

Thread locked.

Go find some other forum to troll on. Next time, try doing some real research before having an argument about something you clearly don't understand.
nightstrm
Veteran User
Posts: 1887
Joined: Fri Mar 23, 2007 5:43 am

Re: ATI gpu support

Post by nightstrm »

tha_specializt wrote:in this thread : 99,999 of the people dont know how CPUs and/or GPUs nor do most people know what gates are and how they work .... well i refuse to trolls and wannabes who start their "arguments" (if appliable) with insults, you kids need to learn about discussions.

There WAS _one_ real argument, though - the increased amount of cycles needed to access the storage ... but even this problem can be reduced a quite significant amount - with asynchronous data-distribution, inline-structures which are able to wait on a event (even if that means waiting 99% of the time), probability-distribution in terms of accessing parts of the memory (where COULD my data be? Where IS it? Where CAN it be in the next cycle?) and many more techniques which are rather advanced and ... well ... are way too complex to explain them to a few trolls :-) (yes, i dont understand some of them myself - yes, i am no superhuman; thank you cpt. obvious)

First off, coalesced loads are a pain in the ass when you need to do things like compare a pixel in one line against a pixel in another line/frame that's got a horizontal offset from the first
lol ... comparing pixels in hardware ... what a brilliant evidence of the fact that you dont even know what you're doing.
You are actually using large structures, references and the like ON HARDWARE?? Lololololol no wonder why you guys never managed it to harness the power of GPUs

by the way : "1000s / 1080p" = a whole movie, transcoded in ~1000s; 1080p. I thought that would be obvious ....
I suggest you get in touch with the x264 project team (again, you're on the wrong forum) and enlighten them on how to properly support GPU acceleration. Until I see a x264 commit thanking you for showing them the way, I think I'll trust those who have a proven track record and have demonstrated their abilities.

I'm pretty sure this thread should be locked now. EDIT: Someone beat me to it.
Locked