TedJ said it best with respect to x264 and HandBrake, now on to the rest of your post.
First, please don't drink the Kool-Aid provided by GPU vendors marketing department. It will lead to delusions. I've worked some on OpenCL lookahead motion estimation for x264. No, it is not finished, and I've kind of put it in a corner for now.
tha_specializt wrote:A GPU is capable of doing a [Censored] of atomic (!) operations per second, MUCH more than a standard CPU will ever dream of - hence EVERY CODE YOU CAN THINK OF fits in a GPU - especially (!) with frameworks like OpenCL, Brook+, Stream, CUDA and whatnot.
I'm kind of confused why you would bring up atomic operations here. Maybe you misunderstand what an atomic operation is? They're really rather slow on a GPU, because they usually require access to global memory. This gives you a penalty of several hundred clock cycles. It really would be more beneficial to avoid atomic operations if you can.
tha_specializt wrote:One would have to break down the code into atomic operations - which is a [Censored] of annoying work but a not-so-small team of developers (therefore : $1 != megalomaniac scriptkiddies) could transpose the essential code in a few weeks - months, maybe. Oh and before you start crying : companys have already catched on, CUDA is very well supported - how does "1080p / ~1000s" sound to you?
I'm not quite sure what you mean by the "1080p / ~1000s" here.
Anyway, the problem with GPUs is they are massively parallel. This is also their greatest benefit. For algorithms that can be parallelized easily they are amazingly fast (Folding@Home is a great example of this type of computation). Most of the video encoding process is very linear. You have to have the results of one step before moving on to the next. In a GPU, this is very limiting, basically they work "all or nothing". You're either using all of the parallel processors or one (although the others still do the work, but the results are wasted). So in this case you end up with a ~700MHz processor that has a VERY high memory latency. This ends up being many times slower than just running the code on the CPU.
Your next problem is transferring information to the GPU. While the PCI-E bus probably has enough bandwidth to do the transfer, it is very high latency when compared to system memory. This is very limiting on where GPU acceleration can be used. For the most part you can just do the computation on the CPU and have the results before they are transfered back from the GPU.
Now let's take the most ideal candidate for GPU acceleration in x264, the look-ahead motion search. Generally we will have the time to wait here, so the latency isn't too big of an issue. For the most part motion estimation is very well suited to parallel processing. Now, on to the issues here. Every thread within a (in OpenCL terms) work group must do the same calculation. If you have any divergence, your code ends up getting serialized, which drastically hurts performance. So this eliminates using any sort of efficient algorithm (even the simplest, diamond search, falls prey to this), because they must be able to make decisions independently of other threads. You are limited to 'dumb' algorithms, such as an exhaustive search. This is rather wasteful of resources, as you have to do a lot more work than is necessary to get the same result.
Next you have predictors to deal with. Even if you do an exhaustive search where each thread represents one MV, all the predictors in a macroblock need to be the same, so this part needs to be serialized for each macroblock.
Next you run into GPU memory access. GPUs have a lot of memory bandwidth, but it is still fairly high latency. If you have specific memory access patterns, loads and stores can be coalesced to take advantage of the bandwidth available. The memory access patterns used in motion searches are not any one of these patterns. You can use the texture cache to reduce memory latency, however according to a nVidia engineer, this reduces available memory bandwidth by a factor of 5 (this may or may not only apply to nVidia hardware).
I'm sure I've missed a few more points, but I don't feel like typing any longer.
Note, Fermi does change some of this slightly, but not enough.