With upcoming changes to data protection and privacy laws in Europe coming into effect soon, we thought this would be a good time to remind everyone that we do have a privacy policy.
This applies to all users and visitors world-wide.

We have made a few changes to the language to make it clearer in relation to this new regulation but fundamentally, the terms and your rights are unchanged.

If you have any questions about this, please feel free to ask in the General Forum

Testing how well x264 scales

Post your testing results with HandBrake.
Post Reply
nhyone
Bright Spark User
Posts: 196
Joined: Fri Jul 24, 2015 4:13 am

Testing how well x264 scales

Post by nhyone » Tue Mar 01, 2016 1:51 am

The question of how well x264 scales has come up again and again. I finally decided to do a small test to see for myself.

Source: 5-minutes clip of a somewhat grainy 1080p BD, no audio
Encoder: HandBrake 0.10.2. x264, slower preset, CRF 20, ref=4:bframes=6
CPU: Intel Xeon CPU E5-2670 v2 @ 2.50 GHz (Ivy Bridge)
OS: RHEL 6.7

Code: Select all

#proc   1 CPU     w/HT   2 CPUs  2 CPUs w/HT
1       1.197        –        –            –
2       2.248    1.343    2.369            –
4       4.247    2.586    4.497        2.806
6       6.519             6.577
8       8.666    4.912    8.522        5.376
10     10.753            10.564
12          –    7.428   12.837
16          –    9.821   16.935        9.984
20          –   12.233   21.037
24          –        –        –       14.784
32          –        –        –       19.225
40          –        –        –       22.864
(Numbers are fps.)

The CPU affinity is set using taskset. I/O is not limited by it (since it is handled by the kernel), so fewer cores will run relatively faster, but it is negligible as long as there is not much I/O (which is true for this test).

This is a straightforward transcoding. No scaling, decomb, denoise or any filters are used.

I did not set the x264 threads (by default 1.5x the logical processors).

How to interpret the table:
  1. First column is the number of logical processors, 1 to 40
  2. Second column: #processors = cores on one CPU
  3. Third column: #processors = half the cores + half from its HyperThread core on one CPU
  4. Four column: #processors = half the cores from each CPU
  5. Fifth column: #processors = 1/4 the cores + 1/4 from its HyperThread core from each CPU
If you read the table carefully, you'll find:
  1. HT increases performance by only ~15%.
  2. Using cores from two CPUs is slightly faster than using cores from the same CPU, up to 6 cores (3+3).
  3. It scales linearly all the way to 20 cores (around ~1.1 fps per core).
I did not expect the linear scaling, because I expected a huge hit when using cores from two CPUs in the same encoding based on my previous experience. But then I was encoding several videos at the same time, so the memory bandwidth could be maxed out.

A sample encoding log:

Code: Select all

[09:09:06] 1 job(s) to process
[09:09:06] starting job
[09:09:06] sync: expecting 7224 video frames
[09:09:06] job configuration:
[09:09:06]  * source
[09:09:06]    + God of Gamblers
[09:09:06]    + title 1, start 00:05:0.00 stop 00:10:0.00
[09:09:06]  * destination
[09:09:06]    + test_x264_cores/x264_preset_slower_17.mkv
[09:09:06]    + container: Matroska (libavformat)
[09:09:06]  * video track
[09:09:06]    + decoder: h264
[09:09:06]      + bitrate 200 kbps
[09:09:06]    + filters
[09:09:06]      + Framerate Shaper (0:27000000:1125000)
[09:09:06]        + frame rate: same as source (around 24.000 fps)
[09:09:06]      + Crop and Scale (1920:1080:0:0:0:0)
[09:09:06]        + source: 1920 * 1080, crop (0/0/0/0): 1920 * 1080, scale: 1920 * 1080
[09:09:06]    + dimensions: 1920 * 1080, mod 0
[09:09:06]    + encoder: H.264 (libx264)
[09:09:06]      + preset:  slower
[09:09:06]      + options: ref=4:bframes=6
[09:09:06]      + quality: 20.00 (RF)
[09:09:06] encx264: min-keyint: 24, keyint: 240
[09:09:06] encx264: encoding at constant RF 20.000000
[09:09:06] encx264: unparsed options: ref=4:bframes=6:b-adapt=2:direct=auto:analyse=all:me=umh:subme=9:trellis=2:rc-lookahead=60
x264 [info]: using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2 AVX
[09:09:06] reader: first SCR 80977500 id 0x1011 DTS 80977500
[h264 @ 0x7fd7080008c0] Application has requested 21 threads. Using a thread count greater than 16 is not recommended.
x264 [info]: profile High, level 4.0
[09:09:06] h264: "Chapter 1" (1) at frame 0 time 3750
[09:09:06] sync: first pts is 3750
[09:09:06] sync: video time didn't advance - dropped 1 frames (delta 0 ms, current 7500, next 11250, dur 3750)
[09:13:48] h264: "Chapter 2" (2) at frame 6500 time 24375000
[09:14:21] sync: reached pts 27000000, exiting early
[09:14:32] work: average encoding speed for job is 22.863874 fps
[09:14:32] mux: track 0, 7200 frames, 687896383 bytes, 18343.90 kbps, fifo 256
[09:14:33] reader: done. 2 scr changes
[09:14:33] sync: got 7200 frames, 7224 expected
[09:14:33] render: lost time: 0 (0 frames)
[09:14:33] render: gained time: 0 (0 frames) (0 not accounted for)
[09:14:33] h264-decoder done: 11897 frames, 0 decoder errors, 0 drops
x264 [info]: frame I:46    Avg QP:17.36  size:218003
x264 [info]: frame P:1504  Avg QP:21.15  size:131713
x264 [info]: frame B:5650  Avg QP:22.97  size: 84916
x264 [info]: consecutive B-frames:  2.9%  2.1%  3.9%  9.5% 14.9% 59.2%  7.5%
x264 [info]: mb I  I16..4: 10.2% 81.7%  8.1%
x264 [info]: mb P  I16..4:  3.2% 29.1%  0.8%  P16..4: 33.2% 20.2%  6.9%  0.1%  0.0%    skip: 6.5%
x264 [info]: mb B  I16..4:  0.5%  4.8%  0.1%  B16..8: 49.2% 21.2%  4.0%  direct: 7.1%  skip:13.1%  L0:49.9% L1:43.4% BI: 6.7%
x264 [info]: 8x8 transform intra:87.7% inter:76.6%
x264 [info]: direct mvs  spatial:99.4% temporal:0.6%
x264 [info]: coded y,uvDC,uvAC intra: 88.4% 71.5% 36.5% inter: 60.5% 32.8% 3.5%
x264 [info]: i16 v,h,dc,p: 27% 15% 34% 23%
x264 [info]: i8 v,h,dc,ddl,ddr,vr,hd,vl,hu:  8%  7% 10% 10% 15% 12% 14% 11% 13%
x264 [info]: i4 v,h,dc,ddl,ddr,vr,hd,vl,hu: 11% 10%  7%  9% 13% 11% 13% 10% 15%
x264 [info]: i8c dc,h,v,p: 44% 25% 17% 14%
x264 [info]: Weighted P-Frames: Y:6.4% UV:1.0%
x264 [info]: ref P L0: 45.6% 10.3% 25.6% 16.4%  2.1%  0.0%
x264 [info]: ref B L0: 86.2% 10.1%  3.8%
x264 [info]: ref B L1: 93.3%  6.7%
x264 [info]: kb/s:18343.95
[09:14:33] stream: 11940 good frames, 0 errors (0%)
[09:14:33] libhb: work result = 0
Note that HandBrake (or x264) requests for max of 21 threads even though there are 40 logical processors in this case.
Last edited by nhyone on Sat Mar 19, 2016 2:14 am, edited 1 time in total.

nhyone
Bright Spark User
Posts: 196
Joined: Fri Jul 24, 2015 4:13 am

Re: Testing how well x264 scales

Post by nhyone » Fri Mar 18, 2016 2:49 pm

Tested on the same machine, but with veryfast preset, CRF 20.

Code: Select all

#proc   1 CPU     w/HT   2 CPUs  2 CPUs w/HT
1       9.924        –        –            –
2      18.030   11.326   18.666            –
4      34.689   21.781   35.659       22.895
6      53.247            52.421
8      69.725   41.361   67.529       43.473
10     85.808            84.234
12          –   61.215   99.423
16          –   80.083  127.156       78.357
20          –   98.226  148.766
24          –        –        –      112.682
32          –        –        –      141.861
40          –        –        –      163.451
veryfast scales all the way up to 40 processors on 2 CPUs for 1080p encoding. I have no idea about 4 CPUs, though.

In general, I find that veryfast is ~3x faster than medium, and that in turn is ~3x faster than slower.

Post Reply