Testing how well x264 scales

Post your testing results with HandBrake.
Post Reply
nhyone
Bright Spark User
Posts: 196
Joined: Fri Jul 24, 2015 4:13 am

Testing how well x264 scales

Post by nhyone » Tue Mar 01, 2016 1:51 am

The question of how well x264 scales has come up again and again. I finally decided to do a small test to see for myself.

Source: 5-minutes clip of a somewhat grainy 1080p BD, no audio
Encoder: HandBrake 0.10.2. x264, slower preset, CRF 20, ref=4:bframes=6
CPU: Intel Xeon CPU E5-2670 v2 @ 2.50 GHz (Ivy Bridge)
OS: RHEL 6.7

Code: Select all

#proc   1 CPU     w/HT   2 CPUs  2 CPUs w/HT
1       1.197        –        –            –
2       2.248    1.343    2.369            –
4       4.247    2.586    4.497        2.806
6       6.519             6.577
8       8.666    4.912    8.522        5.376
10     10.753            10.564
12          –    7.428   12.837
16          –    9.821   16.935        9.984
20          –   12.233   21.037
24          –        –        –       14.784
32          –        –        –       19.225
40          –        –        –       22.864
(Numbers are fps.)

The CPU affinity is set using taskset. I/O is not limited by it (since it is handled by the kernel), so fewer cores will run relatively faster, but it is negligible as long as there is not much I/O (which is true for this test).

This is a straightforward transcoding. No scaling, decomb, denoise or any filters are used.

I did not set the x264 threads (by default 1.5x the logical processors).

How to interpret the table:
  1. First column is the number of logical processors, 1 to 40
  2. Second column: #processors = cores on one CPU
  3. Third column: #processors = half the cores + half from its HyperThread core on one CPU
  4. Four column: #processors = half the cores from each CPU
  5. Fifth column: #processors = 1/4 the cores + 1/4 from its HyperThread core from each CPU
If you read the table carefully, you'll find:
  1. HT increases performance by only ~15%.
  2. Using cores from two CPUs is slightly faster than using cores from the same CPU, up to 6 cores (3+3).
  3. It scales linearly all the way to 20 cores (around ~1.1 fps per core).
I did not expect the linear scaling, because I expected a huge hit when using cores from two CPUs in the same encoding based on my previous experience. But then I was encoding several videos at the same time, so the memory bandwidth could be maxed out.

A sample encoding log:

Code: Select all

[09:09:06] 1 job(s) to process
[09:09:06] starting job
[09:09:06] sync: expecting 7224 video frames
[09:09:06] job configuration:
[09:09:06]  * source
[09:09:06]    + God of Gamblers
[09:09:06]    + title 1, start 00:05:0.00 stop 00:10:0.00
[09:09:06]  * destination
[09:09:06]    + test_x264_cores/x264_preset_slower_17.mkv
[09:09:06]    + container: Matroska (libavformat)
[09:09:06]  * video track
[09:09:06]    + decoder: h264
[09:09:06]      + bitrate 200 kbps
[09:09:06]    + filters
[09:09:06]      + Framerate Shaper (0:27000000:1125000)
[09:09:06]        + frame rate: same as source (around 24.000 fps)
[09:09:06]      + Crop and Scale (1920:1080:0:0:0:0)
[09:09:06]        + source: 1920 * 1080, crop (0/0/0/0): 1920 * 1080, scale: 1920 * 1080
[09:09:06]    + dimensions: 1920 * 1080, mod 0
[09:09:06]    + encoder: H.264 (libx264)
[09:09:06]      + preset:  slower
[09:09:06]      + options: ref=4:bframes=6
[09:09:06]      + quality: 20.00 (RF)
[09:09:06] encx264: min-keyint: 24, keyint: 240
[09:09:06] encx264: encoding at constant RF 20.000000
[09:09:06] encx264: unparsed options: ref=4:bframes=6:b-adapt=2:direct=auto:analyse=all:me=umh:subme=9:trellis=2:rc-lookahead=60
x264 [info]: using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2 AVX
[09:09:06] reader: first SCR 80977500 id 0x1011 DTS 80977500
[h264 @ 0x7fd7080008c0] Application has requested 21 threads. Using a thread count greater than 16 is not recommended.
x264 [info]: profile High, level 4.0
[09:09:06] h264: "Chapter 1" (1) at frame 0 time 3750
[09:09:06] sync: first pts is 3750
[09:09:06] sync: video time didn't advance - dropped 1 frames (delta 0 ms, current 7500, next 11250, dur 3750)
[09:13:48] h264: "Chapter 2" (2) at frame 6500 time 24375000
[09:14:21] sync: reached pts 27000000, exiting early
[09:14:32] work: average encoding speed for job is 22.863874 fps
[09:14:32] mux: track 0, 7200 frames, 687896383 bytes, 18343.90 kbps, fifo 256
[09:14:33] reader: done. 2 scr changes
[09:14:33] sync: got 7200 frames, 7224 expected
[09:14:33] render: lost time: 0 (0 frames)
[09:14:33] render: gained time: 0 (0 frames) (0 not accounted for)
[09:14:33] h264-decoder done: 11897 frames, 0 decoder errors, 0 drops
x264 [info]: frame I:46    Avg QP:17.36  size:218003
x264 [info]: frame P:1504  Avg QP:21.15  size:131713
x264 [info]: frame B:5650  Avg QP:22.97  size: 84916
x264 [info]: consecutive B-frames:  2.9%  2.1%  3.9%  9.5% 14.9% 59.2%  7.5%
x264 [info]: mb I  I16..4: 10.2% 81.7%  8.1%
x264 [info]: mb P  I16..4:  3.2% 29.1%  0.8%  P16..4: 33.2% 20.2%  6.9%  0.1%  0.0%    skip: 6.5%
x264 [info]: mb B  I16..4:  0.5%  4.8%  0.1%  B16..8: 49.2% 21.2%  4.0%  direct: 7.1%  skip:13.1%  L0:49.9% L1:43.4% BI: 6.7%
x264 [info]: 8x8 transform intra:87.7% inter:76.6%
x264 [info]: direct mvs  spatial:99.4% temporal:0.6%
x264 [info]: coded y,uvDC,uvAC intra: 88.4% 71.5% 36.5% inter: 60.5% 32.8% 3.5%
x264 [info]: i16 v,h,dc,p: 27% 15% 34% 23%
x264 [info]: i8 v,h,dc,ddl,ddr,vr,hd,vl,hu:  8%  7% 10% 10% 15% 12% 14% 11% 13%
x264 [info]: i4 v,h,dc,ddl,ddr,vr,hd,vl,hu: 11% 10%  7%  9% 13% 11% 13% 10% 15%
x264 [info]: i8c dc,h,v,p: 44% 25% 17% 14%
x264 [info]: Weighted P-Frames: Y:6.4% UV:1.0%
x264 [info]: ref P L0: 45.6% 10.3% 25.6% 16.4%  2.1%  0.0%
x264 [info]: ref B L0: 86.2% 10.1%  3.8%
x264 [info]: ref B L1: 93.3%  6.7%
x264 [info]: kb/s:18343.95
[09:14:33] stream: 11940 good frames, 0 errors (0%)
[09:14:33] libhb: work result = 0
Note that HandBrake (or x264) requests for max of 21 threads even though there are 40 logical processors in this case.
Last edited by nhyone on Sat Mar 19, 2016 2:14 am, edited 1 time in total.

nhyone
Bright Spark User
Posts: 196
Joined: Fri Jul 24, 2015 4:13 am

Re: Testing how well x264 scales

Post by nhyone » Fri Mar 18, 2016 2:49 pm

Tested on the same machine, but with veryfast preset, CRF 20.

Code: Select all

#proc   1 CPU     w/HT   2 CPUs  2 CPUs w/HT
1       9.924        –        –            –
2      18.030   11.326   18.666            –
4      34.689   21.781   35.659       22.895
6      53.247            52.421
8      69.725   41.361   67.529       43.473
10     85.808            84.234
12          –   61.215   99.423
16          –   80.083  127.156       78.357
20          –   98.226  148.766
24          –        –        –      112.682
32          –        –        –      141.861
40          –        –        –      163.451
veryfast scales all the way up to 40 processors on 2 CPUs for 1080p encoding. I have no idea about 4 CPUs, though.

In general, I find that veryfast is ~3x faster than medium, and that in turn is ~3x faster than slower.

Post Reply