docker pull を速くするために：layer-parallel から chunk-parallel へ

この記事は Recruit Advent Calendar 2021 の15日目の記事です。

TL;DR

従来のレイヤー並列の pull より Range リクエストを用いたチャンク並列の pull によって速度が 2~5倍速くなる可能性がある。

ECR は Public だと region ごとに速度が大きく異るので安定した速度を求める場合は Private にする。 (pull through cache を活用すると良い)

2022/10/9 追記: ECR の Public が適切な Pop から返ってくるようになっていた。その Benchmark も取得し、結果を追記した。 ap-northeast-1 では6倍近く早くなっていて region による差が小さくなっていた。

背景・動機

コンテナイメージは一つ以上のマニフェスト、そこから得られるコンフィグとレイヤーから構成される。

コンテナイメージの pull という操作は以上の情報をコンテナレジストリから取得することである。

具体的にどういう処理をしているかは以下を参照。

https://knqyf263.hatenablog.com/entry/2019/11/29/052818

https://github.com/moby/moby/blob/8955d8da8951695a98eb7e15bead19d402c6eb27/contrib/download-frozen-image-v2.sh

pull はコンテナの実行、ビルドに際して必要になることが多い。

CI/CD の様々なタイミングで pull が必要なので速くすることに意義がある。

dockerd、containerd はデフォルトで3並列でのレイヤー取得を行う。 (layer-parallel)

配布されているイメージは単一レイヤーのものもあり、単一レイヤーの場合はレイヤー並列化のメリットがない。

複数レイヤーの場合でもレイヤーのサイズにばらつきがあることがほとんどである。

コンテナイメージの取得時間はレイヤーの最大サイズに大きく影響をうける。

最もサイズの大きいレイヤーの取得を待っている時間が長いと感じたこともあるのではないだろうか。

chunk-parallel

レイヤーを一定のバイト数に分割したものをチャンクと呼ぶ。

Range リクエスト [RFC7233] を用いたチャンク並列で pull をすることで高速化を図る。

Docker Registry API v2 では Range リクエストを MAY、SHOULD でサポートすると書いてある。

https://docs.docker.com/registry/spec/api/#pulling-a-layer

To allow for incremental downloads, Range requests should be supported, as well.

https://docs.docker.com/registry/spec/api/#fetch-blob-part

This endpoint may also support RFC7233 compliant range requests. Support can be detected by issuing a HEAD request.

OCI Distribution Spec では言及されていない。過去には detail.md で言及されていたが現在では削除されている。

ECR ではサポートしていることがわかっている。裏側が Object Storage になっているレジストリであればサポートしている可能性は高い。

レイヤー並列ではなくチャンク並列の pull にすることで単一レイヤーの場合でも並列に取得することが可能になる。更にレイヤーサイズの偏りの影響を受けづらくなる。

実験

マニフェストの取得、レイヤーの取得までを行うツールを作成し、チャンク並列(チャンク数1、2、4、8)で実験を行った。

https://github.com/orisano/chunpull

今回は pull の制限がない ECR で public.ecr.aws の datadog/agent の amd64 のイメージ(圧縮後248MB)を用いた。

ECR Public/Private は構成が異なるので別々に実験を行った。

ECR Public が返す CloudFront の Pop が期待するものではなかったので ap-northeast-1、us-west-2 での比較実験を行った。

Benchmark は hyperfine を用いて warmup 2回、計測 10回で行った。

ECR Public - ap-northeast-1 (SF, SEA)

Benchmark 1: ./chunpull -i datadog/agent -c 10 -n 1
  Time (mean ± σ):     12.521 s ±  1.063 s    [User: 0.564 s, System: 0.559 s]
  Range (min … max):   11.193 s … 14.220 s    10 runs

Benchmark 2: ./chunpull -i datadog/agent -c 10 -n 2
  Time (mean ± σ):      8.562 s ±  0.301 s    [User: 0.604 s, System: 0.587 s]
  Range (min … max):    7.874 s …  8.982 s    10 runs

Benchmark 3: ./chunpull -i datadog/agent -c 10 -n 4
  Time (mean ± σ):      5.936 s ±  0.086 s    [User: 0.624 s, System: 0.580 s]
  Range (min … max):    5.794 s …  6.047 s    10 runs

Benchmark 4: ./chunpull -i datadog/agent -c 10 -n 8
  Time (mean ± σ):      4.263 s ±  0.037 s    [User: 0.640 s, System: 0.564 s]
  Range (min … max):    4.223 s …  4.344 s    10 runs

Summary
  './chunpull -i datadog/agent -c 10 -n 8' ran
    1.39 ± 0.02 times faster than './chunpull -i datadog/agent -c 10 -n 4'
    2.01 ± 0.07 times faster than './chunpull -i datadog/agent -c 10 -n 2'
    2.94 ± 0.25 times faster than './chunpull -i datadog/agent -c 10 -n 1'

2022/10/09 追記: ECR Public - ap-northeast-1 (NRT57-P3)

Benchmark 1: ./chunpull -i datadog/agent -c 10 -n 1
  Time (mean ± σ):      2.083 s ±  0.060 s    [User: 0.512 s, System: 0.421 s]
  Range (min … max):    1.995 s …  2.188 s    10 runs

Benchmark 2: ./chunpull -i datadog/agent -c 10 -n 2
  Time (mean ± σ):      1.739 s ±  0.101 s    [User: 0.462 s, System: 0.421 s]
  Range (min … max):    1.641 s …  1.944 s    10 runs

Benchmark 3: ./chunpull -i datadog/agent -c 10 -n 4
  Time (mean ± σ):      2.145 s ±  0.610 s    [User: 0.483 s, System: 0.446 s]
  Range (min … max):    1.872 s …  3.871 s    10 runs

Benchmark 4: ./chunpull -i datadog/agent -c 10 -n 8
  Time (mean ± σ):      1.849 s ±  0.037 s    [User: 0.470 s, System: 0.413 s]
  Range (min … max):    1.810 s …  1.926 s    10 runs

Summary
  './chunpull -i datadog/agent -c 10 -n 2' ran
    1.06 ± 0.07 times faster than './chunpull -i datadog/agent -c 10 -n 8'
    1.20 ± 0.08 times faster than './chunpull -i datadog/agent -c 10 -n 1'
    1.23 ± 0.36 times faster than './chunpull -i datadog/agent -c 10 -n 4'

Pop が正常に返るようになって chunk-parallel の効果がなくなったように見える。 2021/12 時点と比較すると6倍近く早くなっていて ECR Public ユーザーだと体感しているかもしれない。

ECR Private - ap-northeast-1

Benchmark 1: ./chunpull -r https://xxx.dkr.ecr.ap-northeast-1.amazonaws.com -i chunpull -c 10 -n 1
  Time (mean ± σ):      2.889 s ±  0.289 s    [User: 0.285 s, System: 0.285 s]
  Range (min … max):    2.761 s …  3.709 s    10 runs

  Warning: The first benchmarking run for this command was significantly slower than the rest (3.709 s). This could be caused by (filesystem) caches that were not filled until after the first run. You should consider using the '--warmup' option to fill those caches before the actual benchmark. Alternatively, use the '--prepare' option to clear the caches before each timing run.

Benchmark 2: ./chunpull -r https://xxx.dkr.ecr.ap-northeast-1.amazonaws.com -i chunpull -c 10 -n 2
  Time (mean ± σ):      1.508 s ±  0.021 s    [User: 0.300 s, System: 0.315 s]
  Range (min … max):    1.484 s …  1.542 s    10 runs

Benchmark 3: ./chunpull -r https://xxx.dkr.ecr.ap-northeast-1.amazonaws.com -i chunpull -c 10 -n 4
  Time (mean ± σ):     862.5 ms ±  13.8 ms    [User: 287.9 ms, System: 311.0 ms]
  Range (min … max):   843.1 ms … 883.8 ms    10 runs

Benchmark 4: ./chunpull -r https://xxx.dkr.ecr.ap-northeast-1.amazonaws.com -i chunpull -c 10 -n 8
  Time (mean ± σ):     574.0 ms ±  40.6 ms    [User: 301.4 ms, System: 343.8 ms]
  Range (min … max):   527.6 ms … 655.9 ms    10 runs

Summary
  './chunpull -r https://xxx.dkr.ecr.ap-northeast-1.amazonaws.com -i chunpull -c 10 -n 8' ran
    1.50 ± 0.11 times faster than './chunpull -r https://xxx.dkr.ecr.ap-northeast-1.amazonaws.com -i chunpull -c 10 -n 4'
    2.63 ± 0.19 times faster than './chunpull -r https://xxx.dkr.ecr.ap-northeast-1.amazonaws.com -i chunpull -c 10 -n 2'
    5.03 ± 0.62 times faster than './chunpull -r https://xxx.dkr.ecr.ap-northeast-1.amazonaws.com -i chunpull -c 10 -n 1'

ECR Public - us-west-2 (HIO)

Benchmark 1: ./chunpull -i datadog/agent -c 10 -n 1
  Time (mean ± σ):      2.359 s ±  0.192 s    [User: 0.452 s, System: 0.355 s]
  Range (min … max):    2.194 s …  2.847 s    10 runs

Benchmark 2: ./chunpull -i datadog/agent -c 10 -n 2
  Time (mean ± σ):      1.725 s ±  0.100 s    [User: 0.428 s, System: 0.374 s]
  Range (min … max):    1.586 s …  1.878 s    10 runs

Benchmark 3: ./chunpull -i datadog/agent -c 10 -n 4
  Time (mean ± σ):      1.490 s ±  0.144 s    [User: 0.419 s, System: 0.368 s]
  Range (min … max):    1.342 s …  1.857 s    10 runs

Benchmark 4: ./chunpull -i datadog/agent -c 10 -n 8
  Time (mean ± σ):      1.451 s ±  0.341 s    [User: 0.431 s, System: 0.361 s]
  Range (min … max):    1.289 s …  2.412 s    10 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  './chunpull -i datadog/agent -c 10 -n 8' ran
    1.03 ± 0.26 times faster than './chunpull -i datadog/agent -c 10 -n 4'
    1.19 ± 0.29 times faster than './chunpull -i datadog/agent -c 10 -n 2'
    1.63 ± 0.40 times faster than './chunpull -i datadog/agent -c 10 -n 1'

Benchmark 1: ./chunpull -i datadog/agent -c 10 -n 1
  Time (mean ± σ):      2.431 s ±  0.178 s    [User: 0.449 s, System: 0.367 s]
  Range (min … max):    2.145 s …  2.681 s    10 runs

Benchmark 2: ./chunpull -i datadog/agent -c 10 -n 2
  Time (mean ± σ):      1.730 s ±  0.103 s    [User: 0.422 s, System: 0.386 s]
  Range (min … max):    1.611 s …  1.921 s    10 runs

Benchmark 3: ./chunpull -i datadog/agent -c 10 -n 4
  Time (mean ± σ):      1.675 s ±  0.554 s    [User: 0.442 s, System: 0.366 s]
  Range (min … max):    1.339 s …  3.238 s    10 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 4: ./chunpull -i datadog/agent -c 10 -n 8
  Time (mean ± σ):      1.317 s ±  0.064 s    [User: 0.395 s, System: 0.361 s]
  Range (min … max):    1.246 s …  1.404 s    10 runs

Summary
  './chunpull -i datadog/agent -c 10 -n 8' ran
    1.27 ± 0.42 times faster than './chunpull -i datadog/agent -c 10 -n 4'
    1.31 ± 0.10 times faster than './chunpull -i datadog/agent -c 10 -n 2'
    1.85 ± 0.16 times faster than './chunpull -i datadog/agent -c 10 -n 1'

結果

AWS上の環境ではチャンク並列にすることで速度が改善することがわかった。

ap-northeast-1 だと Public / Private に関わらず並列数に近い速度改善が見られる。

us-west-2 だと Public でも十分に速いが並列化の恩恵は小さい。

ECRの特性?

ap-northeast-1 と us-west-2 では public.ecr.aws からの pull でも5倍近く時間が違う。(平均12秒 => 平均2.4秒)

これは ap-northeast-1 から pull しても Cloud Front の Pop が SEA か SF で返ってきていることが原因だと思われる。

ECR Public と ECR Private を比較した場合 Private には Cloud Front が存在していないが 4倍近く速い。(平均12秒 => 平均2.9秒)

解決していない問題

何故か家の環境からだと chunk parallel にしてもどこかでスロットリングされているのか速度が向上しなかった。

ECR の pull through cache が動かなかった。

ECR Public が ap-northeast-1 と家の環境からアクセスしても SF、SEA の CloudFront から返ってくるのはなぜか。

レイヤー数、レイヤーサイズの分布の調査をする必要がある。

Pull の制限があるレジストリだと早く制限に到達する可能性がある。

使用するメモリ使用量が増加する or ファイルアクセスの回数が増える。

ECR は一定の割合でレイテンシがスパイクすることがあり、チャンク並列でリクエスト数が多くなるとレイテンシが増加する恐れがある。

チャンク数、チャンクサイズ、実行順の決定方法を最適化したいがすぐには実装できなかった。

最後に

この記事はLT駆動開発によって塩漬けにしていたアイデアを実装したものです。

LT の申し込みをしてくれた Sudo さんありがとう。

実験している様子をツイートしていたら連絡をくれた Tori さんありがとう。

pull がやっていることを調べるのが億劫だったので記事にしてくれていた Fukuda さんありがとう。

薄いブログ

技術の雑多なことを書く場所