英文:
Injecting a mount into a disjoint mount namespace behind a private mount propagation?
问题
作为我正在为Linux容器系统(如docker和containerd/runc)的容器诊断工具开发的一部分,我一直在寻找一种将一个挂载从一个挂载命名空间注入或绑定到另一个不相关的挂载命名空间的方法。
问题陈述
考虑以下情景
hostdir nsdir
------- -----
/ / [mountns 1, pidns 1, ]
/var/containers/container1-root / [mountns 2, pidns 2, propagation=private]
[not visible] /c1volume [mountns 2, pidns 2]
/var/containers/container2-root / [mountns 3, pidns 1, propagation=private] privileged]
container1
是一个普通的容器。它在 c1volume
上挂载了一个卷。由于挂载传播规则,主机无法看到 c1volume
,因为它是在进入新的挂载命名空间之后挂载的。
container2
使用主机的pid命名空间运行,因此可以“看到”容器外部以与主机进行交互。它有特权,并且可以使用 nsenter
来突破到主机挂载命名空间。它的目标是使位于 /var/container/container2-root
的文件系统对运行在 container1
的命名空间中的进程可见,即挂载命名空间 2,以便 container1
中的进程可以访问通常不包含在其容器镜像中的附加工具或实用程序,并且它们看到 pidns 2 (container1) 的pid编号。
我尚未找到一种方法来实现这一目标。
挂载传播规则意味着从主机的挂载命名空间进行绑定挂载不会使绑定挂载对于 container1
的挂载命名空间中的进程可见:
mkdir /var/containers/container1-root/container2
mount -o bind /var/containers/container2-root /var/containers/container1-root/container2
更改 /var/containers/container1-root
的挂载传播似乎对此没有影响。
我可以创建一个新的挂载和进程命名空间,可以将 /var/containers/container1-root
视为 /
,并具有 /var/containers/container2-root
的可见绑定挂载,但它将无法看到原始 container1
的 pid 命名空间中的任何进程,也无法看到 /c1volume
的挂载。
我已经尝试了许多关于 pivot_root
、unshare
、nsenter
、mount -o bind
等的技巧变化,但至今没有成功。
无法获得 container1
的领导进程(pid 1)的合作;这是容器工具层的外部注入。
演示设置
以下是创建一个演示环境的设置步骤,其中使用低级别的Linux原语手工容器化,以便您了解正在发生的事情。
# create "container images" (static)
mkdir images
cd images
mkdir -p container1-root/{bin,proc,sys,dev,etc}
curl -sSLf -o container1-root/bin/busybox busybox https://busybox.net/downloads/binaries/1.35.0-x86_64-linux-musl/busybox
chmod +x container1-root/bin/busybox
for cmd in ls mount sh ; do ln -s busybox container1-root/bin/$cmd; done
cat > container1-root/enter <<'__END__'
#!/bin/sh
mount -t sysfs none /sys
exec /bin/busybox sh -i
__END__
chmod +x container1-root/enter
cp -aR container1-root container2-root
touch container1-root/container1
touch container2-root/container2
mkdir container1-root/c1volume
cd ..
# Create a volume for c1
mkdir -p volumes/c1volume
touch volumes/c1volume/i-see-c1volume
# create the container runtime dirs
for c in container1-root container2-root; do
mkdir -p {containers,workdirs,scratch}/$c
mount -t overlay overlay -o lowerdir=$PWD/images/$c,upperdir=$PWD/scratch/$c,workdir=$PWD/workdirs/$c $PWD/containers/$c
mount --make-rprivate $PWD/containers/$c
done
# [Terminal session 1: container1]
# Launch container1, with mounted volume not visible to the host and new pid namespace.
unshare -m
mount -o bind volumes/c1volume containers/container1-root/c1volume
ls containers/container1-root/c1volume/
unshare -p -m --mount-proc --fork --propagation private --wd=containers/container1-root --root=containers/container1-root /enter
PS1='container1 # '
ls /c1volume
echo $$
演示
现在,从主机上,您将看到
host # findmnt | egrep 'c1volume|container[12]'
├─/root/containers/container1-root overlay overlay rw,relatime,lowerdir=/root/images/container1-root,upperdir=/root/scratch/container1-root,workdir=/root/workdirs/container1-root
└─/root/containers/container2-root overlay overlay rw,relatime,lowerdir=/root/images/container2-root,upperdir=/root/scratch/container2-root,workdir=/root/workdirs/container2-root
c1volume 不可见,并且
host # ls /root/containers/container1-root/c1volume/
host #
它的绑定挂载内容不可见。
在 container2 中的进程可以突破容器并使用 nsenter
进入 container2:
container2 # /bin/busybox nsenter -t 1 -m -p /bin/bash -w /root
host # nsenter -t "$(lsof -t containers/container1-root)" --all -w -r /bin/sh
# ls /c1volume
i-see-c1volume
但无法以任
英文:
As part of some work I'm doing on container diagnostics tooling for Linux container systems like docker and containerd/runc, I've been looking for a way to inject or bind a mount from one mount namespace into another disjoint mount namespace.
Problem statement
Consider the following scenario
hostdir nsdir
------- -----
/ / [mountns 1, pidns 1, ]
/var/containers/container1-root / [mountns 2, pidns 2, propagation=private]
[not visible] /c1volume [mountns 2, pidns 2]
/var/containers/container2-root / [mountns 3, pidns 1, propagation=private] privileged]
container1
is a regular container. It has a volume mounted on c1volume
. Due to mount propagation rules, the host cannot see c1volume
, as it's mounted after the new mount namespace is entered.
container2
is run with the pid namespace of the host, so it can "see" out of the container to interact with the host. It's privileged, and can use nsenter
to container-break into the host mount namespace too.
The goal is to make the filesystem at /var/container/container2-root visible to processes running in container1's namespace, mount namespace 2, e.g. so that processes in container1
can access additional injected tools or utilities not usually included in their container image, and they see the pid numbers for pidns 2 (container1).
I haven't been able to figure out a way to do this.
Mount propagation rules mean that bind-mounting from the host's mount namespace does not make the bind mount visible to processes in container1
's mount namespace:
mkdir /var/containers/container1-root/container2
mount -o bind /var/containers/container2-root /var/containers/container1-root/container2
Changing the mount propagation of /var/containers/container1-root
appears to have no effect on this.
I could create a new mount and process namespace that can see /var/containers/container1-root
as /
and has a bind mount visible for /var/containers/container2-root
, but it won't see any of the processes in the original container1 pid namespace, and it won't see the mount of /c1volume
.
I've tried a great many variations of tricks with pivot_root
, unshare
, nsenter
, mount -o bind
etc, as yet to no avail.
The co-operation of the leader process (pid 1) of container1
is not available; this is an external injection from the container tooling layer.
Demo setup
Here's a setup recipe to create a demo environment with handmade containerization using low-level Linux primitives so you can see what's going on.
# create "container images" (static)
mkdir images
cd images
mkdir -p container1-root/{bin,proc,sys,dev,etc}
curl -sSLf -o container1-root/bin/busybox busybox https://busybox.net/downloads/binaries/1.35.0-x86_64-linux-musl/busybox
chmod +x container1-root/bin/busybox
for cmd in ls mount sh ; do ln -s busybox container1-root/bin/$cmd; done
cat > container1-root/enter <<'__END__'
#!/bin/sh
mount -t sysfs none /sys
exec /bin/busybox sh -i
__END__
chmod +x container1-root/enter
cp -aR container1-root container2-root
touch container1-root/container1
touch container2-root/container2
mkdir container1-root/c1volume
cd ..
# Create a volume for c1
mkdir -p volumes/c1volume
touch volumes/c1volume/i-see-c1volume
# create the container runtime dirs
for c in container1-root container2-root; do
mkdir -p {containers,workdirs,scratch}/$c
mount -t overlay overlay -o lowerdir=$PWD/images/$c,upperdir=$PWD/scratch/$c,workdir=$PWD/workdirs/$c $PWD/containers/$c
mount --make-rprivate $PWD/containers/$c
done
# [Terminal session 1: container1]
# Launch container1, with mounted volume not visible to the host and new pid namespace.
unshare -m
mount -o bind volumes/c1volume containers/container1-root/c1volume
ls containers/container1-root/c1volume/
unshare -p -m --mount-proc --fork --propagation private --wd=containers/container1-root --root=containers/container1-root /enter
PS1='container1 # '
ls /c1volume
echo $$
# [Terminal session 2: container2]
# This container shares the host pid namespace, but not mount namespace, and does not
# have a mounted volume.
unshare -m
unshare -m --mount-proc --fork --propagation private --wd=containers/container2-root --root=containers/container2-root /enter
PS1='container2 # '
Demo
Now, from the host, you will see
host # findmnt | egrep 'c1volume|container[12]'
├─/root/containers/container1-root overlay overlay rw,relatime,lowerdir=/root/images/container1-root,upperdir=/root/scratch/container1-root,workdir=/root/workdirs/container1-root
└─/root/containers/container2-root overlay overlay rw,relatime,lowerdir=/root/images/container2-root,upperdir=/root/scratch/container2-root,workdir=/root/workdirs/container2-root
no c1volume is visible, and
host # ls /root/containers/container1-root/c1volume/
host #
its bind-mounted contents are not visible.
A process in container2 can container-break and then nsenter
container 2:
container2 # /bin/busybox nsenter -t 1 -m -p /bin/bash -w /root
host # nsenter -t "$(lsof -t containers/container1-root)" --all -w -r /bin/sh
# ls /c1volume
i-see-c1volume
but has no way to access container2-root
from there.
It's possible to mount -o bind
into /proc/$(lsof -t containers/container1-root)/root/
, but mount propagation means this won't be seen from the existing processes in container1-root
. And if nsenter
or unshare
are used to first enter the mount namespace for container1, the container2-root file system is no longer visible so it cannot be bind-mounted.
答案1
得分: 2
在我最终完成这篇文章后,当然我会处理这个问题。至少对于我的演示环境来说,我必须与一个真正的containerd进行比较。
诀窍在于,没有任何--root
或--wd
的nsenter
将保留在主机的根目录和工作目录,但进入客户机的挂载命名空间。不需要进入客户机(container1)pid命名空间。
host # c1leader="$(lsof -t containers/container1-root)"
host # nsenter -t $c1leader -m
host # findmnt -o +PROPAGATION | egrep 'container[12]|c1volume'
├─/root/containers/container1-root overlay overlay rw,relatime,lowerdir=/root/images/container1-root,upperdir=/root/scratch/container1-root,workdir=/root/workdirs/container1-root private
│ ├─/root/containers/container1-root/c1volume /dev/mapper/vgubuntu-root[/root/volumes/c1volume] ext4 rw,relatime,errors=remount-ro private
│ ├─/root/containers/container1-root/proc proc proc rw,nosuid,nodev,noexec,relatime private
│ │ └─/root/containers/container1-root/proc none proc rw,relatime private
│ └─/root/containers/container1-root/sys none sysfs rw,relatime private
└─/root/containers/container2-root overlay overlay rw,relatime,lowerdir=/root/images/container2-root,upperdir=/root/scratch/container2-root,workdir=/root/workdirs/container2-root private
host # mkdir /root/containers/container1-root/container2-root
host # mount -o bind,ro /root/containers/container2-root /root/containers/container1-root/container2-root
现在在container1
的会话中:
container1 # ls /
bin c1volume container1 container2-root dev enter etc foo proc sys
container1 # ls /c1volume/
i-see-c1volume
container1 # ls container2-root/
bin container2 dev enter etc proc sys
container1 # busybox ps
PID USER TIME COMMAND
1 0 0:00 /bin/busybox sh -i
24 0 0:00 busybox ps
英文:
So of course I work it out after finally writing this up. At least for my demo env, I have to compare to a real containerd to see.
The trick is that nsenter
without any --root
or --wd
will remain in the host rootdir and workdir, but enter the guest mount namespace. It is not necessary to enter the guest (container1) pid namespace as well.
host # c1leader="$(lsof -t containers/container1-root)"
host # nsenter -t $c1leader -m
host # findmnt -o +PROPAGATION | egrep 'container[12]|c1volume'
├─/root/containers/container1-root overlay overlay rw,relatime,lowerdir=/root/images/container1-root,upperdir=/root/scratch/container1-root,workdir=/root/workdirs/container1-root private
│ ├─/root/containers/container1-root/c1volume /dev/mapper/vgubuntu-root[/root/volumes/c1volume] ext4 rw,relatime,errors=remount-ro private
│ ├─/root/containers/container1-root/proc proc proc rw,nosuid,nodev,noexec,relatime private
│ │ └─/root/containers/container1-root/proc none proc rw,relatime private
│ └─/root/containers/container1-root/sys none sysfs rw,relatime private
└─/root/containers/container2-root overlay overlay rw,relatime,lowerdir=/root/images/container2-root,upperdir=/root/scratch/container2-root,workdir=/root/workdirs/container2-root private
host # mkdir /root/containers/container1-root/container2-root
host # mount -o bind,ro /root/containers/container2-root /root/containers/container1-root/container2-root
now in container1
's session:
container1 # ls /
bin c1volume container1 container2-root dev enter etc foo proc sys
container1 # ls /c1volume/
i-see-c1volume
container1 # ls container2-root/
bin container2 dev enter etc proc sys
container1 # busybox ps
PID USER TIME COMMAND
1 0 0:00 /bin/busybox sh -i
24 0 0:00 busybox ps
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论