将挂载注入到私有挂载传播后的不相交挂载命名空间中?

huangapple go评论70阅读模式
英文:

Injecting a mount into a disjoint mount namespace behind a private mount propagation?

问题

作为我正在为Linux容器系统(如docker和containerd/runc)的容器诊断工具开发的一部分,我一直在寻找一种将一个挂载从一个挂载命名空间注入或绑定到另一个不相关的挂载命名空间的方法。

问题陈述

考虑以下情景

hostdir                                   nsdir
-------                                   -----
/                                         /         [mountns 1, pidns 1, ]
  /var/containers/container1-root         /         [mountns 2, pidns 2, propagation=private]
    [not visible]                         /c1volume [mountns 2, pidns 2]
  /var/containers/container2-root         /         [mountns 3, pidns 1, propagation=private] privileged]

container1 是一个普通的容器。它在 c1volume 上挂载了一个卷。由于挂载传播规则,主机无法看到 c1volume,因为它是在进入新的挂载命名空间之后挂载的。

container2 使用主机的pid命名空间运行,因此可以“看到”容器外部以与主机进行交互。它有特权,并且可以使用 nsenter 来突破到主机挂载命名空间。它的目标是使位于 /var/container/container2-root 的文件系统对运行在 container1 的命名空间中的进程可见,即挂载命名空间 2,以便 container1 中的进程可以访问通常不包含在其容器镜像中的附加工具或实用程序,并且它们看到 pidns 2 (container1) 的pid编号。

我尚未找到一种方法来实现这一目标。

挂载传播规则意味着从主机的挂载命名空间进行绑定挂载不会使绑定挂载对于 container1 的挂载命名空间中的进程可见:

mkdir /var/containers/container1-root/container2
mount -o bind /var/containers/container2-root /var/containers/container1-root/container2

更改 /var/containers/container1-root 的挂载传播似乎对此没有影响。

我可以创建一个新的挂载和进程命名空间,可以将 /var/containers/container1-root 视为 /,并具有 /var/containers/container2-root 的可见绑定挂载,但它将无法看到原始 container1 的 pid 命名空间中的任何进程,也无法看到 /c1volume 的挂载。

我已经尝试了许多关于 pivot_rootunsharensentermount -o bind 等的技巧变化,但至今没有成功。

无法获得 container1 的领导进程(pid 1)的合作;这是容器工具层的外部注入。

演示设置

以下是创建一个演示环境的设置步骤,其中使用低级别的Linux原语手工容器化,以便您了解正在发生的事情。

# create "container images" (static)
mkdir images
cd images
mkdir -p container1-root/{bin,proc,sys,dev,etc} 
curl -sSLf -o container1-root/bin/busybox busybox https://busybox.net/downloads/binaries/1.35.0-x86_64-linux-musl/busybox
chmod +x container1-root/bin/busybox
for cmd in ls mount sh ; do ln -s busybox container1-root/bin/$cmd; done
cat > container1-root/enter <<'__END__'
#!/bin/sh
mount -t sysfs none /sys
exec /bin/busybox sh -i
__END__
chmod +x container1-root/enter
cp -aR container1-root container2-root
touch container1-root/container1
touch container2-root/container2
mkdir container1-root/c1volume
cd ..

# Create a volume for c1
mkdir -p volumes/c1volume
touch volumes/c1volume/i-see-c1volume

# create the container runtime dirs
for c in container1-root container2-root; do
mkdir -p {containers,workdirs,scratch}/$c
mount -t overlay overlay -o lowerdir=$PWD/images/$c,upperdir=$PWD/scratch/$c,workdir=$PWD/workdirs/$c $PWD/containers/$c
mount --make-rprivate $PWD/containers/$c
done

# [Terminal session 1: container1]
# Launch container1, with mounted volume not visible to the host and new pid namespace.
unshare -m 
mount -o bind volumes/c1volume containers/container1-root/c1volume
ls containers/container1-root/c1volume/
unshare -p -m --mount-proc --fork --propagation private --wd=containers/container1-root --root=containers/container1-root /enter
PS1='container1 # '
ls /c1volume
echo $$

演示

现在,从主机上,您将看到

host # findmnt | egrep 'c1volume|container[12]'
├─/root/containers/container1-root                  overlay                                        overlay         rw,relatime,lowerdir=/root/images/container1-root,upperdir=/root/scratch/container1-root,workdir=/root/workdirs/container1-root
└─/root/containers/container2-root                  overlay                                        overlay         rw,relatime,lowerdir=/root/images/container2-root,upperdir=/root/scratch/container2-root,workdir=/root/workdirs/container2-root

c1volume 不可见,并且

host # ls /root/containers/container1-root/c1volume/
host # 

它的绑定挂载内容不可见。

在 container2 中的进程可以突破容器并使用 nsenter 进入 container2:

container2 # /bin/busybox nsenter -t 1 -m -p /bin/bash -w /root
host # nsenter -t "$(lsof -t containers/container1-root)" --all -w -r /bin/sh
# ls /c1volume
i-see-c1volume

但无法以任

英文:

As part of some work I'm doing on container diagnostics tooling for Linux container systems like docker and containerd/runc, I've been looking for a way to inject or bind a mount from one mount namespace into another disjoint mount namespace.

Problem statement

Consider the following scenario

hostdir                                   nsdir
-------                                   -----
/                                         /         [mountns 1, pidns 1, ]
  /var/containers/container1-root         /         [mountns 2, pidns 2, propagation=private]
    [not visible]                         /c1volume [mountns 2, pidns 2]
  /var/containers/container2-root         /         [mountns 3, pidns 1, propagation=private] privileged]

container1 is a regular container. It has a volume mounted on c1volume. Due to mount propagation rules, the host cannot see c1volume, as it's mounted after the new mount namespace is entered.

container2 is run with the pid namespace of the host, so it can "see" out of the container to interact with the host. It's privileged, and can use nsenter to container-break into the host mount namespace too.

The goal is to make the filesystem at /var/container/container2-root visible to processes running in container1's namespace, mount namespace 2, e.g. so that processes in container1 can access additional injected tools or utilities not usually included in their container image, and they see the pid numbers for pidns 2 (container1).

I haven't been able to figure out a way to do this.

Mount propagation rules mean that bind-mounting from the host's mount namespace does not make the bind mount visible to processes in container1's mount namespace:

mkdir /var/containers/container1-root/container2
mount -o bind /var/containers/container2-root /var/containers/container1-root/container2

Changing the mount propagation of /var/containers/container1-root appears to have no effect on this.

I could create a new mount and process namespace that can see /var/containers/container1-root as / and has a bind mount visible for /var/containers/container2-root, but it won't see any of the processes in the original container1 pid namespace, and it won't see the mount of /c1volume.

I've tried a great many variations of tricks with pivot_root, unshare, nsenter, mount -o bind etc, as yet to no avail.

The co-operation of the leader process (pid 1) of container1 is not available; this is an external injection from the container tooling layer.

Demo setup

Here's a setup recipe to create a demo environment with handmade containerization using low-level Linux primitives so you can see what's going on.

# create &quot;container images&quot; (static)
mkdir images
cd images
mkdir -p container1-root/{bin,proc,sys,dev,etc} 
curl -sSLf -o container1-root/bin/busybox busybox https://busybox.net/downloads/binaries/1.35.0-x86_64-linux-musl/busybox
chmod +x container1-root/bin/busybox
for cmd in ls mount sh ; do ln -s busybox container1-root/bin/$cmd; done
cat &gt; container1-root/enter &lt;&lt;&#39;__END__&#39;
#!/bin/sh
mount -t sysfs none /sys
exec /bin/busybox sh -i
__END__
chmod +x container1-root/enter
cp -aR container1-root container2-root
touch container1-root/container1
touch container2-root/container2
mkdir container1-root/c1volume
cd ..

# Create a volume for c1
mkdir -p volumes/c1volume
touch volumes/c1volume/i-see-c1volume

# create the container runtime dirs
for c in container1-root container2-root; do
mkdir -p {containers,workdirs,scratch}/$c
mount -t overlay overlay -o lowerdir=$PWD/images/$c,upperdir=$PWD/scratch/$c,workdir=$PWD/workdirs/$c $PWD/containers/$c
mount --make-rprivate $PWD/containers/$c
done

# [Terminal session 1: container1]
# Launch container1, with mounted volume not visible to the host and new pid namespace.
unshare -m 
mount -o bind volumes/c1volume containers/container1-root/c1volume
ls containers/container1-root/c1volume/
unshare -p -m --mount-proc --fork --propagation private --wd=containers/container1-root --root=containers/container1-root /enter
PS1=&#39;container1 # &#39;
ls /c1volume
echo $$

# [Terminal session 2: container2]
# This container shares the host pid namespace, but not mount namespace, and does not
# have a mounted volume.
unshare -m
unshare -m --mount-proc --fork --propagation private --wd=containers/container2-root --root=containers/container2-root /enter
PS1=&#39;container2 # &#39;

Demo

Now, from the host, you will see

host # findmnt | egrep &#39;c1volume|container[12]&#39;
├─/root/containers/container1-root                  overlay                                        overlay         rw,relatime,lowerdir=/root/images/container1-root,upperdir=/root/scratch/container1-root,workdir=/root/workdirs/container1-root
└─/root/containers/container2-root                  overlay                                        overlay         rw,relatime,lowerdir=/root/images/container2-root,upperdir=/root/scratch/container2-root,workdir=/root/workdirs/container2-root

no c1volume is visible, and

host # ls /root/containers/container1-root/c1volume/
host # 

its bind-mounted contents are not visible.

A process in container2 can container-break and then nsenter container 2:

container2 # /bin/busybox nsenter -t 1 -m -p /bin/bash -w /root
host # nsenter -t &quot;$(lsof -t containers/container1-root)&quot; --all -w -r /bin/sh
# ls /c1volume
i-see-c1volume

but has no way to access container2-root from there.

It's possible to mount -o bind into /proc/$(lsof -t containers/container1-root)/root/, but mount propagation means this won't be seen from the existing processes in container1-root. And if nsenter or unshare are used to first enter the mount namespace for container1, the container2-root file system is no longer visible so it cannot be bind-mounted.

答案1

得分: 2

在我最终完成这篇文章后,当然我会处理这个问题。至少对于我的演示环境来说,我必须与一个真正的containerd进行比较。

诀窍在于,没有任何--root--wdnsenter将保留在主机的根目录和工作目录,但进入客户机的挂载命名空间。不需要进入客户机(container1)pid命名空间

host # c1leader=&quot;$(lsof -t containers/container1-root)&quot;
host # nsenter -t $c1leader -m
host # findmnt -o +PROPAGATION | egrep &#39;container[12]|c1volume&#39;
├─/root/containers/container1-root                  overlay                                           overlay         rw,relatime,lowerdir=/root/images/container1-root,upperdir=/root/scratch/container1-root,workdir=/root/workdirs/container1-root private
│ ├─/root/containers/container1-root/c1volume       /dev/mapper/vgubuntu-root[/root/volumes/c1volume] ext4            rw,relatime,errors=remount-ro                                                                                                   private
│ ├─/root/containers/container1-root/proc           proc                                              proc            rw,nosuid,nodev,noexec,relatime                                                                                                 private
│ │ └─/root/containers/container1-root/proc         none                                              proc            rw,relatime                                                                                                                     private
│ └─/root/containers/container1-root/sys            none                                              sysfs           rw,relatime                                                                                                                     private
└─/root/containers/container2-root                  overlay                                           overlay         rw,relatime,lowerdir=/root/images/container2-root,upperdir=/root/scratch/container2-root,workdir=/root/workdirs/container2-root private
host # mkdir /root/containers/container1-root/container2-root
host # mount -o bind,ro /root/containers/container2-root /root/containers/container1-root/container2-root

现在在container1的会话中:

container1 # ls /
bin              c1volume         container1       container2-root  dev              enter            etc              foo              proc             sys
container1 # ls /c1volume/
i-see-c1volume
container1 # ls container2-root/
bin         container2  dev         enter       etc         proc        sys
container1 # busybox ps
PID   USER     TIME  COMMAND
    1 0         0:00 /bin/busybox sh -i
   24 0         0:00 busybox ps
英文:

So of course I work it out after finally writing this up. At least for my demo env, I have to compare to a real containerd to see.

The trick is that nsenter without any --root or --wd will remain in the host rootdir and workdir, but enter the guest mount namespace. It is not necessary to enter the guest (container1) pid namespace as well.

host # c1leader=&quot;$(lsof -t containers/container1-root)&quot;
host # nsenter -t $c1leader -m
host # findmnt -o +PROPAGATION | egrep &#39;container[12]|c1volume&#39;
├─/root/containers/container1-root                  overlay                                           overlay         rw,relatime,lowerdir=/root/images/container1-root,upperdir=/root/scratch/container1-root,workdir=/root/workdirs/container1-root private
│ ├─/root/containers/container1-root/c1volume       /dev/mapper/vgubuntu-root[/root/volumes/c1volume] ext4            rw,relatime,errors=remount-ro                                                                                                   private
│ ├─/root/containers/container1-root/proc           proc                                              proc            rw,nosuid,nodev,noexec,relatime                                                                                                 private
│ │ └─/root/containers/container1-root/proc         none                                              proc            rw,relatime                                                                                                                     private
│ └─/root/containers/container1-root/sys            none                                              sysfs           rw,relatime                                                                                                                     private
└─/root/containers/container2-root                  overlay                                           overlay         rw,relatime,lowerdir=/root/images/container2-root,upperdir=/root/scratch/container2-root,workdir=/root/workdirs/container2-root private
host # mkdir /root/containers/container1-root/container2-root
host # mount -o bind,ro /root/containers/container2-root /root/containers/container1-root/container2-root

now in container1's session:

container1 # ls /
bin              c1volume         container1       container2-root  dev              enter            etc              foo              proc             sys
container1 # ls /c1volume/
i-see-c1volume
container1 # ls container2-root/
bin         container2  dev         enter       etc         proc        sys
container1 # busybox ps
PID   USER     TIME  COMMAND
    1 0         0:00 /bin/busybox sh -i
   24 0         0:00 busybox ps

huangapple
  • 本文由 发表于 2023年1月9日 11:30:24
  • 转载请务必保留本文链接:https://go.coder-hub.com/75052934.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定