英文:
Calling setns from Go returns EINVAL for mnt namespace
问题
C代码运行正常,并正确进入了命名空间,但Go代码似乎总是从setns
调用中返回EINVAL错误,无法进入mnt
命名空间。我尝试了多种排列组合(包括使用cgo和外部.so
的嵌入式C代码)在Go的1.2
、1.3
和当前的tip版本上。
通过在gdb
中逐步执行代码,我发现两个序列以完全相同的方式调用libc
中的setns
函数(至少在我看来是这样)。
我已经将问题归结为下面的代码。我做错了什么?
设置
我有一个用于启动快速busybox容器的shell别名:
alias startbb='docker inspect --format "{{ .State.Pid }}" $(docker run -d busybox sleep 1000000)'
运行后,startbb
将启动一个容器并输出其PID。
lxc-checkconfig
输出:
Found kernel config file /boot/config-3.8.0-44-generic
--- Namespaces ---
Namespaces: enabled
Utsname namespace: enabled
Ipc namespace: enabled
Pid namespace: enabled
User namespace: missing
Network namespace: enabled
Multiple /dev/pts instances: enabled
--- Control groups ---
Cgroup: enabled
Cgroup clone_children flag: enabled
Cgroup device: enabled
Cgroup sched: enabled
Cgroup cpu account: enabled
Cgroup memory controller: missing
Cgroup cpuset: enabled
--- Misc ---
Veth pair device: enabled
Macvlan: enabled
Vlan: enabled
File capabilities: enabled
uname -a
输出:
Linux gecko 3.8.0-44-generic #66~precise1-Ubuntu SMP Tue Jul 15 04:01:04 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
正常工作的C代码
以下C代码工作正常:
#include <errno.h>
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
main(int argc, char* argv[]) {
int i;
char nspath[1024];
char *namespaces[] = { "ipc", "uts", "net", "pid", "mnt" };
if (geteuid()) { fprintf(stderr, "%s\n", "abort: you want to run this as root"); exit(1); }
if (argc != 2) { fprintf(stderr, "%s\n", "abort: you must provide a PID as the sole argument"); exit(2); }
for (i=0; i<5; i++) {
sprintf(nspath, "/proc/%s/ns/%s", argv[1], namespaces[i]);
int fd = open(nspath, O_RDONLY);
if (setns(fd, 0) == -1) {
fprintf(stderr, "setns on %s namespace failed: %s\n", namespaces[i], strerror(errno));
} else {
fprintf(stdout, "setns on %s namespace succeeded\n", namespaces[i]);
}
close(fd);
}
}
使用gcc -o checkns checkns.c
编译后,运行sudo ./checkns <PID>
的输出为:
setns on ipc namespace succeeded
setns on uts namespace succeeded
setns on net namespace succeeded
setns on pid namespace succeeded
setns on mnt namespace succeeded
失败的Go代码
相反,以下Go代码(应该是相同的)并不完全正常工作:
package main
import (
"fmt"
"os"
"path/filepath"
"syscall"
)
func main() {
if syscall.Geteuid() != 0 {
fmt.Println("abort: you want to run this as root")
os.Exit(1)
}
if len(os.Args) != 2 {
fmt.Println("abort: you must provide a PID as the sole argument")
os.Exit(2)
}
namespaces := []string{"ipc", "uts", "net", "pid", "mnt"}
for i := range namespaces {
fd, _ := syscall.Open(filepath.Join("/proc", os.Args[1], "ns", namespaces[i]), syscall.O_RDONLY, 0644)
err, _, msg := syscall.RawSyscall(308, uintptr(fd), 0, 0) // 308 == setns
if err != 0 {
fmt.Println("setns on", namespaces[i], "namespace failed:", msg)
} else {
fmt.Println("setns on", namespaces[i], "namespace succeeded")
}
}
}
相反,运行sudo go run main.go <PID>
会产生以下输出:
setns on ipc namespace succeeded
setns on uts namespace succeeded
setns on net namespace succeeded
setns on pid namespace succeeded
setns on mnt namespace failed: invalid argument
英文:
The C code works fine and correctly enters the namespace, but the Go code always seems to return EINVAL from the setns
call to enter the mnt
namespace. I've tried a number of permutations (including embedded C code with cgo and external .so
) on Go 1.2
, 1.3
and the current tip.
Stepping through the code in gdb
shows that both sequences are calling setns
in libc
the exact same way (or so it appears to me).
I have boiled what seems to be the issue down to the code below. What am I doing wrong?
Setup
I have a shell alias for starting quick busybox containers:
alias startbb='docker inspect --format "{{ .State.Pid }}" $(docker run -d busybox sleep 1000000)'
After running this, startbb
will start a container and output it's PID.
lxc-checkconfig
outputs:
Found kernel config file /boot/config-3.8.0-44-generic
--- Namespaces ---
Namespaces: enabled
Utsname namespace: enabled
Ipc namespace: enabled
Pid namespace: enabled
User namespace: missing
Network namespace: enabled
Multiple /dev/pts instances: enabled
--- Control groups ---
Cgroup: enabled
Cgroup clone_children flag: enabled
Cgroup device: enabled
Cgroup sched: enabled
Cgroup cpu account: enabled
Cgroup memory controller: missing
Cgroup cpuset: enabled
--- Misc ---
Veth pair device: enabled
Macvlan: enabled
Vlan: enabled
File capabilities: enabled
uname -a
produces:
Linux gecko 3.8.0-44-generic #66~precise1-Ubuntu SMP Tue Jul 15 04:01:04 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
Working C code
The following C code works fine:
#include <errno.h>
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
main(int argc, char* argv[]) {
int i;
char nspath[1024];
char *namespaces[] = { "ipc", "uts", "net", "pid", "mnt" };
if (geteuid()) { fprintf(stderr, "%s\n", "abort: you want to run this as root"); exit(1); }
if (argc != 2) { fprintf(stderr, "%s\n", "abort: you must provide a PID as the sole argument"); exit(2); }
for (i=0; i<5; i++) {
sprintf(nspath, "/proc/%s/ns/%s", argv[1], namespaces[i]);
int fd = open(nspath, O_RDONLY);
if (setns(fd, 0) == -1) {
fprintf(stderr, "setns on %s namespace failed: %s\n", namespaces[i], strerror(errno));
} else {
fprintf(stdout, "setns on %s namespace succeeded\n", namespaces[i]);
}
close(fd);
}
}
After compiling with gcc -o checkns checkns.c
, the output of sudo ./checkns <PID>
is:
setns on ipc namespace succeeded
setns on uts namespace succeeded
setns on net namespace succeeded
setns on pid namespace succeeded
setns on mnt namespace succeeded
Failing Go code
Conversely, the following Go code (which should be identical) doesn't work quite as well:
package main
import (
"fmt"
"os"
"path/filepath"
"syscall"
)
func main() {
if syscall.Geteuid() != 0 {
fmt.Println("abort: you want to run this as root")
os.Exit(1)
}
if len(os.Args) != 2 {
fmt.Println("abort: you must provide a PID as the sole argument")
os.Exit(2)
}
namespaces := []string{"ipc", "uts", "net", "pid", "mnt"}
for i := range namespaces {
fd, _ := syscall.Open(filepath.Join("/proc", os.Args[1], "ns", namespaces[i]), syscall.O_RDONLY, 0644)
err, _, msg := syscall.RawSyscall(308, uintptr(fd), 0, 0) // 308 == setns
if err != 0 {
fmt.Println("setns on", namespaces[i], "namespace failed:", msg)
} else {
fmt.Println("setns on", namespaces[i], "namespace succeeded")
}
}
}
Instead, running sudo go run main.go <PID>
produces:
setns on ipc namespace succeeded
setns on uts namespace succeeded
setns on net namespace succeeded
setns on pid namespace succeeded
setns on mnt namespace failed: invalid argument
答案1
得分: 8
(在Go项目上提交了一个问题)
所以,对于这个问题的答案是,你必须从单线程上下文中调用setns
。这是有道理的,因为setns
应该将当前线程加入到命名空间中。由于Go是多线程的,你需要在Go运行时线程启动之前进行setns
调用。
我认为这是因为调用syscall.RawSyscall
的线程不是主线程--即使使用了runtime.LockOSThread
,结果也不是你所期望的(即goroutine被“锁定”到主C线程,因此等效于下面解释的构造函数技巧)。
在提交问题后,我得到的回复建议使用“cgo
构造函数技巧”。我找不到关于这个“技巧”的“正式”文档,但Docker/Michael Crosby的nsinit
中使用了它,尽管我逐行检查了那段代码,但我没有尝试以这种方式运行它(请参见下面的沮丧)。
这个“技巧”基本上是你可以让cgo
在Go运行时启动之前执行一个C函数。
为了做到这一点,你可以在你想在Go启动之前运行的函数上添加__attribute__((constructor))
宏:
/*
__attribute__((constructor)) void init() {
// this code will execute before Go starts up
// in runs in a single-threaded C context
// before Go's threads start running
}
*/
import "C"
使用这个作为模板,我修改了checkns.go
,像这样:
/*
#include <sched.h>
#include <stdio.h>
#include <fcntl.h>
__attribute__((constructor)) void enter_namespace(void) {
setns(open("/proc/<PID>/ns/mnt", O_RDONLY, 0644), 0);
}
*/
import "C"
... 文件的其余部分不变 ...
这段代码是有效的,但需要硬编码PID
,因为它没有正确从命令行输入中读取,但它说明了这个想法(如果你从上面描述的容器中提供一个PID
,它也可以工作)。
这让人沮丧,因为我想多次调用setns
,但由于这段C代码在Go运行时启动之前执行,所以没有可用的Go代码。
**更新:**在内核邮件列表中搜索提供了这个链接,其中记录了这个问题。我似乎找不到它在任何实际发布的手册中,但这是来自对setns(2)
的补丁的引用,由Eric Biederman确认:
> 如果进程是多线程的,则不能将其重新关联到新的挂载命名空间。更改挂载命名空间要求调用者在其自己的用户命名空间中具有CAP_SYS_CHROOT和CAP_SYS_ADMIN两个能力,并在目标挂载命名空间中具有CAP_SYS_ADMIN能力。
英文:
(There is an issue filed on the Go project)
So, the answer to this question is that you have to call setns
from a single-threaded context. This makes sense since setns
should join the current thread to the namespace. Since Go is multi-threaded, you need to make the setns
call before the Go runtime threads start.
I think this is because the thread in which the call to syscall.RawSyscall
executes is not the main thread -- even with runtime.LockOSThread
the result is not what you would expect (ie. that the goroutine is "locked" to the main C thread and therefore equivalent to the constructor trick explained below).
The reply I got after filing the issue suggested using "the cgo
constructor trick". I couldn't find any "proper" documentation on this "trick", but it is used in nsinit
by Docker/Michael Crosby and even though I went over that code line by line, I didn't try running it this way (see below for frustration).
The "trick" is basically that you can get cgo
to execute a C function prior to starting the Go runtime.
To do this, you add the __attribute__((constructor))
macro to decorate the function you want to run before Go starts up:
/*
__attribute__((constructor)) void init() {
// this code will execute before Go starts up
// in runs in a single-threaded C context
// before Go's threads start running
}
*/
import "C"
Using this as a template, I modified checkns.go
like this:
/*
#include <sched.h>
#include <stdio.h>
#include <fcntl.h>
__attribute__((constructor)) void enter_namespace(void) {
setns(open("/proc/<PID>/ns/mnt", O_RDONLY, 0644), 0);
}
*/
import "C"
... rest of file is unchanged ...
This code works, but requires the PID
to be hardcoded since it's not being read properly from the commandline input, but it illustrates the idea (and works if you provide a PID
from a container started as described above).
It's frustrating because I wanted call setns
multiple times but since this C code executes before the Go runtime starts, there is no Go code available.
Update: Shlepping around in the kernel mailing lists provides this link to a conversation that documents this. I can't seem to find it in any actually published manpages, but here's the quote from a patch to setns(2)
, confirmed by Eric Biederman:
> A process may not be reassociated with a new mount namespace if
> it is multi-threaded. Changing the mount namespace requires
> that the caller possess both CAP_SYS_CHROOT and CAP_SYS_ADMIN
> capabilities in its own user namespace and CAP_SYS_ADMIN in the
> target mount namespace.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论