BPF 在 Golang 中 Crash 导致用户进程奔溃

进程奔溃现场丢失对于排查问题来说就等于是白给了 🐶

感觉好像可行！

事情的起因是这样的，最近线上有一个服务的 RPC 调用出现了问题，虽然调用频率不高，但耗时偶尔会高于预期，且我们并没有将每次调用的耗时都记录下来，所以并不知道具体每次调用到底持续了多久。刚好最近在研究 BPF，所以就萌生了一个想法，能不能在保留现场的情况下，直接在进程的用户态进行插桩，记录方法的执行时间呢。

实践出真知，决定动手试一试。

BPF 的众多新特性得 Linux 内核版本 5.0 以上才支持，秉持着一步到位的态度，用虚拟机梭哈了一个 Fedora33，内核版本 5.10。

[root@localhost ~]# uname -r
5.10.10-200.fc33.x86_64

本次实验使用的是 bpftrace，bpftrace 是 BPF 一门前端语言，设计灵感来自 awk 和 C 语言，按照官方介绍把套件装完之后，开码！这里得吐槽一点，Fedora 用 yum 安装官方 BCC 套件会出现头文件缺失的问题… 最后我还是用源码自己 make 了一遍搞定。

试试到底行不行？

先准备一套 RPC 服务的 Client/Server

Client 端代码

// client/client.go
package main

import (
	"fmt"
	"net/rpc"
)

type Args struct {
	A, B int
}

func main() {
	client, err := rpc.Dial("tcp", "localhost:7788")
	if err != nil {
		panic(err)
	}

	for i := 0; i < 1000; i++ {
		args := &Args{A: 10, B: 100}
		var reply int

		// <-- 目的就是为了记录 Call 的耗时
		err = client.Call("Arith.Multiply", args, &reply)
		if err != nil {
			panic(err)
		}

		fmt.Printf("Arith: %d*%d=%d\n", args.A, args.B, reply)
	}
}
// go build -gcflags="-N -l" -o client client.go

Server 端代码

// server/server.go
package main

import (
	"log"
	"net"
	"net/http"
	"net/rpc"
	"time"
)

type Args struct {
	A, B int
}

type Arith int

func (t *Arith) Multiply(args *Args, reply *int) error {
	time.Sleep(3000 * time.Millisecond) // 假设每次调用服务端需要处理 3s
	*reply = args.A * args.B
	return nil
}

func main() {
	server := rpc.NewServer()
	server.Register(new(Arith))

	l, err := net.Listen("tcp", "127.0.0.1:7788")
	if err != nil {
		log.Fatalf("failed to connect address: [%s], error: %v", "127.0.0.1:7788", err)
	}

	for {
		conn, err := l.Accept()

		log.Printf("accept new connection: %v\n", conn.RemoteAddr().String())

		if err != nil {
			log.Println("listener Accept error")
			time.Sleep(100 * time.Millisecond)
			continue
		}

		go server.ServeConn(conn)
	}
}
// go build -gcflags="-N -l" -o server server.go

接下来把两者 build 起来，为了避免 Golang 的内联优化，构建时指定 -gcflags="-N -l"。最后就是 BPF 出场了。

不过在写 bpftrace 代码之前，还得根据 client 的二进制包的符号表对应的 Call 方法的栈地址，这个可以通过 objdump 命令找到。

[root@localhost client]# objdump --syms ./client | grep "(*Client).Call" | grep text
00000000005b8280 g     F .text	00000000000000ed              net/rpc.(*Client).Call  # <-- 就是它

bpftrace.bt 代码如下

BEGIN
{
  print("Start tracing client rpc-call...");
  @ct = nsecs;
}

// 使用 uprobe 探针在进入函数调用的时候记录起始时间
uprobe:/root/go/src/rpcexam/client/client:0x5b8280
{
    printf("Call begin at: ");
    time("%Y-%m-%d %H:%M:%S");
    @ct = nsecs;
    print("");
}

// 使用 uretprobe 探针在退出函数调用的时候记录结束时间
uretprobe:/root/go/src/rpcexam/client/client:0x5b8280
{
    printf("     finish at: ");
    time("%Y-%m-%d %H:%M:%S");
    printf("\n     call take: %d ms\n\n", (nsecs-@ct)/1000000)
}

END
{
  clear(@ct);
}

OK! Go!

启动服务端

[root@localhost server]# ./server
2021/02/06 06:09:49 accept new connection: 127.0.0.1:39320

启动客户端

[root@localhost client]# ./client
Arith: 10*100=1000
Arith: 10*100=1000
Arith: 10*100=1000
Arith: 10*100=1000

加载启动 BPF 程序，BPF 的输出日志看起来是符合预期的，每个调用耗时为 3s。

[root@localhost trace]# bpftrace trace.bt
Attaching 4 probes...
Start tracing client rpc-call...
Call begin at: 2021-02-06 06:13:11
     finish at: 2021-02-06 06:13:14
     call take: 3002 ms

Call begin at: 2021-02-06 06:13:14
     finish at: 2021-02-06 06:13:17
     call take: 3001 ms

Call begin at: 2021-02-06 06:13:17
     finish at: 2021-02-06 06:13:20
     call take: 3002 ms

Call begin at: 2021-02-06 06:13:20
     finish at: 2021-02-06 06:13:23
     call take: 3001 ms

使用 bpftool 查看正在运行的 BPF 程序，没毛病。

[root@localhost ~]# bpftool prog show | grep kprobe
228: kprobe  name uretprobe__5997  tag 16fb156ad25a6b14  gpl
229: kprobe  name uprobe__5997184  tag 55496f0b3bc88275  gpl

还真的不行！

正当我在感叹 BPF 技术还是有点东西的时候，很快啊，啪的一下就打脸了，我大意了，没有闪。Client 进程奔溃了？？？

SIGILL: illegal instruction
PC=0x7fffffffe001 m=0 sigcode=128
instruction bytes: 0xbb 0x5e 0x1 0x14 0xa7 0xff 0xff 0x90 0xbb 0x5e 0x1 0x14 0xa7 0xff 0xff 0x0

goroutine 1 [running]:
runtime: unknown pc 0x7fffffffe001
stack: frame={sp:0xc0000d9e20, fp:0x0} stack=[0xc0000d6000,0xc0000da000)
000000c0000d9d20:  0000000000000000  0000000000000000
000000c0000d9d30:  000000c000052720  000000c000052720
000000c0000d9d40:  000000c0000d9d70  000000c000100718
000000c0000d9d50:  000000c000000180  000000c000000208
000000c0000d9d60:  000000c000052770  000000c0000002b8
000000c0000d9d70:  000000c0000d9db0  00000000005b8170 <net/rpc.(*Client).Go+208>
000000c0000d9d80:  000000c0000d9db0  0000000000407ccb <runtime.chanrecv1+43>
000000c0000d9d90:  000000c0001006c0  000000c0000d9e08
000000c0000d9da0:  00000000005db301  000000c000100101
000000c0000d9db0:  000000c0000d9e10  00000000005b833c <net/rpc.(*Client).Call+188>
000000c0000d9dc0:  000000c0001006c0  000000c0000d9e08
000000c0000d9dd0:  000000000000000e  00000000005cae00
000000c0000d9de0:  000000c000017620  00000000005ca680
000000c0000d9df0:  000000c000017618  000000c0001006c0
000000c0000d9e00:  000000c00005dae0  000000c00005dae0
000000c0000d9e10:  000000c0000d9f78  00007fffffffe000
000000c0000d9e20: <000000c000052960  0000000000616d71
000000c0000d9e30:  000000000000000e  00000000005cae00
000000c0000d9e40:  000000c000017620  00000000005ca680
000000c0000d9e50:  000000c000017618  0000000000000000
000000c0000d9e60:  0000000000000000  0000000000000043
000000c0000d9e70:  00000000000003e8  0000000000000064
000000c0000d9e80:  000000000000000a  000000c000052960
000000c0000d9e90:  000000c000017620  000000c000017618
000000c0000d9ea0:  000000c000017620  000000c000052960
000000c0000d9eb0:  000000c000017610  000000000075abe0
000000c0000d9ec0:  000000000075a910  000000c0000d9f48
000000c0000d9ed0:  000000c00005c190  000000c000017620
000000c0000d9ee0:  00000000005b8509 <net/rpc.init+297>  000000c000017618
000000c0000d9ef0:  0000000000000000  0000000000000000
000000c0000d9f00:  0000000000000000  0000000000000000
000000c0000d9f10:  0000000000000000  000000c0000d9f48

首先第一个问题是， SIGILL 是什么信号？在 StackOverflow 上找到一个回答。

大概就是 CPU 执行了一个它并不理解的指令或者是寄存器跳到了一个非法的地址去了。

也就是这段指令

instruction bytes: 0xbb 0x5e 0x1 0x14 0xa7 0xff 0xff 0x90 0xbb 0x5e 0x1 0x14 0xa7 0xff 0xff 0x0

在 Golang 源码中定位到具体逻辑。

print("PC=", hex(c.sigpc()), " m=", _g_.m.id, " sigcode=", c.sigcode(), "\n")
if _g_.m.lockedg != 0 && _g_.m.ncgo > 0 && gp == _g_.m.g0 {
	print("signal arrived during cgo execution\n")
	gp = _g_.m.lockedg.ptr()
}
if sig == _SIGILL || sig == _SIGFPE {
	// It would be nice to know how long the instruction is.
	// Unfortunately, that's complicated to do in general (mostly for x86
	// and s930x, but other archs have non-standard instruction lengths also).
	// Opt to print 16 bytes, which covers most instructions.
	const maxN = 16
	n := uintptr(maxN)
	// We have to be careful, though. If we're near the end of
	// a page and the following page isn't mapped, we could
	// segfault. So make sure we don't straddle a page (even though
	// that could lead to printing an incomplete instruction).
	// We're assuming here we can read at least the page containing the PC.
	// I suppose it is possible that the page is mapped executable but not readable?
	pc := c.sigpc()
	if n > physPageSize-pc%physPageSize {
		n = physPageSize - pc%physPageSize
	}
	print("instruction bytes:")        // <-- 在这里打印的非法指令
	b := (*[maxN]byte)(unsafe.Pointer(pc))
	for i := uintptr(0); i < n; i++ {
		print(" ", hex(b[i]))
	}
	println()
}
print("\n")

level, _, docrash := gotraceback()

// 接下来程序进入 level>0 的分支
if level > 0 {
	// 打印 goroutine 现场
	goroutineheader(gp)
	tracebacktrap(c.sigpc(), c.sigsp(), c.siglr(), gp)
	if crashing > 0 && gp != _g_.m.curg && _g_.m.curg != nil && readgstatus(_g_.m.curg)&^_Gscan == _Grunning {
		// tracebackothers on original m skipped this one; trace it now.
		goroutineheader(_g_.m.curg)
		traceback(^uintptr(0), ^uintptr(0), 0, _g_.m.curg)
	} else if crashing == 0 {
		tracebackothers(gp)
		print("\n")
	}
	dumpregs(c)
}

// 一旦到了这里 那进程就必定会退出了
if docrash {
	crashing++
	if crashing < mcount()-int32(extraMCount) {
		// There are other m's that need to dump their stacks.
		// Relay SIGQUIT to the next m by sending it to the current process.
		// All m's that have already received SIGQUIT have signal masks blocking
		// receipt of any signals, so the SIGQUIT will go to an m that hasn't seen it yet.
		// When the last m receives the SIGQUIT, it will fall through to the call to
		// crash below. Just in case the relaying gets botched, each m involved in
		// the relay sleeps for 5 seconds and then does the crash/exit itself.
		// In expected operation, the last m has received the SIGQUIT and run
		// crash/exit and the process is gone, all long before any of the
		// 5-second sleeps have finished.
		print("\n-----\n\n")
		raiseproc(_SIGQUIT)
		usleep(5 * 1000 * 1000)
	}
	crash()
}

printDebugLog()

exit(2)

BPF 说好的是经过重重校验的，安全是有保证的，为什么这么轻易会把用户进程给搞挂了。在 Github Golang 的官方项目的 issue 里找到了相关的解释。

Golang 进程在 GC 和调度的时候是栈地址是会变化的，而 BPF 在使用 uretporbe 的时候需要提前指定好符号表中的栈地址，这时候就会导致进程奔溃了（所以应该是上面讲的寄存器跳到了一个非法的地址去了）。

要怎么样才行？

答案是，换门语言吧。Go 不支持… 😂

我非得让它行！

不就是一个二进制程序吗，Golang 不行，上 Rust！不得不吐槽，rust 的编译速度简直是（挺好的）…

为了简单起见，这里使用 reqwest 构建一个 HTTP 的客户端，就不用 async 了（可以用但没必要 🐶，blocking 即可）。

use reqwest::blocking;

fn httpget() {
    reqwest::blocking::get("https://chenjiandongx.me");
}

fn main() -> Result<(), Box<dyn std::error::Error>> {
    for i in 0..100 {
        httpget();
        println!("request count: {}", i);
    }
    Ok(())
}

使用 cargo build 将程序构建起来（这里同样只使用 debug 模式，该模式下没有内联优化），再使用 objdump 获取 httpget 方法的栈地址。

[root@localhost debug]# objdump --syms ./http | grep httpget
00000000000ab070 l     F .text	000000000000002e              _ZN4http7httpget17h72748326e6c579c8E

修改 trace.bt 代码，主要是变更 uprobe/uretprobe 的追踪路径

BEGIN
{
  print("Start tracing client rpc-call...");
  @ct = nsecs;
}

// 这里
uprobe:/root/rust/http/target/debug/http:0xab070
{
    printf("Call begin at: ");
    time("%Y-%m-%d %H:%M:%S");
    @ct = nsecs;
    print("");
}

// 还有这里
uretprobe:/root/rust/http/target/debug/http:0xab070
{
    printf("     finish at: ");
    time("%Y-%m-%d %H:%M:%S");
    printf("\n     call take: %d ms\n\n", (nsecs-@ct)/1000000)
}

END
{
  clear(@ct)
}

奇迹再现

启动 Client

[root@localhost debug]# ./http
request count: 0
request count: 1
request count: 2
request count: 3
request count: 4
request count: 5
request count: 6
request count: 7
request count: 8
request count: 9
request count: 10
request count: 11
request count: 12
request count: 13
request count: 14
request count: 15
request count: 16
request count: 17
request count: 18
request count: 19
request count: 20
request count: 21
request count: 22

加载并启动 BPF 程序，这回没事了吧。

[root@localhost trace]# bpftrace trace.bt
Attaching 4 probes...
Start tracing client rpc-call...
Call begin at: 2021-02-06 07:36:16
     finish at: 2021-02-06 07:36:16
     call take: 530 ms

Call begin at: 2021-02-06 07:36:16
     finish at: 2021-02-06 07:36:17
     call take: 426 ms

Call begin at: 2021-02-06 07:36:17
     finish at: 2021-02-06 07:36:17
     call take: 167 ms

Call begin at: 2021-02-06 07:36:17
     finish at: 2021-02-06 07:36:17
     call take: 266 ms

Call begin at: 2021-02-06 07:36:17
     finish at: 2021-02-06 07:36:17
     call take: 158 ms

Call begin at: 2021-02-06 07:36:17
     finish at: 2021-02-06 07:36:18
     call take: 171 ms

Call begin at: 2021-02-06 07:36:18
     finish at: 2021-02-06 07:36:18
     call take: 369 ms

结论：Rust 并不会像 Golang 那般在运行时修改寄存器地址，所以看起来问题不大。BPF 在我看来真的是一项具有革命意义的新技术，内核可编程 听起来是不是就有点厉害，在接下来的开发中有机会的话也会逐渐探索这项新技术，看看还能踩出什么坑来。