ostringstream多线程下性能问题分析
最近在某 CPU 比较密集的高并发项目中,加了一段看起来影响不大的代码,上线后 CPU 占用率从 10% 上涨到 11%,觉得在预期内,于是没有太在意。到晚上 8 点左右访问高峰时期,CPU 占用率突然发生雪崩,暴涨至 50% 以上,大量请求失败。赶紧回滚代码,检查原因。
最终定位在一个看起来很简单的函数中,函数的功能是将几个整数拼成一个字符串,使用 std::ostringstream
生成字符串。这个类在多线环境下性能似乎特别差,为了验证这种推测,写了一个测试程序:
#include <sstream>
#include <string>
#include <stdio.h>
#include <sys/time.h>
std::string use_snprintf(int a) {
char buf[64];
snprintf(buf, sizeof(buf), "%d", a);
return buf;
}
std::string use_stringstream(int a) {
std::ostringstream oss;
oss << a;
return oss.str();
}
const int LOOPS = 1000000;
void *thread(void *p) {
std::string (*foo)(int) = (std::string (*)(int))p;
for (int i = 0; i < LOOPS; ++i)
foo(i + 1);
return p;
}
double run_with_threads(int threads, std::string (*foo)(int)) {
timeval start, end;
gettimeofday(&start, nullptr);
pthread_t *tids = new pthread_t[threads];
for (int i = 0; i < threads; ++i)
pthread_create(&tids[i], nullptr, thread, (void *)foo);
for (int i = 0; i < threads; ++i)
pthread_join(tids[i], nullptr);
delete[] tids;
gettimeofday(&end, nullptr);
return (end.tv_sec - start.tv_sec) + (end.tv_usec - start.tv_usec) * 1e-6;
}
void test_with_threads(int threads) {
printf("%d threads:\n", threads);
double time_snprintf = run_with_threads(threads, use_snprintf);
double time_stringstream = run_with_threads(threads, use_stringstream);
printf("snprintf: %f\n", time_snprintf);
printf("stringstream: %f\n", time_stringstream);
printf("stream/snprintf: %f\n", time_stringstream / time_snprintf);
printf("\n");
}
int main(int argc, char **argv) {
if (argc > 1) {
test_with_threads(atoi(argv[1]));
} else {
test_with_threads(1);
test_with_threads(2);
test_with_threads(4);
test_with_threads(10);
test_with_threads(20);
}
}
在我个人的四核电脑上,运行结果如下:
1 threads:
snprintf: 0.088347
stringstream: 0.369389
stream/snprintf: 4.181115
2 threads:
snprintf: 0.082788
stringstream: 0.865383
stream/snprintf: 10.453000
4 threads:
snprintf: 0.085714
stringstream: 1.529721
stream/snprintf: 17.846804
10 threads:
snprintf: 0.216703
stringstream: 3.827154
stream/snprintf: 17.660826
20 threads:
snprintf: 0.428851
stringstream: 7.649718
stream/snprintf: 17.837706
可见,单线程时,ostringstream
的效率比 snprintf
差,但也没有到数量级的差距,而线程数增加后,ostringstream
的效率下降非常快。这时候,我们已经可以合理推测是 ostringstream
使用了线程锁所致了。
使用 perf 跑一把看看,前几名是:
Samples: 25K of event 'cycles:u', Event count (approx.): 21735936055
Overhead Command Shared Object Symbol
16.15% bench.out libstdc++.so.6.0.21 [.] std::locale::operator=
16.04% bench.out libstdc++.so.6.0.21 [.] std::locale::locale
15.06% bench.out libstdc++.so.6.0.21 [.] std::locale::~locale
6.72% bench.out libstdc++.so.6.0.21 [.] std::has_facet<std::ctype<char> >
6.44% bench.out libstdc++.so.6.0.21 [.] __dynamic_cast
3.93% bench.out libstdc++.so.6.0.21 [.] std::use_facet<std::ctype<char> >
可以看到前三名都与 std::locale
有关,而且我们知道 stringstream
是依赖 locale 的。查阅 libstdc++ 源代码:
locale::locale(const locale& __other) throw()
: _M_impl(__other._M_impl)
{ _M_impl->_M_add_reference(); }
locale::~locale() throw()
{ _M_impl->_M_remove_reference(); }
const locale&
locale::operator=(const locale& __other) throw()
{
__other._M_impl->_M_add_reference();
_M_impl->_M_remove_reference();
_M_impl = __other._M_impl;
return *this;
}
引用计数!
继续往下查,_M_impl
的引用计数实现的核心就是简单的加一、减一原子操作。原子操作的效率是比普通读写低一些,但也不至于那么夸张,我只能想到 cache thrashing 一种解释。(Cache thrashing:多个核频繁写同一块内存,导致相互 invalidate 对方的 cache,引起大量 cache miss。)
为了验证这种想法,打开 perf 再看一下,用 annotate 功能,看哪条指令最耗 CPU 时间:
│ 0000003d8dcc13c0 <std::locale::~locale()@@GLIBCXX_3.4>:
0.65 │ cmpq $0x0,vtable for std::messages_byname<wchar_t
│ push %rbx
│ mov (%rdi),%rbx
0.59 │ ↓ je 40
│ mov $0xffffffff,%eax
│ lock xadd %eax,(%rbx)
98.42 │17: cmp $0x1,%eax
│ ↓ jne 38
│ test %rbx,%rbx
│ ↓ je 38
│ mov %rbx,%rdi
│ → callq std::locale::_Impl::~_Impl()@plt
│ mov %rbx,%rdi
│ pop %rbx
│ ↓ jmpq fffffffffffccec0
│ nop
0.21 │38: pop %rbx
0.13 │ ← retq
│ nop
│40: mov (%rbx),%eax
│ lea -0x1(%rax),%edx
│ mov %edx,(%rbx)
│ ↑ jmp 17
这上面显示 98.42% 的 CPU 时间耗在 cmp $0x1,%eax
上,这没有道理,于是看上一条,是 lock xadd
,果然,原子操作!确认了!(由于 CPU 流水线的原因,采样结果偏差一条指令是非常正常的事情。)
以上实验都是在四核电脑上做的,尚且如此明显,在服务器上几十个核争抢一把锁,那就更严重得多,发生雪崩也就好理解了。