英文:
Faster version of strstr for fix string
问题
我有一个短字符串s
(最多8个字符),我想要在许多字符串中搜索它。实际上,我想要在数据流中的每个字符串中搜索第一次出现的位置。对于我的用例,要尽快找到s
的第一个出现位置,因为每秒要处理大量字符串,并且延迟非常关键。当然,可以扩展机器,但一个重要的点是要降低成本(和延迟)。
总体上,我想要创建一个C(或C++)函数,它的行为类似于strstr
,但针对一个固定的“needle”。这个needle在编译时不知道,只在运行时(在启动时)知道。但生成代码并在运行时编译它(或者进行任何其他“昂贵”的初始化)是可以接受的。一旦needle知道了,它就不会再改变了。
另一个细节:needle几乎会出现在输入流的每个字符串中。因此,如果needle不可用的情况下算法较慢是可以接受的(因为这几乎永远不会发生)。还可能很重要的是,输入字符串总是在末尾多分配了额外的64字节(这对于SIMD操作可能有帮助)。
实际上,我对strstr
已经相当快感到惊讶,但我想在needle不改变的情况下可能存在更优化的算法?
非常感谢!
英文:
I have a short string s
(max 8 characters) which I want to search in many strings. Actually I want to search the first occurrence in each string of a stream. Finding the first index of s
has to be as fast as possible for my use-case, because a massive amount of strings per second is processed and the latency is very critical. Of course, machines can be scaled, but a big point is to reduce costs (and latency).
In general I want to create a C (or C++) function which behaves like strstr
, but for a fixed "needle". The needle is not known at compile-time, but only at runtime (at the startup). However, it's okay to generate code at runtime and compile it (or any other "expensive" initialization is fine). Once the needle is known it won't change anymore.
Another detail: The needle will be in almost every string of the input stream. So it's okay if the algorithm is slower for the case that the needle is not available (because that will almost never happen). Also maybe important: The input strings have always extra 64 byte allocated at the end (which might be helpful for SIMD operations).
I was actually surprised that strstr
is already quite fast, but I guess there might be a more optimal algorithm for the case that the needle is does not change?
Thanks a lot
答案1
得分: 1
如果您的目标可以正常处理不对齐的读取,您可以使用以下方法:
#include <stddef.h>
#include <stdint.h>
char *mystrstr8(const char *s, uint64_t str8, uint64_t mask8) {
for (const char *p = s; *p; p++) {
const uint64_t *p64 = (const uint64_t *)(uintptr_t)p;
if ((*p64 & mask8) == str8)
return (char *)(uintptr_t)p;
}
return NULL;
}
如果字符串是可修改的,具有额外的松弛并提供其长度,您可以删除终止符测试:
#include <stddef.h>
#include <stdint.h>
char *mystrstr8_len(char *s, size_t len, uint64_t str8, uint64_t mask8) {
char *end = s + len;
uint64_t *e64 = (uint64_t *)(uintptr_t)end;
uint64_t ee = *e64;
*e64 = str8;
for (const char *p = s;; p++) {
const uint64_t *p64 = (const uint64_t *)(uintptr_t)p;
if ((*p64 & mask8) == str8) {
*e64 = ee;
if (p < end)
return (char *)(uintptr_t)p;
else
return NULL;
}
}
}
str8
和 mask8
必须根据针线字符串的字节预先计算,并根据目标字节顺序进行设置。例如,在小端机器上搜索 Hello
,str8
是 0x6f6c6c6548
,mask8
是 0xffffffffff
。
对于短字符串,这种简单的暴力方法在性能上可能优于使用定制的Boyer Moore实现,具体取决于您的特定数据:数组和针线的长度和内容等。您可以通过比较性能与标准库的 strstr
函数来开始。
以下是不同字符串长度的基准测试:
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <time.h>
// ...(其他代码)
这是我在2015年的Mac x86_64笔记本上的结果:
len needle strstr mystrstr8 mystrstr8_len
10 a 0.013 0.005 0.008
10 ab 0.013 0.005 0.008
10 abc 0.014 0.005 0.008
10 abcd 0.013 0.004 0.007
10 abcde 0.013 0.004 0.007
10 abcdef 0.013 0.003 0.007
10 abcdefg 0.012 0.003 0.007
10 abcdefgh 0.012 0.002 0.002
100 a 0.076 0.057 0.046
100 ab 0.076 0.056 0.045
100 abc 0.077 0.056 0.045
100 abcd 0.076 0.055 0.044
100 abcde 0.077 0.055 0.044
100 abcdef 0.076 0.054 0.044
100 abcdefg 0.076 0.054 0.043
100 abcdefgh 0.076 0.045 0.040
1000 a 0.610 0.480 0.410
1000 ab 0.610 0.470 0.410
1000 abc 0.610 0.480 0.410
1000 abcd 0.610 0.480 0.410
1000 abcde 0.610 0.470 0.400
1000 abcdef 0.610 0.470 0.410
1000 abcdefg 0.610 0.470 0.400
1000 abcdefgh 0.610 0.400 0.370
10000 a 5.900 4.800 4.100
10000 ab 5.900 4.800 4.100
10000 abc 5.900 4.800 4.100
10000 abcd 5.900 4.800 4.100
10000 abcde 5.900 4.800 4.100
10000 abcdef 5.900 4.800 4.100
10000 abcdefg 5.900 4.800 4.100
10000 abcdefgh 5.900 4.000 3.800
100000 a 59.000 50.000 41.000
100000 ab 59.000 49.000 41.000
100000 abc 59.000 49.000 41.000
100000 abcd 59.000 49.000 41.000
100000 abcde 59.000 49.000 41.000
100000 abcdef 59.000 49.000 41.000
100000 abcdefg 59.000 50.000 41.000
100000 abcdefgh 59.000 40.000 39.000
1000000 a 593.000 493.000 415.000
1000000 ab 589.000 472.000 415.000
1000000 abc 592.000 496.000 413.000
1000000 abcd 590.000 496.000 416.000
1000000 abcde 589.000 495.000 415.000
1000000 abcdef 589.000 495.000 416.000
1000000 abcdefg 589.000 495.000
<details>
<summary>英文:</summary>
If your target handles unaligned reads gracefully, you could use this approach:
#include <stddef.h>
#include <stdint.h>
char *mystrstr8(const char *s, uint64_t str8, uint64_t mask8) {
for (const char *p = s; *p; p++) {
const uint64_t *p64 = (const uint64_t *)(uintptr_t)p;
if ((*p64 & mask8) == str8)
return (char *)(uintptr_t)p;
}
return NULL;
}
If the string is modifiable, has extra slack and its length is provided, you can remove the terminator test:
#include <stddef.h>
#include <stdint.h>
char *mystrstr8_len(char *s, size_t len, uint64_t str8, uint64_t mask8) {
char *end = s + len;
uint64_t *e64 = (uint64_t *)(uintptr_t)end;
uint64_t ee = *e64;
*e64 = str8;
for (const char *p = s;; p++) {
const uint64_t *p64 = (const uint64_t *)(uintptr_t)p;
if ((*p64 & mask8) == str8) {
*e64 = ee;
if (p < end)
return (char *)(uintptr_t)p;
else
return NULL;
}
}
}
`str8` and `mask8` must be precomputed from the bytes of the needle string and according to the target endianness. For example, to search for `Hello` on a little endian machine `str8` is `0x6f6c6c6548` and `mask8` is `0xffffffffff`.
For short strings, this simplistic brute force approach might perform better than using a tailored Boyer Moore implementation, depending on your specific data: array and needle lengths and contents... You can start by comparing the performance with that of your standard library's `strstr` function.
Here is a benchmark for various string lengths:
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <time.h>
char *mystrstr8(const char *s, uint64_t str8, uint64_t mask8) {
for (const char *p = s; *p; p++) {
const uint64_t *p64 = (const uint64_t *)(uintptr_t)p;
if ((*p64 & mask8) == str8)
return (char *)(uintptr_t)p;
}
return NULL;
}
char *mystrstr8_8(const char *s, uint64_t str8) {
for (const char *p = s; *p; p++) {
const uint64_t *p64 = (const uint64_t *)(uintptr_t)p;
if (*p64 == str8)
return (char *)(uintptr_t)p;
}
return NULL;
}
char *mystrstr8_len(char *s, size_t len, uint64_t str8, uint64_t mask8) {
char *end = s + len;
uint64_t *e64 = (uint64_t *)(uintptr_t)end;
uint64_t ee = *e64;
*e64 = str8;
for (char *p = s;; p++) {
uint64_t *p64 = (uint64_t *)(uintptr_t)p;
if ((*p64 & mask8) == str8) {
*e64 = ee;
if (p < end)
return p;
else
return NULL;
}
}
}
char *mystrstr8_len8(char *s, size_t len, uint64_t str8) {
char *end = s + len;
uint64_t *e64 = (uint64_t *)(uintptr_t)end;
uint64_t ee = *e64;
*e64 = str8;
for (char *p = s;; p++) {
uint64_t *p64 = (uint64_t *)(uintptr_t)p;
if (*p64 == str8) {
*e64 = ee;
if (p < end)
return p;
else
return NULL;
}
}
}
int benchmark(int len, const char *needle, char *a) {
char buf[9] = { 0 };
strncat(buf, needle, 8);
int needle_len = strlen(buf);
uint64_t mask8 = needle_len ? 0xFFFFFFFFFFFFFFFF >> (64 - needle_len * 8) : 0;
uint64_t str8;
memcpy(&str8, buf, 8);
memset(a, 'x', len);
a[len] = '\0';
int pos = len - needle_len;
if (pos >= 0 && pos <= len - needle_len)
memcpy(a + pos, needle, needle_len);
clock_t c;
long c1, c2, c3;
long b1 = 1000000, b2 = 1000000, b3 = 1000000;
long n1 = 0, n2 = 0, n3 = 0;
int rep = 100000 / len;
rep += rep == 0;
int res = 0;
void *p1[rep], *p2[rep], *p3[rep];
while (n1 < 10000) {
c = clock();
for (int i = 0; i < rep; i++)
p1[i] = strstr(a, needle);
c1 = clock() - c;
if (needle_len == 8) {
c = clock();
for (int i = 0; i < rep; i++)
p2[i] = mystrstr8_8(a, str8);
c2 = clock() - c;
c = clock();
for (int i = 0; i < rep; i++)
p3[i] = mystrstr8_len8(a, len, str8);
c3 = clock() - c;
} else {
c = clock();
for (int i = 0; i < rep; i++)
p2[i] = mystrstr8(a, str8, mask8);
c2 = clock() - c;
c = clock();
for (int i = 0; i < rep; i++)
p3[i] = mystrstr8_len(a, len, str8, mask8);
c3 = clock() - c;
}
n1 += c1;
n2 += c2;
n3 += c3;
b1 -= (b1 - c1) * (b1 > c1);
b2 -= (b2 - c2) * (b2 > c2);
b3 -= (b3 - c3) * (b3 > c3);
res = (p1[rep - 1] != p2[rep - 1] || p1[rep - 1] != p3[rep - 1]);
}
if (p2[0] != p1[0]) {
printf("bench(%d, '%s'): mystrstr8 failure: %p, expected %p\n",
len, needle, p2[0], p1[0]);
}
if (p3[0] != p1[0]) {
printf("bench(%d, '%s'): mystrstr8_len failure: %p, expected %p\n",
len, needle, p3[0], p1[0]);
}
if (res == 0) {
printf("%-8d %-8s %13.3f %13.3f %13.3f\n", len, needle,
(double)b1 / rep, (double)b2 / rep, (double)b3 / rep);
}
return res;
}
#define MAX_LEN 1000000
int main(int argc, char *argv[]) {
char *a = malloc(MAX_LEN + 8);
// ensure full output is buffered
setvbuf(stdout, NULL, _IOFBF, 16384);
printf("%-8s %-8s %13s %13s %13s\n",
"len", "needle", "strstr", "mystrstr8", "mystrstr8_len");
for (int len = 10; len <= MAX_LEN; len *= 10) {
benchmark(len, "a", a);
benchmark(len, "ab", a);
benchmark(len, "abc", a);
benchmark(len, "abcd", a);
benchmark(len, "abcde", a);
benchmark(len, "abcdef", a);
benchmark(len, "abcdefg", a);
benchmark(len, "abcdefgh", a);
}
free(a);
return 0;
}
Here are the results on my 2015 Mac x86_64 laptop:
```none
len needle strstr mystrstr8 mystrstr8_len
10 a 0.013 0.005 0.008
10 ab 0.013 0.005 0.008
10 abc 0.014 0.005 0.008
10 abcd 0.013 0.004 0.007
10 abcde 0.013 0.004 0.007
10 abcdef 0.013 0.003 0.007
10 abcdefg 0.012 0.003 0.007
10 abcdefgh 0.012 0.002 0.002
100 a 0.076 0.057 0.046
100 ab 0.076 0.056 0.045
100 abc 0.077 0.056 0.045
100 abcd 0.076 0.055 0.044
100 abcde 0.077 0.055 0.044
100 abcdef 0.076 0.054 0.044
100 abcdefg 0.076 0.054 0.043
100 abcdefgh 0.076 0.045 0.040
1000 a 0.610 0.480 0.410
1000 ab 0.610 0.470 0.410
1000 abc 0.610 0.480 0.410
1000 abcd 0.610 0.480 0.410
1000 abcde 0.610 0.470 0.400
1000 abcdef 0.610 0.470 0.410
1000 abcdefg 0.610 0.470 0.400
1000 abcdefgh 0.610 0.400 0.370
10000 a 5.900 4.800 4.100
10000 ab 5.900 4.800 4.100
10000 abc 5.900 4.800 4.100
10000 abcd 5.900 4.800 4.100
10000 abcde 5.900 4.800 4.100
10000 abcdef 5.900 4.800 4.100
10000 abcdefg 5.900 4.800 4.100
10000 abcdefgh 5.900 4.000 3.800
100000 a 59.000 50.000 41.000
100000 ab 59.000 49.000 41.000
100000 abc 59.000 49.000 41.000
100000 abcd 59.000 49.000 41.000
100000 abcde 59.000 49.000 41.000
100000 abcdef 59.000 49.000 41.000
100000 abcdefg 59.000 50.000 41.000
100000 abcdefgh 59.000 40.000 39.000
1000000 a 593.000 493.000 415.000
1000000 ab 589.000 472.000 415.000
1000000 abc 592.000 496.000 413.000
1000000 abcd 590.000 496.000 416.000
1000000 abcde 589.000 495.000 415.000
1000000 abcdef 589.000 495.000 416.000
1000000 abcdefg 589.000 495.000 417.000
1000000 abcdefgh 589.000 406.000 385.000
This hack consistently improves performance by 15 to 30% on long strings and even more on shorter ones. I made a special case of 8 byte needles that could be adapted for 1, 2 and 4 byte needles too.
答案2
得分: 1
以下是您要求的代码部分的中文翻译:
如果搜索的针以在干草堆中很少见的字节值开头,一个简单的实现将胜过更复杂的替代方案:
char *mystrstr_naive(const char *s, const char *needle) {
char *p = (char *)(uintptr_t)s;
int c = *needle++;
if (c == 'char *mystrstr_naive(const char *s, const char *needle) {
char *p = (char *)(uintptr_t)s;
int c = *needle++;
if (c == '\0')
return p;
if (*needle == '\0')
return strchr(p, c);
size_t len = strlen(needle);
while ((p = strchr(p, c)) != NULL) {
p++;
if (!memcmp(p, needle, len))
return p - 1;
}
return NULL;
}
')
return p;
if (*needle == 'char *mystrstr_naive(const char *s, const char *needle) {
char *p = (char *)(uintptr_t)s;
int c = *needle++;
if (c == '\0')
return p;
if (*needle == '\0')
return strchr(p, c);
size_t len = strlen(needle);
while ((p = strchr(p, c)) != NULL) {
p++;
if (!memcmp(p, needle, len))
return p - 1;
}
return NULL;
}
')
return strchr(p, c);
size_t len = strlen(needle);
while ((p = strchr(p, c)) != NULL) {
p++;
if (!memcmp(p, needle, len))
return p - 1;
}
return NULL;
}
甚至更快,如果字符串长度已知:
char *mystrstr_naive_len(const char *s, size_t slen, const char *needle) {
char *p = (char *)(uintptr_t)s;
int c = *needle++;
if (c == 'char *mystrstr_naive_len(const char *s, size_t slen, const char *needle) {
char *p = (char *)(uintptr_t)s;
int c = *needle++;
if (c == '\0')
return p;
if (*needle == '\0')
return memchr(p, c, slen);
size_t len = strlen(needle);
if (len < slen) {
char *e = p + slen - len;
while ((p = memchr(p, c, e - p)) != NULL) {
p++;
if (!memcmp(p, needle, len))
return p - 1;
}
}
return NULL;
}
')
return p;
if (*needle == 'char *mystrstr_naive_len(const char *s, size_t slen, const char *needle) {
char *p = (char *)(uintptr_t)s;
int c = *needle++;
if (c == '\0')
return p;
if (*needle == '\0')
return memchr(p, c, slen);
size_t len = strlen(needle);
if (len < slen) {
char *e = p + slen - len;
while ((p = memchr(p, c, e - p)) != NULL) {
p++;
if (!memcmp(p, needle, len))
return p - 1;
}
}
return NULL;
}
')
return memchr(p, c, slen);
size_t len = strlen(needle);
if (len < slen) {
char *e = p + slen - len;
while ((p = memchr(p, c, e - p)) != NULL) {
p++;
if (!memcmp(p, needle, len))
return p - 1;
}
}
return NULL;
}
在我的系统上,如果搜索的针以在干草堆中很少见的字节值开头,这比strstr
和我在另一个答案中提出的其他替代方案快10到20倍。
要提高特定数据的性能,您必须仔细研究数据。事先知道针是一个有趣的线索,但其他一些特征可能更有成果。
英文:
If the needle searched starts with a byte value that is rare in the haystack, a simplistic implementation will beat more complicated alternatives:
char *mystrstr_naive(const char *s, const char *needle) {
char *p = (char *)(uintptr_t)s;
int c = *needle++;
if (c == 'char *mystrstr_naive(const char *s, const char *needle) {
char *p = (char *)(uintptr_t)s;
int c = *needle++;
if (c == '\0')
return p;
if (*needle == '\0')
return strchr(p, c);
size_t len = strlen(needle);
while ((p = strchr(p, c)) != NULL) {
p++;
if (!memcmp(p, needle, len))
return p - 1;
}
return NULL;
}
')
return p;
if (*needle == 'char *mystrstr_naive(const char *s, const char *needle) {
char *p = (char *)(uintptr_t)s;
int c = *needle++;
if (c == '\0')
return p;
if (*needle == '\0')
return strchr(p, c);
size_t len = strlen(needle);
while ((p = strchr(p, c)) != NULL) {
p++;
if (!memcmp(p, needle, len))
return p - 1;
}
return NULL;
}
')
return strchr(p, c);
size_t len = strlen(needle);
while ((p = strchr(p, c)) != NULL) {
p++;
if (!memcmp(p, needle, len))
return p - 1;
}
return NULL;
}
And even faster if the string length is known:
char *mystrstr_naive_len(const char *s, size_t slen, const char *needle) {
char *p = (char *)(uintptr_t)s;
int c = *needle++;
if (c == 'char *mystrstr_naive_len(const char *s, size_t slen, const char *needle) {
char *p = (char *)(uintptr_t)s;
int c = *needle++;
if (c == '\0')
return p;
if (*needle == '\0')
return memchr(p, c, slen);
size_t len = strlen(needle);
if (len < slen) {
char *e = p + slen - len;
while ((p = memchr(p, c, e - p)) != NULL) {
p++;
if (!memcmp(p, needle, len))
return p - 1;
}
}
return NULL;
}
')
return p;
if (*needle == 'char *mystrstr_naive_len(const char *s, size_t slen, const char *needle) {
char *p = (char *)(uintptr_t)s;
int c = *needle++;
if (c == '\0')
return p;
if (*needle == '\0')
return memchr(p, c, slen);
size_t len = strlen(needle);
if (len < slen) {
char *e = p + slen - len;
while ((p = memchr(p, c, e - p)) != NULL) {
p++;
if (!memcmp(p, needle, len))
return p - 1;
}
}
return NULL;
}
')
return memchr(p, c, slen);
size_t len = strlen(needle);
if (len < slen) {
char *e = p + slen - len;
while ((p = memchr(p, c, e - p)) != NULL) {
p++;
if (!memcmp(p, needle, len))
return p - 1;
}
}
return NULL;
}
On my system, if the needle searched starts with a byte value that is rare in the haystack, this is 10 to 20 times faster than strstr
and the alternatives presented in my other answer.
To improve performance on specific data, you must study the data carefully. The needle being known in advance is a interesting clue, but some other characteristics might be more fruitful.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论