Invalid uninitialized jump or move memory error while trying to split a char32_t string into tokens manually

huangapple go评论54阅读模式
英文:

Invalid uninitialized jump or move memory error while trying to split a char32_t string into tokens manually

问题

你的程序存在内存错误。你可以尝试在动态分配内存之后,确保初始化这些内存。在sp函数中,你可以使用memset来初始化分配的内存块,如下所示:

tokens[i] = (char32_t *)malloc(sizeof(char32_t) * (tok_len + 1));
if (tokens[i] == NULL) {
  exit(112);
}
memset(tokens[i], 0, (tok_len + 1) * sizeof(char32_t));

这将初始化分配的内存块为零,避免了未初始化的值导致的内存错误。

另外,请确保在程序结束前释放分配的内存,以免发生内存泄漏。在你的驱动代码中,已经释放了内存块,这是正确的做法。

这些更改应该有助于解决Valgrind报告的内存错误。希望这对你有所帮助。

英文:

I am trying to split a char32_t string into tokens separated by a delimiter. I am not using any strtok or other std library function because, it is gurrented that input string and the delimiter will be mulltibyte unicode string.

Here is the function I have written:

#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <uchar.h>
#include <wchar.h>

char32_t **sp(char32_t *str, char32_t delim, int *len) {
  *len = 1;
  char32_t *s = str;
  while (*s != U'\0') {
    if (*s == delim) {
      (*len)++;
    }
    s++;
  }
  char32_t **tokens = (char32_t **)malloc((*len) * sizeof(char32_t *));
  if (tokens == NULL) {
    exit(111);
  }

  char32_t * p = str;
  int i = 0;
  while (*p != U'\0') {
    int tok_len = 0;
    while (p[tok_len] != U'\0' && p[tok_len] != delim) {
      tok_len++;
    }
    tokens[i] = (char32_t *)malloc(sizeof(char32_t) * (tok_len + 1));
    if (tokens[i] == NULL) {
      exit(112);
    }
    memcpy(tokens[i], p, tok_len * sizeof(char32_t));
    tokens[i][tok_len] = U'\0';
    p += tok_len + 1;
    i++;
  }
  return tokens;
}

And here is the driver code

int main() {
  char32_t *str = U"Hello,World,mango,hey,";
  char32_t delim = U',';
  int len = 0;
  char32_t ** tokens = sp(str, delim, &len);
  wprintf(L"len -> %d\n", len);
  for (int i = 0; i < len; i++) {
    if (tokens[i]) {
    
    wprintf(L"[%d] %ls\n" , i , tokens[i]);  
    }
    free(tokens[i]);
  }  
  free(tokens);

}

Here is the output:

len -> 5
[0] Hello
[1] World
[2] mango
[3] hey
[4] (null)

But when I check the program with valgrind it show multiple memory errors

valgrind -s --leak-check=full --track-origins=yes ./x3
==7703== Memcheck, a memory error detector
==7703== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==7703== Using Valgrind-3.20.0 and LibVEX; rerun with -h for copyright info
==7703== Command: ./x3
==7703== 
tok -> 5
tok -> 5
tok -> 5
tok -> 3
len -> 5
[0] Hello
[1] World
[2] mango
[3] hey
==7703== Conditional jump or move depends on uninitialised value(s)
==7703==    at 0x48FDAF8: __wprintf_buffer (vfprintf-process-arg.c:396)
==7703==    by 0x48FF421: __vfwprintf_internal (vfprintf-internal.c:1459)
==7703==    by 0x490CFAE: wprintf (wprintf.c:32)
==7703==    by 0x1093C9: main (main.c:51)
==7703==  Uninitialised value was created by a heap allocation
==7703==    at 0x4841888: malloc (vg_replace_malloc.c:393)
==7703==    by 0x1091FC: sp (main.c:17)
==7703==    by 0x109384: main (main.c:47)
==7703== 
[4] (null)
==7703== Conditional jump or move depends on uninitialised value(s)
==7703==    at 0x4844225: free (vg_replace_malloc.c:884)
==7703==    by 0x1093DA: main (main.c:52)
==7703==  Uninitialised value was created by a heap allocation
==7703==    at 0x4841888: malloc (vg_replace_malloc.c:393)
==7703==    by 0x1091FC: sp (main.c:17)
==7703==    by 0x109384: main (main.c:47)
==7703== 
==7703== 
==7703== HEAP SUMMARY:
==7703==     in use at exit: 0 bytes in 0 blocks
==7703==   total heap usage: 7 allocs, 7 frees, 5,248 bytes allocated
==7703== 
==7703== All heap blocks were freed -- no leaks are possible
==7703== 
==7703== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)
==7703== 
==7703== 1 errors in context 1 of 2:
==7703== Conditional jump or move depends on uninitialised value(s)
==7703==    at 0x4844225: free (vg_replace_malloc.c:884)
==7703==    by 0x1093DA: main (main.c:52)
==7703==  Uninitialised value was created by a heap allocation
==7703==    at 0x4841888: malloc (vg_replace_malloc.c:393)
==7703==    by 0x1091FC: sp (main.c:17)
==7703==    by 0x109384: main (main.c:47)
==7703== 
==7703== 
==7703== 1 errors in context 2 of 2:
==7703== Conditional jump or move depends on uninitialised value(s)
==7703==    at 0x48FDAF8: __wprintf_buffer (vfprintf-process-arg.c:396)
==7703==    by 0x48FF421: __vfwprintf_internal (vfprintf-internal.c:1459)
==7703==    by 0x490CFAE: wprintf (wprintf.c:32)
==7703==    by 0x1093C9: main (main.c:51)
==7703==  Uninitialised value was created by a heap allocation
==7703==    at 0x4841888: malloc (vg_replace_malloc.c:393)
==7703==    by 0x1091FC: sp (main.c:17)
==7703==    by 0x109384: main (main.c:47)
==7703== 
==7703== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)

I am unable to figure out what is the problem. any help will be appreciated

I have also tried with unicode strings the same error also occurs.

答案1

得分: 3

valgrind 出现这些错误是因为您的程序在 main() 函数中的 for 循环的最后一次迭代中访问了未初始化的内存(即在访问 tokens[4] 时,len 的值为 5):

for (int i = 0; i < len; i++) {
    if (tokens[i]) {
        wprintf(L"[%d] %ls\n" , i , tokens[i]);  
    }
    free(tokens[i]);
}

malloc 函数分配内存并将其保留为未初始化状态。在 sp() 函数中,当您的程序分配内存时,它是未初始化的:

char32_t **tokens = (char32_t **)malloc((*len) * sizeof(char32_t *));

sp() 函数的 while 循环分配并复制某些值到已分配内存的所有 tokens 数组成员,除了最后一个成员,并将其保留为未初始化状态。在 main() 中,您的程序访问了这个未初始化的成员,因此 valgrind 报告了错误。

为了解决这个问题,在 sp() 函数中,分配完内存给 tokens 后:

要么将 tokens 数组的最后一个指针成员设为 NULL

tokens[*len - 1] = NULL; // 这是修复问题所需的最低限度更改

或者将所有指针设为 NULL

for (int i = 0; i < *len; ++i) {
    tokens[i] = NULL;
}

或者使用 calloc 来分配内存给 tokens,这将确保所有分配的指针都初始化为 NULL

char32_t **tokens = calloc((*len), sizeof(char32_t *));

使用上述任何一种解决方案,valgrind 的输出应该是没有泄漏的。

另外,您的代码中还有一个问题,当输入字符串的最后一个字符不是分隔符字符时,程序会访问超出其长度的输入字符串,导致未定义行为。请查看 sp() 函数的 while 循环中的这个语句:

p += tok_len + 1;

假设输入字符串是 U"Hello,World,mango,hey"(请注意字符串的最后一个字符不是逗号 ,)。在迭代输入字符串时,嵌套的 while 循环条件将在 p[tok_len] 等于 U'\0' 时返回 false,然后下面的语句 p += tok_len + 1; 会使指针 p 指向超出输入字符串的内存。外部的 while 循环条件会尝试解引用 p,这将导致未定义行为。

sp() 函数的 while 循环中的这个语句:

p += tok_len + 1;

替换为:

p += tok_len;
p += (*p != '
p += tok_len;
p += (*p != '\0') ? 1 : 0;
'
) ? 1 : 0;

这将首先使指针 p 指向当前标记在输入字符串的末尾后面一个字符,如果该字符不是空终止字符,将添加 1 到指针 p,否则不添加。

while 循环体可以以更好的方式实现,并可以处理其他场景,比如处理输入字符串中的空格或处理只包含分隔符的输入字符串等。我将这些改进留给您来完成。

编辑:

这是您的要求 - 如果输入字符串的最后一个字符是分隔符,则 tokens 数组的最后一个成员应该指向空字符串,而不是 NULL。您无需在循环后处理此作为特殊情况,如您在评论中所示。您可以在处理输入字符串并从中提取标记的循环体中处理这一点,如下所示:

char32_t **sp(const char32_t *str, const char32_t delim, int *len) {
    *len = 1;
    for (int i = 0; str[i] != U'
char32_t **sp(const char32_t *str, const char32_t delim, int *len) {
    *len = 1;
    for (int i = 0; str[i] != U'\0'; ++i) {
        if (str[i] == delim) (*len)++;
    }

    char32_t **tokens = malloc((*len) * sizeof(char32_t *));
    if (tokens == NULL) {
        exit(111);
    }

    int start = 0, end = 0, i = 0;
    do {
        if ((str[end] == delim) || (str[end] == U'\0')) {
            tokens[i] = malloc(sizeof(char32_t) * (end - start + 1));
            if (tokens[i] == NULL) {
                exit(112);
            }
            memcpy(tokens[i], &str[start], sizeof(char32_t) * (end - start));
            tokens[i][end - start] = U'\0';
            start = end + 1;
            i++;
        }
    } while (str[end++] != U'\0');

    return tokens;
}
'
; ++i) {
if (str[i] == delim) (*len)++; } char32_t **tokens = malloc((*len) * sizeof(char32_t *)); if (tokens == NULL) { exit(111); } int start = 0, end = 0, i = 0; do { if ((str[end] == delim) || (str[end] == U'
char32_t **sp(const char32_t *str, const char32_t delim, int *len) {
    *len = 1;
    for (int i = 0; str[i] != U'\0'; ++i) {
        if (str[i] == delim) (*len)++;
    }

    char32_t **tokens = malloc((*len) * sizeof(char32_t *));
    if (tokens == NULL) {
        exit(111);
    }

    int start = 0, end = 0, i = 0;
    do {
        if ((str[end] == delim) || (str[end] == U'\0')) {
            tokens[i] = malloc(sizeof(char32_t) * (end - start + 1));
            if (tokens[i] == NULL) {
                exit(112);
            }
            memcpy(tokens[i], &str[start], sizeof(char32_t) * (end - start));
            tokens[i][end - start] = U'\0';
            start = end + 1;
            i++;
        }
    } while (str[end++] != U'\0');

    return tokens;
}
'
)) {
tokens[i] = malloc(sizeof(char32_t) * (end - start + 1)); if (tokens[i] == NULL) { exit(112); } memcpy(tokens[i], &str[start], sizeof(char32_t) * (end - start)); tokens[i][end - start] = U'
char32_t **sp(const char32_t *str, const char32_t delim, int *len) {
    *len = 1;
    for (int i = 0; str[i] != U'\0'; ++i) {
        if (str[i] == delim) (*len)++;
    }

    char32_t **tokens = malloc((*len) * sizeof(char32_t *));
    if (tokens == NULL) {
        exit(111);
    }

    int start = 0, end = 0, i = 0;
    do {
        if ((str[end] == delim) || (str[end] == U'\0')) {
            tokens[i] = malloc(sizeof(char32_t) * (end - start + 1));
            if (tokens[i] == NULL) {
                exit(112);
            }
            memcpy(tokens[i], &str[start], sizeof(char32_t) * (end - start));
            tokens[i][end - start] = U'\0';
            start = end + 1;
            i++;
        }
    } while (str[end++] != U'\0');

    return tokens;
}
'
;
start = end + 1; i++; } } while (str[end++] != U'
char32_t **sp(const char32_t *str, const char32_t delim, int *len) {
    *len = 1;
    for (int i = 0; str[i] != U'\0'; ++i) {
        if (str[i] == delim) (*len)++;
    }

    char32_t **tokens = malloc((*len) * sizeof(char32_t *));
    if (tokens == NULL) {
        exit(111);
    }

    int start = 0, end = 0, i = 0;
    do {
        if ((str[end] == delim) || (str[end] == U'\0')) {
            tokens[i] = malloc(sizeof(char32_t) * (end - start + 1));
            if (tokens[i] == NULL) {
                exit(112);
            }
            memcpy(tokens[i], &str[start], sizeof(char32_t) * (end - start));
            tokens[i][end - start] = U'\0';
            start = end + 1;
            i++;
        }
    } while (str[end++] != U'\0');

    return tokens;
}
'
);
return tokens; }

一些测试案例:

输入字符串:

char32_t *str = U"Hello,World,mango,hey,";

输出:

# ./a.out 
len -> 5
[0] Hello
[1] World
[2] mango
[3] hey
[4] 

输入字符串:

char32_t *str = U"Hello,World,mango,hey";

输出:

# ./a.out 
len -> 4
[0] Hello
[1] World
[2] mango
[3] hey

输入字符串:

char32_t *str = U",,, , u";

输出:

# ./a.out 
len -> 5
[0] 
[1] 
[2] 
[3]  
[4]  u

输入字符串:

char32_t *str = U" ";

输出:

# ./a.out 
len -> 1
[0]  
英文:

valgrind is giving those errors because your program is accessing uninitialised memory in last iteration of this for loop in main() function (i.e. while accessing tokens[4], when len value is 5):

  for (int i = 0; i < len; i++) {
     if (tokens[i]) {
        wprintf(L"[%d] %ls\n" , i , tokens[i]);  
     }
     free(tokens[i]);
  }  

malloc function allocate memory and leave it uninitialised. Here, in sp() function, when your program allocating memory it is uninitialised:

  char32_t **tokens = (char32_t **)malloc((*len) * sizeof(char32_t *));

The while loop of sp() function allocate and copy some value to allocated memory for all the members of tokens array except the last member and leaves it uninitialised. In the main(), your program is accessing that uninitialised member and hence the valgrind reporting the error.

To fix the problem, in sp() function, after allocating memory to tokens -
Either make last pointer member of tokens array NULL:

// this is the bare minimum change required to fix the problem
tokens [*len - 1] = NULL;

Or, make all pointers NULL

for (int i = 0; i < *len; ++i) {
   tokens[i] = NULL;
}

Or, use calloc to allocate memory to tokens, which will ensure all the allocated pointers initialised to NULL:

char32_t **tokens = calloc((*len), sizeof(char32_t *));

With any of the above mentioned solutions, valgrind output:

# valgrind -s --leak-check=full --track-origins=yes ./a.out 
==9761== Memcheck, a memory error detector
==9761== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==9761== Using Valgrind-3.17.0 and LibVEX; rerun with -h for copyright info
==9761== Command: ./a.out
==9761== 
len -> 5
[0] Hello
[1] World
[2] mango
[3] hey
==9761== 
==9761== HEAP SUMMARY:
==9761==     in use at exit: 0 bytes in 0 blocks
==9761==   total heap usage: 7 allocs, 7 frees, 5,248 bytes allocated
==9761== 
==9761== All heap blocks were freed -- no leaks are possible
==9761== 
==9761== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Found one more problem in your code, when the input string does not have delimiter character as the last character, you program end up accessing input string beyond it's length which results in undefined behaviour. Look at this statement of while loop of sp() function:

p += tok_len + 1;

Assume input string is - U"Hello,World,mango,hey" [note the last character of string is not delimiter ,]. The nested while loop condition will result in false when the p[tok_len] equal to U'\0' while iterating input string and the below statement p += tok_len + 1; will make the pointer p pointing to memory just beyond the input string. The outer while loop condition attempt to dereference the p and it will lead to undefined behaviour.

Replace this statement of while loop of sp() function:

    p += tok_len + 1;

with this

    p += tok_len;
    p += (*p != '
    p += tok_len;
p += (*p != '\0') ? 1 : 0;
') ? 1 : 0;

This will first make the pointer p pointing to one character past the end of current tokens in the input string and if that character is not null terminating character then only 1 will be added to pointer p, otherwise not.

The while loop body can be implemented in a much better way and can also be equipped to handle scenarios like, for e.g., taking care of spaces when the words in the input string have space(s) in between them, input string with only delimiters etc. I am leaving it up to you to improve the implementation and to take care of other scenarios.


EDIT:

This is your requirement - if the last character of input string is delimiter then the last member of tokens array should point to empty string, instead of being NULL.
You don't need to handle this as a special scenario after the loop, as you have shown in comment. You can handle this in loop body which is processing the input string and extracting the tokens from it, like this:

char32_t **sp(const char32_t *str, const char32_t delim, int *len) {
	*len = 1;
	for (int i = 0; str[i] != U'
char32_t **sp(const char32_t *str, const char32_t delim, int *len) {
*len = 1;
for (int i = 0; str[i] != U'\0'; ++i) {
if (str[i] == delim) (*len)++;
}
char32_t **tokens = malloc ((*len) * sizeof (char32_t *));
if (tokens == NULL) {
exit(111);
}
int start = 0, end = 0, i = 0;
do {
if ((str[end] == delim) || (str[end] == U'\0')) {
tokens[i] = malloc (sizeof (char32_t) * (end - start + 1));
if (tokens[i] == NULL) {
exit(112);
}
memcpy (tokens[i], &str[start], sizeof (char32_t) * (end - start));
tokens[i][end - start] = U'\0';
start = end + 1; i++;
}
} while (str[end++] != U'\0');
return tokens;
}
'; ++i) { if (str[i] == delim) (*len)++; } char32_t **tokens = malloc ((*len) * sizeof (char32_t *)); if (tokens == NULL) { exit(111); } int start = 0, end = 0, i = 0; do { if ((str[end] == delim) || (str[end] == U'
char32_t **sp(const char32_t *str, const char32_t delim, int *len) {
*len = 1;
for (int i = 0; str[i] != U'\0'; ++i) {
if (str[i] == delim) (*len)++;
}
char32_t **tokens = malloc ((*len) * sizeof (char32_t *));
if (tokens == NULL) {
exit(111);
}
int start = 0, end = 0, i = 0;
do {
if ((str[end] == delim) || (str[end] == U'\0')) {
tokens[i] = malloc (sizeof (char32_t) * (end - start + 1));
if (tokens[i] == NULL) {
exit(112);
}
memcpy (tokens[i], &str[start], sizeof (char32_t) * (end - start));
tokens[i][end - start] = U'\0';
start = end + 1; i++;
}
} while (str[end++] != U'\0');
return tokens;
}
')) { tokens[i] = malloc (sizeof (char32_t) * (end - start + 1)); if (tokens[i] == NULL) { exit(112); } memcpy (tokens[i], &str[start], sizeof (char32_t) * (end - start)); tokens[i][end - start] = U'
char32_t **sp(const char32_t *str, const char32_t delim, int *len) {
*len = 1;
for (int i = 0; str[i] != U'\0'; ++i) {
if (str[i] == delim) (*len)++;
}
char32_t **tokens = malloc ((*len) * sizeof (char32_t *));
if (tokens == NULL) {
exit(111);
}
int start = 0, end = 0, i = 0;
do {
if ((str[end] == delim) || (str[end] == U'\0')) {
tokens[i] = malloc (sizeof (char32_t) * (end - start + 1));
if (tokens[i] == NULL) {
exit(112);
}
memcpy (tokens[i], &str[start], sizeof (char32_t) * (end - start));
tokens[i][end - start] = U'\0';
start = end + 1; i++;
}
} while (str[end++] != U'\0');
return tokens;
}
'; start = end + 1; i++; } } while (str[end++] != U'
char32_t **sp(const char32_t *str, const char32_t delim, int *len) {
*len = 1;
for (int i = 0; str[i] != U'\0'; ++i) {
if (str[i] == delim) (*len)++;
}
char32_t **tokens = malloc ((*len) * sizeof (char32_t *));
if (tokens == NULL) {
exit(111);
}
int start = 0, end = 0, i = 0;
do {
if ((str[end] == delim) || (str[end] == U'\0')) {
tokens[i] = malloc (sizeof (char32_t) * (end - start + 1));
if (tokens[i] == NULL) {
exit(112);
}
memcpy (tokens[i], &str[start], sizeof (char32_t) * (end - start));
tokens[i][end - start] = U'\0';
start = end + 1; i++;
}
} while (str[end++] != U'\0');
return tokens;
}
'); return tokens; }

Few test cases:

Input string:

char32_t *str = U"Hello,World,mango,hey,";

Output:

# ./a.out 
len -> 5
[0] Hello
[1] World
[2] mango
[3] hey
[4] 

Input string:

char32_t *str = U"Hello,World,mango,hey";

Output:

# ./a.out 
len -> 4
[0] Hello
[1] World
[2] mango
[3] hey

Input string:

char32_t *str = U",,, , u";

Output:

# ./a.out 
len -> 5
[0] 
[1] 
[2] 
[3]  
[4]  u

Input string:

char32_t *str = U" ";

Output:

# ./a.out 
len -> 1
[0]  

huangapple
  • 本文由 发表于 2023年4月11日 13:39:58
  • 转载请务必保留本文链接:https://go.coder-hub.com/75982681.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定