C语言 我如何在多行中循环并标记,返回一个包含所有标记的数组?

mbzjlibv  于 2023-10-16  发布在  其他
关注(0)|答案(2)|浏览(100)

我之前发布了一个(糟糕的格式化和乏善可陈的)问题,询问如何将数组作为输入参数传递并返回修改后的数组。经过一番折腾之后,我发现这个函数对于单行输入很好用,但是当文件包含多行并由换行符分隔时就开始有问题了。
我感谢任何关于如何改进我的代码(和帖子)的建议。谢谢
到目前为止的代码:

#include <stdio.h>
#include <string.h>

#define MAX_LINE_LEN 1000

const char delimiter[] = " \t\r\n\v\f";

void tokenize(char *string, char *ret[MAX_LINE_LEN]) {
    char *ptr;
    ptr = strtok(string, delimiter);
    int i = 0;
    while (ptr != NULL) {
        ret[i] = ptr;
        i++;
        ptr = strtok(NULL, delimiter);
    }
}

int main(void) {
    char line[MAX_LINE_LEN];
    static char *temparr[MAX_LINE_LEN] = {0};

    while (fgets(line, sizeof(line), stdin)) {
        tokenize(line, temparr);
    }

    int i = 0;
    while (temparr[i]) {
        printf("%s\n", temparr[i]);
        i++;
    }
}

输入:

There was nothing else to do. 
The deed had already been done and there was no going back. 
It now had been become a question of how they were going to be able to get out of this situation and escape.

输出似乎是正确的:

There
was
nothing
else
to
do.
The
deed
had
already
been
done
and
there
was
no
going
back.
It
now
had
been
become
a
question
of
how
they
were
going
to
be
able
to
get
out
of
this
situation
and
escape.

但是当每一行都被换行符分隔时:

There was nothing else to do. 
The deed had already been done and there was no going back. 
It now had been become a question of how they were going to be able to get out of this situation and escape.

它只返回最后一行的标记化数组:

It
now
had
been
become
a
question
of
how
they
were
going
to
be
able
to
get
out
of
this
situation
and
escape.

当最后一行很短时:

There was nothing else to do. 
The deed had already been done and there was no going back. 
It now had been become a question of how they were going to be able to 
get out of this situation and escape.

返回数组是:

get
out
of
this
situation
and
escape.
pe.

they
were
going
to
be
able
to

我假设我在循环fgets()函数时出错了,但我不确定为什么或如何继续获得第一个输出。我试过将“\n”作为分隔符之一,但它似乎没有任何作用。我还被告知strtok()是不安全的(不是线程安全的,修改原始字符串,...)。我不知道这是怎么回事,但还有别的选择吗?
(Test第一段摘自https://randomwordgenerator.com/paragraph.php

vfhzx4xs

vfhzx4xs1#

存在多个问题:

  • 令牌指针数组的定义长度为MAX_LINE_LEN,这有点令人困惑。使用一个更有说服力的名称,如MAX_TOKEN_COUNT。类似地,如果MAX_LINE_LEN是最大行长度,则输入缓冲区应该至少有2个额外的字节用于换行符和空终止符。
  • 必须为每个标记分配内存,否则所有标记指针都指向输入缓冲区中的位置,导致观察到的错误输出。
  • 你应该把初始索引传递给附加标记的地方和数组长度,以避免写入超出其边界。

以下是修改后的版本:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAX_LINE_LEN 1000
#define MAX_TOKEN_COUNT 1000

const char delimiter[] = " \t\r\n\v\f";

size_t tokenize(char *str, char *ret[], size_t i, size_t n) {
    char *ptr;
    while (i < n && (ptr = strtok(str, delimiter)) != NULL) {
        ret[i] = strdup(ptr);
        i++;
        str = NULL;
    }
    return i;
}

int main(void) {
    char line[MAX_LINE_LEN + 2];
    char *temparr[MAX_TOKEN_COUNT];
    size_t n = 0;

    while (fgets(line, sizeof(line), stdin)) {
        n = tokenize(line, temparr, n, MAX_TOKEN_COUNT);
    }

    for (size_t i = 0; i < n; i++) {
        printf("%s\n", temparr[i]);
    }
    for (size_t i = 0; i < n; i++) {
        free(temparr[i]);
    }
    return 0;
}
5uzkadbs

5uzkadbs2#

不能对多行输入使用同一个缓冲区。缓冲区将被每一个新的输入行覆盖,使以前分配的指针无效。您需要定义一个有限的缓冲区集合(有限),或者使用“堆存储”。
对C语言不熟悉?也许对“动态内存”的概念不熟悉。
其目标似乎是“缓冲”来自stdin的输入,直到达到最大值。您可以简单地累积所有输入并仅在最后进行 tokenise

#include <stdio.h>
#include <string.h>

#define MAX_LINE_LEN 1000

int main(void) {
    char line[MAX_LINE_LEN]; // a single input buffer
    char *lines = NULL;
    size_t current = 0;

    while( fgets( line, sizeof line, stdin ) ) {
        size_t len = strlen( line );
        char *p = realloc( lines, current + len + 1 ); // growing storage of text input
        if( p == NULL ) {
            fprintf( stderr, "realloc failed\n" );
            exit( EXIT_FAILURE );
        }
        lines = p; // update to point to grown block
        memcpy( lines + current, line, len ); // preserve this line of input
        current += len; // adjust offset to keep track of end
    }
    if( lines == NULL ) // Nothing entered?
        return 0;

    lines[ current ] = '\0'; // terminate making entire block a single C string

    // Below is a "tokenising" mill growing an array of pointers as needed.
    const char delimiter[] = " \t\r\n\v\f";
    char *pArr = NULL;
    size_t i = 0;
    for( char *cp = lines; ( cp = strtok( cp, delimiter ) ) != NULL; cp = NULL ) {
        char **p = realloc( pArr, (i + 2) * sizeof *p ); // notice "+ 2" for NULL also
        if( p == NULL ) {
            fprintf( stderr, "realloc failed\n" );
            exit( EXIT_FAILURE );
        }
        pArr = p;
        pArr[ i++ ] = cp; // store this token
    }
    if( pArr == NULL ) { // Mischievous user entered ONLY whitespace!!
        free( lines );
        return 0;
    }

    pArr[ i++ ] = NULL; // terminate array of pointers with a NULL.

    for( i = 0; pArr[i]; i++ ) // Loop outputting tokens until NULL (end)
        puts( pArr[i] );

    free( pArr ); // release array of ptrs
    free( lines ); // release buffer of text

    return 0;
}

这应该是不言自明的。如果有什么不明白的,请在下面的评论中提问。
“* 其他选择是什么?* ”
strtok()适合这样的简单应用程序。它的弟弟strtok_r()在内部不维护标记化的状态strtok_r()应该在多线程应用程序中使用,或者当你想对字符串进行“子标记化”时使用。例如,您可以对'\n'进行标记,以便从缓冲区中一次提取一个句子,然后使用其他空格字符对该句子中的每个单词进行标记。
另一种选择是开发自己的strtok()版本,可能使用strspn()strcspn()来快速定位字符串。这样做可以让您决定是否要像strtok()那样 clobber 分隔符。
注意:strtok()将连续的分隔符视为单个示例。它不会返回指向空字符串的指针。“a,B,c”将被视为“a,B,c”。
作业:重构这个函数,将“输入”、“标记化”和“打印”分离到它们自己的函数中,并使用适当的参数和返回值。

相关问题