2023年5月26日 10:48:57go评论157阅读模式

英文:

How to easily process texts from a CSV file in Tensorflow?

问题

我有一个小数据集，我试图处理它，以便稍后用它来训练模型。这是一个包含两列的 CSV 文件数据集：类别和消息。这是一个简单的数据集，其中包含可能是垃圾消息的消息。我想转换这个数据集，使类别和消息都成为数字，但我不太明白如何做。

示例文件中的数据：

类别 ham，消息："Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
类别 ham，消息："Ok lar... Joking wif u oni..."
类别 spam，消息："Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"
类别 ham，消息："U dun say so early hor... U c already then say..."
类别 ham，消息："Nah I don't think he goes to usf, he lives around here though"

然后我这样加载它：

dataset = tf.data.experimental.make_csv_dataset(
                file_pattern="directory_to_file",
                batch_size=32,
                column_names=['Category','Message'],
                column_defaults=[tf.string,tf.string],
                label_name='Category',
                field_delim=',',
                header=True,
                num_epochs=1,
            )

例如，我尝试了类似以下的方法：

def standarize_dataset(dataset):
        lowercase = tf.strings.lower(dataset)
        return tf.strings.regex_replace(lowercase, '[$s]' % re.escape(string.punctuation), '')

vectorization = layers.TextVectorization(
            standardize=standarize_dataset,
            max_tokens=1000,
            output_mode='int',
            output_sequence_length=200,
        )
dataset_unbatched = dataset.unbatch().map(lambda x, y: x)
        
vectorization.adapt(dataset_unbatched)

但然后我收到一个错误：

TypeError: Expected string, but got Tensor("IteratorGetNext:0", shape=(None,), dtype=string) of type 'Tensor'.

遍历这个数据集显示消息是例如：

OrderedDict([('Message', <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'Carry on not disturbing both of you' ], dtype=object)>)])

和类别：

[b'ham']

我可能可以创建一个循环，只提取每个 OrderedDict 中的消息，但我觉得有一种更好的方法来读取这些数据，然后处理它，就像这样：如何轻松处理来自 CSV 文件的文本数据在 TensorFlow 中？

英文:

I have a small dataset that I'm trying to process so that I can later train a model with it. This is a dataset in a csv file with two columns: Category and Message, which is a simple dataset with messages that may or may not be spam. I'd like to transform this dataset so that Categories are numbers and messages too, but I don't quite understand how to do that.

Example data from file:

ham,&quot;Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...&quot;
ham,Ok lar... Joking wif u oni...
spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&amp;C&#39;s apply 08452810075over18&#39;s
ham,U dun say so early hor... U c already then say...
ham,&quot;Nah I don&#39;t think he goes to usf, he lives around here though&quot;

And then I load it like this:

dataset = tf.data.experimental.make_csv_dataset(
                file_pattern=&quot;directory_to_file&quot;,
                batch_size=32,
                column_names=[&#39;Category&#39;,&#39;Message&#39;],
                column_defaults=[tf.string,tf.string],
                label_name=&#39;Category&#39;,
                field_delim=&#39;,&#39;,
                header=True,
                num_epochs=1,
            )

For example, I tried something like this:

def standarize_dataset(dataset):
        lowercase = tf.strings.lower(dataset)
        return tf.strings.regex_replace(lowercase, &#39;[$s]&#39; % re.escape(string.punctuation), &#39;&#39;)

vectorization = layers.TextVectorization(
            standardize=standarize_dataset,
            max_tokens=1000,
            output_mode=&#39;int&#39;,
            output_sequence_length=200,
        )
dataset_unbatched = dataset.unbatch().map(lambda x, y: x)
        
vectorization.adapt(dataset_unbatched)

But then I get an error:

TypeError: Expected string, but got Tensor(&quot;IteratorGetNext:0&quot;, shape=(None,), dtype=string) of type &#39;Tensor&#39;.

Looping over this dataset shows that Message is e.g.

OrderedDict([(&#39;Message&#39;, &lt;tf.Tensor: shape=(1,), dtype=string, numpy=array([b&#39;Carry on not disturbing both of you&#39; ], dtype=object)&gt;)])

and Category:

[b&#39;ham&#39;]

I can probably just create a loop that will extract only the message from each OrderedDict, but I feel like there is a better way to read this data and then process it, so as in: How to easily process texts from a CSV file in Tensorflow?

答案1

得分: 1

通过修改.unbatch().map()操作，我让代码运行起来了。

请注意，在我修改后，您的standarize_dataset()函数不起作用，返回TypeError: not all arguments converted during string formatting错误。但是，您可以通过在layers.TextVectorization()中指定standardize="lower_and_strip_punctuation"来替代您的函数。

完整的代码如下：

import re
import string

import tensorflow as tf
import tensorflow.keras.layers as layers

file_as_str = """
ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
ham,Ok lar... Joking wif u oni..."
spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"
ham,U dun say so early hor... U c already then say..."
ham,"Nah I don't think he goes to usf, he lives around here though"
"""

with open("example.txt", "w") as f:
    f.write(file_as_str)

dataset = tf.data.experimental.make_csv_dataset(
    file_pattern="example.txt",
    batch_size=32,
    column_names=['Category','Message'],
    column_defaults=[tf.string, tf.string],
    label_name='Category',
    field_delim=',',
    header=True,
    num_epochs=1,
)

def standarize_dataset(dataset):
    lowercase = tf.strings.lower(dataset)
    return tf.strings.regex_replace(lowercase, '[$s]' % re.escape(string.punctuation), '')

vectorization = layers.TextVectorization(
    standardize="lower_and_strip_punctuation",
    max_tokens=1000,
    output_mode='int',
    output_sequence_length=200
)

dataset_unbatched = dataset.unbatch().map(lambda x, y: x['Message'])

vectorization.adapt(dataset_unbatched)

vectorized_text = vectorization(next(iter(dataset_unbatched)))
print(vectorized_text)

输出：

tf.Tensor(
[46  8  5 68 64 10 54  2 11  7 52 47 17 66 33 67 22  7  2 65  2 24  8 26
 16 25 61 69  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0], shape=(200,), dtype=int64)

英文:

By modifying the .unbatch().map() operation, I got the code running.

Please note that your standarize_dataset() function did not work after my modification and it returned TypeError: not all arguments converted during string formatting. However, your function can be substituted by specifying standardize="lower_and_strip_punctuation" in layers.TextVectorization().

Full code below:

import re
import string

import tensorflow as tf
import tensorflow.keras.layers as layers

file_as_str = &quot;&quot;&quot;
ham,&quot;Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...&quot;
ham,Ok lar... Joking wif u oni...
spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&amp;C&#39;s apply 08452810075over18&#39;s
ham,U dun say so early hor... U c already then say...
ham,&quot;Nah I don&#39;t think he goes to usf, he lives around here though&quot;
&quot;&quot;&quot;

with open(&quot;example.txt&quot;, &quot;w&quot;) as f:
    f.write(file_as_str)

dataset = tf.data.experimental.make_csv_dataset(
    file_pattern=&quot;example.txt&quot;,
    batch_size=32,
    column_names=[&#39;Category&#39;,&#39;Message&#39;],
    column_defaults=[tf.string, tf.string],
    label_name=&#39;Category&#39;,
    field_delim=&#39;,&#39;,
    header=True,
    num_epochs=1,
)


def standarize_dataset(dataset):
    lowercase = tf.strings.lower(dataset)
    return tf.strings.regex_replace(lowercase, &#39;[$s]&#39; % re.escape(string.punctuation), &#39;&#39;)


vectorization = layers.TextVectorization(
    standardize=&quot;lower_and_strip_punctuation&quot;,
    max_tokens=1000,
    output_mode=&#39;int&#39;,
    output_sequence_length=200
 )


dataset_unbatched = dataset.unbatch().map(lambda x, y: x[&#39;Message&#39;])


vectorization.adapt(dataset_unbatched)


vectorized_text = vectorization(next(iter(dataset_unbatched)))
print(vectorized_text)

# prints:
# tf.Tensor(
# [46  8  5 68 64 10 54  2 11  7 52 47 17 66 33 67 22  7  2 65  2 24  8 26
#  16 25 61 69  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
#   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
#   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
#   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
#   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
#   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
#   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
#   0  0  0  0  0  0  0  0], shape=(200,), dtype=int64)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在Tensorflow中轻松处理来自CSV文件的文本？

问题

答案1

在Charm-crypto中如何将Python整数转换为integer.Element模N？

When using pd.read_csv, is there a way to exclude certain rows based on their contents when identifying the header?

How can I make this while True loop run faster and work properly? It detects the presence of 3 unwanted items on screen, when gone, alert triggered

我不明白为什么这是一个“语法错误”？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论