如何在Tensorflow中轻松处理来自CSV文件的文本?

huangapple go评论70阅读模式
英文:

How to easily process texts from a CSV file in Tensorflow?

问题

我有一个小数据集,我试图处理它,以便稍后用它来训练模型。这是一个包含两列的 CSV 文件数据集:类别和消息。这是一个简单的数据集,其中包含可能是垃圾消息的消息。我想转换这个数据集,使类别和消息都成为数字,但我不太明白如何做。

示例文件中的数据:

  • 类别 ham,消息:"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
  • 类别 ham,消息:"Ok lar... Joking wif u oni..."
  • 类别 spam,消息:"Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"
  • 类别 ham,消息:"U dun say so early hor... U c already then say..."
  • 类别 ham,消息:"Nah I don't think he goes to usf, he lives around here though"

然后我这样加载它:

dataset = tf.data.experimental.make_csv_dataset(
                file_pattern="directory_to_file",
                batch_size=32,
                column_names=['Category','Message'],
                column_defaults=[tf.string,tf.string],
                label_name='Category',
                field_delim=',',
                header=True,
                num_epochs=1,
            )

例如,我尝试了类似以下的方法:

def standarize_dataset(dataset):
        lowercase = tf.strings.lower(dataset)
        return tf.strings.regex_replace(lowercase, '[$s]' % re.escape(string.punctuation), '')

vectorization = layers.TextVectorization(
            standardize=standarize_dataset,
            max_tokens=1000,
            output_mode='int',
            output_sequence_length=200,
        )
dataset_unbatched = dataset.unbatch().map(lambda x, y: x)
        
vectorization.adapt(dataset_unbatched)

但然后我收到一个错误:

TypeError: Expected string, but got Tensor("IteratorGetNext:0", shape=(None,), dtype=string) of type 'Tensor'.

遍历这个数据集显示消息是例如:

OrderedDict([('Message', <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'Carry on not disturbing both of you' ], dtype=object)>)])

和类别:

[b'ham']

我可能可以创建一个循环,只提取每个 OrderedDict 中的消息,但我觉得有一种更好的方法来读取这些数据,然后处理它,就像这样:如何轻松处理来自 CSV 文件的文本数据在 TensorFlow 中?

英文:

I have a small dataset that I'm trying to process so that I can later train a model with it. This is a dataset in a csv file with two columns: Category and Message, which is a simple dataset with messages that may or may not be spam. I'd like to transform this dataset so that Categories are numbers and messages too, but I don't quite understand how to do that.

Example data from file:

ham,&quot;Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...&quot;
ham,Ok lar... Joking wif u oni...
spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&amp;C&#39;s apply 08452810075over18&#39;s
ham,U dun say so early hor... U c already then say...
ham,&quot;Nah I don&#39;t think he goes to usf, he lives around here though&quot;

And then I load it like this:

dataset = tf.data.experimental.make_csv_dataset(
                file_pattern=&quot;directory_to_file&quot;,
                batch_size=32,
                column_names=[&#39;Category&#39;,&#39;Message&#39;],
                column_defaults=[tf.string,tf.string],
                label_name=&#39;Category&#39;,
                field_delim=&#39;,&#39;,
                header=True,
                num_epochs=1,
            )

For example, I tried something like this:

def standarize_dataset(dataset):
        lowercase = tf.strings.lower(dataset)
        return tf.strings.regex_replace(lowercase, &#39;[$s]&#39; % re.escape(string.punctuation), &#39;&#39;)

vectorization = layers.TextVectorization(
            standardize=standarize_dataset,
            max_tokens=1000,
            output_mode=&#39;int&#39;,
            output_sequence_length=200,
        )
dataset_unbatched = dataset.unbatch().map(lambda x, y: x)
        
vectorization.adapt(dataset_unbatched)

But then I get an error:

TypeError: Expected string, but got Tensor(&quot;IteratorGetNext:0&quot;, shape=(None,), dtype=string) of type &#39;Tensor&#39;.

Looping over this dataset shows that Message is e.g.

OrderedDict([(&#39;Message&#39;, &lt;tf.Tensor: shape=(1,), dtype=string, numpy=array([b&#39;Carry on not disturbing both of you&#39; ], dtype=object)&gt;)])

and Category:

[b&#39;ham&#39;]

I can probably just create a loop that will extract only the message from each OrderedDict, but I feel like there is a better way to read this data and then process it, so as in: How to easily process texts from a CSV file in Tensorflow?

答案1

得分: 1

通过修改.unbatch().map()操作,我让代码运行起来了。

请注意,在我修改后,您的standarize_dataset()函数不起作用,返回TypeError: not all arguments converted during string formatting错误。但是,您可以通过在layers.TextVectorization()中指定standardize="lower_and_strip_punctuation"来替代您的函数。

完整的代码如下:

import re
import string

import tensorflow as tf
import tensorflow.keras.layers as layers

file_as_str = """
ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
ham,Ok lar... Joking wif u oni..."
spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"
ham,U dun say so early hor... U c already then say..."
ham,"Nah I don't think he goes to usf, he lives around here though"
"""

with open("example.txt", "w") as f:
    f.write(file_as_str)

dataset = tf.data.experimental.make_csv_dataset(
    file_pattern="example.txt",
    batch_size=32,
    column_names=['Category','Message'],
    column_defaults=[tf.string, tf.string],
    label_name='Category',
    field_delim=',',
    header=True,
    num_epochs=1,
)

def standarize_dataset(dataset):
    lowercase = tf.strings.lower(dataset)
    return tf.strings.regex_replace(lowercase, '[$s]' % re.escape(string.punctuation), '')

vectorization = layers.TextVectorization(
    standardize="lower_and_strip_punctuation",
    max_tokens=1000,
    output_mode='int',
    output_sequence_length=200
)

dataset_unbatched = dataset.unbatch().map(lambda x, y: x['Message'])

vectorization.adapt(dataset_unbatched)

vectorized_text = vectorization(next(iter(dataset_unbatched)))
print(vectorized_text)

输出:

tf.Tensor(
[46  8  5 68 64 10 54  2 11  7 52 47 17 66 33 67 22  7  2 65  2 24  8 26
 16 25 61 69  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0], shape=(200,), dtype=int64)
英文:

By modifying the .unbatch().map() operation, I got the code running.

Please note that your standarize_dataset() function did not work after my modification and it returned TypeError: not all arguments converted during string formatting. However, your function can be substituted by specifying standardize=&quot;lower_and_strip_punctuation&quot; in layers.TextVectorization().

Full code below:

import re
import string

import tensorflow as tf
import tensorflow.keras.layers as layers

file_as_str = &quot;&quot;&quot;
ham,&quot;Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...&quot;
ham,Ok lar... Joking wif u oni...
spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&amp;C&#39;s apply 08452810075over18&#39;s
ham,U dun say so early hor... U c already then say...
ham,&quot;Nah I don&#39;t think he goes to usf, he lives around here though&quot;
&quot;&quot;&quot;

with open(&quot;example.txt&quot;, &quot;w&quot;) as f:
    f.write(file_as_str)

dataset = tf.data.experimental.make_csv_dataset(
    file_pattern=&quot;example.txt&quot;,
    batch_size=32,
    column_names=[&#39;Category&#39;,&#39;Message&#39;],
    column_defaults=[tf.string, tf.string],
    label_name=&#39;Category&#39;,
    field_delim=&#39;,&#39;,
    header=True,
    num_epochs=1,
)


def standarize_dataset(dataset):
    lowercase = tf.strings.lower(dataset)
    return tf.strings.regex_replace(lowercase, &#39;[$s]&#39; % re.escape(string.punctuation), &#39;&#39;)


vectorization = layers.TextVectorization(
    standardize=&quot;lower_and_strip_punctuation&quot;,
    max_tokens=1000,
    output_mode=&#39;int&#39;,
    output_sequence_length=200
 )


dataset_unbatched = dataset.unbatch().map(lambda x, y: x[&#39;Message&#39;])


vectorization.adapt(dataset_unbatched)


vectorized_text = vectorization(next(iter(dataset_unbatched)))
print(vectorized_text)

# prints:
# tf.Tensor(
# [46  8  5 68 64 10 54  2 11  7 52 47 17 66 33 67 22  7  2 65  2 24  8 26
#  16 25 61 69  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
#   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
#   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
#   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
#   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
#   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
#   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
#   0  0  0  0  0  0  0  0], shape=(200,), dtype=int64)

huangapple
  • 本文由 发表于 2023年5月26日 10:48:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/76337358.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定