英文:
How to easily process texts from a CSV file in Tensorflow?
问题
我有一个小数据集,我试图处理它,以便稍后用它来训练模型。这是一个包含两列的 CSV 文件数据集:类别和消息。这是一个简单的数据集,其中包含可能是垃圾消息的消息。我想转换这个数据集,使类别和消息都成为数字,但我不太明白如何做。
示例文件中的数据:
- 类别 ham,消息:"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
- 类别 ham,消息:"Ok lar... Joking wif u oni..."
- 类别 spam,消息:"Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"
- 类别 ham,消息:"U dun say so early hor... U c already then say..."
- 类别 ham,消息:"Nah I don't think he goes to usf, he lives around here though"
然后我这样加载它:
dataset = tf.data.experimental.make_csv_dataset(
file_pattern="directory_to_file",
batch_size=32,
column_names=['Category','Message'],
column_defaults=[tf.string,tf.string],
label_name='Category',
field_delim=',',
header=True,
num_epochs=1,
)
例如,我尝试了类似以下的方法:
def standarize_dataset(dataset):
lowercase = tf.strings.lower(dataset)
return tf.strings.regex_replace(lowercase, '[$s]' % re.escape(string.punctuation), '')
vectorization = layers.TextVectorization(
standardize=standarize_dataset,
max_tokens=1000,
output_mode='int',
output_sequence_length=200,
)
dataset_unbatched = dataset.unbatch().map(lambda x, y: x)
vectorization.adapt(dataset_unbatched)
但然后我收到一个错误:
TypeError: Expected string, but got Tensor("IteratorGetNext:0", shape=(None,), dtype=string) of type 'Tensor'.
遍历这个数据集显示消息是例如:
OrderedDict([('Message', <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'Carry on not disturbing both of you' ], dtype=object)>)])
和类别:
[b'ham']
我可能可以创建一个循环,只提取每个 OrderedDict 中的消息,但我觉得有一种更好的方法来读取这些数据,然后处理它,就像这样:如何轻松处理来自 CSV 文件的文本数据在 TensorFlow 中?
英文:
I have a small dataset that I'm trying to process so that I can later train a model with it. This is a dataset in a csv file with two columns: Category and Message, which is a simple dataset with messages that may or may not be spam. I'd like to transform this dataset so that Categories are numbers and messages too, but I don't quite understand how to do that.
Example data from file:
ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
ham,Ok lar... Joking wif u oni...
spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
ham,U dun say so early hor... U c already then say...
ham,"Nah I don't think he goes to usf, he lives around here though"
And then I load it like this:
dataset = tf.data.experimental.make_csv_dataset(
file_pattern="directory_to_file",
batch_size=32,
column_names=['Category','Message'],
column_defaults=[tf.string,tf.string],
label_name='Category',
field_delim=',',
header=True,
num_epochs=1,
)
For example, I tried something like this:
def standarize_dataset(dataset):
lowercase = tf.strings.lower(dataset)
return tf.strings.regex_replace(lowercase, '[$s]' % re.escape(string.punctuation), '')
vectorization = layers.TextVectorization(
standardize=standarize_dataset,
max_tokens=1000,
output_mode='int',
output_sequence_length=200,
)
dataset_unbatched = dataset.unbatch().map(lambda x, y: x)
vectorization.adapt(dataset_unbatched)
But then I get an error:
TypeError: Expected string, but got Tensor("IteratorGetNext:0", shape=(None,), dtype=string) of type 'Tensor'.
Looping over this dataset shows that Message is e.g.
OrderedDict([('Message', <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'Carry on not disturbing both of you' ], dtype=object)>)])
and Category:
[b'ham']
I can probably just create a loop that will extract only the message from each OrderedDict, but I feel like there is a better way to read this data and then process it, so as in: How to easily process texts from a CSV file in Tensorflow?
答案1
得分: 1
通过修改.unbatch().map()
操作,我让代码运行起来了。
请注意,在我修改后,您的standarize_dataset()
函数不起作用,返回TypeError: not all arguments converted during string formatting
错误。但是,您可以通过在layers.TextVectorization()
中指定standardize="lower_and_strip_punctuation"
来替代您的函数。
完整的代码如下:
import re
import string
import tensorflow as tf
import tensorflow.keras.layers as layers
file_as_str = """
ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
ham,Ok lar... Joking wif u oni..."
spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"
ham,U dun say so early hor... U c already then say..."
ham,"Nah I don't think he goes to usf, he lives around here though"
"""
with open("example.txt", "w") as f:
f.write(file_as_str)
dataset = tf.data.experimental.make_csv_dataset(
file_pattern="example.txt",
batch_size=32,
column_names=['Category','Message'],
column_defaults=[tf.string, tf.string],
label_name='Category',
field_delim=',',
header=True,
num_epochs=1,
)
def standarize_dataset(dataset):
lowercase = tf.strings.lower(dataset)
return tf.strings.regex_replace(lowercase, '[$s]' % re.escape(string.punctuation), '')
vectorization = layers.TextVectorization(
standardize="lower_and_strip_punctuation",
max_tokens=1000,
output_mode='int',
output_sequence_length=200
)
dataset_unbatched = dataset.unbatch().map(lambda x, y: x['Message'])
vectorization.adapt(dataset_unbatched)
vectorized_text = vectorization(next(iter(dataset_unbatched)))
print(vectorized_text)
输出:
tf.Tensor(
[46 8 5 68 64 10 54 2 11 7 52 47 17 66 33 67 22 7 2 65 2 24 8 26
16 25 61 69 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0], shape=(200,), dtype=int64)
英文:
By modifying the .unbatch().map()
operation, I got the code running.
Please note that your standarize_dataset()
function did not work after my modification and it returned TypeError: not all arguments converted during string formatting
. However, your function can be substituted by specifying standardize="lower_and_strip_punctuation"
in layers.TextVectorization()
.
Full code below:
import re
import string
import tensorflow as tf
import tensorflow.keras.layers as layers
file_as_str = """
ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
ham,Ok lar... Joking wif u oni...
spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
ham,U dun say so early hor... U c already then say...
ham,"Nah I don't think he goes to usf, he lives around here though"
"""
with open("example.txt", "w") as f:
f.write(file_as_str)
dataset = tf.data.experimental.make_csv_dataset(
file_pattern="example.txt",
batch_size=32,
column_names=['Category','Message'],
column_defaults=[tf.string, tf.string],
label_name='Category',
field_delim=',',
header=True,
num_epochs=1,
)
def standarize_dataset(dataset):
lowercase = tf.strings.lower(dataset)
return tf.strings.regex_replace(lowercase, '[$s]' % re.escape(string.punctuation), '')
vectorization = layers.TextVectorization(
standardize="lower_and_strip_punctuation",
max_tokens=1000,
output_mode='int',
output_sequence_length=200
)
dataset_unbatched = dataset.unbatch().map(lambda x, y: x['Message'])
vectorization.adapt(dataset_unbatched)
vectorized_text = vectorization(next(iter(dataset_unbatched)))
print(vectorized_text)
# prints:
# tf.Tensor(
# [46 8 5 68 64 10 54 2 11 7 52 47 17 66 33 67 22 7 2 65 2 24 8 26
# 16 25 61 69 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 0 0 0 0 0 0 0 0], shape=(200,), dtype=int64)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论