英文:
Tracing source of error in difflib due to very different string comparison
问题
我正在处理大量文本数据(1100万行),并遇到以下错误。是否有办法追踪导致此错误的文本行?
我的代码:
from difflib import ndiff # 查找字符串之间的差异
import pandas as pd
from tqdm import tqdm # 为 pandas apply() 添加计时器
tqdm.pandas() # 启动计时器
# 读取所有按键
dat = pd.read_csv("all_ks_dat_good.csv", delimiter="|",
encoding="ISO-8859-1")
# 使用 ndiff 函数查找字符串的添加部分,即 c[0] == '+'
def diff(x):
s1 = str(x['last_text'])
s2 = str(x['scrubbed_text'])
l = [c[-1] for c in ndiff(s1, s2) if c[0] == '+']
return ''.join(l)
# 添加一个列用于附加按键,使用 tqdm 的 progress_apply() 而不是 apply()
dat['add_ks'] = dat.progress_apply(diff, axis=1)
dat.to_csv('all_ks_word_dat.csv', sep="|", encoding="utf-8")
缩减后的错误:
File "/home/goodkindan/.conda/envs/ks0/lib/python3.11/difflib.py", line 997, in _fancy_helper
yield from g
File "/home/goodkindan/.conda/envs/ks0/lib/python3.11/difflib.py", line 985, in _fancy_replace
yield from self._fancy_helper(a, best_i+1, ahi, b, best_j+1, bhi)
File "/home/goodkindan/.conda/envs/ks0/lib/python3.11/difflib.py", line 997, in _fancy_helper
yield from g
File "/home/goodkindan/.conda/envs/ks0/lib/python3.11/difflib.py", line 915, in _fancy_replace
cruncher = SequenceMatcher(self.charjunk)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/goodkindan/.conda/envs/ks0/lib/python3.11/difflib.py", line 182, in __init__
self.set_seqs(a, b)
File "/home/goodkindan/.conda/envs/ks0/lib/python3.11/difflib.py", line 194, in set_seqs
self.set_seq2(b)
File "/home/goodkindan/.conda/envs/ks0/lib/python3.11/difflib.py", line 248, in set_seq2
self.__chain_b()
File "/home/goodkindan/.conda/envs/ks0/lib/python3.11/difflib.py", line 288, in __chain_b
for elt in b2j.keys():
希望这对你有所帮助。
英文:
I am processing a large amount of text data (11m rows), and get the error below. Is there a way I can trace the row of text that's causing this error?
My code:
from difflib import ndiff # find differences between strings
import pandas as pd
from tqdm import tqdm # add a timer to pandas apply()
tqdm.pandas() # start timer
# read in all keystrokes
dat = pd.read_csv("all_ks_dat_good.csv", delimiter="|",
encoding="ISO-8859-1")
# use the ndiff function to find additions to strings, i.e. c[0]=='+'
def diff(x):
s1 = str(x['last_text'])
s2 = str(x['scrubbed_text'])
l = [c[-1] for c in ndiff(s1, s2) if c[0] == '+']
return ''.join(l)
# add a column for the additional keystrokes,
# using tqdm's progress_apply() instead of apply()
dat['add_ks'] = dat.progress_apply(diff, axis=1)
dat.to_csv('all_ks_word_dat.csv', sep="|", encoding="utf-8")
The abridged error:
File "/home/goodkindan/.conda/envs/ks0/lib/python3.11/difflib.py", line 997, in _fancy_helper
yield from g
File "/home/goodkindan/.conda/envs/ks0/lib/python3.11/difflib.py", line 985, in _fancy_replace
yield from self._fancy_helper(a, best_i+1, ahi, b, best_j+1, bhi)
File "/home/goodkindan/.conda/envs/ks0/lib/python3.11/difflib.py", line 997, in _fancy_helper
yield from g
File "/home/goodkindan/.conda/envs/ks0/lib/python3.11/difflib.py", line 915, in _fancy_replace
cruncher = SequenceMatcher(self.charjunk)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/goodkindan/.conda/envs/ks0/lib/python3.11/difflib.py", line 182, in __init__
self.set_seqs(a, b)
File "/home/goodkindan/.conda/envs/ks0/lib/python3.11/difflib.py", line 194, in set_seqs
self.set_seq2(b)
File "/home/goodkindan/.conda/envs/ks0/lib/python3.11/difflib.py", line 248, in set_seq2
self.__chain_b()
File "/home/goodkindan/.conda/envs/ks0/lib/python3.11/difflib.py", line 288, in __chain_b
for elt in b2j.keys():
答案1
得分: 1
For debugging purposes, you could try to iterate on the dataframe with Pandas iterrows and print the row causing the error, like this:
for _, row in dat.iterrows():
try:
diff(row)
except Exception:
print(row)
英文:
For debugging purposes, you could try to iterate on the dataframe with Pandas iterrows and print the row causing the error, like this:
for _, row in dat.iterrows():
try:
diff(row)
except Exception:
print(row)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论