Tracing source of error in difflib due to very different string comparison

huangapple go评论76阅读模式
英文:

Tracing source of error in difflib due to very different string comparison

问题

我正在处理大量文本数据(1100万行),并遇到以下错误。是否有办法追踪导致此错误的文本行?

我的代码:

from difflib import ndiff # 查找字符串之间的差异
import pandas as pd
from tqdm import tqdm # 为 pandas apply() 添加计时器
tqdm.pandas() # 启动计时器

# 读取所有按键
dat = pd.read_csv("all_ks_dat_good.csv", delimiter="|",
                  encoding="ISO-8859-1")

# 使用 ndiff 函数查找字符串的添加部分,即 c[0] == '+'
def diff(x):
    s1 = str(x['last_text'])
    s2 = str(x['scrubbed_text'])
    l = [c[-1] for c in ndiff(s1, s2) if c[0] == '+']
    return ''.join(l)

# 添加一个列用于附加按键,使用 tqdm 的 progress_apply() 而不是 apply()
dat['add_ks'] = dat.progress_apply(diff, axis=1)

dat.to_csv('all_ks_word_dat.csv', sep="|", encoding="utf-8")

缩减后的错误:

  File "/home/goodkindan/.conda/envs/ks0/lib/python3.11/difflib.py", line 997, in _fancy_helper
    yield from g
  File "/home/goodkindan/.conda/envs/ks0/lib/python3.11/difflib.py", line 985, in _fancy_replace
    yield from self._fancy_helper(a, best_i+1, ahi, b, best_j+1, bhi)
  File "/home/goodkindan/.conda/envs/ks0/lib/python3.11/difflib.py", line 997, in _fancy_helper
    yield from g
  File "/home/goodkindan/.conda/envs/ks0/lib/python3.11/difflib.py", line 915, in _fancy_replace
    cruncher = SequenceMatcher(self.charjunk)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/goodkindan/.conda/envs/ks0/lib/python3.11/difflib.py", line 182, in __init__
    self.set_seqs(a, b)
  File "/home/goodkindan/.conda/envs/ks0/lib/python3.11/difflib.py", line 194, in set_seqs
    self.set_seq2(b)
  File "/home/goodkindan/.conda/envs/ks0/lib/python3.11/difflib.py", line 248, in set_seq2
    self.__chain_b()
  File "/home/goodkindan/.conda/envs/ks0/lib/python3.11/difflib.py", line 288, in __chain_b
    for elt in b2j.keys():

希望这对你有所帮助。

英文:

I am processing a large amount of text data (11m rows), and get the error below. Is there a way I can trace the row of text that's causing this error?

My code:

from difflib import ndiff # find differences between strings
import pandas as pd
from tqdm import tqdm # add a timer to pandas apply()
tqdm.pandas() # start timer

# read in all keystrokes
dat = pd.read_csv("all_ks_dat_good.csv", delimiter="|",
                  encoding="ISO-8859-1")

# use the ndiff function to find additions to strings, i.e. c[0]=='+'
def diff(x):
    s1 = str(x['last_text'])
    s2 = str(x['scrubbed_text'])
    l = [c[-1] for c in ndiff(s1, s2) if c[0] == '+']
    return ''.join(l)

# add a column for the additional keystrokes, 
# using tqdm's progress_apply() instead of apply()
dat['add_ks'] = dat.progress_apply(diff, axis=1)

dat.to_csv('all_ks_word_dat.csv', sep="|", encoding="utf-8")

The abridged error:

  File "/home/goodkindan/.conda/envs/ks0/lib/python3.11/difflib.py", line 997, in _fancy_helper
    yield from g
  File "/home/goodkindan/.conda/envs/ks0/lib/python3.11/difflib.py", line 985, in _fancy_replace
    yield from self._fancy_helper(a, best_i+1, ahi, b, best_j+1, bhi)
  File "/home/goodkindan/.conda/envs/ks0/lib/python3.11/difflib.py", line 997, in _fancy_helper
    yield from g
  File "/home/goodkindan/.conda/envs/ks0/lib/python3.11/difflib.py", line 915, in _fancy_replace
    cruncher = SequenceMatcher(self.charjunk)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/goodkindan/.conda/envs/ks0/lib/python3.11/difflib.py", line 182, in __init__
    self.set_seqs(a, b)
  File "/home/goodkindan/.conda/envs/ks0/lib/python3.11/difflib.py", line 194, in set_seqs
    self.set_seq2(b)
  File "/home/goodkindan/.conda/envs/ks0/lib/python3.11/difflib.py", line 248, in set_seq2
    self.__chain_b()
  File "/home/goodkindan/.conda/envs/ks0/lib/python3.11/difflib.py", line 288, in __chain_b
    for elt in b2j.keys():

答案1

得分: 1

For debugging purposes, you could try to iterate on the dataframe with Pandas iterrows and print the row causing the error, like this:

for _, row in dat.iterrows():
    try:
        diff(row)
    except Exception:
        print(row)
英文:

For debugging purposes, you could try to iterate on the dataframe with Pandas iterrows and print the row causing the error, like this:

for _, row in dat.iterrows():
    try:
        diff(row)
    except Exception:
        print(row)

huangapple
  • 本文由 发表于 2023年6月5日 02:34:19
  • 转载请务必保留本文链接:https://go.coder-hub.com/76401901.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定