英文:
Pandas slowness with dataframe size increased size
问题
我想从一个列中删除所有的 URL。该列的格式为字符串。
我的数据框有两列:`str_val[str]`、`str_length[int]`。
我正在使用以下代码:
t1 = time.time()
reg_exp_val = r"((?:https?://|www\d{0,3}[.]|[a-z0-9.-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|(([^\s()<>]+|(([^\s()<>]+)))*))+)"
df_mdr_pd['str_val1'] = df_mdr_pd.str_val.str.replace(reg_exp_val, r'')
print(time.time()-t1)
当我对 `10000` 个实例运行该代码时,它在 `0.6` 秒内完成。对于 `100000` 个实例,执行过程就会卡住。我尝试使用 `.loc[i, i+10000]` 并在 `for` 循环中运行它,但也没有帮助。
英文:
I want to remove all url from a column. The column has string format.
My Dataframe has two columns: str_val[str], str_length[int]
.
I am using following code:
t1 = time.time()
reg_exp_val = r"((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+)"
df_mdr_pd['str_val1'] = df_mdr_pd.str_val.str.replace(reg_exp_val, r'')
print(time.time()-t1)
When I run the code for 10000
instance, it is finished in 0.6
seconds. For 100000 instances the execution just gets stuck. I tried using .loc[i, i+10000]
and run it in for
cycle but it did not help either.
答案1
得分: 0
问题是由于我使用的正则表达式而引起的。对我而言有效的那个是:
r""(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]|\(([^\s()<>]|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:''".,<>?«»“”‘’]))"",
这个正则表达式是从这个链接中获取的。
英文:
The problem was due to the reg exp I was using. The one, which worked for me was
r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]|\(([^\s()<>]|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))",
Which was taken from this link.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论