Pandas在数据框尺寸增大时的速度较慢

huangapple go评论52阅读模式
英文:

Pandas slowness with dataframe size increased size

问题

我想从一个列中删除所有的 URL。该列的格式为字符串。
我的数据框有两列:`str_val[str]`、`str_length[int]`。
我正在使用以下代码:

t1 = time.time()
reg_exp_val = r"((?:https?://|www\d{0,3}[.]|[a-z0-9.-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|(([^\s()<>]+|(([^\s()<>]+)))*))+)"
df_mdr_pd['str_val1'] = df_mdr_pd.str_val.str.replace(reg_exp_val, r'')
print(time.time()-t1)

当我对 `10000` 个实例运行该代码时,它在 `0.6` 秒内完成。对于 `100000` 个实例,执行过程就会卡住。我尝试使用 `.loc[i, i+10000]` 并在 `for` 循环中运行它,但也没有帮助。
英文:

I want to remove all url from a column. The column has string format.
My Dataframe has two columns: str_val[str], str_length[int].
I am using following code:

t1 = time.time()
reg_exp_val = r&quot;((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()&lt;&gt;]+|\(([^\s()&lt;&gt;]+|(\([^\s()&lt;&gt;]+\)))*\))+)&quot;
df_mdr_pd[&#39;str_val1&#39;] = df_mdr_pd.str_val.str.replace(reg_exp_val, r&#39;&#39;)
print(time.time()-t1)

When I run the code for 10000 instance, it is finished in 0.6 seconds. For 100000 instances the execution just gets stuck. I tried using .loc[i, i+10000] and run it in for cycle but it did not help either.

答案1

得分: 0

问题是由于我使用的正则表达式而引起的。对我而言有效的那个是:

r"&quot;(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()&lt;&gt;]|\(([^\s()&lt;&gt;]|(\([^\s()&lt;&gt;]+\)))*\))+(?:\(([^\s()&lt;&gt;]|(\([^\s()&lt;&gt;]+\)))*\)|[^\s`!()\[\]{};:&#39;'&quot;.,&lt;&gt;?&#171;&#187;“”‘’]))&quot;",

这个正则表达式是从这个链接中获取的。

英文:

The problem was due to the reg exp I was using. The one, which worked for me was

r&quot;(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()&lt;&gt;]|\(([^\s()&lt;&gt;]|(\([^\s()&lt;&gt;]+\)))*\))+(?:\(([^\s()&lt;&gt;]|(\([^\s()&lt;&gt;]+\)))*\)|[^\s`!()\[\]{};:&#39;\&quot;.,&lt;&gt;?&#171;&#187;“”‘’]))&quot;,

Which was taken from this link.

huangapple
  • 本文由 发表于 2023年2月6日 19:56:55
  • 转载请务必保留本文链接:https://go.coder-hub.com/75361001.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定