问题

我想从一个列中删除所有的 URL。该列的格式为字符串。
我的数据框有两列：`str_val[str]`、`str_length[int]`。
我正在使用以下代码：

t1 = time.time()
reg_exp_val = r"((?:https?://|www\d{0,3}[.]|[a-z0-9.-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|(([^\s()<>]+|(([^\s()<>]+)))*))+)"
df_mdr_pd['str_val1'] = df_mdr_pd.str_val.str.replace(reg_exp_val, r'')
print(time.time()-t1)

当我对 `10000` 个实例运行该代码时，它在 `0.6` 秒内完成。对于 `100000` 个实例，执行过程就会卡住。我尝试使用 `.loc[i, i+10000]` 并在 `for` 循环中运行它，但也没有帮助。

英文:

I want to remove all url from a column. The column has string format.
My Dataframe has two columns: str_val[str], str_length[int].
I am using following code:

t1 = time.time()
reg_exp_val = r&quot;((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()&lt;&gt;]+|\(([^\s()&lt;&gt;]+|(\([^\s()&lt;&gt;]+\)))*\))+)&quot;
df_mdr_pd[&#39;str_val1&#39;] = df_mdr_pd.str_val.str.replace(reg_exp_val, r&#39;&#39;)
print(time.time()-t1)

When I run the code for 10000 instance, it is finished in 0.6 seconds. For 100000 instances the execution just gets stuck. I tried using .loc[i, i+10000] and run it in for cycle but it did not help either.

答案1

得分: 0

问题是由于我使用的正则表达式而引起的。对我而言有效的那个是：

r"&quot;(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()&lt;&gt;]|\(([^\s()&lt;&gt;]|(\([^\s()&lt;&gt;]+\)))*\))+(?:\(([^\s()&lt;&gt;]|(\([^\s()&lt;&gt;]+\)))*\)|[^\s`!()\[\]{};:&#39;'&quot;.,&lt;&gt;?&#171;&#187;“”‘’]))&quot;",

这个正则表达式是从这个链接中获取的。

英文:

The problem was due to the reg exp I was using. The one, which worked for me was

r&quot;(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()&lt;&gt;]|\(([^\s()&lt;&gt;]|(\([^\s()&lt;&gt;]+\)))*\))+(?:\(([^\s()&lt;&gt;]|(\([^\s()&lt;&gt;]+\)))*\)|[^\s`!()\[\]{};:&#39;\&quot;.,&lt;&gt;?&#171;&#187;“”‘’]))&quot;,

Which was taken from this link.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pandas在数据框尺寸增大时的速度较慢

问题

答案1

Discord Bot check message author and emoji reaction error: 1 positional argument but 2 were given

在Python中变量的大小

获取每个时间点的过去成功率，使用pandas。

`socket.setdefaulttimeout`与使用Python编写的Windows服务有什么关系？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论