2023年5月7日 00:17:02go评论118阅读模式

英文:

String manipulation - for each value in a dataframe column, extract a custom string of text comprised between two separators

问题

该代码在提取文本时似乎有一些问题。可能是正则表达式提取的部分导致了NaN。您可以尝试调整正则表达式的匹配方式，或者检查是否确实存在与搜索字符串匹配的内容。

英文:

I have an imported dataframe which includes mainly text values under the column 'full_name' . The values look typically like this: 'Insulation, glass wool, water repellent, kraft paper lining, 2.25 Km2/W, 90 mm, 12 kg/m3, 1.08 kg/m2, IBR (Isover)'

Now I would like to extract from these certain values (physical properties of building materials) by utilising the measure unit as a search string, for instance 'Km2/W' for the sake of this example. Then I would like the text comprised between two commas separators before and after the search string to be copied in a separate column, where the values can ultimately be converted to numerical.

I asked this question to ChatGPT and it returned the following code:
This code splits the text column by commas, removes any leading or trailing whitespace, selects the second column, extracts the substring that contains the search string and any characters after it, and then splits that substring by commas and selects the first part.

# Extract the text between two comma separators
filtered_df[&#39;extracted_text&#39;] = filtered_df[&#39;text&#39;].str.split(&#39;,&#39;, expand=True).apply(lambda x: x.str.strip()).iloc[:, 1].str.extract(f&#39;({search_string}.*)&#39;)[0].str.split(&#39;,&#39;).str[0]

However the resulting column - in this example filtered_df['extracted_text'], is full of NaN. What do you think is going wrong here?

答案1

得分: 1

以下是翻译好的内容：

ChatGPT的代码比必要的要复杂得多。我认为您应该能够通过类似以下的方式实现您想要的结果：

search_string = "Km2/W"
filtered_df['extracted_text'] = filtered_df['text'].str.extract(f', ([^,]*{search_string}),')

df.str.extract 使用正则表达式，允许您搜索模式，比如 , ([^,]*{search_string}),，它搜索逗号后跟着一个空格，然后捕获包含搜索词的下一个逗号之前的所有内容。
如果您想学习正则表达式，RegexLearn 可以帮助。

英文:

The ChatGPT code is much more complex than is necessary. I believe you should be able to achieve the result you're after with something like:

search_string = &quot;Km2/W&quot;
filtered_df[&#39;extracted_text&#39;] = filtered_df[&#39;text&#39;].str.extract(f&#39;, ([^,]*{search_string}),&#39;)

df.str.extract uses regular expressions, which allow you to search for patterns, such as , ([^,]*{search_string}),', which searches for a comma followed by a space, then captures everything before the next comma if it contains the search term.
If you'd like to learn regex, RegexLearn can help.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

提取自数据框列中每个值之间的两个分隔符之间的自定义文本字符串。

问题

答案1

getting this error in my pyhon code psycopg2.OperationalError: fe_sendauth: no password supplied

No module named 'flask' in virtualenv

防止Matplotlib删除坐标轴上的数字的方法

如何使用数据框中的字典来替换字符串中的特定字符串？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。