提取自数据框列中每个值之间的两个分隔符之间的自定义文本字符串。

huangapple go评论89阅读模式
英文:

String manipulation - for each value in a dataframe column, extract a custom string of text comprised between two separators

问题

该代码在提取文本时似乎有一些问题。可能是正则表达式提取的部分导致了NaN。您可以尝试调整正则表达式的匹配方式,或者检查是否确实存在与搜索字符串匹配的内容。

英文:

I have an imported dataframe which includes mainly text values under the column 'full_name' . The values look typically like this: 'Insulation, glass wool, water repellent, kraft paper lining, 2.25 Km2/W, 90 mm, 12 kg/m3, 1.08 kg/m2, IBR (Isover)'

Now I would like to extract from these certain values (physical properties of building materials) by utilising the measure unit as a search string, for instance 'Km2/W' for the sake of this example. Then I would like the text comprised between two commas separators before and after the search string to be copied in a separate column, where the values can ultimately be converted to numerical.

I asked this question to ChatGPT and it returned the following code:
This code splits the text column by commas, removes any leading or trailing whitespace, selects the second column, extracts the substring that contains the search string and any characters after it, and then splits that substring by commas and selects the first part.

# Extract the text between two comma separators
filtered_df['extracted_text'] = filtered_df['text'].str.split(',', expand=True).apply(lambda x: x.str.strip()).iloc[:, 1].str.extract(f'({search_string}.*)')[0].str.split(',').str[0]

However the resulting column - in this example filtered_df['extracted_text'], is full of NaN. What do you think is going wrong here?

答案1

得分: 1

以下是翻译好的内容:

ChatGPT的代码比必要的要复杂得多。我认为您应该能够通过类似以下的方式实现您想要的结果:

search_string = "Km2/W"
filtered_df['extracted_text'] = filtered_df['text'].str.extract(f', ([^,]*{search_string}),')

df.str.extract 使用正则表达式,允许您搜索模式,比如 , ([^,]*{search_string}),,它搜索逗号后跟着一个空格,然后捕获包含搜索词的下一个逗号之前的所有内容。
如果您想学习正则表达式,RegexLearn 可以帮助。

英文:

The ChatGPT code is much more complex than is necessary. I believe you should be able to achieve the result you're after with something like:

search_string = "Km2/W"
filtered_df['extracted_text'] = filtered_df['text'].str.extract(f', ([^,]*{search_string}),')

df.str.extract uses regular expressions, which allow you to search for patterns, such as , ([^,]*{search_string}),', which searches for a comma followed by a space, then captures everything before the next comma if it contains the search term.
If you'd like to learn regex, RegexLearn can help.

huangapple
  • 本文由 发表于 2023年5月7日 00:17:02
  • 转载请务必保留本文链接:https://go.coder-hub.com/76189924.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定