英文:
String manipulation - for each value in a dataframe column, extract a custom string of text comprised between two separators
问题
该代码在提取文本时似乎有一些问题。可能是正则表达式提取的部分导致了NaN。您可以尝试调整正则表达式的匹配方式,或者检查是否确实存在与搜索字符串匹配的内容。
英文:
I have an imported dataframe which includes mainly text values under the column 'full_name'
. The values look typically like this: 'Insulation, glass wool, water repellent, kraft paper lining, 2.25 Km2/W, 90 mm, 12 kg/m3, 1.08 kg/m2, IBR (Isover)'
Now I would like to extract from these certain values (physical properties of building materials) by utilising the measure unit as a search string, for instance 'Km2/W'
for the sake of this example. Then I would like the text comprised between two commas separators before and after the search string to be copied in a separate column, where the values can ultimately be converted to numerical
.
I asked this question to ChatGPT and it returned the following code:
This code splits the text column by commas, removes any leading or trailing whitespace, selects the second column, extracts the substring that contains the search string and any characters after it, and then splits that substring by commas and selects the first part.
# Extract the text between two comma separators
filtered_df['extracted_text'] = filtered_df['text'].str.split(',', expand=True).apply(lambda x: x.str.strip()).iloc[:, 1].str.extract(f'({search_string}.*)')[0].str.split(',').str[0]
However the resulting column - in this example filtered_df['extracted_text']
, is full of NaN
. What do you think is going wrong here?
答案1
得分: 1
以下是翻译好的内容:
ChatGPT的代码比必要的要复杂得多。我认为您应该能够通过类似以下的方式实现您想要的结果:
search_string = "Km2/W"
filtered_df['extracted_text'] = filtered_df['text'].str.extract(f', ([^,]*{search_string}),')
df.str.extract
使用正则表达式,允许您搜索模式,比如 , ([^,]*{search_string}),
,它搜索逗号后跟着一个空格,然后捕获包含搜索词的下一个逗号之前的所有内容。
如果您想学习正则表达式,RegexLearn 可以帮助。
英文:
The ChatGPT code is much more complex than is necessary. I believe you should be able to achieve the result you're after with something like:
search_string = "Km2/W"
filtered_df['extracted_text'] = filtered_df['text'].str.extract(f', ([^,]*{search_string}),')
df.str.extract
uses regular expressions, which allow you to search for patterns, such as , ([^,]*{search_string}),'
, which searches for a comma followed by a space, then captures everything before the next comma if it contains the search term.
If you'd like to learn regex, RegexLearn can help.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论