2023年7月13日 15:42:19go评论104阅读模式

英文:

Split a column into 2 columns like alphabetic text in one column and alphanumeric or numbers or anything in 2nd column

问题

我有一个包含产品和技术细节合并在一起的数据框列。我只想将它们分开成两列，一个列中包含实际产品名称，另一个列中包含其他技术细节。

我尝试使用正则表达式解决这个问题，将技术细节分开，但是每当技术细节被分开时，产品名称都变成了null。不确定出了什么问题。

这是我尝试的数据框：

df = pd.DataFrame({'Description': ['WASHER tey DIN6340 10.5 C 35;', 'CABINET EL', 'CYLINDER SCREW', 'M12x N15']})

代码：

df['Technical Data'] = df['Description'].str.extract(r'^.*?(\s\w*\d+\w*\s.*)$')
df['Product Description'] = df['Description'].apply(lambda x: re.sub(r'^.*?(\w*\d+\w*\s.*)$', '', x))

我得到的结果是：

所以我希望输出是这样的：

有关如何实现这一点的建议吗？

英文:

I have a dataframe column which contains product and technical details merged. I just want to split them separately into 2 columns like actual product name in one column and other technical details in one column.

I tried to solve the problem using regex and splitted the technical details separately, but the product name was going null wherever the technical details get splitted. not sure what went wrong.

This is the dataframe I tried
df = pd.DataFrame({&#39;Description&#39;: [&#39;WASHER tey DIN6340 10.5 C 35;&#39;, &#39;CABINET EL&#39;, &#39;CYLINDER SCREW&#39;, &#39;M12x N15&#39;]})
Code:
df[&#39;Technical Data&#39;] = df[&#39;Description&#39;].str.extract(r&#39;^.*?(\s\w*\d+\w*\s.*)$&#39;)
df[&#39;Product Description&#39;] = df[&#39;Description&#39;].apply(lambda x: re.sub(r&#39;^.*?(\w*\d+\w*\s.*)$&#39;, &#39;&#39;, x))

The result I'm getting is

So I want the output to be like this

Any suggestions on how to do that??

答案1

得分: 2

你可以捕获任意数量的字符（"Technical Data"列），尽可能少地匹配，然后是可选的空格，然后是字母数字字符串，然后是字符串的结尾（"Product Description"列）：

df[['Technical Data','Product Description']] = df['Description'].str.extract(r'^(.*?)(?:\s*((?:[a-zA-Z]+[0-9]|[0-9]+[a-zA-Z]).*))?$', expand=True)

详细信息：

^ - 字符串的开始
(.*?) - 第1组：任意数量的字符（除了换行符），尽可能少
(?:\s*((?:[a-zA-Z]+[0-9]|[0-9]+[a-zA-Z]).*))? - 一个可选的匹配组
- \s* - 零个或多个空格
- ((?:[a-zA-Z]+[0-9]|[0-9]+[a-zA-Z]).*) - 第2组：要么一个或多个字母，然后是一个数字，要么一个或多个数字，然后是一个字母，然后是任意数量的字符（除了换行符），尽可能少
$ - 字符串的结尾。

如果你有Unicode字符，可以使用常见的[^\W\d_]构造（你需要将[a-zA-Z]替换为[^\W\d_]）。

英文:

You may capture any zero or more chars as few as possible ("Technical Data" column) and then optional whitespaces followed with an alphanumeric string and then anything till the end of the string (the "Product Description" column):

df[[&#39;Technical Data&#39;,&#39;Product Description&#39;]] = df[&#39;Description&#39;].str.extract(r&#39;^(.*?)(?:\s*((?:[a-zA-Z]+[0-9]|[0-9]+[a-zA-Z]).*))?$&#39;, expand=True)

See the regex demo.

Details:

^ - start of string
(.*?) - Group 1: any zero or more chars, other than line break chars, as few as possible
(?:\s*((?:[a-zA-Z]+[0-9]|[0-9]+[a-zA-Z]).*))? - an optional group matching
- \s* - zero or more whitespaces
- ((?:[a-zA-Z]+[0-9]|[0-9]+[a-zA-Z]).*) - Group 2: either one or more letters and then a digit, or one or more digits and then a letter, and then any zero or more chars, other than line break chars, as few as possible
$ - end of string.

If you have Unicode letters, a common [^\W\d_] construct can help (you will need to replace [a-zA-Z] with [^\W\d_]).

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Split a column into 2 columns like alphabetic text in one column and alphanumeric or numbers or anything in 2nd column

问题

答案1

为什么在ML模型初始化期间出现TypeError错误？

在类构造函数init()中如何初始化Pandas “DataFrame”作为类属性？

将日期转换为天数，使用numpy的时间戳和datetime64。

我的Python程序在编译之前运行得更快。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。