Split a column into 2 columns like alphabetic text in one column and alphanumeric or numbers or anything in 2nd column

huangapple go评论62阅读模式
英文:

Split a column into 2 columns like alphabetic text in one column and alphanumeric or numbers or anything in 2nd column

问题

我有一个包含产品和技术细节合并在一起的数据框列。我只想将它们分开成两列,一个列中包含实际产品名称,另一个列中包含其他技术细节。

我尝试使用正则表达式解决这个问题,将技术细节分开,但是每当技术细节被分开时,产品名称都变成了null。不确定出了什么问题。

这是我尝试的数据框:

df = pd.DataFrame({'Description': ['WASHER tey DIN6340 10.5 C 35;', 'CABINET EL', 'CYLINDER SCREW', 'M12x N15']})

代码:

df['Technical Data'] = df['Description'].str.extract(r'^.*?(\s\w*\d+\w*\s.*)$')
df['Product Description'] = df['Description'].apply(lambda x: re.sub(r'^.*?(\w*\d+\w*\s.*)$', '', x))

我得到的结果是:

Split a column into 2 columns like alphabetic text in one column and alphanumeric or numbers or anything in 2nd column

所以我希望输出是这样的:

Split a column into 2 columns like alphabetic text in one column and alphanumeric or numbers or anything in 2nd column

有关如何实现这一点的建议吗?

英文:

I have a dataframe column which contains product and technical details merged. I just want to split them separately into 2 columns like actual product name in one column and other technical details in one column.

I tried to solve the problem using regex and splitted the technical details separately, but the product name was going null wherever the technical details get splitted. not sure what went wrong.

This is the dataframe I tried
df = pd.DataFrame({'Description': ['WASHER tey DIN6340 10.5 C 35;', 'CABINET EL', 'CYLINDER SCREW', 'M12x N15']})

Code:
df['Technical Data'] = df['Description'].str.extract(r'^.*?(\s\w*\d+\w*\s.*)$')
df['Product Description'] = df['Description'].apply(lambda x: re.sub(r'^.*?(\w*\d+\w*\s.*)$', '', x))

The result I'm getting is
Split a column into 2 columns like alphabetic text in one column and alphanumeric or numbers or anything in 2nd column

So I want the output to be like this

Split a column into 2 columns like alphabetic text in one column and alphanumeric or numbers or anything in 2nd column

Any suggestions on how to do that??

答案1

得分: 2

你可以捕获任意数量的字符("Technical Data"列),尽可能少地匹配,然后是可选的空格,然后是字母数字字符串,然后是字符串的结尾("Product Description"列):

df[['Technical Data','Product Description']] = df['Description'].str.extract(r'^(.*?)(?:\s*((?:[a-zA-Z]+[0-9]|[0-9]+[a-zA-Z]).*))?$', expand=True)

详细信息:

  • ^ - 字符串的开始
  • (.*?) - 第1组:任意数量的字符(除了换行符),尽可能少
  • (?:\s*((?:[a-zA-Z]+[0-9]|[0-9]+[a-zA-Z]).*))? - 一个可选的匹配组
    • \s* - 零个或多个空格
    • ((?:[a-zA-Z]+[0-9]|[0-9]+[a-zA-Z]).*) - 第2组:要么一个或多个字母,然后是一个数字,要么一个或多个数字,然后是一个字母,然后是任意数量的字符(除了换行符),尽可能少
  • $ - 字符串的结尾。

如果你有Unicode字符,可以使用常见的[^\W\d_]构造(你需要将[a-zA-Z]替换为[^\W\d_])。

英文:

You may capture any zero or more chars as few as possible ("Technical Data" column) and then optional whitespaces followed with an alphanumeric string and then anything till the end of the string (the "Product Description" column):

df[['Technical Data','Product Description']] = df['Description'].str.extract(r'^(.*?)(?:\s*((?:[a-zA-Z]+[0-9]|[0-9]+[a-zA-Z]).*))?$', expand=True)

See the regex demo.

Details:

  • ^ - start of string
  • (.*?) - Group 1: any zero or more chars, other than line break chars, as few as possible
  • (?:\s*((?:[a-zA-Z]+[0-9]|[0-9]+[a-zA-Z]).*))? - an optional group matching
    • \s* - zero or more whitespaces
    • ((?:[a-zA-Z]+[0-9]|[0-9]+[a-zA-Z]).*) - Group 2: either one or more letters and then a digit, or one or more digits and then a letter, and then any zero or more chars, other than line break chars, as few as possible
  • $ - end of string.

If you have Unicode letters, a common [^\W\d_] construct can help (you will need to replace [a-zA-Z] with [^\W\d_]).

huangapple
  • 本文由 发表于 2023年7月13日 15:42:19
  • 转载请务必保留本文链接:https://go.coder-hub.com/76677004.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定