英文:
Split string using Python when same delimiter has different meaning in different records
问题
Sure, here are the translated code-related parts of your text:
我有类似这样的数据。
| 记录编号 | 一级人员 | 二级人员 | 日期 | 工作时间
| ------------- | -------------- | -------------- | ----- | -----------------
| 1 | | Tim David,Cameron Green - (第1分区)| 2023年01月01日| 5
| 2 | Tim David - (第1分区)| Mitch,Eli Kin Marsh - (第2分区)| 2023年02月02日| 3
| 3 | | David Warner - (第2分区),Travis Head - (第3分区)| 2023年03月04日 | 1
| 4 | Cameron Green - (第1分区)| Tim David - (第1分区) | 2023年07月01日| 2
最终目标是获取每个人每个月在工作上花费的总时间,按分区分类。这与人员级别无关。结果应类似于:
| 分区 | 人员 | 月份 | 工作时间
| ------------- | -------------- | -------------- | ----- |
| 第1分区 | Tim David | 2023年01月| 7
| 第1分区 | Tim David| 2023年02月| 3
| 第1分区 | Cameron Green| 2023年01月| 7
| 第2分区 | Mitch,Eli Kin Marsh| 2023年02月| 3
| 第2分区 | David Warner| 2023年04月| 1
| 第3分区 | Travis Head| 2023年04月| 1
为了实现这一目标,首先我正在尝试清理“二级人员”列。在这一列中,记录1表示有两个人都在第1分区。一个人是Tim David,另一个是Cameron Green。在记录2中,只有一个人Mitch,Eli Kin Marsh在第2分区。在第3条记录中,有两个人在两个不同的分区。David Warner在第2分区,Travis Head在第3分区。在记录4中,只有一个人Tim David在第1分区。
1. 我试图创建一个新列,其中包含参与特定记录的所有人员。在这样做时,我在尝试通过逗号拆分“二级人员”列时遇到了问题。例如,在记录1和记录2中,我在尝试按逗号分割时遇到了问题,因为在记录2中,即使只有一个人,也有一个逗号分隔姓氏和其他名字。因此,我想要记录1的列表是['Tim David','Cameron Green'],记录2的列表是['Mitch Eli Kin Marsh']。
以下是我尝试的部分代码:
```python
def split_names(row):
string = row['二级人员']
pattern = '([\w\s,-]+)&'
names = re.split(pattern, string)
name_list = list()
for name in names:
replacements = [('-', ''), ('(', ''), (')', '')]
for char, replacement in replacements:
if char in name:
name = name.replace(char, replacement)
name_list.append(name)
while("" in name_list): # 移除空元素
name_list.remove("")
return name_list
df['人员'] = df.apply(split_names, axis=1)
- 然后,我还想为那些没有分区的人员分配分区。如果多个人在同一个分区,就会发生这种情况。例如,在记录1中。所以,我考虑创建另一个列,其中每个元素都对应于该人员所属的分区。因此,对于记录1,这个列表将是['第1分区','第1分区']。
<details>
<summary>英文:</summary>
I have data that looks like this.
| Record number | level 1 person |level 2 person| date| time spent on job
| ------------- | -------------- |--------------|-----|-----------------
| 1 | |Tim David, Cameron Green - (Division 1) |01/01/2023|5
| 2 | Tim David - (Division 1) |Mitch, Eli Kin Marsh - (Division 2)|02/02/2023|3
|3| | David Warner - (Division 2), Travis Head - (Division 3)| 03/04/2023 | 1
|4| Cameron Green - (Division 1)| Tim David - (Division 1) | 07/01/2023| 2
The final aim is to get the total time each person spends on doing jobs per month categorised by the division. This is regardless of the level of person. The result should be something similar to:
| Division | Person |Month| time spent on job
| ------------- | -------------- |--------------|-----|
|Division 1| Tim David | Jan-23 |7
|Division 1| Tim David| Feb-23| 3
|Division 1| Cameron Green| Jan-23| 7
|Division 2| Mitch, Eli Kin Marsh| Feb-23| 3
|Division 2| David Warner| Apr-23| 1
|Division 3| Travis Head| Apr-23| 1
To achieve this first I am trying to clean the ‘level 2 person’ column. In this column, record 1 means there are two people both in Division 1. One person is Tim David and the other is Cameron Green. In record 2 there is only one person Mitch, Eli Kin Marsh who is in Division 2. In the 3rd record there are two people in two separate divisions. David Warner is in Division 2 and Travis Head is in Division 3. In record 4, only one person Tim David in Division 1.
1. I am trying to create a new column that captures all the people involved in a particular record. In doing this I am having trouble splitting the names in 'level 2 person' column. For example in Record 1 and Record 2 I have trouble splitting by a comma because in Record 2 even though there is only one person there is a comma separating the last name and other names. So the list I want for Record 1 is [‘Tim David’, ‘Cameron Green’] for Record 2 [‘Mitch Eli Kin Marsh'].
This is what I did to attempt this part:
def split_names(row):
string = row['level 2 person']
pattern = '([\w\s,-]+)'
names = re.split(pattern, string)
name_list = list()
for name in names:
replacements = [('-', ''), ('(', ''), (')', '')]
for char, replacement in replacements:
if char in name:
name= name.replace(char, replacement)
name_list.append(name)
while("" in name_list): # remove empty elements
name_list.remove("")
return name_list
df['names'] = df.apply(split_names,axis=1)
2. Then I also want to assign Division for those who do not have it. This happens if multiple people are in the same division. For example, in Record 1. So, I am thinking of creating another column with a list where each element would correspond to the division that person belongs to. So for Record 1 this list would be [‘Division 1’, ‘Division 1’]
</details>
# 答案1
**得分**: 2
"1 word" is "1 个单词",规则:"2 个单词"必须在逗号之前,能解决这个问题吗?
可以使用[pypi的正则模块](https://pypi.org/project/regex)来实现,因为它支持可变长度的回顾断言。
```python
>>> import regex
>>>
>>> pattern = r'(?<=[^,\s]+\s+[^,\s]+), '
>>> regex.split(pattern, 'I, am all one name - (Division 2)')
['I, am all one name - (Division 2)']
>>> regex.split(pattern, 'I am, not all one name - (Division 2)')
['I am', 'not all one name - (Division 2)']
(您也可以在不使用正则表达式的情况下通过仅拆分逗号并将"1 个单词"单元格与相邻单元格合并来实现这一点。)
修改您的示例:
def split_names(cols):
# 逗号前必须有2个"单词"
pattern = r'(?<=[^,\s]+\s+[^,\s]+), '
people = {}
for names in cols:
names = regex.split(pattern, names)
if names == ['']:
continue
same_division = False
level = {}
for name in names:
if ' - ' in name:
name, division = name.split(' - ')
division = division.strip('()')
else:
division = None
same_division = True
level[name] = division
if same_division:
level = dict.fromkeys(level, division)
people.update(level)
return [
{'Division': division, 'Person': person} for person, division in people.items()
]
示例用法:
columns = ['level 1 person', 'level 2 person']
df[columns].apply(split_names, axis=1)
0 [{'Division': 'Division 1', 'Person': 'Tim Dav...
1 [{'Division': 'Division 1', 'Person': 'Tim Dav...
2 [{'Division': 'Division 2', 'Person': 'David W...
3 [{'Division': 'Division 1', 'Person': 'Cameron...
dtype: object
您可以使用.explode
和.join
将结果转换为列。
columns = ['level 1 person', 'level 2 person']
df = (
df.drop(columns=columns)
.join(df[columns].apply(split_names, axis=1).rename('People'))
.explode('People', ignore_index=True)
)
df = df.join(pd.DataFrame(df.pop('People').values.tolist()))
记录编号 日期 在工作上花费的时间 部门 人员
0 1 2023-01-01 5 Division 1 Tim David
1 1 2023-01-01 5 Division 1 Cameron Green
2 2 2023-02-02 3 Division 1 Tim David
3 2 2023-02-02 3 Division 2 Mitch, Eli Kin Marsh
4 3 2023-03-04 1 Division 2 David Warner
5 3 2023-03-04 1 Division 3 Travis Head
6 4 2023-07-01 2 Division 1 Cameron Green
7 4 2023-07-01 2 Division 1 Tim David
英文:
As Mitch
is "1 word", would the rule: "2 words" must come before the comma solve this issue?
That can be done with the regex module from pypi as it supports variable length lookbehind assertions.
>>> import regex
>>>
>>> pattern = r'(?<=[^,\s]+\s+[^,\s]+), '
>>> regex.split(pattern, 'I, am all one name - (Division 2)')
['I, am all one name - (Division 2)']
>>> regex.split(pattern, 'I am, not all one name - (Division 2)')
['I am', 'not all one name - (Division 2)']
(You could also implement this without regex by splitting on just the comma and merging "1 word" cells with their neighbor.)
Modifying your example:
def split_names(cols):
# must be 2 "words" before comma space
pattern = r'(?<=[^,\s]+\s+[^,\s]+), '
people = {}
for names in cols:
names = regex.split(pattern, names)
if names == ['']:
continue
same_division = False
level = {}
for name in names:
if ' - ' in name:
name, division = name.split(' - ')
division = division.strip('()')
else:
division = None
same_division = True
level[name] = division
if same_division:
level = dict.fromkeys(level, division)
people.update(level)
return [
{'Division': division, 'Person': person} for person, division in people.items()
]
Example usage:
columns = ['level 1 person', 'level 2 person']
df[columns].apply(split_names, axis=1)
0 [{'Division': 'Division 1', 'Person': 'Tim Dav...
1 [{'Division': 'Division 1', 'Person': 'Tim Dav...
2 [{'Division': 'Division 2', 'Person': 'David W...
3 [{'Division': 'Division 1', 'Person': 'Cameron...
dtype: object
You could .explode
and .join
to turn the result into columns.
columns = ['level 1 person', 'level 2 person']
df = (
df.drop(columns=columns)
.join(df[columns].apply(split_names, axis=1).rename('People'))
.explode('People', ignore_index=True)
)
df = df.join(pd.DataFrame(df.pop('People').values.tolist()))
Record number date time spent on job Division Person
0 1 2023-01-01 5 Division 1 Tim David
1 1 2023-01-01 5 Division 1 Cameron Green
2 2 2023-02-02 3 Division 1 Tim David
3 2 2023-02-02 3 Division 2 Mitch, Eli Kin Marsh
4 3 2023-03-04 1 Division 2 David Warner
5 3 2023-03-04 1 Division 3 Travis Head
6 4 2023-07-01 2 Division 1 Cameron Green
7 4 2023-07-01 2 Division 1 Tim David
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论