2023年6月15日 09:10:01go评论121阅读模式

英文:

Split string using Python when same delimiter has different meaning in different records

问题

Sure, here are the translated code-related parts of your text:

我有类似这样的数据。
| 记录编号 | 一级人员 | 二级人员 | 日期 | 工作时间
| ------------- | -------------- | -------------- | ----- | -----------------
| 1 | | Tim David，Cameron Green - （第1分区）| 2023年01月01日| 5
| 2 | Tim David - （第1分区）| Mitch，Eli Kin Marsh - （第2分区）| 2023年02月02日| 3
| 3 | | David Warner - （第2分区），Travis Head - （第3分区）| 2023年03月04日 | 1
| 4 | Cameron Green - （第1分区）| Tim David - （第1分区） | 2023年07月01日| 2
最终目标是获取每个人每个月在工作上花费的总时间，按分区分类。这与人员级别无关。结果应类似于：
| 分区 | 人员 | 月份 | 工作时间
| ------------- | -------------- | -------------- | ----- |
| 第1分区 | Tim David | 2023年01月| 7
| 第1分区 | Tim David| 2023年02月| 3
| 第1分区 | Cameron Green| 2023年01月| 7
| 第2分区 | Mitch，Eli Kin Marsh| 2023年02月| 3
| 第2分区 | David Warner| 2023年04月| 1
| 第3分区 | Travis Head| 2023年04月| 1
为了实现这一目标，首先我正在尝试清理“二级人员”列。在这一列中，记录1表示有两个人都在第1分区。一个人是Tim David，另一个是Cameron Green。在记录2中，只有一个人Mitch，Eli Kin Marsh在第2分区。在第3条记录中，有两个人在两个不同的分区。David Warner在第2分区，Travis Head在第3分区。在记录4中，只有一个人Tim David在第1分区。
1. 我试图创建一个新列，其中包含参与特定记录的所有人员。在这样做时，我在尝试通过逗号拆分“二级人员”列时遇到了问题。例如，在记录1和记录2中，我在尝试按逗号分割时遇到了问题，因为在记录2中，即使只有一个人，也有一个逗号分隔姓氏和其他名字。因此，我想要记录1的列表是['Tim David'，'Cameron Green']，记录2的列表是['Mitch Eli Kin Marsh']。
以下是我尝试的部分代码：
```python
def split_names(row):
    string = row['二级人员']
    pattern = '([\w\s,-]+)&'
    names = re.split(pattern, string) 
    name_list = list()
    for name in names:
        replacements = [('-', ''), ('(', ''), (')', '')]
        for char, replacement in replacements:
            if char in name:
                name = name.replace(char, replacement)
        name_list.append(name)
    while("" in name_list): # 移除空元素
        name_list.remove("")
    return name_list
df['人员'] = df.apply(split_names, axis=1)

然后，我还想为那些没有分区的人员分配分区。如果多个人在同一个分区，就会发生这种情况。例如，在记录1中。所以，我考虑创建另一个列，其中每个元素都对应于该人员所属的分区。因此，对于记录1，这个列表将是['第1分区'，'第1分区']。


<details>
<summary>英文:</summary>
I have data that looks like this. 
| Record number | level 1 person |level 2 person| date| time spent on job
| ------------- | -------------- |--------------|-----|-----------------
| 1         | |Tim David, Cameron Green - (Division 1) |01/01/2023|5
| 2      | Tim David - (Division 1) |Mitch, Eli Kin Marsh - (Division 2)|02/02/2023|3
|3| | David Warner - (Division 2), Travis Head - (Division 3)| 03/04/2023 | 1
|4| Cameron Green - (Division 1)| Tim David - (Division 1) | 07/01/2023| 2
The final aim is to get the total time each person spends on doing jobs per month categorised by the division. This is regardless of the level of person. The result should be something similar to:
| Division | Person |Month|  time spent on job
| ------------- | -------------- |--------------|-----|
|Division 1| Tim David | Jan-23	|7
|Division 1|	Tim David|	Feb-23|	3
|Division 1|	Cameron Green| 	Jan-23|	7
|Division 2|	Mitch, Eli Kin Marsh|	Feb-23|	3
|Division 2|	David Warner|	Apr-23|	1
|Division 3|	Travis Head|	Apr-23|	1
To achieve this first I am trying to clean the ‘level 2 person’ column. In this column, record 1 means there are two people both in Division 1. One person is Tim David and the other is Cameron Green. In record 2 there is only one person Mitch, Eli Kin Marsh who is in Division 2. In the 3rd record there are two people in two separate divisions. David Warner is in Division 2 and Travis Head is in Division 3. In record 4, only one person Tim David in Division 1. 
1.	I am trying to create a new column that captures all the people involved in a particular record. In doing this I am having trouble splitting the names in &#39;level 2 person&#39; column. For example in Record 1 and Record 2 I have trouble splitting by a comma because in Record 2 even though there is only one person there is a comma separating the last name and other names. So the list I want for Record 1 is [‘Tim David’, ‘Cameron Green’] for Record 2 [‘Mitch Eli Kin Marsh&#39;].
This is what I did to attempt this part:
def split_names(row):
string = row[&#39;level 2 person&#39;]
pattern = &#39;([\w\s,-]+)&#39;
names = re.split(pattern, string) 
name_list = list()
for name in names:
replacements = [(&#39;-&#39;, &#39;&#39;), (&#39;(&#39;, &#39;&#39;), (&#39;)&#39;, &#39;&#39;)]
for char, replacement in replacements:
if char in name:
name= name.replace(char, replacement)
name_list.append(name)        
while(&quot;&quot; in name_list): # remove empty elements
name_list.remove(&quot;&quot;)
return name_list
df[&#39;names&#39;] = df.apply(split_names,axis=1)
2.	Then I also want to assign Division for those who do not have it. This happens if multiple people are in the same division. For example, in Record 1. So, I am thinking of creating another column with a list where each element would correspond to the division that person belongs to. So for Record 1 this list would be [‘Division 1’, ‘Division 1’]
</details>
# 答案1
**得分**: 2
"1 word" is &quot;1 个单词&quot;，规则："2 个单词"必须在逗号之前，能解决这个问题吗？
可以使用[pypi的正则模块](https://pypi.org/project/regex)来实现，因为它支持可变长度的回顾断言。
```python
&gt;&gt;&gt; import regex
&gt;&gt;&gt;
&gt;&gt;&gt; pattern = r'(?<=[^,\s]+\s+[^,\s]+), '
&gt;&gt;&gt; regex.split(pattern, 'I, am all one name - (Division 2)')
['I, am all one name - (Division 2)']
&gt;&gt;&gt; regex.split(pattern, 'I am, not all one name - (Division 2)')
['I am', 'not all one name - (Division 2)']

(您也可以在不使用正则表达式的情况下通过仅拆分逗号并将"1 个单词"单元格与相邻单元格合并来实现这一点。)

修改您的示例：

def split_names(cols):
     # 逗号前必须有2个"单词"
     pattern = r'(?<=[^,\s]+\s+[^,\s]+), '
    
     people = {}
     
     for names in cols:
         names = regex.split(pattern, names)
         if names == ['']: 
             continue 
             
         same_division = False
         level = {}
             
         for name in names:
             if ' - ' in name:
                name, division = name.split(' - ')
                division = division.strip('()')
             else:
                division = None
                same_division = True
               
             level[name] = division
 
         if same_division:
             level = dict.fromkeys(level, division)
             
         people.update(level)
            
     return [
         {'Division': division, 'Person': person} for person, division in people.items()
     ]

示例用法：

columns = ['level 1 person', 'level 2 person']
df[columns].apply(split_names, axis=1)

0    [{'Division': 'Division 1', 'Person': 'Tim Dav...
1    [{'Division': 'Division 1', 'Person': 'Tim Dav...
2    [{'Division': 'Division 2', 'Person': 'David W...
3    [{'Division': 'Division 1', 'Person': 'Cameron...
dtype: object

您可以使用.explode和.join将结果转换为列。

columns = ['level 1 person', 'level 2 person'] 
df = (
   df.drop(columns=columns)
     .join(df[columns].apply(split_names, axis=1).rename('People'))
     .explode('People', ignore_index=True)
)
df = df.join(pd.DataFrame(df.pop('People').values.tolist()))

   记录编号        日期  在工作上花费的时间      部门                人员
0              1  2023-01-01                   5  Division 1             Tim David
1              1  2023-01-01                   5  Division 1         Cameron Green
2              2  2023-02-02                   3  Division 1             Tim David
3              2  2023-02-02                   3  Division 2  Mitch, Eli Kin Marsh
4              3  2023-03-04                   1  Division 2          David Warner
5              3  2023-03-04                   1  Division 3           Travis Head
6              4  2023-07-01                   2  Division 1         Cameron Green
7              4  2023-07-01                   2  Division 1             Tim David

英文:

As Mitch is "1 word", would the rule: "2 words" must come before the comma solve this issue?

That can be done with the regex module from pypi as it supports variable length lookbehind assertions.

&gt;&gt;&gt; import regex
&gt;&gt;&gt;
&gt;&gt;&gt; pattern = r&#39;(?&lt;=[^,\s]+\s+[^,\s]+), &#39;
&gt;&gt;&gt; regex.split(pattern, &#39;I, am all one name - (Division 2)&#39;)
[&#39;I, am all one name - (Division 2)&#39;]
&gt;&gt;&gt; regex.split(pattern, &#39;I am, not all one name - (Division 2)&#39;)
[&#39;I am&#39;, &#39;not all one name - (Division 2)&#39;]

(You could also implement this without regex by splitting on just the comma and merging "1 word" cells with their neighbor.)

Modifying your example:

def split_names(cols):
# must be 2 &quot;words&quot; before comma space 
pattern = r&#39;(?&lt;=[^,\s]+\s+[^,\s]+), &#39;
people = {}
for names in cols:
names = regex.split(pattern, names)
if names == [&#39;&#39;]: 
continue 
same_division = False
level = {}
for name in names:
if &#39; - &#39; in name:
name, division = name.split(&#39; - &#39;)
division = division.strip(&#39;()&#39;)
else:
division = None
same_division = True
level[name] = division
if same_division:
level = dict.fromkeys(level, division)
people.update(level)
return [
{&#39;Division&#39;: division, &#39;Person&#39;: person} for person, division in people.items()
]

Example usage:

columns = [&#39;level 1 person&#39;, &#39;level 2 person&#39;]
df[columns].apply(split_names, axis=1)

0    [{&#39;Division&#39;: &#39;Division 1&#39;, &#39;Person&#39;: &#39;Tim Dav...
1    [{&#39;Division&#39;: &#39;Division 1&#39;, &#39;Person&#39;: &#39;Tim Dav...
2    [{&#39;Division&#39;: &#39;Division 2&#39;, &#39;Person&#39;: &#39;David W...
3    [{&#39;Division&#39;: &#39;Division 1&#39;, &#39;Person&#39;: &#39;Cameron...
dtype: object

You could .explode and .join to turn the result into columns.

columns = [&#39;level 1 person&#39;, &#39;level 2 person&#39;] 
df = (
df.drop(columns=columns)
.join(df[columns].apply(split_names, axis=1).rename(&#39;People&#39;))
.explode(&#39;People&#39;, ignore_index=True)
)
df = df.join(pd.DataFrame(df.pop(&#39;People&#39;).values.tolist()))

   Record number       date  time spent on job    Division                Person
0              1 2023-01-01                  5  Division 1             Tim David
1              1 2023-01-01                  5  Division 1         Cameron Green
2              2 2023-02-02                  3  Division 1             Tim David
3              2 2023-02-02                  3  Division 2  Mitch, Eli Kin Marsh
4              3 2023-03-04                  1  Division 2          David Warner
5              3 2023-03-04                  1  Division 3           Travis Head
6              4 2023-07-01                  2  Division 1         Cameron Green
7              4 2023-07-01                  2  Division 1             Tim David

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用Python拆分字符串，当相同的分隔符在不同记录中具有不同含义时。

问题

"TypeError: must be str or None, not list" when trying to pass row from csv.reader to str.split

如何比较两个Python AST，忽略参数？

如何找到一个组合的列等于另一个数据框的列的索引/行。

如何根据它们的位置显示瓦片？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。