使用Python拆分字符串,当相同的分隔符在不同记录中具有不同含义时。

huangapple go评论95阅读模式
英文:

Split string using Python when same delimiter has different meaning in different records

问题

Sure, here are the translated code-related parts of your text:

我有类似这样的数据

| 记录编号 | 一级人员 | 二级人员 | 日期 | 工作时间
| ------------- | -------------- | -------------- | ----- | -----------------
| 1 | | Tim DavidCameron Green - 第1分区| 2023年01月01日| 5
| 2 | Tim David - 第1分区| MitchEli Kin Marsh - 第2分区| 2023年02月02日| 3
| 3 | | David Warner - 第2分区),Travis Head - 第3分区| 2023年03月04日 | 1
| 4 | Cameron Green - 第1分区| Tim David - 第1分区 | 2023年07月01日| 2

最终目标是获取每个人每个月在工作上花费的总时间按分区分类这与人员级别无关结果应类似于

| 分区 | 人员 | 月份 | 工作时间
| ------------- | -------------- | -------------- | ----- |
| 第1分区 | Tim David | 2023年01月| 7
| 第1分区 | Tim David| 2023年02月| 3
| 第1分区 | Cameron Green| 2023年01月| 7
| 第2分区 | MitchEli Kin Marsh| 2023年02月| 3
| 第2分区 | David Warner| 2023年04月| 1
| 第3分区 | Travis Head| 2023年04月| 1

为了实现这一目标首先我正在尝试清理二级人员在这一列中记录1表示有两个人都在第1分区一个人是Tim David另一个是Cameron Green在记录2中只有一个人MitchEli Kin Marsh在第2分区在第3条记录中有两个人在两个不同的分区David Warner在第2分区Travis Head在第3分区在记录4中只有一个人Tim David在第1分区

1. 我试图创建一个新列其中包含参与特定记录的所有人员在这样做时我在尝试通过逗号拆分二级人员列时遇到了问题例如在记录1和记录2中我在尝试按逗号分割时遇到了问题因为在记录2中即使只有一个人也有一个逗号分隔姓氏和其他名字因此我想要记录1的列表是['Tim David''Cameron Green']记录2的列表是['Mitch Eli Kin Marsh']

以下是我尝试的部分代码

```python
def split_names(row):
    string = row['二级人员']

    pattern = '([\w\s,-]+)&'

    names = re.split(pattern, string) 

    name_list = list()

    for name in names:
        replacements = [('-', ''), ('(', ''), (')', '')]

        for char, replacement in replacements:
            if char in name:
                name = name.replace(char, replacement)
        name_list.append(name)

    while("" in name_list): # 移除空元素
        name_list.remove("")

    return name_list

df['人员'] = df.apply(split_names, axis=1)
  1. 然后,我还想为那些没有分区的人员分配分区。如果多个人在同一个分区,就会发生这种情况。例如,在记录1中。所以,我考虑创建另一个列,其中每个元素都对应于该人员所属的分区。因此,对于记录1,这个列表将是['第1分区','第1分区']。

<details>
<summary>英文:</summary>
I have data that looks like this. 
| Record number | level 1 person |level 2 person| date| time spent on job
| ------------- | -------------- |--------------|-----|-----------------
| 1         | |Tim David, Cameron Green - (Division 1) |01/01/2023|5
| 2      | Tim David - (Division 1) |Mitch, Eli Kin Marsh - (Division 2)|02/02/2023|3
|3| | David Warner - (Division 2), Travis Head - (Division 3)| 03/04/2023 | 1
|4| Cameron Green - (Division 1)| Tim David - (Division 1) | 07/01/2023| 2
The final aim is to get the total time each person spends on doing jobs per month categorised by the division. This is regardless of the level of person. The result should be something similar to:
| Division | Person |Month|  time spent on job
| ------------- | -------------- |--------------|-----|
|Division 1| Tim David | Jan-23	|7
|Division 1|	Tim David|	Feb-23|	3
|Division 1|	Cameron Green| 	Jan-23|	7
|Division 2|	Mitch, Eli Kin Marsh|	Feb-23|	3
|Division 2|	David Warner|	Apr-23|	1
|Division 3|	Travis Head|	Apr-23|	1
To achieve this first I am trying to clean the ‘level 2 person’ column. In this column, record 1 means there are two people both in Division 1. One person is Tim David and the other is Cameron Green. In record 2 there is only one person Mitch, Eli Kin Marsh who is in Division 2. In the 3rd record there are two people in two separate divisions. David Warner is in Division 2 and Travis Head is in Division 3. In record 4, only one person Tim David in Division 1. 
1.	I am trying to create a new column that captures all the people involved in a particular record. In doing this I am having trouble splitting the names in &#39;level 2 person&#39; column. For example in Record 1 and Record 2 I have trouble splitting by a comma because in Record 2 even though there is only one person there is a comma separating the last name and other names. So the list I want for Record 1 is [‘Tim David’, ‘Cameron Green’] for Record 2 [‘Mitch Eli Kin Marsh&#39;].
This is what I did to attempt this part:
def split_names(row):
string = row[&#39;level 2 person&#39;]
pattern = &#39;([\w\s,-]+)&#39;
names = re.split(pattern, string) 
name_list = list()
for name in names:
replacements = [(&#39;-&#39;, &#39;&#39;), (&#39;(&#39;, &#39;&#39;), (&#39;)&#39;, &#39;&#39;)]
for char, replacement in replacements:
if char in name:
name= name.replace(char, replacement)
name_list.append(name)        
while(&quot;&quot; in name_list): # remove empty elements
name_list.remove(&quot;&quot;)
return name_list
df[&#39;names&#39;] = df.apply(split_names,axis=1)
2.	Then I also want to assign Division for those who do not have it. This happens if multiple people are in the same division. For example, in Record 1. So, I am thinking of creating another column with a list where each element would correspond to the division that person belongs to. So for Record 1 this list would be [‘Division 1’, ‘Division 1’]
</details>
# 答案1
**得分**: 2
"1 word" is &quot;1 个单词&quot;,规则:"2 个单词"必须在逗号之前,能解决这个问题吗?
可以使用[pypi的正则模块](https://pypi.org/project/regex)来实现,因为它支持可变长度的回顾断言。
```python
&gt;&gt;&gt; import regex
&gt;&gt;&gt;
&gt;&gt;&gt; pattern = r'(?<=[^,\s]+\s+[^,\s]+), '
&gt;&gt;&gt; regex.split(pattern, 'I, am all one name - (Division 2)')
['I, am all one name - (Division 2)']
&gt;&gt;&gt; regex.split(pattern, 'I am, not all one name - (Division 2)')
['I am', 'not all one name - (Division 2)']

(您也可以在不使用正则表达式的情况下通过仅拆分逗号并将"1 个单词"单元格与相邻单元格合并来实现这一点。)

修改您的示例:

def split_names(cols):
     # 逗号前必须有2个"单词"
     pattern = r'(?<=[^,\s]+\s+[^,\s]+), '
    
     people = {}
     
     for names in cols:
         names = regex.split(pattern, names)

         if names == ['']: 
             continue 
             
         same_division = False
         level = {}
             
         for name in names:
             if ' - ' in name:
                name, division = name.split(' - ')
                division = division.strip('()')
             else:
                division = None
                same_division = True
               
             level[name] = division
 
         if same_division:
             level = dict.fromkeys(level, division)
             
         people.update(level)
            
     return [
         {'Division': division, 'Person': person} for person, division in people.items()
     ]

示例用法:

columns = ['level 1 person', 'level 2 person']
df[columns].apply(split_names, axis=1)
0    [{'Division': 'Division 1', 'Person': 'Tim Dav...
1    [{'Division': 'Division 1', 'Person': 'Tim Dav...
2    [{'Division': 'Division 2', 'Person': 'David W...
3    [{'Division': 'Division 1', 'Person': 'Cameron...
dtype: object

您可以使用.explode.join将结果转换为列。

columns = ['level 1 person', 'level 2 person'] 

df = (
   df.drop(columns=columns)
     .join(df[columns].apply(split_names, axis=1).rename('People'))
     .explode('People', ignore_index=True)
)

df = df.join(pd.DataFrame(df.pop('People').values.tolist())) 
   记录编号        日期  在工作上花费的时间      部门                人员
0              1  2023-01-01                   5  Division 1             Tim David
1              1  2023-01-01                   5  Division 1         Cameron Green
2              2  2023-02-02                   3  Division 1             Tim David
3              2  2023-02-02                   3  Division 2  Mitch, Eli Kin Marsh
4              3  2023-03-04                   1  Division 2          David Warner
5              3  2023-03-04                   1  Division 3           Travis Head
6              4  2023-07-01                   2  Division 1         Cameron Green
7              4  2023-07-01                   2  Division 1             Tim David
英文:

As Mitch is "1 word", would the rule: "2 words" must come before the comma solve this issue?

That can be done with the regex module from pypi as it supports variable length lookbehind assertions.

&gt;&gt;&gt; import regex
&gt;&gt;&gt;
&gt;&gt;&gt; pattern = r&#39;(?&lt;=[^,\s]+\s+[^,\s]+), &#39;
&gt;&gt;&gt; regex.split(pattern, &#39;I, am all one name - (Division 2)&#39;)
[&#39;I, am all one name - (Division 2)&#39;]
&gt;&gt;&gt; regex.split(pattern, &#39;I am, not all one name - (Division 2)&#39;)
[&#39;I am&#39;, &#39;not all one name - (Division 2)&#39;]

(You could also implement this without regex by splitting on just the comma and merging "1 word" cells with their neighbor.)

Modifying your example:

def split_names(cols):
# must be 2 &quot;words&quot; before comma space 
pattern = r&#39;(?&lt;=[^,\s]+\s+[^,\s]+), &#39;
people = {}
for names in cols:
names = regex.split(pattern, names)
if names == [&#39;&#39;]: 
continue 
same_division = False
level = {}
for name in names:
if &#39; - &#39; in name:
name, division = name.split(&#39; - &#39;)
division = division.strip(&#39;()&#39;)
else:
division = None
same_division = True
level[name] = division
if same_division:
level = dict.fromkeys(level, division)
people.update(level)
return [
{&#39;Division&#39;: division, &#39;Person&#39;: person} for person, division in people.items()
]

Example usage:

columns = [&#39;level 1 person&#39;, &#39;level 2 person&#39;]
df[columns].apply(split_names, axis=1)
0    [{&#39;Division&#39;: &#39;Division 1&#39;, &#39;Person&#39;: &#39;Tim Dav...
1    [{&#39;Division&#39;: &#39;Division 1&#39;, &#39;Person&#39;: &#39;Tim Dav...
2    [{&#39;Division&#39;: &#39;Division 2&#39;, &#39;Person&#39;: &#39;David W...
3    [{&#39;Division&#39;: &#39;Division 1&#39;, &#39;Person&#39;: &#39;Cameron...
dtype: object

You could .explode and .join to turn the result into columns.

columns = [&#39;level 1 person&#39;, &#39;level 2 person&#39;] 
df = (
df.drop(columns=columns)
.join(df[columns].apply(split_names, axis=1).rename(&#39;People&#39;))
.explode(&#39;People&#39;, ignore_index=True)
)
df = df.join(pd.DataFrame(df.pop(&#39;People&#39;).values.tolist())) 
   Record number       date  time spent on job    Division                Person
0              1 2023-01-01                  5  Division 1             Tim David
1              1 2023-01-01                  5  Division 1         Cameron Green
2              2 2023-02-02                  3  Division 1             Tim David
3              2 2023-02-02                  3  Division 2  Mitch, Eli Kin Marsh
4              3 2023-03-04                  1  Division 2          David Warner
5              3 2023-03-04                  1  Division 3           Travis Head
6              4 2023-07-01                  2  Division 1         Cameron Green
7              4 2023-07-01                  2  Division 1             Tim David

huangapple
  • 本文由 发表于 2023年6月15日 09:10:01
  • 转载请务必保留本文链接:https://go.coder-hub.com/76478458.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定