Python: 在数据框中去除重复的小数值

huangapple go评论69阅读模式
英文:

Python: Get rid of duplicate decimals in a dafaframe

问题

我有一个包含手动输入数据点的数据框,理想情况下应该包含数字。然而,存在许多数据质量问题,其中下面显示了两位小数:
A	        B

0 54.6464 46.8484
1 64.68461 65.4
2 95.79527 65.644
3 484.644.161 45.45
4 71.257.9 21.1
5 12.8 10.8
6 9.6 12.5
7 312.4 12.787.57.674

通常情况下,如果这只是一个小数据框中的少数几个值,我会手动更改它们。但如果是一个较大的数据框,这将变得太繁琐。我想要去掉第二个小数点,以便获得以下结果:
A	        B

0 54.646400 46.848400
1 64.684610 65.400000
2 95.795270 65.644000
3 484.644161 45.450000
4 71.257900 21.100000
5 12.800000 10.800000
6 9.600000 12.500000
7 312.400000 12.787577

我尝试强制限制字符长度以去掉第二个小数点,但它会出现在意想不到的地方,因此以下逻辑在这里不起作用:

df['A'] = df['A'].str.slice(0, 4)
df['B'] = df['B'].str.slice(0, 4)


<details>
<summary>英文:</summary>

I have a dataframe that contains manually entered data points that should ideally contain numbers. There are however plenty of data quality issues where there are two decimal places shown below:
A	           B

0 54.6464 46.8484
1 64.68461 65.4
2 95.79527 65.644
3 484.644.161 45.45
4 71.257.9 21.1
5 12.8 10.8
6 9.6 12.5
7 312.4 12.787.57.674

Normally if these were just a handful of values in a small dataframe then I&#39;d go and manually change them. In case of a larger dataframe it would be too cumbersome. I want to get rid of the second decimal point such that I get the following result:
A	        B

0 54.646400 46.848400
1 64.684610 65.400000
2 95.795270 65.644000
3 484.644161 45.450000
4 71.257900 21.100000
5 12.800000 10.800000
6 9.600000 12.500000
7 312.400000 12.787577

I&#39;ve tried to force a limit on the character length to get rid of the second decimal point but it pops up in unexpected places so the following logic does not work well here:

df['A'] = df['A'].str.slice(0,4)
df['B'] = df['B'].str.slice(0,4)


</details>


# 答案1
**得分**: 1

我会这样做:

```python
def rectify_decimal(string):
   parts = string.split('.')
   if len(parts) > 1:
     return f"{parts[0]}.{''.join(parts[1:])}"
   else:
     return(parts[0])

df['A'] = df['A'].apply(rectify_decimal)
英文:

I would do something like:

def rectify_decimal(string):
   parts = string.split(&#39;.&#39;)
   if len(parts) &gt; 1:
     return f&quot;{parts[0]}.{&#39;&#39;.join(parts[1:])}&quot;
   else:
     return(parts[0])

df[&#39;A&#39;] = df[&#39;A&#39;].apply(rectify_decimal)

答案2

得分: 1

使用扩展正则表达式替换:

import re

pat = re.compile(r'^(\d+\.)([\d.]+)')  # 预编译的模式
repl = lambda m: m.group(1) + m.group(2).replace(".", "")
df.A = pd.to_numeric(df.A.str.replace(pat, repl))
df.B = pd.to_numeric(df.B.str.replace(pat, repl))

            A          B
0   54.646400  46.848400
1   64.684610  65.400000
2   95.795270  65.644000
3  484.644161  45.450000
4   71.257900  21.100000
5   12.800000  10.800000
6    9.600000  12.500000
7  312.400000  12.787577
英文:

With extended regex replacement:

import re

pat = re.compile(r&#39;^(\d+\.)([\d.]+)&#39;)  # precompiled pattern
repl = lambda m: m.group(1) + m.group(2).replace(&quot;.&quot;, &quot;&quot;)
df.A = pd.to_numeric(df.A.str.replace(pat, repl))
df.B = pd.to_numeric(df.B.str.replace(pat, repl))

            A          B
0   54.646400  46.848400
1   64.684610  65.400000
2   95.795270  65.644000
3  484.644161  45.450000
4   71.257900  21.100000
5   12.800000  10.800000
6    9.600000  12.500000
7  312.400000  12.787577

答案3

得分: 1

以下是代码部分的翻译,不包括注释或问题回答:

非常类似于Leonid的回答但没有使用.apply不确定哪个更好

import pandas as pd

data = {
    'A': ['54.6464', '64.68461', '95.79527', '484.644.161', '71.257.9', '12.8', '9.6', '312.4'],
    'B': ['46.8484', '65.4', '65.644', '45.45', '21.1', '10.8', '12.5', '12.787.57.674']
}

df = pd.DataFrame(data=data)

for key in df:
    df[key] = [x.split('.')[0]+'.'+'.'.join(x.split('.')[1:]) for x in df[key].tolist()]

print(df)
            A          B
0     54.6464    46.8484
1    64.68461       65.4
2    95.79527     65.644
3  484.644161      45.45
4     71.2579       21.1
5        12.8       10.8
6         9.6       12.5
7       312.4  12.78757674
英文:

Very similar to Leonid's answer but without using .apply. Unsure which is best.

import pandas as pd

data = {
    &#39;A&#39;: [&#39;54.6464&#39;, &#39;64.68461&#39;, &#39;95.79527&#39;, &#39;484.644.161&#39;, &#39;71.257.9&#39;, &#39;12.8&#39;, &#39;9.6&#39;, &#39;312.4&#39;],
    &#39;B&#39;: [&#39;46.8484&#39;, &#39;65.4&#39;, &#39;65.644&#39;, &#39;45.45&#39;, &#39;21.1&#39;, &#39;10.8&#39;, &#39;12.5&#39;, &#39;12.787.57.674&#39;]
}

df = pd.DataFrame(data=data)

for key in df:
    df[key] = [x.split(&#39;.&#39;)[0]+&#39;.&#39;+&#39;&#39;.join(x.split(&#39;.&#39;)[1:]) for x in df[key].tolist()]
    
print(df)
            A            B
0     54.6464      46.8484
1    64.68461         65.4
2    95.79527       65.644
3  484.644161        45.45
4     71.2579         21.1
5        12.8         10.8
6         9.6         12.5
7       312.4  12.78757674

答案4

得分: 1

这是使用正则表达式来移除第二个小数点的解决方案,而不是由@Leonid Astrin提出的函数的缩短版本:

import pandas as pd

# 创建一个示例数据框
data = {
    'A': ['54.6464', '64.68461', '95.79527', '484.644.161', '71.257.9', '12.8', '9.6', '312.4'],
    'B': ['46.8484', '65.4', '65.644', '45.45', '21.1', '10.8', '12.5', '12.787.57.674']
}
df = pd.DataFrame(data)

# 使用正则表达式去除第二个小数点
df = df.replace(r'\.(?=.*\.)', '', regex=True)

# 将列转换为浮点数
df['A'] = df['A'].astype(float)
df['B'] = df['B'].astype(float)

print(df)

输出将为:

                A          B
0   54.646400  46.848400
1   64.684610  65.400000
2   95.795270  65.644000
3  484.644161  45.450000
4   71.257900  21.100000
5   12.800000  10.800000
6    9.600000  12.500000
7  312.400000  12.787577
英文:

This is solution with using regex to remove second decimal point instead of function proposed by @Leonid Astrin looks like shorten version:

import pandas as pd

# create a sample dataframe
data = {
    &#39;A&#39;: [&#39;54.6464&#39;, &#39;64.68461&#39;, &#39;95.79527&#39;, &#39;484.644.161&#39;, &#39;71.257.9&#39;, &#39;12.8&#39;, &#39;9.6&#39;, &#39;312.4&#39;],
    &#39;B&#39;: [&#39;46.8484&#39;, &#39;65.4&#39;, &#39;65.644&#39;, &#39;45.45&#39;, &#39;21.1&#39;, &#39;10.8&#39;, &#39;12.5&#39;, &#39;12.787.57.674&#39;]
}
df = pd.DataFrame(data)

# use regex to remove second decimal point
df = df.replace(r&#39;\.(?=.*\.)&#39;, &#39;&#39;, regex=True)

# convert columns to float
df[&#39;A&#39;] = df[&#39;A&#39;].astype(float)
df[&#39;B&#39;] = df[&#39;B&#39;].astype(float)

print(df)

Output will be:

            A          B
0   54.646400  46.848400
1   64.684610  65.400000
2   95.795270  65.644000
3  484.644161  45.450000
4   71.257900  21.100000
5   12.800000  10.800000
6    9.600000  12.500000
7  312.400000  12.787577

答案5

得分: 0

如果始终存在小数点:

df['A'].str.split('.').str[0] + '.' + df['A'].str.split('.').str[1:].str.join('')

英文:

If there is always a decimal point:

df[&#39;A&#39;].str.split(&#39;.&#39;).str[0] + &#39;.&#39; + df[&#39;A&#39;].str.split(&#39;.&#39;).str[1:].str.join(&#39;&#39;)

答案6

得分: 0

你可以使用一些字符串操作

```python
def convert(sr):
    return (sr.str.split('.', n=1, expand=True)
              .pipe(lambda x: x[0] + '.' + x[1].str.replace('.', '', regex=False))
              .astype(float))

df = df.apply(convert)
print(df)

# 输出
            A          B
0   54.646400  46.848400
1   64.684610  65.400000
2   95.795270  65.644000
3  484.644161  45.450000
4   71.257900  21.100000
5   12.800000  10.800000
6    9.600000  12.500000
7  312.400000  12.787577

<details>
<summary>英文:</summary>

You can use some string manipulations:

def convert(sr):
return (sr.str.split('.', n=1, expand=True)
.pipe(lambda x: x[0] + '.' + x[1].str.replace('.', '', regex=False))
.astype(float))

df = df.apply(convert)
print(df)

Output

        A          B

0 54.646400 46.848400
1 64.684610 65.400000
2 95.795270 65.644000
3 484.644161 45.450000
4 71.257900 21.100000
5 12.800000 10.800000
6 9.600000 12.500000
7 312.400000 12.787577


</details>



huangapple
  • 本文由 发表于 2023年2月24日 05:06:31
  • 转载请务必保留本文链接:https://go.coder-hub.com/75550329.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定