2023年2月24日 05:06:31go评论96阅读模式

英文:

Python: Get rid of duplicate decimals in a dafaframe

问题

我有一个包含手动输入数据点的数据框，理想情况下应该包含数字。然而，存在许多数据质量问题，其中下面显示了两位小数：

A	        B

0 54.6464 46.8484
1 64.68461 65.4
2 95.79527 65.644
3 484.644.161 45.45
4 71.257.9 21.1
5 12.8 10.8
6 9.6 12.5
7 312.4 12.787.57.674

通常情况下，如果这只是一个小数据框中的少数几个值，我会手动更改它们。但如果是一个较大的数据框，这将变得太繁琐。我想要去掉第二个小数点，以便获得以下结果：

A	        B

0 54.646400 46.848400
1 64.684610 65.400000
2 95.795270 65.644000
3 484.644161 45.450000
4 71.257900 21.100000
5 12.800000 10.800000
6 9.600000 12.500000
7 312.400000 12.787577

我尝试强制限制字符长度以去掉第二个小数点，但它会出现在意想不到的地方，因此以下逻辑在这里不起作用：

df['A'] = df['A'].str.slice(0, 4)
df['B'] = df['B'].str.slice(0, 4)


<details>
<summary>英文:</summary>
I have a dataframe that contains manually entered data points that should ideally contain numbers. There are however plenty of data quality issues where there are two decimal places shown below:

A	           B

0 54.6464 46.8484
1 64.68461 65.4
2 95.79527 65.644
3 484.644.161 45.45
4 71.257.9 21.1
5 12.8 10.8
6 9.6 12.5
7 312.4 12.787.57.674

Normally if these were just a handful of values in a small dataframe then I&#39;d go and manually change them. In case of a larger dataframe it would be too cumbersome. I want to get rid of the second decimal point such that I get the following result:

A	        B

0 54.646400 46.848400
1 64.684610 65.400000
2 95.795270 65.644000
3 484.644161 45.450000
4 71.257900 21.100000
5 12.800000 10.800000
6 9.600000 12.500000
7 312.400000 12.787577

I&#39;ve tried to force a limit on the character length to get rid of the second decimal point but it pops up in unexpected places so the following logic does not work well here:

df['A'] = df['A'].str.slice(0,4)
df['B'] = df['B'].str.slice(0,4)


</details>
# 答案1
**得分**: 1
我会这样做：
```python
def rectify_decimal(string):
   parts = string.split('.')
   if len(parts) > 1:
     return f"{parts[0]}.{''.join(parts[1:])}"
   else:
     return(parts[0])
df['A'] = df['A'].apply(rectify_decimal)

英文:

I would do something like:

def rectify_decimal(string):
   parts = string.split(&#39;.&#39;)
   if len(parts) &gt; 1:
     return f&quot;{parts[0]}.{&#39;&#39;.join(parts[1:])}&quot;
   else:
     return(parts[0])
df[&#39;A&#39;] = df[&#39;A&#39;].apply(rectify_decimal)

答案2

得分: 1

使用扩展正则表达式替换：

import re
pat = re.compile(r'^(\d+\.)([\d.]+)')  # 预编译的模式
repl = lambda m: m.group(1) + m.group(2).replace(".", "")
df.A = pd.to_numeric(df.A.str.replace(pat, repl))
df.B = pd.to_numeric(df.B.str.replace(pat, repl))

            A          B
0   54.646400  46.848400
1   64.684610  65.400000
2   95.795270  65.644000
3  484.644161  45.450000
4   71.257900  21.100000
5   12.800000  10.800000
6    9.600000  12.500000
7  312.400000  12.787577

英文:

With extended regex replacement:

import re
pat = re.compile(r&#39;^(\d+\.)([\d.]+)&#39;)  # precompiled pattern
repl = lambda m: m.group(1) + m.group(2).replace(&quot;.&quot;, &quot;&quot;)
df.A = pd.to_numeric(df.A.str.replace(pat, repl))
df.B = pd.to_numeric(df.B.str.replace(pat, repl))

            A          B
0   54.646400  46.848400
1   64.684610  65.400000
2   95.795270  65.644000
3  484.644161  45.450000
4   71.257900  21.100000
5   12.800000  10.800000
6    9.600000  12.500000
7  312.400000  12.787577

答案3

得分: 1

以下是代码部分的翻译，不包括注释或问题回答：

非常类似于Leonid的回答，但没有使用.apply。不确定哪个更好。
import pandas as pd
data = {
    'A': ['54.6464', '64.68461', '95.79527', '484.644.161', '71.257.9', '12.8', '9.6', '312.4'],
    'B': ['46.8484', '65.4', '65.644', '45.45', '21.1', '10.8', '12.5', '12.787.57.674']
}
df = pd.DataFrame(data=data)
for key in df:
    df[key] = [x.split('.')[0]+'.'+'.'.join(x.split('.')[1:]) for x in df[key].tolist()]
print(df)
            A          B
0     54.6464    46.8484
1    64.68461       65.4
2    95.79527     65.644
3  484.644161      45.45
4     71.2579       21.1
5        12.8       10.8
6         9.6       12.5
7       312.4  12.78757674

英文:

Very similar to Leonid's answer but without using .apply. Unsure which is best.

import pandas as pd
data = {
    &#39;A&#39;: [&#39;54.6464&#39;, &#39;64.68461&#39;, &#39;95.79527&#39;, &#39;484.644.161&#39;, &#39;71.257.9&#39;, &#39;12.8&#39;, &#39;9.6&#39;, &#39;312.4&#39;],
    &#39;B&#39;: [&#39;46.8484&#39;, &#39;65.4&#39;, &#39;65.644&#39;, &#39;45.45&#39;, &#39;21.1&#39;, &#39;10.8&#39;, &#39;12.5&#39;, &#39;12.787.57.674&#39;]
}
df = pd.DataFrame(data=data)
for key in df:
    df[key] = [x.split(&#39;.&#39;)[0]+&#39;.&#39;+&#39;&#39;.join(x.split(&#39;.&#39;)[1:]) for x in df[key].tolist()]
    
print(df)
            A            B
0     54.6464      46.8484
1    64.68461         65.4
2    95.79527       65.644
3  484.644161        45.45
4     71.2579         21.1
5        12.8         10.8
6         9.6         12.5
7       312.4  12.78757674

答案4

得分: 1

这是使用正则表达式来移除第二个小数点的解决方案，而不是由@Leonid Astrin提出的函数的缩短版本：

import pandas as pd
# 创建一个示例数据框
data = {
    'A': ['54.6464', '64.68461', '95.79527', '484.644.161', '71.257.9', '12.8', '9.6', '312.4'],
    'B': ['46.8484', '65.4', '65.644', '45.45', '21.1', '10.8', '12.5', '12.787.57.674']
}
df = pd.DataFrame(data)
# 使用正则表达式去除第二个小数点
df = df.replace(r'\.(?=.*\.)', '', regex=True)
# 将列转换为浮点数
df['A'] = df['A'].astype(float)
df['B'] = df['B'].astype(float)
print(df)

输出将为：

                A          B
0   54.646400  46.848400
1   64.684610  65.400000
2   95.795270  65.644000
3  484.644161  45.450000
4   71.257900  21.100000
5   12.800000  10.800000
6    9.600000  12.500000
7  312.400000  12.787577

英文:

This is solution with using regex to remove second decimal point instead of function proposed by @Leonid Astrin looks like shorten version:

import pandas as pd
# create a sample dataframe
data = {
    &#39;A&#39;: [&#39;54.6464&#39;, &#39;64.68461&#39;, &#39;95.79527&#39;, &#39;484.644.161&#39;, &#39;71.257.9&#39;, &#39;12.8&#39;, &#39;9.6&#39;, &#39;312.4&#39;],
    &#39;B&#39;: [&#39;46.8484&#39;, &#39;65.4&#39;, &#39;65.644&#39;, &#39;45.45&#39;, &#39;21.1&#39;, &#39;10.8&#39;, &#39;12.5&#39;, &#39;12.787.57.674&#39;]
}
df = pd.DataFrame(data)
# use regex to remove second decimal point
df = df.replace(r&#39;\.(?=.*\.)&#39;, &#39;&#39;, regex=True)
# convert columns to float
df[&#39;A&#39;] = df[&#39;A&#39;].astype(float)
df[&#39;B&#39;] = df[&#39;B&#39;].astype(float)
print(df)

Output will be:

            A          B
0   54.646400  46.848400
1   64.684610  65.400000
2   95.795270  65.644000
3  484.644161  45.450000
4   71.257900  21.100000
5   12.800000  10.800000
6    9.600000  12.500000
7  312.400000  12.787577

答案5

得分: 0

如果始终存在小数点：

df['A'].str.split('.').str[0] + '.' + df['A'].str.split('.').str[1:].str.join('')

英文:

If there is always a decimal point:

df[&#39;A&#39;].str.split(&#39;.&#39;).str[0] + &#39;.&#39; + df[&#39;A&#39;].str.split(&#39;.&#39;).str[1:].str.join(&#39;&#39;)

答案6

得分: 0

你可以使用一些字符串操作：
```python
def convert(sr):
    return (sr.str.split('.', n=1, expand=True)
              .pipe(lambda x: x[0] + '.' + x[1].str.replace('.', '', regex=False))
              .astype(float))
df = df.apply(convert)
print(df)
# 输出
            A          B
0   54.646400  46.848400
1   64.684610  65.400000
2   95.795270  65.644000
3  484.644161  45.450000
4   71.257900  21.100000
5   12.800000  10.800000
6    9.600000  12.500000
7  312.400000  12.787577


<details>
<summary>英文:</summary>
You can use some string manipulations:

def convert(sr):
return (sr.str.split('.', n=1, expand=True)
.pipe(lambda x: x[0] + '.' + x[1].str.replace('.', '', regex=False))
.astype(float))

df = df.apply(convert)
print(df)

Output

        A          B

0 54.646400 46.848400
1 64.684610 65.400000
2 95.795270 65.644000
3 484.644161 45.450000
4 71.257900 21.100000
5 12.800000 10.800000
6 9.600000 12.500000
7 312.400000 12.787577


</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python: 在数据框中去除重复的小数值

问题

答案2

答案3

答案4

答案5

答案6

Output

如何解决Snakemake中的通配符错误？

提取 Pandas Python 中的日期时间数据框的行。

在Python中如何绘制连接参数空间图中最大点的线？ #python

重写此代码为PL/SQL是否比在Python中执行3个for循环更快，以及如何进行转换？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论