英文:
Python: Get rid of duplicate decimals in a dafaframe
问题
我有一个包含手动输入数据点的数据框,理想情况下应该包含数字。然而,存在许多数据质量问题,其中下面显示了两位小数:
A B
0 54.6464 46.8484
1 64.68461 65.4
2 95.79527 65.644
3 484.644.161 45.45
4 71.257.9 21.1
5 12.8 10.8
6 9.6 12.5
7 312.4 12.787.57.674
通常情况下,如果这只是一个小数据框中的少数几个值,我会手动更改它们。但如果是一个较大的数据框,这将变得太繁琐。我想要去掉第二个小数点,以便获得以下结果:
A B
0 54.646400 46.848400
1 64.684610 65.400000
2 95.795270 65.644000
3 484.644161 45.450000
4 71.257900 21.100000
5 12.800000 10.800000
6 9.600000 12.500000
7 312.400000 12.787577
我尝试强制限制字符长度以去掉第二个小数点,但它会出现在意想不到的地方,因此以下逻辑在这里不起作用:
df['A'] = df['A'].str.slice(0, 4)
df['B'] = df['B'].str.slice(0, 4)
<details>
<summary>英文:</summary>
I have a dataframe that contains manually entered data points that should ideally contain numbers. There are however plenty of data quality issues where there are two decimal places shown below:
A B
0 54.6464 46.8484
1 64.68461 65.4
2 95.79527 65.644
3 484.644.161 45.45
4 71.257.9 21.1
5 12.8 10.8
6 9.6 12.5
7 312.4 12.787.57.674
Normally if these were just a handful of values in a small dataframe then I'd go and manually change them. In case of a larger dataframe it would be too cumbersome. I want to get rid of the second decimal point such that I get the following result:
A B
0 54.646400 46.848400
1 64.684610 65.400000
2 95.795270 65.644000
3 484.644161 45.450000
4 71.257900 21.100000
5 12.800000 10.800000
6 9.600000 12.500000
7 312.400000 12.787577
I've tried to force a limit on the character length to get rid of the second decimal point but it pops up in unexpected places so the following logic does not work well here:
df['A'] = df['A'].str.slice(0,4)
df['B'] = df['B'].str.slice(0,4)
</details>
# 答案1
**得分**: 1
我会这样做:
```python
def rectify_decimal(string):
parts = string.split('.')
if len(parts) > 1:
return f"{parts[0]}.{''.join(parts[1:])}"
else:
return(parts[0])
df['A'] = df['A'].apply(rectify_decimal)
英文:
I would do something like:
def rectify_decimal(string):
parts = string.split('.')
if len(parts) > 1:
return f"{parts[0]}.{''.join(parts[1:])}"
else:
return(parts[0])
df['A'] = df['A'].apply(rectify_decimal)
答案2
得分: 1
使用扩展正则表达式替换:
import re
pat = re.compile(r'^(\d+\.)([\d.]+)') # 预编译的模式
repl = lambda m: m.group(1) + m.group(2).replace(".", "")
df.A = pd.to_numeric(df.A.str.replace(pat, repl))
df.B = pd.to_numeric(df.B.str.replace(pat, repl))
A B
0 54.646400 46.848400
1 64.684610 65.400000
2 95.795270 65.644000
3 484.644161 45.450000
4 71.257900 21.100000
5 12.800000 10.800000
6 9.600000 12.500000
7 312.400000 12.787577
英文:
With extended regex replacement:
import re
pat = re.compile(r'^(\d+\.)([\d.]+)') # precompiled pattern
repl = lambda m: m.group(1) + m.group(2).replace(".", "")
df.A = pd.to_numeric(df.A.str.replace(pat, repl))
df.B = pd.to_numeric(df.B.str.replace(pat, repl))
A B
0 54.646400 46.848400
1 64.684610 65.400000
2 95.795270 65.644000
3 484.644161 45.450000
4 71.257900 21.100000
5 12.800000 10.800000
6 9.600000 12.500000
7 312.400000 12.787577
答案3
得分: 1
以下是代码部分的翻译,不包括注释或问题回答:
非常类似于Leonid的回答,但没有使用.apply。不确定哪个更好。
import pandas as pd
data = {
'A': ['54.6464', '64.68461', '95.79527', '484.644.161', '71.257.9', '12.8', '9.6', '312.4'],
'B': ['46.8484', '65.4', '65.644', '45.45', '21.1', '10.8', '12.5', '12.787.57.674']
}
df = pd.DataFrame(data=data)
for key in df:
df[key] = [x.split('.')[0]+'.'+'.'.join(x.split('.')[1:]) for x in df[key].tolist()]
print(df)
A B
0 54.6464 46.8484
1 64.68461 65.4
2 95.79527 65.644
3 484.644161 45.45
4 71.2579 21.1
5 12.8 10.8
6 9.6 12.5
7 312.4 12.78757674
英文:
Very similar to Leonid's answer but without using .apply. Unsure which is best.
import pandas as pd
data = {
'A': ['54.6464', '64.68461', '95.79527', '484.644.161', '71.257.9', '12.8', '9.6', '312.4'],
'B': ['46.8484', '65.4', '65.644', '45.45', '21.1', '10.8', '12.5', '12.787.57.674']
}
df = pd.DataFrame(data=data)
for key in df:
df[key] = [x.split('.')[0]+'.'+''.join(x.split('.')[1:]) for x in df[key].tolist()]
print(df)
A B
0 54.6464 46.8484
1 64.68461 65.4
2 95.79527 65.644
3 484.644161 45.45
4 71.2579 21.1
5 12.8 10.8
6 9.6 12.5
7 312.4 12.78757674
答案4
得分: 1
这是使用正则表达式来移除第二个小数点的解决方案,而不是由@Leonid Astrin提出的函数的缩短版本:
import pandas as pd
# 创建一个示例数据框
data = {
'A': ['54.6464', '64.68461', '95.79527', '484.644.161', '71.257.9', '12.8', '9.6', '312.4'],
'B': ['46.8484', '65.4', '65.644', '45.45', '21.1', '10.8', '12.5', '12.787.57.674']
}
df = pd.DataFrame(data)
# 使用正则表达式去除第二个小数点
df = df.replace(r'\.(?=.*\.)', '', regex=True)
# 将列转换为浮点数
df['A'] = df['A'].astype(float)
df['B'] = df['B'].astype(float)
print(df)
输出将为:
A B
0 54.646400 46.848400
1 64.684610 65.400000
2 95.795270 65.644000
3 484.644161 45.450000
4 71.257900 21.100000
5 12.800000 10.800000
6 9.600000 12.500000
7 312.400000 12.787577
英文:
This is solution with using regex to remove second decimal point instead of function proposed by @Leonid Astrin looks like shorten version:
import pandas as pd
# create a sample dataframe
data = {
'A': ['54.6464', '64.68461', '95.79527', '484.644.161', '71.257.9', '12.8', '9.6', '312.4'],
'B': ['46.8484', '65.4', '65.644', '45.45', '21.1', '10.8', '12.5', '12.787.57.674']
}
df = pd.DataFrame(data)
# use regex to remove second decimal point
df = df.replace(r'\.(?=.*\.)', '', regex=True)
# convert columns to float
df['A'] = df['A'].astype(float)
df['B'] = df['B'].astype(float)
print(df)
Output will be:
A B
0 54.646400 46.848400
1 64.684610 65.400000
2 95.795270 65.644000
3 484.644161 45.450000
4 71.257900 21.100000
5 12.800000 10.800000
6 9.600000 12.500000
7 312.400000 12.787577
答案5
得分: 0
如果始终存在小数点:
df['A'].str.split('.').str[0] + '.' + df['A'].str.split('.').str[1:].str.join('')
英文:
If there is always a decimal point:
df['A'].str.split('.').str[0] + '.' + df['A'].str.split('.').str[1:].str.join('')
答案6
得分: 0
你可以使用一些字符串操作:
```python
def convert(sr):
return (sr.str.split('.', n=1, expand=True)
.pipe(lambda x: x[0] + '.' + x[1].str.replace('.', '', regex=False))
.astype(float))
df = df.apply(convert)
print(df)
# 输出
A B
0 54.646400 46.848400
1 64.684610 65.400000
2 95.795270 65.644000
3 484.644161 45.450000
4 71.257900 21.100000
5 12.800000 10.800000
6 9.600000 12.500000
7 312.400000 12.787577
<details>
<summary>英文:</summary>
You can use some string manipulations:
def convert(sr):
return (sr.str.split('.', n=1, expand=True)
.pipe(lambda x: x[0] + '.' + x[1].str.replace('.', '', regex=False))
.astype(float))
df = df.apply(convert)
print(df)
Output
A B
0 54.646400 46.848400
1 64.684610 65.400000
2 95.795270 65.644000
3 484.644161 45.450000
4 71.257900 21.100000
5 12.800000 10.800000
6 9.600000 12.500000
7 312.400000 12.787577
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论