Python: 在数据框中去除重复的小数值

huangapple go评论96阅读模式
英文:

Python: Get rid of duplicate decimals in a dafaframe

问题

  1. 我有一个包含手动输入数据点的数据框,理想情况下应该包含数字。然而,存在许多数据质量问题,其中下面显示了两位小数:
  1. A B

0 54.6464 46.8484
1 64.68461 65.4
2 95.79527 65.644
3 484.644.161 45.45
4 71.257.9 21.1
5 12.8 10.8
6 9.6 12.5
7 312.4 12.787.57.674

  1. 通常情况下,如果这只是一个小数据框中的少数几个值,我会手动更改它们。但如果是一个较大的数据框,这将变得太繁琐。我想要去掉第二个小数点,以便获得以下结果:
  1. A B

0 54.646400 46.848400
1 64.684610 65.400000
2 95.795270 65.644000
3 484.644161 45.450000
4 71.257900 21.100000
5 12.800000 10.800000
6 9.600000 12.500000
7 312.400000 12.787577

  1. 我尝试强制限制字符长度以去掉第二个小数点,但它会出现在意想不到的地方,因此以下逻辑在这里不起作用:

df['A'] = df['A'].str.slice(0, 4)
df['B'] = df['B'].str.slice(0, 4)

  1. <details>
  2. <summary>英文:</summary>
  3. I have a dataframe that contains manually entered data points that should ideally contain numbers. There are however plenty of data quality issues where there are two decimal places shown below:
  1. A B

0 54.6464 46.8484
1 64.68461 65.4
2 95.79527 65.644
3 484.644.161 45.45
4 71.257.9 21.1
5 12.8 10.8
6 9.6 12.5
7 312.4 12.787.57.674

  1. Normally if these were just a handful of values in a small dataframe then I&#39;d go and manually change them. In case of a larger dataframe it would be too cumbersome. I want to get rid of the second decimal point such that I get the following result:
  1. A B

0 54.646400 46.848400
1 64.684610 65.400000
2 95.795270 65.644000
3 484.644161 45.450000
4 71.257900 21.100000
5 12.800000 10.800000
6 9.600000 12.500000
7 312.400000 12.787577

  1. I&#39;ve tried to force a limit on the character length to get rid of the second decimal point but it pops up in unexpected places so the following logic does not work well here:

df['A'] = df['A'].str.slice(0,4)
df['B'] = df['B'].str.slice(0,4)

  1. </details>
  2. # 答案1
  3. **得分**: 1
  4. 我会这样做:
  5. ```python
  6. def rectify_decimal(string):
  7. parts = string.split('.')
  8. if len(parts) > 1:
  9. return f"{parts[0]}.{''.join(parts[1:])}"
  10. else:
  11. return(parts[0])
  12. df['A'] = df['A'].apply(rectify_decimal)
英文:

I would do something like:

  1. def rectify_decimal(string):
  2. parts = string.split(&#39;.&#39;)
  3. if len(parts) &gt; 1:
  4. return f&quot;{parts[0]}.{&#39;&#39;.join(parts[1:])}&quot;
  5. else:
  6. return(parts[0])
  7. df[&#39;A&#39;] = df[&#39;A&#39;].apply(rectify_decimal)

答案2

得分: 1

使用扩展正则表达式替换:

  1. import re
  2. pat = re.compile(r'^(\d+\.)([\d.]+)') # 预编译的模式
  3. repl = lambda m: m.group(1) + m.group(2).replace(".", "")
  4. df.A = pd.to_numeric(df.A.str.replace(pat, repl))
  5. df.B = pd.to_numeric(df.B.str.replace(pat, repl))

  1. A B
  2. 0 54.646400 46.848400
  3. 1 64.684610 65.400000
  4. 2 95.795270 65.644000
  5. 3 484.644161 45.450000
  6. 4 71.257900 21.100000
  7. 5 12.800000 10.800000
  8. 6 9.600000 12.500000
  9. 7 312.400000 12.787577
英文:

With extended regex replacement:

  1. import re
  2. pat = re.compile(r&#39;^(\d+\.)([\d.]+)&#39;) # precompiled pattern
  3. repl = lambda m: m.group(1) + m.group(2).replace(&quot;.&quot;, &quot;&quot;)
  4. df.A = pd.to_numeric(df.A.str.replace(pat, repl))
  5. df.B = pd.to_numeric(df.B.str.replace(pat, repl))

  1. A B
  2. 0 54.646400 46.848400
  3. 1 64.684610 65.400000
  4. 2 95.795270 65.644000
  5. 3 484.644161 45.450000
  6. 4 71.257900 21.100000
  7. 5 12.800000 10.800000
  8. 6 9.600000 12.500000
  9. 7 312.400000 12.787577

答案3

得分: 1

以下是代码部分的翻译,不包括注释或问题回答:

  1. 非常类似于Leonid的回答但没有使用.apply不确定哪个更好
  2. import pandas as pd
  3. data = {
  4. 'A': ['54.6464', '64.68461', '95.79527', '484.644.161', '71.257.9', '12.8', '9.6', '312.4'],
  5. 'B': ['46.8484', '65.4', '65.644', '45.45', '21.1', '10.8', '12.5', '12.787.57.674']
  6. }
  7. df = pd.DataFrame(data=data)
  8. for key in df:
  9. df[key] = [x.split('.')[0]+'.'+'.'.join(x.split('.')[1:]) for x in df[key].tolist()]
  10. print(df)
  11. A B
  12. 0 54.6464 46.8484
  13. 1 64.68461 65.4
  14. 2 95.79527 65.644
  15. 3 484.644161 45.45
  16. 4 71.2579 21.1
  17. 5 12.8 10.8
  18. 6 9.6 12.5
  19. 7 312.4 12.78757674
英文:

Very similar to Leonid's answer but without using .apply. Unsure which is best.

  1. import pandas as pd
  2. data = {
  3. &#39;A&#39;: [&#39;54.6464&#39;, &#39;64.68461&#39;, &#39;95.79527&#39;, &#39;484.644.161&#39;, &#39;71.257.9&#39;, &#39;12.8&#39;, &#39;9.6&#39;, &#39;312.4&#39;],
  4. &#39;B&#39;: [&#39;46.8484&#39;, &#39;65.4&#39;, &#39;65.644&#39;, &#39;45.45&#39;, &#39;21.1&#39;, &#39;10.8&#39;, &#39;12.5&#39;, &#39;12.787.57.674&#39;]
  5. }
  6. df = pd.DataFrame(data=data)
  7. for key in df:
  8. df[key] = [x.split(&#39;.&#39;)[0]+&#39;.&#39;+&#39;&#39;.join(x.split(&#39;.&#39;)[1:]) for x in df[key].tolist()]
  9. print(df)
  10. A B
  11. 0 54.6464 46.8484
  12. 1 64.68461 65.4
  13. 2 95.79527 65.644
  14. 3 484.644161 45.45
  15. 4 71.2579 21.1
  16. 5 12.8 10.8
  17. 6 9.6 12.5
  18. 7 312.4 12.78757674

答案4

得分: 1

这是使用正则表达式来移除第二个小数点的解决方案,而不是由@Leonid Astrin提出的函数的缩短版本:

  1. import pandas as pd
  2. # 创建一个示例数据框
  3. data = {
  4. 'A': ['54.6464', '64.68461', '95.79527', '484.644.161', '71.257.9', '12.8', '9.6', '312.4'],
  5. 'B': ['46.8484', '65.4', '65.644', '45.45', '21.1', '10.8', '12.5', '12.787.57.674']
  6. }
  7. df = pd.DataFrame(data)
  8. # 使用正则表达式去除第二个小数点
  9. df = df.replace(r'\.(?=.*\.)', '', regex=True)
  10. # 将列转换为浮点数
  11. df['A'] = df['A'].astype(float)
  12. df['B'] = df['B'].astype(float)
  13. print(df)

输出将为:

  1. A B
  2. 0 54.646400 46.848400
  3. 1 64.684610 65.400000
  4. 2 95.795270 65.644000
  5. 3 484.644161 45.450000
  6. 4 71.257900 21.100000
  7. 5 12.800000 10.800000
  8. 6 9.600000 12.500000
  9. 7 312.400000 12.787577
英文:

This is solution with using regex to remove second decimal point instead of function proposed by @Leonid Astrin looks like shorten version:

  1. import pandas as pd
  2. # create a sample dataframe
  3. data = {
  4. &#39;A&#39;: [&#39;54.6464&#39;, &#39;64.68461&#39;, &#39;95.79527&#39;, &#39;484.644.161&#39;, &#39;71.257.9&#39;, &#39;12.8&#39;, &#39;9.6&#39;, &#39;312.4&#39;],
  5. &#39;B&#39;: [&#39;46.8484&#39;, &#39;65.4&#39;, &#39;65.644&#39;, &#39;45.45&#39;, &#39;21.1&#39;, &#39;10.8&#39;, &#39;12.5&#39;, &#39;12.787.57.674&#39;]
  6. }
  7. df = pd.DataFrame(data)
  8. # use regex to remove second decimal point
  9. df = df.replace(r&#39;\.(?=.*\.)&#39;, &#39;&#39;, regex=True)
  10. # convert columns to float
  11. df[&#39;A&#39;] = df[&#39;A&#39;].astype(float)
  12. df[&#39;B&#39;] = df[&#39;B&#39;].astype(float)
  13. print(df)

Output will be:

  1. A B
  2. 0 54.646400 46.848400
  3. 1 64.684610 65.400000
  4. 2 95.795270 65.644000
  5. 3 484.644161 45.450000
  6. 4 71.257900 21.100000
  7. 5 12.800000 10.800000
  8. 6 9.600000 12.500000
  9. 7 312.400000 12.787577

答案5

得分: 0

如果始终存在小数点:

df['A'].str.split('.').str[0] + '.' + df['A'].str.split('.').str[1:].str.join('')

英文:

If there is always a decimal point:

  1. df[&#39;A&#39;].str.split(&#39;.&#39;).str[0] + &#39;.&#39; + df[&#39;A&#39;].str.split(&#39;.&#39;).str[1:].str.join(&#39;&#39;)

答案6

得分: 0

  1. 你可以使用一些字符串操作
  2. ```python
  3. def convert(sr):
  4. return (sr.str.split('.', n=1, expand=True)
  5. .pipe(lambda x: x[0] + '.' + x[1].str.replace('.', '', regex=False))
  6. .astype(float))
  7. df = df.apply(convert)
  8. print(df)
  9. # 输出
  10. A B
  11. 0 54.646400 46.848400
  12. 1 64.684610 65.400000
  13. 2 95.795270 65.644000
  14. 3 484.644161 45.450000
  15. 4 71.257900 21.100000
  16. 5 12.800000 10.800000
  17. 6 9.600000 12.500000
  18. 7 312.400000 12.787577
  1. <details>
  2. <summary>英文:</summary>
  3. You can use some string manipulations:

def convert(sr):
return (sr.str.split('.', n=1, expand=True)
.pipe(lambda x: x[0] + '.' + x[1].str.replace('.', '', regex=False))
.astype(float))

df = df.apply(convert)
print(df)

Output

  1. A B

0 54.646400 46.848400
1 64.684610 65.400000
2 95.795270 65.644000
3 484.644161 45.450000
4 71.257900 21.100000
5 12.800000 10.800000
6 9.600000 12.500000
7 312.400000 12.787577

  1. </details>

huangapple
  • 本文由 发表于 2023年2月24日 05:06:31
  • 转载请务必保留本文链接:https://go.coder-hub.com/75550329.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定