转换具有各种类型数字单位的列。

huangapple go评论95阅读模式
英文:

Converting column with various types of numerical units

问题

I will provide translations for the code-related portions:

  1. # 我有一个包含各种单位类型的数据框
  2. # rawNo 列是数据的原始形式。我想要将它转换成类似 ConvNo 列的形式
  3. datasample = pd.DataFrame(columns=['rawNo','ConvNo'])
  4. datasample = datasample.append({'rawNo': '-4.35%','ConvNo': -.0435},ignore_index = True)
  5. datasample = datasample.append({'rawNo': '246.6K','ConvNo': 246600},ignore_index = True)
  6. datasample = datasample.append({'rawNo': np.nan,'ConvNo': np.nan},ignore_index = True)
  7. datasample = datasample.append({'rawNo': '$12.76B','ConvNo': 12760000000},ignore_index = True)
  8. datasample = datasample.append({'rawNo': '4.68%','ConvNo': .0468},ignore_index = True)
  9. datasample = datasample.append({'rawNo': '¥-459.5B','ConvNo': -459500000000},ignore_index = True)
  10. datasample = datasample.append({'rawNo': '€-6.8B','ConvNo': -6800000000},ignore_index = True)
  11. datasample = datasample.append({'rawNo': '£-15.623B','ConvNo': -15623000000},ignore_index = True)
  12. datasample = datasample.append({'rawNo': '$-1,400B','ConvNo': -1400000000},ignore_index = True)
  1. # 我认为我需要使用某种条件应用。尝试删除百分比的应用失败了
  2. def rPercent(value):
  3. value = str(value)
  4. count = value.count('%')
  5. print(count)
  6. if (count != 0):
  7. return value.rstrip('% ').astype('float') / 100.0
  8. else:
  9. return value
  10. datasample["ConvNo"] = datasample['rawNo'].apply(rPercent)
  1. # 我得到的错误是:
  2. # AttributeError: 'str' object has no attribute 'str'

关于数据文件和其他列的转换,你可以参考提供的链接下载文件并继续转换你的数据。

英文:

I have a dataframe with a column that has various unit types

The rawNo column is how the data comes in. I would like to change it to look like the ConvNo column

  1. datasample = pd.DataFrame(columns=['rawNo','ConvNo'])
  2. datasample = datasample.append({'rawNo': '-4.35%','ConvNo': -.0435},ignore_index = True)
  3. datasample = datasample.append({'rawNo': '246.6K','ConvNo': 246600},ignore_index = True)
  4. datasample = datasample.append({'rawNo': np.nan,'ConvNo': np.nan},ignore_index = True)
  5. datasample = datasample.append({'rawNo': '$12.76B','ConvNo': 12760000000},ignore_index = True)
  6. datasample = datasample.append({'rawNo': '4.68%','ConvNo': .0468},ignore_index = True)
  7. datasample = datasample.append({'rawNo': '¥-459.5B','ConvNo': -459500000000},ignore_index = True)
  8. datasample = datasample.append({'rawNo': '€-6.8B','ConvNo': -6800000000},ignore_index = True)
  9. datasample = datasample.append({'rawNo': '£-15.623B','ConvNo': -15623000000},ignore_index = True)
  10. datasample = datasample.append({'rawNo': '$-1,400B','ConvNo': -15623000000},ignore_index = True)

I figure I will have to use some type of conditional apply. This apply to remove the percent is failing

  1. def rPercent(value):
  2. value = str(value)
  3. count = value.count('%')
  4. print(count)
  5. if (count != 0):
  6. return value.str.rstrip('% ').astype('float') / 100.0
  7. else:
  8. return value
  9. datasample["ConvNo"] = datasample['actual'].apply(rPercent)

Error I get:

  1. > AttributeError: 'str' object has no attribute 'str'

Data File. you can download file from this link
https://projectcodesamples.s3.amazonaws.com/ForEx.csv

The columns I am trying to convert is "actual" the result is in the "CNactual" column

答案1

得分: 1

你所使用的函数对每个单独的值执行操作,但为了提高效率,最好执行矢量化操作。一个具体的技巧是将特定字符映射到它们在科学计数法中的等效形式(例如,参见这个答案来处理%)。

  1. from numpy import nan
  2. from pandas import DataFrame
  3. data = [
  4. {"rawNo": "-4.35%", "ConvNo": -0.0435},
  5. {"rawNo": "246.6K", "ConvNo": 246600},
  6. {"rawNo": nan, "ConvNo": nan},
  7. {"rawNo": "$12.76B", "ConvNo": 12760000000},
  8. {"rawNo": "4.68%", "ConvNo": 0.0468},
  9. {"rawNo": "¥-459.5B", "ConvNo": -459500000000},
  10. {"rawNo": "€-6.8B", "ConvNo": -6800000000},
  11. {"rawNo": "£-15.623B", "ConvNo": -15623000000},
  12. ]
  13. df = DataFrame(data)
  14. # 请注意,货币符号映射为空字符串,以便于转换为浮点数
  15. character_mapping = {
  16. "%": "e-2",
  17. "K": "e3",
  18. "M": "e6",
  19. "B": "e9",
  20. "£": "",
  21. "$": "",
  22. "¥": "",
  23. "€": "",
  24. }
  25. df["Converted"] = df["rawNo"].replace(character_mapping, regex=True).astype(float, errors='ignore')
  26. print(df)
  27. # rawNo ConvNo Converted
  28. # 0 -4.35% -4.350000e-02 -4.350000e-02
  29. # 1 246.6K 2.466000e+05 2.466000e+05
  30. # 2 NaN NaN NaN
  31. # 3 $12.76B 1.276000e+10 1.276000e+10
  32. # 4 4.68% 4.680000e-02 4.680000e-02
  33. # 5 ¥-459.5B -4.595000e+11 -4.595000e+11
  34. # 6 €-6.8B -6.800000e+09 -6.800000e+09
  35. # 7 £-15.623B -1.562300e+10 -1.562300e+10

请注意,美元符号必须转义,因为它是一个特殊的正则表达式字符。还要注意,.astypeerrors="ignore" 的关键字参数,这将在转换为浮点数失败时保留原始值(但将允许管道运行而不引发错误)。

英文:

The function you use performs operations on every individual value, but for efficiency it's better to perform vectorized operations. One specific trick is to map the specific characters to their equivalent in scientific notation (e.g. see this answer for handling just the %).

  1. from numpy import nan
  2. from pandas import DataFrame
  3. data = [
  4. {"rawNo": "-4.35%", "ConvNo": -0.0435},
  5. {"rawNo": "246.6K", "ConvNo": 246600},
  6. {"rawNo": nan, "ConvNo": nan},
  7. {"rawNo": "$12.76B", "ConvNo": 12760000000},
  8. {"rawNo": "4.68%", "ConvNo": 0.0468},
  9. {"rawNo": "¥-459.5B", "ConvNo": -459500000000},
  10. {"rawNo": "-6.8B", "ConvNo": -6800000000},
  11. {"rawNo": "£-15.623B", "ConvNo": -15623000000},
  12. ]
  13. df = DataFrame(data)
  14. # note that the currency symbols are mapped to blank string
  15. # to facilitate the conversion to float
  16. character_mapping = {
  17. "%": "e-2",
  18. "K": "e3",
  19. "M": "e6",
  20. "B": "e9",
  21. "£": "",
  22. "\$": "",
  23. "¥": "",
  24. "": "",
  25. }
  26. df["Converted"] = df["rawNo"].replace(character_mapping, regex=True).astype(float, errors='ignore')
  27. print(df)
  28. # rawNo ConvNo Converted
  29. # 0 -4.35% -4.350000e-02 -4.350000e-02
  30. # 1 246.6K 2.466000e+05 2.466000e+05
  31. # 2 NaN NaN NaN
  32. # 3 $12.76B 1.276000e+10 1.276000e+10
  33. # 4 4.68% 4.680000e-02 4.680000e-02
  34. # 5 ¥-459.5B -4.595000e+11 -4.595000e+11
  35. # 6 €-6.8B -6.800000e+09 -6.800000e+09
  36. # 7 £-15.623B -1.562300e+10 -1.562300e+10

Note that dollar sign had to be escaped because it's a special regex character. Also note that .astype has kwarg errors="ignore", this will retain the original values if conversion to float failed (but will allow the pipeline to run without raising errors).

答案2

得分: 1

在你的输出中,你失去了货币类型信息。为了进一步操作,最好在失去这些信息之前进行货币转换。就像 @SultanOrazbayev 提到的,我们可以使用字典来执行转换:

  1. factors = {
  2. '%': '0.01',
  3. 'K': '1e3',
  4. 'M': '1e6',
  5. 'B': '1e9',
  6. 'T': '1e12',
  7. np.nan: '1'
  8. }
  9. # 转换成美元
  10. currencies = {
  11. '£': '1.28',
  12. '$': '1',
  13. '¥': '0.007',
  14. '€': '1.09',
  15. np.nan: '1'
  16. }
  17. values = {
  18. ",": "",
  19. "'": ""
  20. }
  21. # 分割前缀(货币)、值和后缀(因子)
  22. dmap = {'currency': currencies, 'value': values, 'factor': factors}
  23. pattern = r'(?P<currency>[^\d\-\.]+)?(?P<value>[^%KMBT]+)(?P<factor>.+)?'
  24. # 替换并计算总数
  25. df1 = datasample['rawNo'].str.extract(pattern).replace(dmap, regex=True).astype(float)
  26. datasample['ConvNo'] = df1['currency'] * df1['value'] * df1['factor']

输出:

  1. >>> datasample
  2. rawNo ConvNo
  3. 0 -4.35% -4.350000e-02 # 值 * 0.01
  4. 1 246.6K 2.466000e+05 # 值 * 1000
  5. 2 NaN NaN
  6. 3 $12.76B 1.276000e+10 # 1 * 值 * 1000000000
  7. 4 4.68% 4.680000e-02 # 值 * 0.01
  8. 5 ¥-459.5B -3.216500e+09 # 0.007 * 值 * 1000000000
  9. 6 -6.8B -7.412000e+09 # 1.09 * 值 * 1000000000
  10. 7 £-15.623B -1.999744e+10 # 1.28 * 值 * 1000000000
  11. 8 $-1,400B -1.400000e+12 # 1 * 值 * 1000000000
  12. >>> df1
  13. currency value factor
  14. 0 1.000 -4.350 1.000000e-02
  15. 1 1.000 246.600 1.000000e+03
  16. 2 1.000 NaN 1.000000e+00
  17. 3 1.000 12.760 1.000000e+09
  18. 4 1.000 4.680 1.000000e-02
  19. 5 0.007 -459.500 1.000000e+09
  20. 6 1.090 -6.800 1.000000e+09
  21. 7 1.280 -15.623 1.000000e+09
  22. 8 1.000 -1400.000 1.000000e+09
英文:

In your output, you lost the currency type. To go further, It would be great to convert the currencies before losing this information. As @SultanOrazbayev, we can use dictionaries to perform the conversion:

  1. factors = {
  2. &#39;%&#39;: &#39;0.01&#39;,
  3. &#39;K&#39;: &#39;1e3&#39;,
  4. &#39;M&#39;: &#39;1e6&#39;,
  5. &#39;B&#39;: &#39;1e9&#39;,
  6. &#39;T&#39;: &#39;1e12&#39;,
  7. np.nan: &#39;1&#39;
  8. }
  9. # to US Dollars
  10. currencies = {
  11. &#39;&#163;&#39;: &#39;1.28&#39;,
  12. &#39;\$&#39;: &#39;1&#39;,
  13. &#39;&#165;&#39;: &#39;0.007&#39;,
  14. &#39;€&#39;: &#39;1.09&#39;,
  15. np.nan: &#39;1&#39;
  16. }
  17. values = {
  18. &quot;,&quot;: &quot;&quot;,
  19. &quot;&#39;&quot; : &quot;&quot;
  20. }
  21. # Split prefix (currency), value and suffix (factor)
  22. dmap = {&#39;currency&#39;: currencies, &#39;value&#39;: values, &#39;factor&#39;: factors}
  23. pattern = r&#39;(?P&lt;currency&gt;[^\d\-\.]+)?(?P&lt;value&gt;[^%KMBT]+)(?P&lt;factor&gt;.+)?&#39;
  24. # Replace and compute the total
  25. df1 = datasample[&#39;rawNo&#39;].str.extract(pattern).replace(dmap, regex=True).astype(float)
  26. datasample[&#39;ConvNo&#39;] = df1[&#39;currency&#39;] * df1[&#39;value&#39;] * df1[&#39;factor&#39;]

Output:

  1. &gt;&gt;&gt; datasample
  2. rawNo ConvNo
  3. 0 -4.35% -4.350000e-02 # val * 0.01
  4. 1 246.6K 2.466000e+05 # val * 1000
  5. 2 NaN NaN
  6. 3 $12.76B 1.276000e+10 # 1 * val * 1000000000
  7. 4 4.68% 4.680000e-02 # val * 0.01
  8. 5 &#165;-459.5B -3.216500e+09 # 0.07 * val * 1000000000
  9. 6 €-6.8B -7.412000e+09 # 1.09 * val * 1000000000
  10. 7 &#163;-15.623B -1.999744e+10 # 1.28 * val * 1000000000
  11. 8 $-1,400B -1.400000e+12 # 1 * val * 1000000000
  12. &gt;&gt;&gt; df1
  13. currency value factor
  14. 0 1.000 -4.350 1.000000e-02
  15. 1 1.000 246.600 1.000000e+03
  16. 2 1.000 NaN 1.000000e+00
  17. 3 1.000 12.760 1.000000e+09
  18. 4 1.000 4.680 1.000000e-02
  19. 5 0.007 -459.500 1.000000e+09
  20. 6 1.090 -6.800 1.000000e+09
  21. 7 1.280 -15.623 1.000000e+09
  22. 8 1.000 -1400.000 1.000000e+09

答案3

得分: 1

以下是您要翻译的内容:

这是作为单独回答发布的,因为我希望保留原始的可重现示例(以防链接的 CSV 不再可用)。

对数据进行仔细分析将显示导致问题的非数字字符是:逗号用作千位分隔符,单破折号符号(可能表示 'nan')。将这些内容合并到'character_mapping'中后,转换将在不引发任何错误的情况下完成:

  1. from numpy import nan
  2. from pandas import read_csv
  3. df = read_csv("https://projectcodesamples.s3.amazonaws.com/ForEx.csv")
  4. character_mapping = {
  5. "%" : "e-2",
  6. "K" : "e3",
  7. "M" : "e6",
  8. "B" : "e9",
  9. "T" : "e12",
  10. # 注意,货币符号映射为空字符串
  11. # 以便于转换为浮点数
  12. "£" : "",
  13. "$" : "",
  14. "¥" : "",
  15. "€" : "",
  16. # 如果一个条目只包含破折号,替换为nan
  17. "^-$" : nan,
  18. # 一些数字使用逗号作为千位分隔符
  19. "," : "",
  20. }
  21. df["Converted"] = df["consensus"].replace(character_mapping, regex=True).astype(float)
  22. mask = (df["Converted"] != df["consensus"]) & (df["consensus"].notna())
  23. print(df.loc[mask, ["consensus", "Converted"]])
英文:

This is posted as a separate answer, since I want to retain the original reproducible example (in case the linked csv is no longer available).

A careful analysis of the data will show that the non-numeric characters that cause trouble are: commas used as thousand separators, single dash symbols (presumably indicating nan). After incorporating these into the character_mapping the conversion works without raising any errors:

  1. from numpy import nan
  2. from pandas import read_csv
  3. df = read_csv(&quot;https://projectcodesamples.s3.amazonaws.com/ForEx.csv&quot;)
  4. character_mapping = {
  5. &quot;%&quot;: &quot;e-2&quot;,
  6. &quot;K&quot;: &quot;e3&quot;,
  7. &quot;M&quot;: &quot;e6&quot;,
  8. &quot;B&quot;: &quot;e9&quot;,
  9. &quot;T&quot;: &quot;e12&quot;,
  10. # note that the currency symbols are mapped to blank string
  11. # to facilitate the conversion to float
  12. &quot;&#163;&quot;: &quot;&quot;,
  13. &quot;\$&quot;: &quot;&quot;,
  14. &quot;&#165;&quot;: &quot;&quot;,
  15. &quot;&quot;: &quot;&quot;,
  16. # if an entry consists of a dash, replace with nan
  17. &quot;^-$&quot;: nan,
  18. # some numbers have a comma as a thousands separator
  19. &quot;,&quot;: &quot;&quot;,
  20. }
  21. df[&quot;Converted&quot;] = df[&quot;consensus&quot;].replace(character_mapping, regex=True).astype(float)
  22. mask = (df[&quot;Converted&quot;] != df[&quot;consensus&quot;]) &amp; (df[&quot;consensus&quot;].notna())
  23. print(df.loc[mask, [&quot;consensus&quot;, &quot;Converted&quot;]])

huangapple
  • 本文由 发表于 2023年6月12日 08:59:56
  • 转载请务必保留本文链接:https://go.coder-hub.com/76453108.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定