转换具有各种类型数字单位的列。

huangapple go评论72阅读模式
英文:

Converting column with various types of numerical units

问题

I will provide translations for the code-related portions:

# 我有一个包含各种单位类型的数据框

# rawNo 列是数据的原始形式。我想要将它转换成类似 ConvNo 列的形式

datasample = pd.DataFrame(columns=['rawNo','ConvNo'])

datasample = datasample.append({'rawNo': '-4.35%','ConvNo': -.0435},ignore_index = True)
datasample = datasample.append({'rawNo': '246.6K','ConvNo': 246600},ignore_index = True)
datasample = datasample.append({'rawNo': np.nan,'ConvNo': np.nan},ignore_index = True)
datasample = datasample.append({'rawNo': '$12.76B','ConvNo': 12760000000},ignore_index = True)
datasample = datasample.append({'rawNo': '4.68%','ConvNo': .0468},ignore_index = True)
datasample = datasample.append({'rawNo': '¥-459.5B','ConvNo': -459500000000},ignore_index = True)
datasample = datasample.append({'rawNo': '€-6.8B','ConvNo': -6800000000},ignore_index = True)
datasample = datasample.append({'rawNo': '£-15.623B','ConvNo': -15623000000},ignore_index = True)
datasample = datasample.append({'rawNo': '$-1,400B','ConvNo': -1400000000},ignore_index = True)
# 我认为我需要使用某种条件应用。尝试删除百分比的应用失败了

def rPercent(value):
    value = str(value)
    count = value.count('%')
    print(count)
    
    if (count != 0):
        return value.rstrip('% ').astype('float') / 100.0
    else:
        return value

datasample["ConvNo"] = datasample['rawNo'].apply(rPercent)
# 我得到的错误是:

# AttributeError: 'str' object has no attribute 'str'

关于数据文件和其他列的转换,你可以参考提供的链接下载文件并继续转换你的数据。

英文:

I have a dataframe with a column that has various unit types

The rawNo column is how the data comes in. I would like to change it to look like the ConvNo column

datasample = pd.DataFrame(columns=['rawNo','ConvNo'])

datasample = datasample.append({'rawNo': '-4.35%','ConvNo': -.0435},ignore_index = True)
datasample = datasample.append({'rawNo': '246.6K','ConvNo': 246600},ignore_index = True)
datasample = datasample.append({'rawNo': np.nan,'ConvNo': np.nan},ignore_index = True)
datasample = datasample.append({'rawNo': '$12.76B','ConvNo': 12760000000},ignore_index = True)
datasample = datasample.append({'rawNo': '4.68%','ConvNo': .0468},ignore_index = True)
datasample = datasample.append({'rawNo': '¥-459.5B','ConvNo': -459500000000},ignore_index = True)
datasample = datasample.append({'rawNo': '€-6.8B','ConvNo': -6800000000},ignore_index = True)
datasample = datasample.append({'rawNo': '£-15.623B','ConvNo': -15623000000},ignore_index = True)
datasample = datasample.append({'rawNo': '$-1,400B','ConvNo': -15623000000},ignore_index = True)

I figure I will have to use some type of conditional apply. This apply to remove the percent is failing

def rPercent(value):
    value = str(value)
    count = value.count('%')
    print(count)
    
    if (count != 0):
        return value.str.rstrip('% ').astype('float') / 100.0
    else:
        return value

datasample["ConvNo"] = datasample['actual'].apply(rPercent)

Error I get:

> AttributeError: 'str' object has no attribute 'str'

Data File. you can download file from this link
https://projectcodesamples.s3.amazonaws.com/ForEx.csv

The columns I am trying to convert is "actual" the result is in the "CNactual" column

答案1

得分: 1

你所使用的函数对每个单独的值执行操作,但为了提高效率,最好执行矢量化操作。一个具体的技巧是将特定字符映射到它们在科学计数法中的等效形式(例如,参见这个答案来处理%)。

from numpy import nan
from pandas import DataFrame

data = [
    {"rawNo": "-4.35%", "ConvNo": -0.0435},
    {"rawNo": "246.6K", "ConvNo": 246600},
    {"rawNo": nan, "ConvNo": nan},
    {"rawNo": "$12.76B", "ConvNo": 12760000000},
    {"rawNo": "4.68%", "ConvNo": 0.0468},
    {"rawNo": "¥-459.5B", "ConvNo": -459500000000},
    {"rawNo": "€-6.8B", "ConvNo": -6800000000},
    {"rawNo": "£-15.623B", "ConvNo": -15623000000},
]

df = DataFrame(data)

# 请注意,货币符号映射为空字符串,以便于转换为浮点数
character_mapping = {
    "%": "e-2",
    "K": "e3",
    "M": "e6",
    "B": "e9",
    "£": "",
    "$": "",
    "¥": "",
    "€": "",
}

df["Converted"] = df["rawNo"].replace(character_mapping, regex=True).astype(float, errors='ignore')

print(df)
#        rawNo        ConvNo     Converted
# 0     -4.35% -4.350000e-02 -4.350000e-02
# 1     246.6K  2.466000e+05  2.466000e+05
# 2        NaN           NaN           NaN
# 3    $12.76B  1.276000e+10  1.276000e+10
# 4      4.68%  4.680000e-02  4.680000e-02
# 5   ¥-459.5B -4.595000e+11 -4.595000e+11
# 6     €-6.8B -6.800000e+09 -6.800000e+09
# 7  £-15.623B -1.562300e+10 -1.562300e+10

请注意,美元符号必须转义,因为它是一个特殊的正则表达式字符。还要注意,.astypeerrors="ignore" 的关键字参数,这将在转换为浮点数失败时保留原始值(但将允许管道运行而不引发错误)。

英文:

The function you use performs operations on every individual value, but for efficiency it's better to perform vectorized operations. One specific trick is to map the specific characters to their equivalent in scientific notation (e.g. see this answer for handling just the %).

from numpy import nan
from pandas import DataFrame

data = [
    {"rawNo": "-4.35%", "ConvNo": -0.0435},
    {"rawNo": "246.6K", "ConvNo": 246600},
    {"rawNo": nan, "ConvNo": nan},
    {"rawNo": "$12.76B", "ConvNo": 12760000000},
    {"rawNo": "4.68%", "ConvNo": 0.0468},
    {"rawNo": "¥-459.5B", "ConvNo": -459500000000},
    {"rawNo": "-6.8B", "ConvNo": -6800000000},
    {"rawNo": "£-15.623B", "ConvNo": -15623000000},
]

df = DataFrame(data)

# note that the currency symbols are mapped to blank string
# to facilitate the conversion to float
character_mapping = {
    "%": "e-2",
    "K": "e3",
    "M": "e6",
    "B": "e9",
    "£": "",
    "\$": "",
    "¥": "",
    "": "",
}

df["Converted"] = df["rawNo"].replace(character_mapping, regex=True).astype(float, errors='ignore')

print(df)
#        rawNo        ConvNo     Converted
# 0     -4.35% -4.350000e-02 -4.350000e-02
# 1     246.6K  2.466000e+05  2.466000e+05
# 2        NaN           NaN           NaN
# 3    $12.76B  1.276000e+10  1.276000e+10
# 4      4.68%  4.680000e-02  4.680000e-02
# 5   ¥-459.5B -4.595000e+11 -4.595000e+11
# 6     €-6.8B -6.800000e+09 -6.800000e+09
# 7  £-15.623B -1.562300e+10 -1.562300e+10

Note that dollar sign had to be escaped because it's a special regex character. Also note that .astype has kwarg errors="ignore", this will retain the original values if conversion to float failed (but will allow the pipeline to run without raising errors).

答案2

得分: 1

在你的输出中,你失去了货币类型信息。为了进一步操作,最好在失去这些信息之前进行货币转换。就像 @SultanOrazbayev 提到的,我们可以使用字典来执行转换:

factors = {
    '%': '0.01',
    'K': '1e3',
    'M': '1e6',
    'B': '1e9',
    'T': '1e12',
    np.nan: '1'
}

# 转换成美元
currencies = {
    '£': '1.28',
    '$': '1',
    '¥': '0.007',
    '€': '1.09',
    np.nan: '1'
}

values = {
    ",": "",
    "'": ""
}

# 分割前缀(货币)、值和后缀(因子)
dmap = {'currency': currencies, 'value': values, 'factor': factors}
pattern = r'(?P<currency>[^\d\-\.]+)?(?P<value>[^%KMBT]+)(?P<factor>.+)?'

# 替换并计算总数
df1 = datasample['rawNo'].str.extract(pattern).replace(dmap, regex=True).astype(float)
datasample['ConvNo'] = df1['currency'] * df1['value'] * df1['factor']

输出:

>>> datasample

       rawNo        ConvNo
0     -4.35% -4.350000e-02  # 值 * 0.01
1     246.6K  2.466000e+05  # 值 * 1000
2        NaN           NaN
3    $12.76B  1.276000e+10  # 1 * 值 * 1000000000
4      4.68%  4.680000e-02  # 值 * 0.01
5   ¥-459.5B -3.216500e+09  # 0.007 * 值 * 1000000000
6     -6.8B -7.412000e+09  # 1.09 * 值 * 1000000000
7   £-15.623B -1.999744e+10  # 1.28 * 值 * 1000000000
8  $-1,400B -1.400000e+12  # 1 * 值 * 1000000000

>>> df1
   currency     value        factor
0     1.000    -4.350  1.000000e-02
1     1.000   246.600  1.000000e+03
2     1.000       NaN  1.000000e+00
3     1.000    12.760  1.000000e+09
4     1.000     4.680  1.000000e-02
5     0.007  -459.500  1.000000e+09
6     1.090    -6.800  1.000000e+09
7     1.280   -15.623  1.000000e+09
8     1.000 -1400.000  1.000000e+09
英文:

In your output, you lost the currency type. To go further, It would be great to convert the currencies before losing this information. As @SultanOrazbayev, we can use dictionaries to perform the conversion:

factors = {
&#39;%&#39;: &#39;0.01&#39;,
&#39;K&#39;: &#39;1e3&#39;,
&#39;M&#39;: &#39;1e6&#39;,
&#39;B&#39;: &#39;1e9&#39;,
&#39;T&#39;: &#39;1e12&#39;,
np.nan: &#39;1&#39;
}
# to US Dollars
currencies = {
&#39;&#163;&#39;: &#39;1.28&#39;,
&#39;\$&#39;: &#39;1&#39;,
&#39;&#165;&#39;: &#39;0.007&#39;,
&#39;€&#39;: &#39;1.09&#39;,
np.nan: &#39;1&#39;
}
values = {
&quot;,&quot;: &quot;&quot;,
&quot;&#39;&quot; : &quot;&quot;
}
# Split prefix (currency), value and suffix (factor)
dmap = {&#39;currency&#39;: currencies, &#39;value&#39;: values, &#39;factor&#39;: factors}
pattern = r&#39;(?P&lt;currency&gt;[^\d\-\.]+)?(?P&lt;value&gt;[^%KMBT]+)(?P&lt;factor&gt;.+)?&#39;
# Replace and compute the total
df1 = datasample[&#39;rawNo&#39;].str.extract(pattern).replace(dmap, regex=True).astype(float)
datasample[&#39;ConvNo&#39;] = df1[&#39;currency&#39;] * df1[&#39;value&#39;] * df1[&#39;factor&#39;]

Output:

&gt;&gt;&gt; datasample
rawNo        ConvNo
0     -4.35% -4.350000e-02  # val * 0.01
1     246.6K  2.466000e+05  # val * 1000
2        NaN           NaN
3    $12.76B  1.276000e+10  # 1 * val * 1000000000
4      4.68%  4.680000e-02  # val * 0.01
5   &#165;-459.5B -3.216500e+09  # 0.07 * val * 1000000000
6     €-6.8B -7.412000e+09  # 1.09 * val * 1000000000
7  &#163;-15.623B -1.999744e+10  # 1.28 * val * 1000000000
8   $-1,400B -1.400000e+12  # 1 * val * 1000000000
&gt;&gt;&gt; df1
currency     value        factor
0     1.000    -4.350  1.000000e-02
1     1.000   246.600  1.000000e+03
2     1.000       NaN  1.000000e+00
3     1.000    12.760  1.000000e+09
4     1.000     4.680  1.000000e-02
5     0.007  -459.500  1.000000e+09
6     1.090    -6.800  1.000000e+09
7     1.280   -15.623  1.000000e+09
8     1.000 -1400.000  1.000000e+09

答案3

得分: 1

以下是您要翻译的内容:

这是作为单独回答发布的,因为我希望保留原始的可重现示例(以防链接的 CSV 不再可用)。

对数据进行仔细分析将显示导致问题的非数字字符是:逗号用作千位分隔符,单破折号符号(可能表示 'nan')。将这些内容合并到'character_mapping'中后,转换将在不引发任何错误的情况下完成:

from numpy import nan
from pandas import read_csv

df = read_csv("https://projectcodesamples.s3.amazonaws.com/ForEx.csv")

character_mapping = {
    "%" : "e-2",
    "K" : "e3",
    "M" : "e6",
    "B" : "e9",
    "T" : "e12",
    # 注意,货币符号映射为空字符串
    # 以便于转换为浮点数
    "£" : "",
    "$" : "",
    "¥" : "",
    "€" : "",
    # 如果一个条目只包含破折号,替换为nan
    "^-$" : nan,
    # 一些数字使用逗号作为千位分隔符
    "," : "",
}

df["Converted"] = df["consensus"].replace(character_mapping, regex=True).astype(float)

mask = (df["Converted"] != df["consensus"]) & (df["consensus"].notna())
print(df.loc[mask, ["consensus", "Converted"]])
英文:

This is posted as a separate answer, since I want to retain the original reproducible example (in case the linked csv is no longer available).

A careful analysis of the data will show that the non-numeric characters that cause trouble are: commas used as thousand separators, single dash symbols (presumably indicating nan). After incorporating these into the character_mapping the conversion works without raising any errors:

from numpy import nan
from pandas import read_csv

df = read_csv(&quot;https://projectcodesamples.s3.amazonaws.com/ForEx.csv&quot;)

character_mapping = {
    &quot;%&quot;: &quot;e-2&quot;,
    &quot;K&quot;: &quot;e3&quot;,
    &quot;M&quot;: &quot;e6&quot;,
    &quot;B&quot;: &quot;e9&quot;,
    &quot;T&quot;: &quot;e12&quot;,
    # note that the currency symbols are mapped to blank string
    # to facilitate the conversion to float
    &quot;&#163;&quot;: &quot;&quot;,
    &quot;\$&quot;: &quot;&quot;,
    &quot;&#165;&quot;: &quot;&quot;,
    &quot;&quot;: &quot;&quot;,
    # if an entry consists of a dash, replace with nan
    &quot;^-$&quot;: nan,
    # some numbers have a comma as a thousands separator
    &quot;,&quot;: &quot;&quot;,
}

df[&quot;Converted&quot;] = df[&quot;consensus&quot;].replace(character_mapping, regex=True).astype(float)

mask = (df[&quot;Converted&quot;] != df[&quot;consensus&quot;]) &amp; (df[&quot;consensus&quot;].notna())
print(df.loc[mask, [&quot;consensus&quot;, &quot;Converted&quot;]])

huangapple
  • 本文由 发表于 2023年6月12日 08:59:56
  • 转载请务必保留本文链接:https://go.coder-hub.com/76453108.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定