2023年6月12日 08:59:56go评论95阅读模式

英文:

Converting column with various types of numerical units

问题

I will provide translations for the code-related portions:

# 我有一个包含各种单位类型的数据框
# rawNo 列是数据的原始形式。我想要将它转换成类似 ConvNo 列的形式
datasample = pd.DataFrame(columns=['rawNo','ConvNo'])
datasample = datasample.append({'rawNo': '-4.35%','ConvNo': -.0435},ignore_index = True)
datasample = datasample.append({'rawNo': '246.6K','ConvNo': 246600},ignore_index = True)
datasample = datasample.append({'rawNo': np.nan,'ConvNo': np.nan},ignore_index = True)
datasample = datasample.append({'rawNo': '$12.76B','ConvNo': 12760000000},ignore_index = True)
datasample = datasample.append({'rawNo': '4.68%','ConvNo': .0468},ignore_index = True)
datasample = datasample.append({'rawNo': '¥-459.5B','ConvNo': -459500000000},ignore_index = True)
datasample = datasample.append({'rawNo': '€-6.8B','ConvNo': -6800000000},ignore_index = True)
datasample = datasample.append({'rawNo': '£-15.623B','ConvNo': -15623000000},ignore_index = True)
datasample = datasample.append({'rawNo': '$-1,400B','ConvNo': -1400000000},ignore_index = True)

# 我认为我需要使用某种条件应用。尝试删除百分比的应用失败了
def rPercent(value):
    value = str(value)
    count = value.count('%')
    print(count)
    
    if (count != 0):
        return value.rstrip('% ').astype('float') / 100.0
    else:
        return value
datasample["ConvNo"] = datasample['rawNo'].apply(rPercent)

# 我得到的错误是：
# AttributeError: 'str' object has no attribute 'str'

关于数据文件和其他列的转换，你可以参考提供的链接下载文件并继续转换你的数据。

英文:

I have a dataframe with a column that has various unit types

The rawNo column is how the data comes in. I would like to change it to look like the ConvNo column

datasample = pd.DataFrame(columns=[&#39;rawNo&#39;,&#39;ConvNo&#39;])
datasample = datasample.append({&#39;rawNo&#39;: &#39;-4.35%&#39;,&#39;ConvNo&#39;: -.0435},ignore_index = True)
datasample = datasample.append({&#39;rawNo&#39;: &#39;246.6K&#39;,&#39;ConvNo&#39;: 246600},ignore_index = True)
datasample = datasample.append({&#39;rawNo&#39;: np.nan,&#39;ConvNo&#39;: np.nan},ignore_index = True)
datasample = datasample.append({&#39;rawNo&#39;: &#39;$12.76B&#39;,&#39;ConvNo&#39;: 12760000000},ignore_index = True)
datasample = datasample.append({&#39;rawNo&#39;: &#39;4.68%&#39;,&#39;ConvNo&#39;: .0468},ignore_index = True)
datasample = datasample.append({&#39;rawNo&#39;: &#39;&#165;-459.5B&#39;,&#39;ConvNo&#39;: -459500000000},ignore_index = True)
datasample = datasample.append({&#39;rawNo&#39;: &#39;€-6.8B&#39;,&#39;ConvNo&#39;: -6800000000},ignore_index = True)
datasample = datasample.append({&#39;rawNo&#39;: &#39;&#163;-15.623B&#39;,&#39;ConvNo&#39;: -15623000000},ignore_index = True)
datasample = datasample.append({&#39;rawNo&#39;: &#39;$-1,400B&#39;,&#39;ConvNo&#39;: -15623000000},ignore_index = True)

I figure I will have to use some type of conditional apply. This apply to remove the percent is failing

def rPercent(value):
    value = str(value)
    count = value.count(&#39;%&#39;)
    print(count)
    
    if (count != 0):
        return value.str.rstrip(&#39;% &#39;).astype(&#39;float&#39;) / 100.0
    else:
        return value
datasample[&quot;ConvNo&quot;] = datasample[&#39;actual&#39;].apply(rPercent)

Error I get:

&gt; AttributeError: &#39;str&#39; object has no attribute &#39;str&#39;

Data File. you can download file from this link
https://projectcodesamples.s3.amazonaws.com/ForEx.csv

The columns I am trying to convert is "actual" the result is in the "CNactual" column

答案1

得分: 1

你所使用的函数对每个单独的值执行操作，但为了提高效率，最好执行矢量化操作。一个具体的技巧是将特定字符映射到它们在科学计数法中的等效形式（例如，参见这个答案来处理%）。

from numpy import nan
from pandas import DataFrame
data = [
    {"rawNo": "-4.35%", "ConvNo": -0.0435},
    {"rawNo": "246.6K", "ConvNo": 246600},
    {"rawNo": nan, "ConvNo": nan},
    {"rawNo": "$12.76B", "ConvNo": 12760000000},
    {"rawNo": "4.68%", "ConvNo": 0.0468},
    {"rawNo": "&#165;-459.5B", "ConvNo": -459500000000},
    {"rawNo": "€-6.8B", "ConvNo": -6800000000},
    {"rawNo": "&#163;-15.623B", "ConvNo": -15623000000},
]
df = DataFrame(data)
# 请注意，货币符号映射为空字符串，以便于转换为浮点数
character_mapping = {
    "%": "e-2",
    "K": "e3",
    "M": "e6",
    "B": "e9",
    "£": "",
    "$": "",
    "¥": "",
    "€": "",
}
df["Converted"] = df["rawNo"].replace(character_mapping, regex=True).astype(float, errors='ignore')
print(df)
#        rawNo        ConvNo     Converted
# 0     -4.35% -4.350000e-02 -4.350000e-02
# 1     246.6K  2.466000e+05  2.466000e+05
# 2        NaN           NaN           NaN
# 3    $12.76B  1.276000e+10  1.276000e+10
# 4      4.68%  4.680000e-02  4.680000e-02
# 5   &#165;-459.5B -4.595000e+11 -4.595000e+11
# 6     €-6.8B -6.800000e+09 -6.800000e+09
# 7  &#163;-15.623B -1.562300e+10 -1.562300e+10

请注意，美元符号必须转义，因为它是一个特殊的正则表达式字符。还要注意，.astype 有 errors="ignore" 的关键字参数，这将在转换为浮点数失败时保留原始值（但将允许管道运行而不引发错误）。

英文:

The function you use performs operations on every individual value, but for efficiency it's better to perform vectorized operations. One specific trick is to map the specific characters to their equivalent in scientific notation (e.g. see this answer for handling just the %).

from numpy import nan
from pandas import DataFrame
data = [
    {&quot;rawNo&quot;: &quot;-4.35%&quot;, &quot;ConvNo&quot;: -0.0435},
    {&quot;rawNo&quot;: &quot;246.6K&quot;, &quot;ConvNo&quot;: 246600},
    {&quot;rawNo&quot;: nan, &quot;ConvNo&quot;: nan},
    {&quot;rawNo&quot;: &quot;$12.76B&quot;, &quot;ConvNo&quot;: 12760000000},
    {&quot;rawNo&quot;: &quot;4.68%&quot;, &quot;ConvNo&quot;: 0.0468},
    {&quot;rawNo&quot;: &quot;&#165;-459.5B&quot;, &quot;ConvNo&quot;: -459500000000},
    {&quot;rawNo&quot;: &quot;€-6.8B&quot;, &quot;ConvNo&quot;: -6800000000},
    {&quot;rawNo&quot;: &quot;&#163;-15.623B&quot;, &quot;ConvNo&quot;: -15623000000},
]
df = DataFrame(data)
# note that the currency symbols are mapped to blank string
# to facilitate the conversion to float
character_mapping = {
    &quot;%&quot;: &quot;e-2&quot;,
    &quot;K&quot;: &quot;e3&quot;,
    &quot;M&quot;: &quot;e6&quot;,
    &quot;B&quot;: &quot;e9&quot;,
    &quot;&#163;&quot;: &quot;&quot;,
    &quot;\$&quot;: &quot;&quot;,
    &quot;&#165;&quot;: &quot;&quot;,
    &quot;€&quot;: &quot;&quot;,
}
df[&quot;Converted&quot;] = df[&quot;rawNo&quot;].replace(character_mapping, regex=True).astype(float, errors=&#39;ignore&#39;)
print(df)
#        rawNo        ConvNo     Converted
# 0     -4.35% -4.350000e-02 -4.350000e-02
# 1     246.6K  2.466000e+05  2.466000e+05
# 2        NaN           NaN           NaN
# 3    $12.76B  1.276000e+10  1.276000e+10
# 4      4.68%  4.680000e-02  4.680000e-02
# 5   &#165;-459.5B -4.595000e+11 -4.595000e+11
# 6     €-6.8B -6.800000e+09 -6.800000e+09
# 7  &#163;-15.623B -1.562300e+10 -1.562300e+10

Note that dollar sign had to be escaped because it's a special regex character. Also note that .astype has kwarg errors="ignore", this will retain the original values if conversion to float failed (but will allow the pipeline to run without raising errors).

答案2

得分: 1

在你的输出中，你失去了货币类型信息。为了进一步操作，最好在失去这些信息之前进行货币转换。就像 @SultanOrazbayev 提到的，我们可以使用字典来执行转换：

factors = {
    '%': '0.01',
    'K': '1e3',
    'M': '1e6',
    'B': '1e9',
    'T': '1e12',
    np.nan: '1'
}
# 转换成美元
currencies = {
    '£': '1.28',
    '$': '1',
    '¥': '0.007',
    '€': '1.09',
    np.nan: '1'
}
values = {
    ",": "",
    "'": ""
}
# 分割前缀（货币）、值和后缀（因子）
dmap = {'currency': currencies, 'value': values, 'factor': factors}
pattern = r'(?P<currency>[^\d\-\.]+)?(?P<value>[^%KMBT]+)(?P<factor>.+)?'
# 替换并计算总数
df1 = datasample['rawNo'].str.extract(pattern).replace(dmap, regex=True).astype(float)
datasample['ConvNo'] = df1['currency'] * df1['value'] * df1['factor']

输出：

>>> datasample
       rawNo        ConvNo
0     -4.35% -4.350000e-02  # 值 * 0.01
1     246.6K  2.466000e+05  # 值 * 1000
2        NaN           NaN
3    $12.76B  1.276000e+10  # 1 * 值 * 1000000000
4      4.68%  4.680000e-02  # 值 * 0.01
5   ¥-459.5B -3.216500e+09  # 0.007 * 值 * 1000000000
6     €-6.8B -7.412000e+09  # 1.09 * 值 * 1000000000
7   £-15.623B -1.999744e+10  # 1.28 * 值 * 1000000000
8  $-1,400B -1.400000e+12  # 1 * 值 * 1000000000
>>> df1
   currency     value        factor
0     1.000    -4.350  1.000000e-02
1     1.000   246.600  1.000000e+03
2     1.000       NaN  1.000000e+00
3     1.000    12.760  1.000000e+09
4     1.000     4.680  1.000000e-02
5     0.007  -459.500  1.000000e+09
6     1.090    -6.800  1.000000e+09
7     1.280   -15.623  1.000000e+09
8     1.000 -1400.000  1.000000e+09

英文:

In your output, you lost the currency type. To go further, It would be great to convert the currencies before losing this information. As @SultanOrazbayev, we can use dictionaries to perform the conversion:

factors = {
&#39;%&#39;: &#39;0.01&#39;,
&#39;K&#39;: &#39;1e3&#39;,
&#39;M&#39;: &#39;1e6&#39;,
&#39;B&#39;: &#39;1e9&#39;,
&#39;T&#39;: &#39;1e12&#39;,
np.nan: &#39;1&#39;
}
# to US Dollars
currencies = {
&#39;&#163;&#39;: &#39;1.28&#39;,
&#39;\$&#39;: &#39;1&#39;,
&#39;&#165;&#39;: &#39;0.007&#39;,
&#39;€&#39;: &#39;1.09&#39;,
np.nan: &#39;1&#39;
}
values = {
&quot;,&quot;: &quot;&quot;,
&quot;&#39;&quot; : &quot;&quot;
}
# Split prefix (currency), value and suffix (factor)
dmap = {&#39;currency&#39;: currencies, &#39;value&#39;: values, &#39;factor&#39;: factors}
pattern = r&#39;(?P&lt;currency&gt;[^\d\-\.]+)?(?P&lt;value&gt;[^%KMBT]+)(?P&lt;factor&gt;.+)?&#39;
# Replace and compute the total
df1 = datasample[&#39;rawNo&#39;].str.extract(pattern).replace(dmap, regex=True).astype(float)
datasample[&#39;ConvNo&#39;] = df1[&#39;currency&#39;] * df1[&#39;value&#39;] * df1[&#39;factor&#39;]

Output:

&gt;&gt;&gt; datasample
rawNo        ConvNo
0     -4.35% -4.350000e-02  # val * 0.01
1     246.6K  2.466000e+05  # val * 1000
2        NaN           NaN
3    $12.76B  1.276000e+10  # 1 * val * 1000000000
4      4.68%  4.680000e-02  # val * 0.01
5   &#165;-459.5B -3.216500e+09  # 0.07 * val * 1000000000
6     €-6.8B -7.412000e+09  # 1.09 * val * 1000000000
7  &#163;-15.623B -1.999744e+10  # 1.28 * val * 1000000000
8   $-1,400B -1.400000e+12  # 1 * val * 1000000000
&gt;&gt;&gt; df1
currency     value        factor
0     1.000    -4.350  1.000000e-02
1     1.000   246.600  1.000000e+03
2     1.000       NaN  1.000000e+00
3     1.000    12.760  1.000000e+09
4     1.000     4.680  1.000000e-02
5     0.007  -459.500  1.000000e+09
6     1.090    -6.800  1.000000e+09
7     1.280   -15.623  1.000000e+09
8     1.000 -1400.000  1.000000e+09

答案3

得分: 1

以下是您要翻译的内容：

这是作为单独回答发布的，因为我希望保留原始的可重现示例（以防链接的 CSV 不再可用）。

对数据进行仔细分析将显示导致问题的非数字字符是：逗号用作千位分隔符，单破折号符号（可能表示 'nan'）。将这些内容合并到'character_mapping'中后，转换将在不引发任何错误的情况下完成：

from numpy import nan
from pandas import read_csv
df = read_csv("https://projectcodesamples.s3.amazonaws.com/ForEx.csv")
character_mapping = {
    "%" : "e-2",
    "K" : "e3",
    "M" : "e6",
    "B" : "e9",
    "T" : "e12",
    # 注意，货币符号映射为空字符串
    # 以便于转换为浮点数
    "£" : "",
    "$" : "",
    "¥" : "",
    "€" : "",
    # 如果一个条目只包含破折号，替换为nan
    "^-$" : nan,
    # 一些数字使用逗号作为千位分隔符
    "," : "",
}
df["Converted"] = df["consensus"].replace(character_mapping, regex=True).astype(float)
mask = (df["Converted"] != df["consensus"]) & (df["consensus"].notna())
print(df.loc[mask, ["consensus", "Converted"]])

英文:

This is posted as a separate answer, since I want to retain the original reproducible example (in case the linked csv is no longer available).

A careful analysis of the data will show that the non-numeric characters that cause trouble are: commas used as thousand separators, single dash symbols (presumably indicating nan). After incorporating these into the character_mapping the conversion works without raising any errors:

from numpy import nan
from pandas import read_csv
df = read_csv(&quot;https://projectcodesamples.s3.amazonaws.com/ForEx.csv&quot;)
character_mapping = {
    &quot;%&quot;: &quot;e-2&quot;,
    &quot;K&quot;: &quot;e3&quot;,
    &quot;M&quot;: &quot;e6&quot;,
    &quot;B&quot;: &quot;e9&quot;,
    &quot;T&quot;: &quot;e12&quot;,
    # note that the currency symbols are mapped to blank string
    # to facilitate the conversion to float
    &quot;&#163;&quot;: &quot;&quot;,
    &quot;\$&quot;: &quot;&quot;,
    &quot;&#165;&quot;: &quot;&quot;,
    &quot;€&quot;: &quot;&quot;,
    # if an entry consists of a dash, replace with nan
    &quot;^-$&quot;: nan,
    # some numbers have a comma as a thousands separator
    &quot;,&quot;: &quot;&quot;,
}
df[&quot;Converted&quot;] = df[&quot;consensus&quot;].replace(character_mapping, regex=True).astype(float)
mask = (df[&quot;Converted&quot;] != df[&quot;consensus&quot;]) &amp; (df[&quot;consensus&quot;].notna())
print(df.loc[mask, [&quot;consensus&quot;, &quot;Converted&quot;]])

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

转换具有各种类型数字单位的列。

问题

答案1

答案2

答案3

Python regex, one word with n characters followed by two words with one char

将一个 Pandas 的 groupby 对象保存到一个 CSV 文件中。

Scipy optimize curve_fit未正确响应

customtkinter – 是否有一种方法可以在组合框上执行验证？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论