英文:
Converting column with various types of numerical units
问题
I will provide translations for the code-related portions:
# 我有一个包含各种单位类型的数据框
# rawNo 列是数据的原始形式。我想要将它转换成类似 ConvNo 列的形式
datasample = pd.DataFrame(columns=['rawNo','ConvNo'])
datasample = datasample.append({'rawNo': '-4.35%','ConvNo': -.0435},ignore_index = True)
datasample = datasample.append({'rawNo': '246.6K','ConvNo': 246600},ignore_index = True)
datasample = datasample.append({'rawNo': np.nan,'ConvNo': np.nan},ignore_index = True)
datasample = datasample.append({'rawNo': '$12.76B','ConvNo': 12760000000},ignore_index = True)
datasample = datasample.append({'rawNo': '4.68%','ConvNo': .0468},ignore_index = True)
datasample = datasample.append({'rawNo': '¥-459.5B','ConvNo': -459500000000},ignore_index = True)
datasample = datasample.append({'rawNo': '€-6.8B','ConvNo': -6800000000},ignore_index = True)
datasample = datasample.append({'rawNo': '£-15.623B','ConvNo': -15623000000},ignore_index = True)
datasample = datasample.append({'rawNo': '$-1,400B','ConvNo': -1400000000},ignore_index = True)
# 我认为我需要使用某种条件应用。尝试删除百分比的应用失败了
def rPercent(value):
value = str(value)
count = value.count('%')
print(count)
if (count != 0):
return value.rstrip('% ').astype('float') / 100.0
else:
return value
datasample["ConvNo"] = datasample['rawNo'].apply(rPercent)
# 我得到的错误是:
# AttributeError: 'str' object has no attribute 'str'
关于数据文件和其他列的转换,你可以参考提供的链接下载文件并继续转换你的数据。
英文:
I have a dataframe with a column that has various unit types
The rawNo column is how the data comes in. I would like to change it to look like the ConvNo column
datasample = pd.DataFrame(columns=['rawNo','ConvNo'])
datasample = datasample.append({'rawNo': '-4.35%','ConvNo': -.0435},ignore_index = True)
datasample = datasample.append({'rawNo': '246.6K','ConvNo': 246600},ignore_index = True)
datasample = datasample.append({'rawNo': np.nan,'ConvNo': np.nan},ignore_index = True)
datasample = datasample.append({'rawNo': '$12.76B','ConvNo': 12760000000},ignore_index = True)
datasample = datasample.append({'rawNo': '4.68%','ConvNo': .0468},ignore_index = True)
datasample = datasample.append({'rawNo': '¥-459.5B','ConvNo': -459500000000},ignore_index = True)
datasample = datasample.append({'rawNo': '€-6.8B','ConvNo': -6800000000},ignore_index = True)
datasample = datasample.append({'rawNo': '£-15.623B','ConvNo': -15623000000},ignore_index = True)
datasample = datasample.append({'rawNo': '$-1,400B','ConvNo': -15623000000},ignore_index = True)
I figure I will have to use some type of conditional apply. This apply to remove the percent is failing
def rPercent(value):
value = str(value)
count = value.count('%')
print(count)
if (count != 0):
return value.str.rstrip('% ').astype('float') / 100.0
else:
return value
datasample["ConvNo"] = datasample['actual'].apply(rPercent)
Error I get:
> AttributeError: 'str' object has no attribute 'str'
Data File. you can download file from this link
https://projectcodesamples.s3.amazonaws.com/ForEx.csv
The columns I am trying to convert is "actual" the result is in the "CNactual" column
答案1
得分: 1
你所使用的函数对每个单独的值执行操作,但为了提高效率,最好执行矢量化操作。一个具体的技巧是将特定字符映射到它们在科学计数法中的等效形式(例如,参见这个答案来处理%
)。
from numpy import nan
from pandas import DataFrame
data = [
{"rawNo": "-4.35%", "ConvNo": -0.0435},
{"rawNo": "246.6K", "ConvNo": 246600},
{"rawNo": nan, "ConvNo": nan},
{"rawNo": "$12.76B", "ConvNo": 12760000000},
{"rawNo": "4.68%", "ConvNo": 0.0468},
{"rawNo": "¥-459.5B", "ConvNo": -459500000000},
{"rawNo": "€-6.8B", "ConvNo": -6800000000},
{"rawNo": "£-15.623B", "ConvNo": -15623000000},
]
df = DataFrame(data)
# 请注意,货币符号映射为空字符串,以便于转换为浮点数
character_mapping = {
"%": "e-2",
"K": "e3",
"M": "e6",
"B": "e9",
"£": "",
"$": "",
"¥": "",
"€": "",
}
df["Converted"] = df["rawNo"].replace(character_mapping, regex=True).astype(float, errors='ignore')
print(df)
# rawNo ConvNo Converted
# 0 -4.35% -4.350000e-02 -4.350000e-02
# 1 246.6K 2.466000e+05 2.466000e+05
# 2 NaN NaN NaN
# 3 $12.76B 1.276000e+10 1.276000e+10
# 4 4.68% 4.680000e-02 4.680000e-02
# 5 ¥-459.5B -4.595000e+11 -4.595000e+11
# 6 €-6.8B -6.800000e+09 -6.800000e+09
# 7 £-15.623B -1.562300e+10 -1.562300e+10
请注意,美元符号必须转义,因为它是一个特殊的正则表达式字符。还要注意,.astype
有 errors="ignore"
的关键字参数,这将在转换为浮点数失败时保留原始值(但将允许管道运行而不引发错误)。
英文:
The function you use performs operations on every individual value, but for efficiency it's better to perform vectorized operations. One specific trick is to map the specific characters to their equivalent in scientific notation (e.g. see this answer for handling just the %
).
from numpy import nan
from pandas import DataFrame
data = [
{"rawNo": "-4.35%", "ConvNo": -0.0435},
{"rawNo": "246.6K", "ConvNo": 246600},
{"rawNo": nan, "ConvNo": nan},
{"rawNo": "$12.76B", "ConvNo": 12760000000},
{"rawNo": "4.68%", "ConvNo": 0.0468},
{"rawNo": "¥-459.5B", "ConvNo": -459500000000},
{"rawNo": "€-6.8B", "ConvNo": -6800000000},
{"rawNo": "£-15.623B", "ConvNo": -15623000000},
]
df = DataFrame(data)
# note that the currency symbols are mapped to blank string
# to facilitate the conversion to float
character_mapping = {
"%": "e-2",
"K": "e3",
"M": "e6",
"B": "e9",
"£": "",
"\$": "",
"¥": "",
"€": "",
}
df["Converted"] = df["rawNo"].replace(character_mapping, regex=True).astype(float, errors='ignore')
print(df)
# rawNo ConvNo Converted
# 0 -4.35% -4.350000e-02 -4.350000e-02
# 1 246.6K 2.466000e+05 2.466000e+05
# 2 NaN NaN NaN
# 3 $12.76B 1.276000e+10 1.276000e+10
# 4 4.68% 4.680000e-02 4.680000e-02
# 5 ¥-459.5B -4.595000e+11 -4.595000e+11
# 6 €-6.8B -6.800000e+09 -6.800000e+09
# 7 £-15.623B -1.562300e+10 -1.562300e+10
Note that dollar sign had to be escaped because it's a special regex character. Also note that .astype
has kwarg errors="ignore"
, this will retain the original values if conversion to float failed (but will allow the pipeline to run without raising errors).
答案2
得分: 1
在你的输出中,你失去了货币类型信息。为了进一步操作,最好在失去这些信息之前进行货币转换。就像 @SultanOrazbayev 提到的,我们可以使用字典来执行转换:
factors = {
'%': '0.01',
'K': '1e3',
'M': '1e6',
'B': '1e9',
'T': '1e12',
np.nan: '1'
}
# 转换成美元
currencies = {
'£': '1.28',
'$': '1',
'¥': '0.007',
'€': '1.09',
np.nan: '1'
}
values = {
",": "",
"'": ""
}
# 分割前缀(货币)、值和后缀(因子)
dmap = {'currency': currencies, 'value': values, 'factor': factors}
pattern = r'(?P<currency>[^\d\-\.]+)?(?P<value>[^%KMBT]+)(?P<factor>.+)?'
# 替换并计算总数
df1 = datasample['rawNo'].str.extract(pattern).replace(dmap, regex=True).astype(float)
datasample['ConvNo'] = df1['currency'] * df1['value'] * df1['factor']
输出:
>>> datasample
rawNo ConvNo
0 -4.35% -4.350000e-02 # 值 * 0.01
1 246.6K 2.466000e+05 # 值 * 1000
2 NaN NaN
3 $12.76B 1.276000e+10 # 1 * 值 * 1000000000
4 4.68% 4.680000e-02 # 值 * 0.01
5 ¥-459.5B -3.216500e+09 # 0.007 * 值 * 1000000000
6 €-6.8B -7.412000e+09 # 1.09 * 值 * 1000000000
7 £-15.623B -1.999744e+10 # 1.28 * 值 * 1000000000
8 $-1,400B -1.400000e+12 # 1 * 值 * 1000000000
>>> df1
currency value factor
0 1.000 -4.350 1.000000e-02
1 1.000 246.600 1.000000e+03
2 1.000 NaN 1.000000e+00
3 1.000 12.760 1.000000e+09
4 1.000 4.680 1.000000e-02
5 0.007 -459.500 1.000000e+09
6 1.090 -6.800 1.000000e+09
7 1.280 -15.623 1.000000e+09
8 1.000 -1400.000 1.000000e+09
英文:
In your output, you lost the currency type. To go further, It would be great to convert the currencies before losing this information. As @SultanOrazbayev, we can use dictionaries to perform the conversion:
factors = {
'%': '0.01',
'K': '1e3',
'M': '1e6',
'B': '1e9',
'T': '1e12',
np.nan: '1'
}
# to US Dollars
currencies = {
'£': '1.28',
'\$': '1',
'¥': '0.007',
'€': '1.09',
np.nan: '1'
}
values = {
",": "",
"'" : ""
}
# Split prefix (currency), value and suffix (factor)
dmap = {'currency': currencies, 'value': values, 'factor': factors}
pattern = r'(?P<currency>[^\d\-\.]+)?(?P<value>[^%KMBT]+)(?P<factor>.+)?'
# Replace and compute the total
df1 = datasample['rawNo'].str.extract(pattern).replace(dmap, regex=True).astype(float)
datasample['ConvNo'] = df1['currency'] * df1['value'] * df1['factor']
Output:
>>> datasample
rawNo ConvNo
0 -4.35% -4.350000e-02 # val * 0.01
1 246.6K 2.466000e+05 # val * 1000
2 NaN NaN
3 $12.76B 1.276000e+10 # 1 * val * 1000000000
4 4.68% 4.680000e-02 # val * 0.01
5 ¥-459.5B -3.216500e+09 # 0.07 * val * 1000000000
6 €-6.8B -7.412000e+09 # 1.09 * val * 1000000000
7 £-15.623B -1.999744e+10 # 1.28 * val * 1000000000
8 $-1,400B -1.400000e+12 # 1 * val * 1000000000
>>> df1
currency value factor
0 1.000 -4.350 1.000000e-02
1 1.000 246.600 1.000000e+03
2 1.000 NaN 1.000000e+00
3 1.000 12.760 1.000000e+09
4 1.000 4.680 1.000000e-02
5 0.007 -459.500 1.000000e+09
6 1.090 -6.800 1.000000e+09
7 1.280 -15.623 1.000000e+09
8 1.000 -1400.000 1.000000e+09
答案3
得分: 1
以下是您要翻译的内容:
这是作为单独回答发布的,因为我希望保留原始的可重现示例(以防链接的 CSV 不再可用)。
对数据进行仔细分析将显示导致问题的非数字字符是:逗号用作千位分隔符,单破折号符号(可能表示 'nan')。将这些内容合并到'character_mapping'中后,转换将在不引发任何错误的情况下完成:
from numpy import nan
from pandas import read_csv
df = read_csv("https://projectcodesamples.s3.amazonaws.com/ForEx.csv")
character_mapping = {
"%" : "e-2",
"K" : "e3",
"M" : "e6",
"B" : "e9",
"T" : "e12",
# 注意,货币符号映射为空字符串
# 以便于转换为浮点数
"£" : "",
"$" : "",
"¥" : "",
"€" : "",
# 如果一个条目只包含破折号,替换为nan
"^-$" : nan,
# 一些数字使用逗号作为千位分隔符
"," : "",
}
df["Converted"] = df["consensus"].replace(character_mapping, regex=True).astype(float)
mask = (df["Converted"] != df["consensus"]) & (df["consensus"].notna())
print(df.loc[mask, ["consensus", "Converted"]])
英文:
This is posted as a separate answer, since I want to retain the original reproducible example (in case the linked csv is no longer available).
A careful analysis of the data will show that the non-numeric characters that cause trouble are: commas used as thousand separators, single dash symbols (presumably indicating nan
). After incorporating these into the character_mapping
the conversion works without raising any errors:
from numpy import nan
from pandas import read_csv
df = read_csv("https://projectcodesamples.s3.amazonaws.com/ForEx.csv")
character_mapping = {
"%": "e-2",
"K": "e3",
"M": "e6",
"B": "e9",
"T": "e12",
# note that the currency symbols are mapped to blank string
# to facilitate the conversion to float
"£": "",
"\$": "",
"¥": "",
"€": "",
# if an entry consists of a dash, replace with nan
"^-$": nan,
# some numbers have a comma as a thousands separator
",": "",
}
df["Converted"] = df["consensus"].replace(character_mapping, regex=True).astype(float)
mask = (df["Converted"] != df["consensus"]) & (df["consensus"].notna())
print(df.loc[mask, ["consensus", "Converted"]])
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论