英文:
Polars read_excel not equal to Pandas read_excel for columns with "mixed" types
问题
I'm trying to read some excel data via Polars.read_excel(), and the data is not identical to the Pandas.read_excel() approach for columns with mixed data. Here's an example to illustrate:
# create sample data, save to excel.
test = pd.DataFrame(
{
'nums': [1, 2, 3],
'mixed': [1, 4, '6A'],
'factor': ['A', 'B', 'C']
}
)
test.to_excel('test.xlsx', index = False)
# read data using Pandas and Polars. Convert polars version to pandas.
test_pd = pd.read_excel('test.xlsx', engine='openpyxl')
test_pl = pl.read_excel('test.xlsx')
test_pl = test_pl.to_pandas()
# compare the two
print(test_pd)
print(test_pl)
print(test_pd == test_pl)
print(test_pd) and print(test_pl), suggest the data is identical. However, print(test_pd == test_pl) returns the following:
nums mixed factor
0 True False True
1 True False True
2 True True True
Is there something causing the data to not be identical? And is this a Polars (or Arrow) limitation when dealing with object variables? I want the pl.read_excel() / conversion to pandas approach to ultimately yield an identical DataFrame to pd.read_excel().
Thanks!
英文:
I'm trying to read some excel data via Polars.read_excel(), and the data is not identical to the Pandas.read_excel() approach for columns with mixed data.
Here's an example to illustrate:
# create sample data, save to excel.
test = pd.DataFrame(
{
'nums': [1, 2, 3],
'mixed': [1, 4, '6A'],
'factor': ['A', 'B', 'C']
}
)
test.to_excel('test.xlsx', index = False)
# read data using Pandas and Polars. Convert polars version to pandas.
test_pd = pd.read_excel('test.xlsx', engine='openpyxl')
test_pl = pl.read_excel('test.xlsx')
test_pl = test_pl.to_pandas()
# compare the two
print(test_pd)
print(test_pl)
print(test_pd == test_pl)
print(test_pd) and print(test_pl), suggest the data is identical. However, print(test_pd == test_pl) returns the following:
nums mixed factor
0 True False True
1 True False True
2 True True True
Is there something causing the data to not be identical? And is this a Polars (or Arrow) limitation when dealing with object variables? I want the pl.read_excel() / conversion to pandas approach to ultimately yield an identical DataFrame to pd.read_excel().
Thanks!
答案1
得分: 1
somehow polars made some of your numbers to strings. Look here:
test_pl.iloc[0,1]
'1'
while pandas made integers, where it is possible. The same cell in pandas:
test_pd.iloc[0,1]
1
If you enforce typecast to both tables all cells are equal:
test_pd.astype('string') == test_pl.astype('string')
nums mixed factor
0 True True True
1 True True True
2 True True True
英文:
somehow polars made some of your numbers to strings. Look here:
test_pl.iloc[0,1]
'1'
while pandas made integers, where it is possible. The same cell in pandas:
test_pd.iloc[0,1]
1
If you enforce typecast to both tables all cells are equal:
test_pd.astype('string') == test_pl.astype('string')
nums mixed factor
0 True True True
1 True True True
2 True True True
答案2
得分: 1
Polars 和 Arrow 依赖于严格的数据类型,因此,从根本上来说,是的,这是一个限制。您永远无法拥有一个有时是 Utf8 有时是 Floatxx 的列。
另一方面,Pandas 乐于拥有混合数据类型的列,因为它基本上只是一个 Python 列表。
英文:
Polars and arrow rely on strict data types so ultimately, yes, it's a limitation. You can never have a column that is sometimes Utf8 and sometimes Floatxx.
Pandas, on the other hand, is happy to have a column of mixed data types because it's basically just a python list.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论