2023年2月8日 16:49:30go评论103阅读模式

英文:

Export dataset haveing more than 1048576 rows, in multiple sheets of single excel file, in Python

问题

我有一个给定的数据集，如下所示：

import numpy as np
import pandas as pd
## 创建一个数组
data = np.arange(7152123)
print(data)
## 数据框
df = pd.DataFrame(data)
print("一个数据框 = \n", df)     
## 将数据加载到 Excel 表中    
df.to_excel('df.xlsx', index=False, header=False)

我收到了一个错误消息：

> f"This sheet is too large! Your sheet size is: {num_rows}, {num_cols}
> " ValueError: This sheet is too large! Your sheet size is: 7152123, 1
> Max sheet size is: 1048576, 16384

这个错误是因为单个 Excel 表格允许的最大行数为 1048576 行。

然而，我需要导出的数据集有 7152123 行。

请问有人能告诉我如何在 Python 中将这样一个庞大的数据集导出到单个 Excel 文件的多个表格中吗？

英文:

I have a dataset given as such:

import numpy as np
import pandas as pd
## Create an array
data = np.arange(7152123)
print(data)
## Dataframe
df = pd.DataFrame(data)
print(&quot;\n an df = \n&quot;, df)     
## Load the data in excel sheet    
df.to_excel(&#39;df.xlsx&#39;, index=False,header=False)

I get an error:

> f"This sheet is too large! Your sheet size is: {num_rows}, {num_cols}
> " ValueError: This sheet is too large! Your sheet size is: 7152123, 1
> Max sheet size is: 1048576, 16384

The error comes because a single sheet of excel allows a maximum of 1048576 rows.

However, the dataset that I need to export has 7152123 rows

Can somebody please let me know how do I export such a huge dataset in multiple sheets of single excel file in Python?

答案1

得分: 2

以下是翻译好的部分：

The data is being generated in Python and then I need to import the data in MATLAB.

这些数据是在Python中生成的，然后我需要将这些数据导入MATLAB。

In that case I'd suggest not using Excel. Excel was built to analyze data, not exchange data. It's not only that splitting up the data into 1M-row sheets is complex, the MATLAB code will have to read those sheets and recombine them in memory.
在这种情况下，我建议不使用Excel。Excel是用来分析数据的，而不是用来交换数据的。不仅仅是将数据拆分为100万行工作表复杂，MATLAB代码还必须读取这些工作表并在内存中重新组合它们。

There are better formats for this. to_csv can generate a single text file that can be read by MATLAB.

有更好的格式可供使用。to_csv可以生成一个文本文件，MATLAB可以读取。

df.to_csv('out.csv')

另一个选项，对于那些无法放入内存的大文件，可以使用Parquet格式。在MATLAB文档中，Parquet Files 在Large Files and Big Data 部分有自己的章节。

文件可以使用to_parquet 生成，并可以使用MATLAB的parquetread 进行读取：

df.to_parquet('out.parquet')

T = parquetread('outages.parquet');

最后，rowfilter 可以用于在导入数据之前对数据进行筛选。

不推荐的方法 - 拆分为工作表

This can be done by splitting the dataframe into several, then storing them into different sheets with the help of ExcelWriter. There are several relevant SO answers already. This answer shows several options, including using NumPy's array_split:

这可以通过将数据框拆分为多个部分，然后使用ExcelWriter将它们存储在不同的工作表中来实现。已经有一些相关的Stack Overflow答案提供了多种选项，包括使用NumPy的array_split：

n=1048576
list_df = np.array_split(df, n)

to_excel 的文档显示了如何将数据写入同一文件的不同工作表：

with pd.ExcelWriter('output.xlsx') as writer:  
    df1.to_excel(writer, sheet_name='Sheet_name_1')
    df2.to_excel(writer, sheet_name='Sheet_name_2')

这可以在循环中完成：

with pd.ExcelWriter('output.xlsx') as writer:  
    for i, d in enumerate(list_df):
        d.to_excel(writer, sheet_name=f'Sheet_name_{i}')

有意义的分组

如果数据是用于在Excel中进行分析，那么根据有意义的分组拆分数据是有意义的，假设没有组超过100万行：

with pd.ExcelWriter('output.xlsx') as writer:  
    for name,d in df.groupby('AcctName'):
        d.to_excel(writer, sheet_name=f'Sheet_name_{name}')

英文:

> The data is being generated in Python and then I need to import the data in MATLAB.

There are better formats for this. to_csv can generate a single text file that can be read by MATLAB.

df.to_csv(&#39;out.csv&#39;)

Another option, for really large files that can't fit in memory, is Parquet. In the MATLAB docs Parquet Files have their own section in the Large Files and Big Data section.

The file can be generated with to_parquet and read with MATLAB's parquetread :

df.to_parquet(&#39;out.parquet&#39;)

T = parquetread(&#39;outages.parquet&#39;);

Finally, rowfilter can be used to filter data before importing it.

BAD IDEA - Splitting into sheets

n=1048576
list_df = np.array_split(df, n)

The to_excel docs show how to write to different sheets in the same file:

with pd.ExcelWriter(&#39;output.xlsx&#39;) as writer:  
    df1.to_excel(writer, sheet_name=&#39;Sheet_name_1&#39;)
    df2.to_excel(writer, sheet_name=&#39;Sheet_name_2&#39;)

This can be done in a loop:

with pd.ExcelWriter(&#39;output.xlsx&#39;) as writer:  
    for i, d in enumerate(list_df):
        d.to_excel(writer, sheet_name=f&#39;Sheet_name_{i}&#39;)

Meaningful grouping

If the data is meant for analysis using Excel, it makes sense to split the data using a meaningful grouping, assuming no group exceeds 1M rows:

with pd.ExcelWriter(&#39;output.xlsx&#39;) as writer:  
    for name,d in df.groupby(&#39;AcctName&#39;):
        d.to_excel(writer, sheet_name=f&#39;Sheet_name_{name}&#39;)

答案2

得分: 2

你需要将pandas.ExcelWriter对象设置为追加模式。

以下是一个示例，我将数据库切分成1000行一组，并将每个切片保存到一个工作表中。
确保在相同的路径下有一个空的output.xlsx文件，然后尝试以下代码：

import pandas as pd
import numpy as np
data = np.arange(10000)
df = pd.DataFrame(data)
with pd.ExcelWriter(path="output.xlsx", mode="a") as writer:
    for n in range(1, len(df), 1000):
        df.iloc[n:n+1000].to_excel(writer, sheet_name=f'MySheet_{n}')

然而，正如先前建议的，你可能应该只将数据保存在一个单独的csv文件中。

英文:

You have to set the pandas.ExcelWriter object to append mode.

Here is an example where I slice the database in batches of 1000 rows and save each slice into a sheet.
Make sure you have an empty output.xlsx file in the same path, then try the following.

import pandas as pd
import numpy as np
import pandas as pd
data = np.arange(10000)
df = pd.DataFrame(data)
with pd.ExcelWriter(path=&quot;output.xlsx&quot;, mode=&quot;a&quot;) as writer:
    for n in range(1, len(df), 1000):
        df.iloc[n:n+1000].to_excel(writer, sheet_name=f&#39;MySheet_{n}&#39;)

However, as already suggested, you should probably just save the data in a single csv file instead.

答案3

得分: 1

尝试将数据帧分成每组最大行数：

# 将数据帧分成每组1048576行
max_rows = 1048576
# 向上取整
rows = len(df)
if rows > max_rows:
    n_of_groups = math.ceil(rows / max_rows)
    for i in range(n_of_groups):
        start = i * max_rows
        end = (i + 1) * max_rows
        if end < rows:
            df_write = df[start:end]
        else:
            df_write = df[start:]
        print(f'将df_write写入具有{len(df_write)}行的Excel文件')
输出：
    将df_write写入具有1048576行的Excel文件
    将df_write写入具有1048576行的Excel文件
    将df_write写入具有1048576行的Excel文件
    将df_write写入具有1048576行的Excel文件
    将df_write写入具有1048576行的Excel文件
    将df_write写入具有860667行的Excel文件
希望它能正常运行！
<details>
<summary>英文:</summary>
Try splitting the dataframe in groups of max_rows:
    # Split dataframe in groups of 1048576 rows
    max_rows = 1048576
    # Round up the number
    rows = len(df)
    if rows &gt; max_rows:
        n_of_groups = math.ceil(rows / max_rows)
        for i in range(n_of_groups):
            start = i*max_rows
            end = (i+1)*max_rows
            if end &lt; rows:
                df_write = df[start:end]
            else:
                df_write = df[start:]
            print(f&#39;Write df_write with {len(df_write)} rows on Excel file&#39;)
Output:
    Write df_write with 1048576 rows on Excel file
    Write df_write with 1048576 rows on Excel file
    Write df_write with 1048576 rows on Excel file
    Write df_write with 1048576 rows on Excel file
    Write df_write with 1048576 rows on Excel file
    Write df_write with 1048576 rows on Excel file
    Write df_write with 860667 rows on Excel file
Hope it works!
</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Export dataset haveing more than 1048576 rows, in multiple sheets of single excel file, in Python

问题

答案1

答案2

答案3

How do I use foreign keys in Django?

按照 pandas 数据框中的列进行分组和聚合。

选择除两张表之外的所有表。

Selenium查找onclick的其他方法

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。