2023年7月12日 21:44:54go评论137阅读模式

英文:

Filter and sort CSV data and store as PDF file with page breaks after specific rows

问题

我正在使用一个Python脚本，该脚本导入CSV数据，进行筛选和排序，将其转换为HTML，然后再转换为PDF。我想找到一种方法，在特定行之后添加分页符。

假设以下示例：

数据按列col1排序，形成“组”。我想在每个组（col1中的新值）之后添加分页符：

输入数据（CSV表格）

col1	col2	col3
A	x	a
A	y	b
B	x	a
B	x	a
B	y	b
B	x	a
C	x	a
C	y	b

输出数据（PDF中的表格）

（添加了分页符，每页都重复列标题）

col1	col2	col3
A	x	a
A	y	b
分页符
col1	col2	col3
B	x	a
B	x	a
B	y	b
B	x	a
分页符
col1	col2	col3
C	x	a
C	y	b

我的工作流程大致如下：

df = pd.read_csv(input_filename, encoding="")
filtered_df = df[some_condition]
filtered_df = filtered_df.sort_values(some_other_condition)
html_table = filtered_df.to_html(index=False)
html_string = html_head + html_something + html_table + html_something_else + html_foot
pdfkit.from_string(html_string, outfile_name, options=pdfkit_options)

我看到以下替代方法（但不知道如何实现它们，而且我不喜欢它们中的任何一个）：

解析数据并添加“虚拟”行，不携带任何数据，但包含一些魔术字符串标记，可以在HTML转换后用其他HTML魔术（具有特定CSS样式的表行？）替换它们。感觉非常巧妙。
将大表拆分成较小的表格（每个组一个 - 但如何？）。分别将它们转换为HTML，然后将它们放回原处（使用一些HTML/CSS魔术）。
使用一些我不知道的pdfkit选项或pandas.DataFrame.to_html选项。
使用完全不同的方法。

我事先不知道col1包含的所有值，但很可能可以一次找到它们并重复使用它们进行进一步处理。

非常感谢任何帮助。

英文:

I am using a Python script that imports CSV data, filters and sorts it, converts it to HTML and then PDF. I'd like to find a way to add page breaks after specific rows.

Assume the following example:

The data is sorted by columns col1 forming "groups". I'd like to add a page break after every group (new value in col1):

Input data (CSV table)

col1	col2	col3
A	x	a
A	y	b
B	x	a
B	x	a
B	y	b
B	x	a
C	x	a
C	y	b

Output data (table in PDF)

(page breaks added, column headings repeated every page)

col1	col2	col3
A	x	a
A	y	b
pagebreak
col1	col2	col3
B	x	a
B	x	a
B	y	b
B	x	a
pagebreak
col1	col2	col3
C	x	a
C	y	b

My workflow briefly looks as follows:

df = pd.read_csv(input_filename, encoding=&quot;&quot;)
filtered_df = df[some_condition]
filtered_df = filtered_df.sort_values(some_other_condition)
html_table = filtered_df.to_html(index=False)
html_string = html_head + html_something + html_table + html_something_else + html_foot
pdfkit.from_string(html_string, outfile_name, options=pdfkit_options)

I see the following alternative approaches (but don't have a clue how to implement them yet, and I don't like any of them):

Parse the data and add "ghost" lines, carrying no data but some magic string token that can be replaced after the HTML conversion by other HTML magic (table row with specific CSS style?). Feels very hacky.
Split the big table into smaller tables (one for every group - but how?). Convert them to HTML separately and put them back afterwards (using some HTML/CSS magic).
Use some pdfkit option or pandas.DataFrame.to_html option I don't know about.
Use a completely different approach.

I don't know all the values col1 holds in advance, but it's probably easy to find them out once and reuse them for further processing.

Any help is very much appreciated.

答案1

得分: 1

尝试类似以下的代码：

html_page_break = '<div style="page-break-after: always;"></div>'
html_string = ''
html_tables = filtered_df.to_html(index=False)
for i, html_table in enumerate(html_tables):
    html_string += html_head + html_something + html_table + html_something_else + html_foot
    if i < len(html_tables) - 1:
        html_string += html_page_break

然后像这样使用：

pdfkit.from_string(html_string, outfile_name, options=pdfkit_options)

英文:

Try something like:

html_page_break = &#39;&lt;div style=&quot;page-break-after: always;&quot;&gt;&lt;/div&gt;&#39;
html_string = &#39;&#39;
html_tables = filtered_df.to_html(index=False)
for i, html_table in enumerate(html_tables):
    html_string += html_head + html_something + html_table + html_something_else + html_foot
    if i &lt; len(html_tables) - 1:
        html_string += html_page_break

Then just like you do:

pdfkit.from_string(html_string, outfile_name, options=pdfkit_options)

答案2

得分: 1

我使用以下方法（OP的第2种方法）：

将单个表拆分成更小的表格（"子表格"），然后逐个将它们转换为HTML，将所有HTML表格放在一起，每个表格之间插入分页符。

获取group_column的唯一值（在我的示例中是col1）。
遍历这些分组，并筛选出仅匹配该分组的行。
在每个分组/表格之后添加分页符（除了最后一个之后）。

html_page_break = '<div style="page-break-after: always;"></div>'
groups = filtered_df[group_column].unique()
html_string = html_head + html_note
for i, group in enumerate(groups):
    filtered_df_subtable = filtered_df[(filtered_df[group_column] == group)]
    html_table = filtered_df_subtable.to_html(index=False)
    html_string += html_table
    if i < len(groups) - 1:
        html_string += html_page_break
html_string += html_foot

HTML到PDF的转换没有改变。

感谢@notarealgreal提供的html_page_break和enumerate。

英文:

I use the following approach (#2 of OP):

Split single table up into smaller tables ("sub tables") and converting them one after each other to HTML, putting all HTML tables together with page breaks in between.

Get unique values of the group_column (in my example that would be col1)
Iterate over the groups and filter the pandas.DataFrame to select only rows that match the group.

Add page break after every group/table (except after the last one).

 html_page_break = &#39;&lt;div style=&quot;page-break-after: always;&quot;&gt;&lt;/div&gt;&#39;
 groups = filtered_df[group_column].unique()
 html_string = html_head + html_note
 for i, group in enumerate(groups):
     filtered_df_subtable = filtered_df[(filtered_df[group_column] == group)]
     html_table = filtered_df_subtable.to_html(index=False)
     html_string += html_table
     if i &lt; len(groups) - 1:
         html_string += html_page_break
 html_string += html_foot

The conversion from HTML to PDF is untouched.

Thanks @notarealgreal for html_page_break + enumerate.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

过滤和排序CSV数据，并将其存储为PDF文件，在特定行后插入分页。

问题

答案1

答案2

使用以下命令通过pip安装正确版本的onxruntime来安装chromadb： “`

如何在Snowpark（Python）数据框中实现explode功能，而不使用explode()函数？

将字节以 JSON 格式发送到 Python 套接字编程中的服务器。

Python的POST请求在控制台上无法接收POST数据，但在Postman上运行正常。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。