
huangapple go评论116阅读模式

Filter and sort CSV data and store as PDF file with page breaks after specific rows






col1 col2 col3
A x a
A y b
B x a
B x a
B y b
B x a
C x a
C y b



col1 col2 col3
A x a
A y b
col1 col2 col3
B x a
B x a
B y b
B x a
col1 col2 col3
C x a
C y b


df = pd.read_csv(input_filename, encoding="")

filtered_df = df[some_condition]

filtered_df = filtered_df.sort_values(some_other_condition)

html_table = filtered_df.to_html(index=False)

html_string = html_head + html_something + html_table + html_something_else + html_foot

pdfkit.from_string(html_string, outfile_name, options=pdfkit_options)


  1. 解析数据并添加“虚拟”行,不携带任何数据,但包含一些魔术字符串标记,可以在HTML转换后用其他HTML魔术(具有特定CSS样式的表行?)替换它们。感觉非常巧妙。

  2. 将大表拆分成较小的表格(每个组一个 - 但如何?)。分别将它们转换为HTML,然后将它们放回原处(使用一些HTML/CSS魔术)。

  3. 使用一些我不知道的pdfkit选项或pandas.DataFrame.to_html选项。

  4. 使用完全不同的方法。




I am using a Python script that imports CSV data, filters and sorts it, converts it to HTML and then PDF. I'd like to find a way to add page breaks after specific rows.

Assume the following example:

The data is sorted by columns col1 forming "groups". I'd like to add a page break after every group (new value in col1):

Input data (CSV table)

col1 col2 col3
A x a
A y b
B x a
B x a
B y b
B x a
C x a
C y b

Output data (table in PDF)

(page breaks added, column headings repeated every page)

col1 col2 col3
A x a
A y b
col1 col2 col3
B x a
B x a
B y b
B x a
col1 col2 col3
C x a
C y b

My workflow briefly looks as follows:

df = pd.read_csv(input_filename, encoding="")

filtered_df = df[some_condition]

filtered_df = filtered_df.sort_values(some_other_condition)

html_table = filtered_df.to_html(index=False)

html_string = html_head + html_something + html_table + html_something_else + html_foot

pdfkit.from_string(html_string, outfile_name, options=pdfkit_options)

I see the following alternative approaches (but don't have a clue how to implement them yet, and I don't like any of them):

  1. Parse the data and add "ghost" lines, carrying no data but some magic string token that can be replaced after the HTML conversion by other HTML magic (table row with specific CSS style?). Feels very hacky.
  2. Split the big table into smaller tables (one for every group - but how?). Convert them to HTML separately and put them back afterwards (using some HTML/CSS magic).
  3. Use some pdfkit option or pandas.DataFrame.to_html option I don't know about.
  4. Use a completely different approach.

I don't know all the values col1 holds in advance, but it's probably easy to find them out once and reuse them for further processing.

Any help is very much appreciated.


得分: 1


html_page_break = '<div style="page-break-after: always;"></div>'
html_string = ''

html_tables = filtered_df.to_html(index=False)
for i, html_table in enumerate(html_tables):
    html_string += html_head + html_something + html_table + html_something_else + html_foot
    if i < len(html_tables) - 1:
        html_string += html_page_break


pdfkit.from_string(html_string, outfile_name, options=pdfkit_options)

Try something like:

html_page_break = &#39;&lt;div style=&quot;page-break-after: always;&quot;&gt;&lt;/div&gt;&#39;
html_string = &#39;&#39;

html_tables = filtered_df.to_html(index=False)
for i, html_table in enumerate(html_tables):
    html_string += html_head + html_something + html_table + html_something_else + html_foot
    if i &lt; len(html_tables) - 1:
        html_string += html_page_break

Then just like you do:

pdfkit.from_string(html_string, outfile_name, options=pdfkit_options)


得分: 1



  1. 获取group_column的唯一值(在我的示例中是col1)。

  2. 遍历这些分组,并筛选出仅匹配该分组的行。

  3. 在每个分组/表格之后添加分页符(除了最后一个之后)。

html_page_break = '<div style="page-break-after: always;"></div>'

groups = filtered_df[group_column].unique()

html_string = html_head + html_note

for i, group in enumerate(groups):
    filtered_df_subtable = filtered_df[(filtered_df[group_column] == group)]
    html_table = filtered_df_subtable.to_html(index=False)
    html_string += html_table
    if i < len(groups) - 1:
        html_string += html_page_break

html_string += html_foot




I use the following approach (#2 of OP):

Split single table up into smaller tables ("sub tables") and converting them one after each other to HTML, putting all HTML tables together with page breaks in between.

  1. Get unique values of the group_column (in my example that would be col1)

  2. Iterate over the groups and filter the pandas.DataFrame to select only rows that match the group.

  3. Add page break after every group/table (except after the last one).

     html_page_break = &#39;&lt;div style=&quot;page-break-after: always;&quot;&gt;&lt;/div&gt;&#39;
     groups = filtered_df[group_column].unique()
     html_string = html_head + html_note
     for i, group in enumerate(groups):
         filtered_df_subtable = filtered_df[(filtered_df[group_column] == group)]
         html_table = filtered_df_subtable.to_html(index=False)
         html_string += html_table
         if i &lt; len(groups) - 1:
             html_string += html_page_break
     html_string += html_foot

The conversion from HTML to PDF is untouched.

Thanks @notarealgreal for html_page_break + enumerate.

  • 本文由 发表于 2023年7月12日 21:44:54
  • 转载请务必保留本文链接:https://go.coder-hub.com/76671271.html



:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:
