创建数据库表格从HTML表格中。

huangapple go评论97阅读模式
英文:

Create database tables from html tables

问题

我需要帮助从HTML表格创建数据库表格。目前我正在手动进行操作。

我有一个HTML文档,其中包含所有数据,但我不知道提取数据的最佳方式。

以下是HTML代码的一部分:

  1. <html xmlns="http://www.w3.org/1999/xhtml">
  2. <head>
  3. <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  4. <title>ATOMS Definition for Type tom.service.soc.SocRecord</title>
  5. <style type="text/css">
  6. body
  7. {
  8. line-height: 1.6em;
  9. font-family: "Lucida Sans Unicode", "Lucida Grande", Sans-Serif;
  10. font-size: 14px;
  11. margin: 45px;
  12. }
  13. #box-table-a
  14. {
  15. font-family: "Lucida Sans Unicode", "Lucida Grande", Sans-Serif;
  16. font-size: 12px;
  17. margin: 5%;
  18. width: 90%;
  19. text-align: left;
  20. border-collapse: collapse;
  21. }
  22. #box-table-a th
  23. {
  24. font-size: 13px;
  25. font-weight: normal;
  26. padding: 8px;
  27. background: #b9c9fe;
  28. border-top: 4px solid #aabcfe;
  29. border-bottom: 1px solid #fff;
  30. color: #039;
  31. }
  32. #box-table-a td
  33. {
  34. padding: 8px;
  35. background: #e8edff;
  36. border-bottom: 1px solid #fff;
  37. color: #669;
  38. border-top: 1px solid transparent;
  39. }
  40. #box-table-a tr:hover td
  41. {
  42. background: #d0dafd;
  43. color: #339;
  44. }
  45. </style>
  46. </head>
  47. <body>
  48. <table id="box-table-a" summary="Definition for tom.service.soc.SocRecord">
  49. <thead>
  50. <tr><th colspan="2">tom.service.soc.SocRecord</th></tr>
  51. </thead>
  52. <tbody>
  53. <tr>
  54. <td>Version</td>
  55. <td>1</td>
  56. </tr>
  57. <tr>
  58. <td>Description</td>
  59. <td>[type is UNCLASSIFIED] Temporary dummy test object for SOC</td>
  60. </tr>
  61. </tbody>
  62. </table>
  63. <table id="box-table-a" summary="Fields Definition for Type tom.service.soc.SocRecord">
  64. <thead>
  65. <tr>
  66. <th scope="col">Index</th>
  67. <th scope="col">Name</th>
  68. <th scope="col">Type</th>
  69. <th scope="col">Range</th>
  70. <th scope="col">Default</th>
  71. <th scope="col" width="50%">Description</th>
  72. </tr>
  73. </thead>
  74. <tbody>
  75. <tr>
  76. <td>1</td>
  77. <td>socID</td>
  78. <td>String</td>
  79. <td>-</td>
  80. <td>""</td>
  81. <td>[ ] The UUID of the tracked object -- String for transmission purposes</td>
  82. </tr>
  83. <tr>
  84. <td>2</td>
  85. <td>satID</td>
  86. <td><a href="../../../../../tom/state/vcm/SatNumberType.html">SatNumberType</a></td>
  87. <td></td>
  88. <td></td>
  89. <td>[ ] The ID of the tracked object -- copy of the satelliteId in the VCM</td>
  90. </tr>
  91. </tbody>
  92. </table>
  93. </body></html>

我希望从HTML创建一个像这样的PostgreSQL数据库表格创建脚本,并且如果有href链接到另一个表格,则应该包含在内:

  1. CREATE TABLE soc.SocRecord(
  2. socId TEXT, --[ ] The UUID of the tracked object -- String for transmission purposes
  3. satId UUID, --[ ] The ID of the tracked object -- copy of the satId in the VCM
  4. commonName TEXT, --[ ] The name of the tracked object -- may be blank -
  5. --This field is optional in the current version of the message, check the set attribute before use.);

如果需要进一步的帮助,请告诉我。

英文:

I need help creating database tables from html tables. Right now I'm just manually doing it.
I have a html doc that has all the data but I don't know what the best way to extract the data.

  1. &lt;html xmlns=&quot;http://www.w3.org/1999/xhtml&quot;&gt;&lt;head&gt;
  2. &lt;meta http-equiv=&quot;Content-Type&quot; content=&quot;text/html; charset=utf-8&quot;&gt;
  3. &lt;title&gt;ATOMS Definition for Type tom.service.soc.SocRecord&lt;/title&gt;
  4. &lt;style type=&quot;text/css&quot;&gt;
  5. body
  6. {
  7. line-height: 1.6em;
  8. font-family: &quot;Lucida Sans Unicode&quot;, &quot;Lucida Grande&quot;, Sans-Serif;
  9. font-size: 14px;
  10. margin: 45px;
  11. }
  12. #box-table-a
  13. {
  14. font-family: &quot;Lucida Sans Unicode&quot;, &quot;Lucida Grande&quot;, Sans-Serif;
  15. font-size: 12px;
  16. margin: 5%;
  17. width: 90%;
  18. text-align: left;
  19. border-collapse: collapse;
  20. }
  21. #box-table-a th
  22. {
  23. font-size: 13px;
  24. font-weight: normal;
  25. padding: 8px;
  26. background: #b9c9fe;
  27. border-top: 4px solid #aabcfe;
  28. border-bottom: 1px solid #fff;
  29. color: #039;
  30. }
  31. #box-table-a td
  32. {
  33. padding: 8px;
  34. background: #e8edff;
  35. border-bottom: 1px solid #fff;
  36. color: #669;
  37. border-top: 1px solid transparent;
  38. }
  39. #box-table-a tr:hover td
  40. {
  41. background: #d0dafd;
  42. color: #339;
  43. }
  44. &lt;/style&gt;
  45. &lt;/head&gt;
  46. &lt;body&gt;
  47. &lt;table id=&quot;box-table-a&quot; summary=&quot;Definition for tom.service.soc.SocRecord&quot;&gt;
  48. &lt;thead&gt;
  49. &lt;tr&gt;&lt;th colspan=&quot;2&quot;&gt;tom.service.soc.SocRecord&lt;/th&gt;&lt;/tr&gt;
  50. &lt;/thead&gt;
  51. &lt;tbody&gt;
  52. &lt;tr&gt;
  53. &lt;td&gt;Version&lt;/td&gt;
  54. &lt;td&gt;1&lt;/td&gt;
  55. &lt;/tr&gt;
  56. &lt;tr&gt;
  57. &lt;td&gt;Description&lt;/td&gt;
  58. &lt;td&gt;[type is UNCLASSIFIED] Temporary dummy test object for SOC&lt;/td&gt;
  59. &lt;/tr&gt;
  60. &lt;/tbody&gt;
  61. &lt;/table&gt;
  62. &lt;table id=&quot;box-table-a&quot; summary=&quot;Fields Definition for Type tom.service.soc.SocRecord&quot;&gt;
  63. &lt;thead&gt;
  64. &lt;tr&gt;
  65. &lt;th scope=&quot;col&quot;&gt;Index&lt;/th&gt;
  66. &lt;th scope=&quot;col&quot;&gt;Name&lt;/th&gt;
  67. &lt;th scope=&quot;col&quot;&gt;Type&lt;/th&gt;
  68. &lt;th scope=&quot;col&quot;&gt;Range&lt;/th&gt;
  69. &lt;th scope=&quot;col&quot;&gt;Default&lt;/th&gt;
  70. &lt;th scope=&quot;col&quot; width=&quot;50%&quot;&gt;Description&lt;/th&gt;
  71. &lt;/tr&gt;
  72. &lt;/thead&gt;
  73. &lt;tbody&gt;
  74. &lt;tr&gt;
  75. &lt;td&gt;1&lt;/td&gt;
  76. &lt;td&gt;socID&lt;/td&gt;
  77. &lt;td&gt;String&lt;/td&gt;
  78. &lt;td&gt;
  79. -
  80. &lt;/td&gt;
  81. &lt;td&gt;&quot;&quot;&lt;/td&gt;
  82. &lt;td&gt;
  83. [ ] The UUID of the tracked object -- String for transmission purposes
  84. &lt;/td&gt;
  85. &lt;/tr&gt;
  86. &lt;tr&gt;
  87. &lt;td&gt;2&lt;/td&gt;
  88. &lt;td&gt;satID&lt;/td&gt;
  89. &lt;td&gt;&lt;a href=&quot;../../../../../tom/state/vcm/SatNumberType.html&quot;&gt;SatNumberType&lt;/a&gt;&lt;/td&gt;
  90. &lt;td&gt;
  91. &lt;/td&gt;
  92. &lt;td&gt;&lt;/td&gt;
  93. &lt;td&gt;
  94. [ ] The ID of the tracked object -- copy of the satelliteId in the VCM
  95. &lt;/td&gt;
  96. &lt;/tr&gt;
  97. &lt;/tbody&gt;
  98. &lt;/table&gt;
  99. &lt;/body&gt;&lt;/html&gt;

Here is the html.
I would like help making a create postgres database table script like this from the html. And if it has a href then that links to another table.

  1. CREATE TABLE soc.SocRecord(
  2. socId TEXT, --[ ] The UUID of the tracked object -- String for transmission purposes
  3. satId UUID, --[ ] The ID of the tracked object -- copy of the satId in the VCM
  4. commonName TEXT, --[ ] The name of the tracked object -- may be blank -
  5. --This field is optional in the current version of the message, check the set attribute before use.);

答案1

得分: 1

Here is the translated content:

编辑
尝试了一些不同的变体,发现使用 zip 没有错误,并返回了 SQL 脚本。

> for header, value in zip(headers, values):

我也更新了下面的代码。


所以你的方法可能是这样的:

读取 .html 文件
解析 table
解析 theader 等等

我觉得这个想法很有趣,所以我在 Python 中尝试了一下。

  1. from bs4 import BeautifulSoup
  2. # 指定 HTML 文件的路径
  3. html_file_path = 'path/to/your/file.html'

在这里,你可以编写另一个脚本来读取文件夹中的所有 .html 文件,并执行相应的操作。

  1. # 读取 HTML 文件的内容
  2. with open(html_file_path, 'r') as file:
  3. html = file.read()
  4. # 查找 HTML 中的所有表格
  5. tables = soup.find_all('table')
  6. # 遍历表格
  7. for table in tables:
  8. # 查找表格的 ID 属性
  9. table_id = table.get('id')
  10. # 提取表格的表头
  11. headers = [th.get_text() for th in table.find('thead').find_all('th')]
  12. # 创建一个字典来存储表格数据
  13. table_data = {}
  14. # 遍历表格的行
  15. for row in table.find('tbody').find_all('tr'):
  16. # 提取行中的单元格
  17. cells = row.find_all('td')
  18. # 提取单元格的值
  19. values = [cell.get_text().strip() for cell in cells]
  20. # 使用表头与值来存储数据到字典中
  21. for header, value in zip(headers, values):
  22. if header not in table_data:
  23. table_data[header] = []
  24. table_data[header].append(value)
  25. # 生成 PostgreSQL 表格脚本
  26. create_table_script = f"CREATE TABLE {table_id} (\n"
  27. for header, values in table_data.items():
  28. # 处理带有空格或特殊字符的列名
  29. column_name = header.lower().replace(' ', '_').replace('.', '_')
  30. # 将列值组合成逗号分隔的字符串
  31. column_values = ', '.join([f"'{value}'" if isinstance(value, str) else str(value) for value in values])
  32. # 将列定义附加到脚本中
  33. create_table_script += f" {column_name} {column_values},\n"
  34. create_table_script = create_table_script.rstrip(',\n') + "\n);\n"
  35. # 打印表格脚本
  36. print(create_table_script)

上述操作的结果可能不会完全符合你的要求,但可以帮助你入门。

英文:

Edit
Tried out some more variants and found that using zip resulted in no errors and returned the sql script.

> for header, value in zip(headers, values):

I've updated the code bellow as well.


So your approach could constitute something like:

read .html file
parse for table
parse for theader and so on

I find idea pretty interesting, so I tried it out in python.

  1. from bs4 import BeautifulSoup
  2. # Specify the path to your HTML file
  3. html_file_path = &#39;path/to/your/file.html&#39;

here you could write another script to read all .html files in a folder and execute against that instead.

  1. # Read the contents of the HTML file
  2. with open(html_file_path, &#39;r&#39;) as file:
  3. html = file.read()
  4. # Find all the tables in the HTML
  5. tables = soup.find_all(&#39;table&#39;)
  6. # Iterate over the tables
  7. for table in tables:
  8. # Find the table&#39;s ID attribute
  9. table_id = table.get(&#39;id&#39;)
  10. # Extract the table headers
  11. headers = [th.get_text() for th in table.find(&#39;thead&#39;).find_all(&#39;th&#39;)]
  12. # Create a dictionary to store the table data
  13. table_data = {}
  14. # Iterate over the table rows
  15. for row in table.find(&#39;tbody&#39;).find_all(&#39;tr&#39;):
  16. # Extract the row cells
  17. cells = row.find_all(&#39;td&#39;)
  18. # Extract the cell values
  19. values = [cell.get_text().strip() for cell in cells]
  20. # Store the values with their corresponding headers in the dictionary
  21. for header, value in zip(headers, values):
  22. if header not in table_data:
  23. table_data[header] = []
  24. table_data[header].append(value)
  25. # Generate the PostgreSQL table script
  26. create_table_script = f&quot;CREATE TABLE {table_id} (\n&quot;
  27. for header, values in table_data.items():
  28. # Handle column names with spaces or special characters
  29. column_name = header.lower().replace(&#39; &#39;, &#39;_&#39;).replace(&#39;.&#39;, &#39;_&#39;)
  30. # Combine the column values into a comma-separated string
  31. column_values = &#39;, &#39;.join([f&quot;&#39;{value}&#39;&quot; if isinstance(value, str) else str(value) for value in values])
  32. # Append the column definition to the script
  33. create_table_script += f&quot; {column_name} {column_values},\n&quot;
  34. create_table_script = create_table_script.rstrip(&#39;,\n&#39;) + &quot;\n);\n&quot;
  35. # Print the table script
  36. print(create_table_script)

The result of the above is probably not going to be exactly what you want, but it gets you started.

huangapple
  • 本文由 发表于 2023年7月13日 12:23:29
  • 转载请务必保留本文链接:https://go.coder-hub.com/76675914.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定