Non UTF-8兼容字符”\x{0D}”在输出CSV行末尾。

huangapple go评论106阅读模式
英文:

Non UTF-8 compliant character "\x{0D}" at the end of output csv rows

问题

I have translated the code as requested. Here is the translated code:

  1. import csv
  2. import re
  3. import time
  4. from pathlib import Path
  5. from lxml import etree as et
  6. beg_main = time.time()
  7. # Directory containing XML files
  8. xmls_dir = Path('/PathTo/CLEAN_COMP_TEST2')
  9. files = [e for e in xmls_dir.iterdir() if e.is_file()]
  10. xml_files = [f for f in files if f.with_suffix(".xml")]
  11. # Path to the CSV output file
  12. csv_path = Path("/PathTo/My_Output.csv")
  13. csv_file = open(csv_path, "w", newline="", encoding="utf-8")
  14. writer = csv.writer(csv_file, delimiter=";")
  15. # Path to the results file
  16. results_path = Path("/PathTo/my_results.txt")
  17. results = open(results_path, "w", encoding="utf-8")
  18. # Path to the times file
  19. times_path = Path("/PathTo/my_times.txt")
  20. times = open(times_path, "w", encoding="utf-8")
  21. # XPath expression to select 'tok' and 'dtok' elements
  22. tok_path = et.XPath('//tok | //dtok')
  23. def xml_extract(doc_root, fname: str):
  24. all_toks = tok_path(doc_root)
  25. matching_toks = filter(
  26. lambda tok:
  27. tok.get('xpos') is not None and tok.get('xpos').startswith('A') and not (tok.get('xpos').startswith('AX')),
  28. all_toks
  29. )
  30. for el in matching_toks:
  31. preceding_tok = el.xpath("./preceding-sibling::tok[1][@lemma and @xpos]")
  32. preceding_tok_with_dtoks = el.xpath("./preceding-sibling::tok[1][not(@lemma) and not(@xpos)]")
  33. following_dtok_of_dtok = el.xpath("./preceding-sibling::dtok[1]")
  34. if el.tag == 'tok':
  35. tok_dtok = 'tok'
  36. Adj = "".join(el.itertext())
  37. Adj_lemma = el.get('lemma')
  38. Adj_xpos = el.get('xpos')
  39. elif el.tag == 'dtok':
  40. tok_dtok = 'dtok'
  41. Adj = el.get('form')
  42. Adj_lemma = el.get('lemma')
  43. Adj_xpos = el.get('xpos')
  44. pos = all_toks.index(el)
  45. RelevantPrecedingElements = all_toks[max(pos - 6, 0):pos]
  46. RelevantFollowingElements = all_toks[pos + 1:max(pos + 6, 1)]
  47. if RelevantPrecedingElements:
  48. prec1 = RelevantPrecedingElements[-1]
  49. else:
  50. prec1 = None
  51. if RelevantFollowingElements:
  52. foll1 = RelevantFollowingElements[0]
  53. else:
  54. foll1 = None
  55. ElementsContext = all_toks[max(pos - 6, 0):pos + 1]
  56. context_list = []
  57. if ElementsContext:
  58. for elem in ElementsContext:
  59. elem_text = "".join(elem.itertext())
  60. assert elem_text is not None
  61. context_list.append(elem_text)
  62. Adj = f"<{Adj}>"
  63. for elem in RelevantFollowingElements:
  64. elem_text = "".join(elem.itertext())
  65. assert elem_text is not None
  66. context_list.append(elem_text)
  67. fol_lem = foll1.get('lemma') if foll1 is not None else None
  68. prec_lem = prec1.get('lemma') if prec1 is not None else None
  69. fol_xpos = foll1.get('xpos') if foll1 is not None else None
  70. prec_xpos = prec1.get('xpos') if prec1 is not None else None
  71. fol_form = None
  72. if foll1 is not None:
  73. if foll1.tag == "tok":
  74. fol_form = foll1.text
  75. elif foll1.tag == "dtok":
  76. fol_form = foll1.get("form")
  77. prec_form = None
  78. if prec1 is not None:
  79. if prec1.tag == "tok":
  80. prec_form = prec1.text
  81. elif prec1.tag == "dtok":
  82. prec_form = prec1.get("form")
  83. context = " ".join(context_list).replace(", ", ",").replace(". ", ".").replace(" ", " ").replace(" ", " ")
  84. llista = [
  85. context,
  86. prec_form,
  87. Adj,
  88. fol_form,
  89. prec_lem,
  90. Adj_lemma,
  91. fol_lem,
  92. prec_xpos,
  93. Adj_xpos,
  94. fol_xpos,
  95. tok_dtok,
  96. xml_file.name,
  97. autor,
  98. data,
  99. tipus,
  100. dialecte,
  101. ]
  102. writer.writerow(llista)
  103. results.write(f"@@@ {context} @@@\n\n")
  104. results.write(f"Source: {fname}\n\n\n")
  105. for xml_file in xml_files:
  106. if xml_file.name.startswith("."):
  107. continue
  108. beg_extract = time.time()
  109. doc_root = et.parse(xml_file, parser=None).getroot()
  110. obra = None
  111. autor = None
  112. data = None
  113. tipus = None
  114. dialecte = None
  115. header = doc_root.find("header")
  116. if header is not None:
  117. for el in header:
  118. if el.get("type") == "obra":
  119. obra = el.text
  120. elif el.get("type") == "autor":
  121. autor = el.text
  122. elif el.get("type") == "data":
  123. data = el.text
  124. elif el.get("type") == "tipologia":
  125. tipus = el.text
  126. elif el.get("type") == "dialecte":
  127. dialecte = el.text
  128. xml_extract(doc_root, xml_file.name)
  129. times.write(f"Time to extract {xml_file.name}: {time.time() - beg_extract}s\n")
  130. elapsed = time.time() - beg_main
  131. times.write(f"\n \n The end: The whole process took {elapsed}s\n")
  132. print("Execution time:", elapsed, "seconds")

Please note that you'll need to replace /PathTo/ with the actual file paths in your system.

英文:

I have a very frustrating problem that I don't know how to solve. The following Python script processes a bunch of XML documents from a directory and it extracts information from them. With that information, it creates a csv file.

  1. import re
  2. import time
  3. import csv
  4. from lxml import etree as et
  5. from pathlib import Path
  6. from joblib import Parallel, delayed
  7. from tqdm import tqdm
  8. import ftfy
  9. st = time.time()
  10. XMLDIR = Path(&#39;/Users/josepm.fontana/Downloads/CICA_CORPUS_XML_CLEAN&#39;)
  11. files = [e for e in XMLDIR.iterdir() if e.is_file()]
  12. xml_doc = [f for f in files if f.with_suffix(&quot;.xml&quot;)]
  13. myCSV_FILE = &quot;/Volumes/SanDisk1TB/_CORPUS_WORK/CSVs/TestDataSet19-6-23_YZh.csv&quot;
  14. time_log = Path(
  15. &#39;/Volumes/SanDisk1TB/_CORPUS_WORK/TEXT_FILES/log_time_pathlib.txt&#39;)
  16. results = Path(
  17. &#39;/Volumes/SanDisk1TB/_CORPUS_WORK/TEXT_FILES/TestDataSet19-6-23_YZh.txt&#39;)
  18. tok_path = et.XPath(&#39;//tok | //dtok&#39;)
  19. def xml_extract(xml_doc):
  20. root_element = et.parse(xml_doc).getroot()
  21. autor = None
  22. data = None
  23. tipus = None
  24. dialecte = None
  25. header = root_element.find(&quot;header&quot;)
  26. if header is not None:
  27. for el in header:
  28. if el.get(&quot;type&quot;) == &quot;autor&quot;:
  29. autor = el.text
  30. autor = ftfy.fix_text(autor)
  31. elif el.get(&quot;type&quot;) == &quot;data&quot;:
  32. data = el.text
  33. data = ftfy.fix_text(data)
  34. elif el.get(&quot;type&quot;) == &quot;tipologia&quot;:
  35. tipus = el.text
  36. tipus = ftfy.fix_text(tipus)
  37. elif el.get(&quot;type&quot;) == &quot;dialecte&quot;:
  38. dialecte = el.text
  39. dialecte = ftfy.fix_text(dialecte)
  40. all_toks = tok_path(root_element)
  41. matching_toks = filter(lambda tok: tok.get(&#39;xpos&#39;) is not None and tok.get(
  42. &#39;xpos&#39;).startswith(&#39;A&#39;) and not (tok.get(&#39;xpos&#39;).startswith(&#39;AX&#39;)), all_toks)
  43. for el in matching_toks:
  44. preceding_tok = el.xpath(
  45. &quot;./preceding-sibling::tok[1][@lemma and @xpos]&quot;)
  46. preceding_tok_with_dtoks = el.xpath(
  47. &quot;./preceding-sibling::tok[1][not(@lemma) and not(@xpos)]&quot;
  48. )
  49. following_dtok_of_dtok = el.xpath(&quot;./preceding-sibling::dtok[1]&quot;)
  50. if el.tag == &#39;tok&#39;:
  51. tok_dtok = &#39;tok&#39;
  52. Adj = &quot;&quot;.join(el.itertext())
  53. Adj_lemma = el.get(&#39;lemma&#39;)
  54. Adj_xpos = el.get(&#39;xpos&#39;)
  55. Adj = ftfy.fix_text(Adj)
  56. elif el.tag == &#39;dtok&#39;:
  57. tok_dtok = &#39;dtok&#39;
  58. Adj = el.get(&#39;form&#39;)
  59. Adj_lemma = el.get(&#39;lemma&#39;)
  60. Adj_xpos = el.get(&#39;xpos&#39;)
  61. Adj = ftfy.fix_text(Adj)
  62. pos = all_toks.index(el)
  63. RelevantPrecedingElements = all_toks[max(pos - 6, 0):pos]
  64. RelevantFollowingElements = all_toks[pos + 1:max(pos + 6, 1)]
  65. if RelevantPrecedingElements:
  66. prec1 = RelevantPrecedingElements[-1]
  67. else:
  68. prec1 = None
  69. if RelevantFollowingElements:
  70. foll1 = RelevantFollowingElements[0]
  71. else:
  72. foll1 = None
  73. ElementsContext = all_toks[max(pos - 6, 0):pos + 1]
  74. context_list = []
  75. if ElementsContext:
  76. for elem in ElementsContext:
  77. elem_text = &quot;&quot;.join(elem.itertext())
  78. assert elem_text is not None
  79. context_list.append(elem_text)
  80. Adj = f&quot;&lt;{Adj}&gt;&quot;
  81. for elem in RelevantFollowingElements:
  82. elem_text = &quot;&quot;.join(elem.itertext())
  83. assert elem_text is not None
  84. context_list.append(elem_text)
  85. fol_lem = foll1.get(&#39;lemma&#39;) if foll1 is not None else None
  86. prec_lem = prec1.get(&#39;lemma&#39;) if prec1 is not None else None
  87. fol_xpos = foll1.get(&#39;xpos&#39;) if foll1 is not None else None
  88. prec_xpos = prec1.get(&#39;xpos&#39;) if prec1 is not None else None
  89. fol_form = None
  90. if foll1 is not None:
  91. if foll1.tag == &quot;tok&quot;:
  92. fol_form = foll1.text
  93. elif foll1.tag == &quot;dtok&quot;:
  94. fol_form = foll1.get(&quot;form&quot;)
  95. prec_form = None
  96. if prec1 is not None:
  97. if prec1.tag == &quot;tok&quot;:
  98. prec_form = prec1.text
  99. elif prec1.tag == &quot;dtok&quot;:
  100. prec_form = prec1.get(&quot;form&quot;)
  101. context = &quot; &quot;.join(context_list).replace(
  102. &quot; ,&quot;, &quot;,&quot;).replace(&quot; .&quot;, &quot;.&quot;).replace(&quot; &quot;, &quot; &quot;).replace(&quot; &quot;, &quot; &quot;)
  103. llista = [
  104. context,
  105. prec_form,
  106. Adj,
  107. fol_form,
  108. prec_lem,
  109. Adj_lemma,
  110. fol_lem,
  111. prec_xpos,
  112. Adj_xpos,
  113. fol_xpos,
  114. tok_dtok,
  115. xml_doc.name,
  116. autor,
  117. data,
  118. tipus,
  119. dialecte,
  120. ]
  121. writer = csv.writer(csv_file, delimiter=&quot;;&quot;)
  122. writer.writerow(llista)
  123. with open(results, &quot;a&quot;) as Results:
  124. Results.write(f&quot;@@@ {context} @@@\n\n&quot;)
  125. Results.write(f&quot;Source: {xml_doc.name}\n\n\n&quot;)
  126. with open(myCSV_FILE, &quot;a+&quot;, encoding=&quot;UTF8&quot;, newline=&#39;&#39;) as csv_file:
  127. #Parallel(n_jobs=-1, prefer=&quot;threads&quot;)(delayed(xml_extract)(xml_doc) for xml_doc in tqdm(files))
  128. Parallel(n_jobs=-1, prefer=&quot;threads&quot;)(delayed(xml_extract)(xml_doc) for xml_doc in tqdm(files) if not xml_doc.name.startswith(&quot;.&quot;))
  129. elapsed_time = time.time() - st
  130. with open(
  131. time_log, &quot;a&quot;
  132. ) as Myfile:
  133. Myfile.write(f&quot;\n \n The end: The whole process took {elapsed_time} \n&quot;)

The text file that is created is perfect UTF-8. All of the XML documents have been double checked and triple checked to make sure they are all also properly formated as UTF-8.

At the end of every row of the csv file that is created, however, there is the "\x{0D}" character.

I do not understand this at all. This script was based on the following script that creates properly formatted csv files where this problem does not occur. The main difference is that in the problematic code I introduced parallelization via the 'joblib' library because otherwise it took forever to process all those files.

  1. import re
  2. import time
  3. import csv
  4. from lxml import etree as et
  5. from pathlib import Path
  6. st = time.time()
  7. #XMLDIR = Path(&#39;/Volumes/SanDisk1TB/_CORPUS_WORK/CICA_WORKING_NEW&#39;)
  8. XMLDIR = Path(&#39;/Users/josepm.fontana/Downloads/CICA_CORPUS_XML_CLEAN&#39;)
  9. files = [e for e in XMLDIR.iterdir() if e.is_file()]
  10. xml_doc = [f for f in files if f.with_suffix(&quot;.xml&quot;)]
  11. myCSV_FILE = &quot;/Volumes/SanDisk1TB/_CORPUS_WORK/CSVs/clitic_context_testTEST2.csv&quot;
  12. time_log = Path(&#39;/Volumes/SanDisk1TB/_CORPUS_WORK/TEXT_FILES/log_time_pathlib.txt&#39;)
  13. results = Path(&#39;/Volumes/SanDisk1TB/_CORPUS_WORK/TEXT_FILES/resultsTEST2.txt&#39;)
  14. tok_path = et.XPath(&#39;//tok&#39;)
  15. def xml_extract(root_element):
  16. all_toks = tok_path(root_element)
  17. matching_toks = filter(lambda tok: re.match(r&#39;^[EeLl][LlOoAa][Ss]*$&#39;, &quot;&quot;.join(tok.itertext())) is not None and not(tok.get(&#39;xpos&#39;).startswith(&#39;D&#39;)), all_toks)
  18. for el in matching_toks:
  19. fake_clitic = &quot;&quot;.join(el.itertext())
  20. pos = all_toks.index(el)
  21. RelevantPrecedingElements = all_toks[max(pos - 6, 0):pos]
  22. print(RelevantPrecedingElements)
  23. prec1 = RelevantPrecedingElements[-1]
  24. #foll1 = all_toks[pos + 1]
  25. RelevantFollowingElements = all_toks[pos + 1:max(pos + 6, 1)]
  26. #prec1 = RelevantFollowingElements[]
  27. #foll1 = all_toks[pos + 1]
  28. print(RelevantFollowingElements)
  29. foll1 = RelevantFollowingElements[0]
  30. context_list = []
  31. context_clean = []
  32. for elem in RelevantPrecedingElements:
  33. elem_text = &quot;&quot;.join(elem.itertext())
  34. assert elem_text is not None
  35. context_list.append(elem_text)
  36. context_clean.append(elem_text)
  37. # adjective = &#39;&lt;&#39; + str(el.text) + &#39;&gt;&#39;
  38. fake_clitic = f&quot;&lt;{fake_clitic}&gt;&quot;
  39. fake_clitic_clean = f&quot;{el.text}&quot;
  40. print(fake_clitic)
  41. context_list.append(fake_clitic)
  42. context_clean.append(fake_clitic_clean)
  43. for elem in RelevantFollowingElements:
  44. elem_text = &quot;&quot;.join(elem.itertext())
  45. assert elem_text is not None
  46. context_list.append(elem_text)
  47. context_clean.append(elem_text)
  48. lema_fol = foll1.get(&#39;lemma&#39;) if foll1 is not None else None
  49. lema_prec = prec1.get(&#39;lemma&#39;) if prec1 is not None else None
  50. xpos_fol = foll1.get(&#39;xpos&#39;) if foll1 is not None else None
  51. xpos_prec = prec1.get(&#39;xpos&#39;) if prec1 is not None else None
  52. form_fol = foll1.text if foll1 is not None else None
  53. form_prec = prec1.text if prec1 is not None else None
  54. context = &quot; &quot;.join(context_list)
  55. clean_context = &quot; &quot;.join(context_clean).replace(&quot; ,&quot;, &quot;,&quot;).replace(&quot; .&quot;, &quot;.&quot;)
  56. print(f&quot;Context is: {context}&quot;)
  57. llista = [
  58. context,
  59. lema_prec,
  60. xpos_prec,
  61. form_prec,
  62. fake_clitic,
  63. lema_fol,
  64. xpos_fol,
  65. form_fol,
  66. ]
  67. writer = csv.writer(csv_file, delimiter=&quot;;&quot;)
  68. writer.writerow(llista)
  69. with open(
  70. results, &quot;a&quot;
  71. ) as Results:
  72. Results.write(f&quot;@@@ {context} @@@\n\n&quot;)
  73. Results.write(f&quot;{clean_context}\n\n&quot;)
  74. Results.write(f&quot;Source: {xml_doc.name}\n\n\n&quot;)
  75. with open(myCSV_FILE, &quot;a+&quot;, encoding=&quot;UTF8&quot;, newline=&quot;&quot;) as csv_file:
  76. for xml_doc in files:
  77. if xml_doc.name.startswith(&quot;.&quot;):
  78. continue
  79. doc = xml_doc.stem # this was
  80. print(doc)
  81. start_file_time_beforeParse = time.time()
  82. print(start_file_time_beforeParse)
  83. print(
  84. f&quot;{time.time() - st} seconds after the beginning of the process I&#39;m starting to get the root of {xml_doc.name}&quot;
  85. )
  86. file_root = et.parse(xml_doc).getroot()
  87. xml_extract(file_root)
  88. print(
  89. f&quot;I ran through {xml_doc.name} in {time.time() - start_file_time_beforeParse} seconds!&quot;
  90. )
  91. with open(
  92. time_log, &quot;a&quot;
  93. ) as Myfile:
  94. Myfile.write(&quot;Time it took to getroot and parse &quot;)
  95. Myfile.write(xml_doc.name)
  96. Myfile.write(&quot;\n&quot;)
  97. Myfile.write(&quot;Time it took to loop through the entire &quot;)
  98. Myfile.write(xml_doc.name)
  99. Myfile.write(&quot; is: &quot;)
  100. Myfile.write(f&quot;{time.time() - start_file_time_beforeParse} seconds!&quot;)
  101. Myfile.write(&quot;\n&quot;)
  102. Myfile.write(&quot;\n&quot;)
  103. elapsed_time = time.time() - st
  104. with open(
  105. time_log, &quot;a&quot;
  106. ) as Myfile:
  107. Myfile.write(f&quot;\n \n The end: The whole process took {elapsed_time} \n&quot;)
  108. print(&quot;Execution time:&quot;, elapsed_time, &quot;seconds&quot;)

I would greatly appreciate any help you can offer. This is really frustrating.

Here is a link to some sample XML files like the ones I'm trying to process:

Sample XML files

EDIT:

Adaptation of Zach Young's script for problematic task:

  1. import csv
  2. import re
  3. import time
  4. from pathlib import Path
  5. from lxml import etree as et
  6. beg_main = time.time()
  7. #xmls_dir = Path(&quot;./xmls&quot;)
  8. xmls_dir = Path(&#39;/PathTo/CLEAN_COMP_TEST2&#39;)
  9. files = [e for e in xmls_dir.iterdir() if e.is_file()]
  10. xml_files = [f for f in files if f.with_suffix(&quot;.xml&quot;)]
  11. csv_path = Path(&quot;/PathTo/My_Output.csv&quot;)
  12. csv_file = open(csv_path, &quot;w&quot;, newline=&quot;&quot;, encoding=&quot;utf-8&quot;)
  13. writer = csv.writer(csv_file, delimiter=&quot;;&quot;)
  14. results_path = Path(&quot;/PathTo/my_results.txt&quot;)
  15. results = open(results_path, &quot;w&quot;, encoding=&quot;utf-8&quot;)
  16. times_path = Path(&quot;/PathTo/my_times.txt&quot;)
  17. times = open(times_path, &quot;w&quot;, encoding=&quot;utf-8&quot;)
  18. tok_path = et.XPath(&#39;//tok | //dtok&#39;)
  19. def xml_extract(doc_root, fname: str):
  20. all_toks = tok_path(doc_root)
  21. matching_toks = filter(
  22. lambda tok:
  23. tok.get(&#39;xpos&#39;) is not None and tok.get
  24. (
  25. &#39;xpos&#39;).startswith(&#39;A&#39;) and not (tok.get(&#39;xpos&#39;).startswith(&#39;AX&#39;)
  26. ),
  27. all_toks
  28. )
  29. for el in matching_toks:
  30. preceding_tok = el.xpath(
  31. &quot;./preceding-sibling::tok[1][@lemma and @xpos]&quot;)
  32. preceding_tok_with_dtoks = el.xpath(
  33. &quot;./preceding-sibling::tok[1][not(@lemma) and not(@xpos)]&quot;
  34. )
  35. following_dtok_of_dtok = el.xpath(&quot;./preceding-sibling::dtok[1]&quot;)
  36. if el.tag == &#39;tok&#39;:
  37. tok_dtok = &#39;tok&#39;
  38. Adj = &quot;&quot;.join(el.itertext())
  39. Adj_lemma = el.get(&#39;lemma&#39;)
  40. Adj_xpos = el.get(&#39;xpos&#39;)
  41. elif el.tag == &#39;dtok&#39;:
  42. tok_dtok = &#39;dtok&#39;
  43. Adj = el.get(&#39;form&#39;)
  44. Adj_lemma = el.get(&#39;lemma&#39;)
  45. Adj_xpos = el.get(&#39;xpos&#39;)
  46. pos = all_toks.index(el)
  47. RelevantPrecedingElements = all_toks[max(pos - 6, 0):pos]
  48. RelevantFollowingElements = all_toks[pos + 1:max(pos + 6, 1)]
  49. if RelevantPrecedingElements:
  50. prec1 = RelevantPrecedingElements[-1]
  51. else:
  52. prec1 = None
  53. if RelevantFollowingElements:
  54. foll1 = RelevantFollowingElements[0]
  55. else:
  56. foll1 = None
  57. ElementsContext = all_toks[max(pos - 6, 0):pos + 1]
  58. context_list = []
  59. if ElementsContext:
  60. for elem in ElementsContext:
  61. elem_text = &quot;&quot;.join(elem.itertext())
  62. assert elem_text is not None
  63. context_list.append(elem_text)
  64. Adj = f&quot;&lt;{Adj}&gt;&quot;
  65. for elem in RelevantFollowingElements:
  66. elem_text = &quot;&quot;.join(elem.itertext())
  67. assert elem_text is not None
  68. context_list.append(elem_text)
  69. fol_lem = foll1.get(&#39;lemma&#39;) if foll1 is not None else None
  70. prec_lem = prec1.get(&#39;lemma&#39;) if prec1 is not None else None
  71. fol_xpos = foll1.get(&#39;xpos&#39;) if foll1 is not None else None
  72. prec_xpos = prec1.get(&#39;xpos&#39;) if prec1 is not None else None
  73. fol_form = None
  74. if foll1 is not None:
  75. if foll1.tag == &quot;tok&quot;:
  76. fol_form = foll1.text
  77. elif foll1.tag == &quot;dtok&quot;:
  78. fol_form = foll1.get(&quot;form&quot;)
  79. prec_form = None
  80. if prec1 is not None:
  81. if prec1.tag == &quot;tok&quot;:
  82. prec_form = prec1.text
  83. elif prec1.tag == &quot;dtok&quot;:
  84. prec_form = prec1.get(&quot;form&quot;)
  85. context = &quot; &quot;.join(context_list).replace(
  86. &quot; ,&quot;, &quot;,&quot;).replace(&quot; .&quot;, &quot;.&quot;).replace(&quot; &quot;, &quot; &quot;).replace(&quot; &quot;, &quot; &quot;)
  87. #print(f&quot;Context is: {context}&quot;)
  88. llista = [
  89. context,
  90. prec_form,
  91. Adj,
  92. fol_form,
  93. prec_lem,
  94. Adj_lemma,
  95. fol_lem,
  96. prec_xpos,
  97. Adj_xpos,
  98. fol_xpos,
  99. tok_dtok,
  100. xml_file.name,
  101. autor,
  102. data,
  103. tipus,
  104. dialecte,
  105. ]
  106. writer.writerow(llista)
  107. results.write(f&quot;@@@ {context} @@@\n\n&quot;)
  108. results.write(f&quot;Source: {fname}\n\n\n&quot;)
  109. for xml_file in xml_files:
  110. if xml_file.name.startswith(&quot;.&quot;):
  111. continue
  112. beg_extract = time.time()
  113. doc_root = et.parse(xml_file, parser=None).getroot()
  114. obra = None
  115. autor = None
  116. data = None
  117. tipus = None
  118. dialecte = None
  119. header = doc_root.find(&quot;header&quot;)
  120. if header is not None:
  121. for el in header:
  122. if el.get(&quot;type&quot;) == &quot;obra&quot;:
  123. obra = el.text
  124. elif el.get(&quot;type&quot;) == &quot;autor&quot;:
  125. autor = el.text
  126. elif el.get(&quot;type&quot;) == &quot;data&quot;:
  127. data = el.text
  128. elif el.get(&quot;type&quot;) == &quot;tipologia&quot;:
  129. tipus = el.text
  130. elif el.get(&quot;type&quot;) == &quot;dialecte&quot;:
  131. dialecte = el.text
  132. xml_extract(doc_root, xml_file.name)
  133. times.write(f&quot;Time to extract {xml_file.name}: {time.time() - beg_extract}s\n&quot;)
  134. elapsed = time.time() - beg_main
  135. times.write(f&quot;\n \n The end: The whole process took {elapsed}s\n&quot;)
  136. print(&quot;Execution time:&quot;, elapsed, &quot;seconds&quot;)

答案1

得分: 1

根据我们在评论中的小讨论,我建议从以下内容开始。您可以在最顶部一次性打开所有文件以进行写操作,然后在需要写入的地方引用它们(但不要并行,只是同步进行):

  1. import csv
  2. import re
  3. import time
  4. from pathlib import Path
  5. from lxml import etree as et
  6. beg_main = time.time()
  7. xmls_dir = Path("./xmls")
  8. files = [e for e in xmls_dir.iterdir() if e.is_file()]
  9. xml_files = [f for f in files if f.suffix == ".xml"]
  10. csv_path = Path("./my_output.csv")
  11. csv_file = open(csv_path, "w", newline="", encoding="utf-8")
  12. writer = csv.writer(csv_file, delimiter=";")
  13. results_path = Path("./my_results.txt")
  14. results = open(results_path, "w", encoding="utf-8")
  15. times_path = Path("./my_times.txt")
  16. times = open(times_path, "w", encoding="utf-8")
  17. tok_path = et.XPath("//tok")
  18. def xml_extract(doc_root, fname: str):
  19. all_toks = tok_path(doc_root)
  20. matching_toks = filter(
  21. lambda tok: (
  22. re.match(r"^[EeLl][LlOoAa][Ss]*$", "".join(tok.itertext())) is not None
  23. and not (tok.get("xpos").startswith("D"))
  24. ),
  25. all_toks,
  26. )
  27. for el in matching_toks:
  28. fake_clitic = "".join(el.itertext())
  29. pos = all_toks.index(el)
  30. RelevantPrecedingElements = all_toks[max(pos - 6, 0) : pos]
  31. prec1 = RelevantPrecedingElements[-1]
  32. RelevantFollowingElements = all_toks[pos + 1 : max(pos + 6, 1)]
  33. foll1 = RelevantFollowingElements[0]
  34. context_list = []
  35. context_clean = []
  36. for elem in RelevantPrecedingElements:
  37. elem_text = "".join(elem.itertext())
  38. assert elem_text is not None
  39. context_list.append(elem_text)
  40. context_clean.append(elem_text)
  41. fake_clitic = f"<{fake_clitic}>"
  42. fake_clitic_clean = f"{el.text}"
  43. context_list.append(fake_clitic)
  44. context_clean.append(fake_clitic_clean)
  45. for elem in RelevantFollowingElements:
  46. elem_text = "".join(elem.itertext())
  47. assert elem_text is not None
  48. context_list.append(elem_text)
  49. context_clean.append(elem_text)
  50. lema_fol = foll1.get("lemma") if foll1 is not None else None
  51. lema_prec = prec1.get("lemma") if prec1 is not None else None
  52. xpos_fol = foll1.get("xpos") if foll1 is not None else None
  53. xpos_prec = prec1.get("xpos") if prec1 is not None else None
  54. form_fol = foll1.text if foll1 is not None else None
  55. form_prec = prec1.text if prec1 is not None else None
  56. context = " ".join(context_list)
  57. clean_context = " ".join(context_clean).replace(" ,", ",").replace(" .", ".")
  58. llista = [
  59. context,
  60. lema_prec,
  61. xpos_prec,
  62. form_prec,
  63. fake_clitic,
  64. lema_fol,
  65. xpos_fol,
  66. form_fol,
  67. ]
  68. writer.writerow(llista)
  69. results.write(f"@@@ {context} @@@\n\n")
  70. results.write(f"{clean_context}\n\n")
  71. results.write(f"Source: {fname}\n\n\n")
  72. for xml_file in xml_files:
  73. if xml_file.name.startswith("."):
  74. continue
  75. beg_extract = time.time()
  76. doc_root = et.parse(xml_file, parser=None).getroot()
  77. xml_extract(doc_root, xml_file.name)
  78. times.write(f"Time to extract {xml_file.name}: {time.time() - beg_extract}s\n")
  79. elapsed = time.time() - beg_main
  80. times.write(f"\n \n The end: The whole process took {elapsed}s\n")
  81. print("Execution time:", elapsed, "seconds")

当程序退出时,Python会自动为您关闭文件,所以您不需要所有的with open(...)和缩进。

我在您分享的16个XML文件上运行了这个版本和您的版本。

在我的计算机上,这相对于在extract_xml内部打开文件有所不同。我的运行时间大约是您的时间的80%(比您的快20%?)。不过,我有双通道SSD,所以我的读写速度很快。如果您没有这种硬件,打开/写入/关闭会花费更多时间。我不知道这是否足以导致您遇到的减速。在处理您共享的所有16个ZIP文件时,我的程序运行时间为0.0055秒,而您的程序仅为0.0066秒。此外,在我的测试中,我发现只需注释掉您的打印/调试语句也可以节省时间。

请在您分享的样本XML上尝试我的代码,并查看它的运行速度与您的代码相比如何。

至于奇怪的写入错误,当多个代理尝试同时写入时,您将始终遇到这种情况。如果您真的希望/需要追求并行性,您需要弄清楚如何同步写入,以便一次只有一个进程尝试/可以写入任何一个文件...这可能会破坏您最初希望并行化的整个原因。

请告诉我这对您的结果有何影响。祝您好运!

英文:

Based on our little discussion in the comments, I recommend starting with something like the following. You can open all the files once for write at the very top, then reference them whereever you need to write (not in parallel, though, just synchronously):

  1. import csv
  2. import re
  3. import time
  4. from pathlib import Path
  5. from lxml import etree as et
  6. beg_main = time.time()
  7. xmls_dir = Path(&quot;./xmls&quot;)
  8. files = [e for e in xmls_dir.iterdir() if e.is_file()]
  9. xml_files = [f for f in files if f.with_suffix(&quot;.xml&quot;)]
  10. csv_path = Path(&quot;./my_output.csv&quot;)
  11. csv_file = open(csv_path, &quot;w&quot;, newline=&quot;&quot;, encoding=&quot;utf-8&quot;)
  12. writer = csv.writer(csv_file, delimiter=&quot;;&quot;)
  13. results_path = Path(&quot;./my_results.txt&quot;)
  14. results = open(results_path, &quot;w&quot;, encoding=&quot;utf-8&quot;)
  15. times_path = Path(&quot;./my_times.txt&quot;)
  16. times = open(times_path, &quot;w&quot;, encoding=&quot;utf-8&quot;)
  17. tok_path = et.XPath(&quot;//tok&quot;)
  18. def xml_extract(doc_root, fname: str):
  19. all_toks = tok_path(doc_root)
  20. matching_toks = filter(
  21. lambda tok: (
  22. re.match(r&quot;^[EeLl][LlOoAa][Ss]*$&quot;, &quot;&quot;.join(tok.itertext())) is not None
  23. and not (tok.get(&quot;xpos&quot;).startswith(&quot;D&quot;))
  24. ),
  25. all_toks,
  26. )
  27. for el in matching_toks:
  28. fake_clitic = &quot;&quot;.join(el.itertext())
  29. pos = all_toks.index(el)
  30. RelevantPrecedingElements = all_toks[max(pos - 6, 0) : pos]
  31. prec1 = RelevantPrecedingElements[-1]
  32. RelevantFollowingElements = all_toks[pos + 1 : max(pos + 6, 1)]
  33. foll1 = RelevantFollowingElements[0]
  34. context_list = []
  35. context_clean = []
  36. for elem in RelevantPrecedingElements:
  37. elem_text = &quot;&quot;.join(elem.itertext())
  38. assert elem_text is not None
  39. context_list.append(elem_text)
  40. context_clean.append(elem_text)
  41. fake_clitic = f&quot;&lt;{fake_clitic}&gt;&quot;
  42. fake_clitic_clean = f&quot;{el.text}&quot;
  43. context_list.append(fake_clitic)
  44. context_clean.append(fake_clitic_clean)
  45. for elem in RelevantFollowingElements:
  46. elem_text = &quot;&quot;.join(elem.itertext())
  47. assert elem_text is not None
  48. context_list.append(elem_text)
  49. context_clean.append(elem_text)
  50. lema_fol = foll1.get(&quot;lemma&quot;) if foll1 is not None else None
  51. lema_prec = prec1.get(&quot;lemma&quot;) if prec1 is not None else None
  52. xpos_fol = foll1.get(&quot;xpos&quot;) if foll1 is not None else None
  53. xpos_prec = prec1.get(&quot;xpos&quot;) if prec1 is not None else None
  54. form_fol = foll1.text if foll1 is not None else None
  55. form_prec = prec1.text if prec1 is not None else None
  56. context = &quot; &quot;.join(context_list)
  57. clean_context = &quot; &quot;.join(context_clean).replace(&quot; ,&quot;, &quot;,&quot;).replace(&quot; .&quot;, &quot;.&quot;)
  58. llista = [
  59. context,
  60. lema_prec,
  61. xpos_prec,
  62. form_prec,
  63. fake_clitic,
  64. lema_fol,
  65. xpos_fol,
  66. form_fol,
  67. ]
  68. writer.writerow(llista)
  69. results.write(f&quot;@@@ {context} @@@\n\n&quot;)
  70. results.write(f&quot;{clean_context}\n\n&quot;)
  71. results.write(f&quot;Source: {fname}\n\n\n&quot;)
  72. for xml_file in xml_files:
  73. if xml_file.name.startswith(&quot;.&quot;):
  74. continue
  75. beg_extract = time.time()
  76. doc_root = et.parse(xml_file, parser=None).getroot()
  77. xml_extract(doc_root, xml_file.name)
  78. times.write(f&quot;Time to extract {xml_file.name}: {time.time() - beg_extract}s\n&quot;)
  79. elapsed = time.time() - beg_main
  80. times.write(f&quot;\n \n The end: The whole process took {elapsed}s\n&quot;)
  81. print(&quot;Execution time:&quot;, elapsed, &quot;seconds&quot;)

When the program exits, Python will close the files for you, so you don't need all the with open(...) and the indentation.

I ran this version and your version on the 16 XML files you shared.

On my machine that makes some difference over opening the files inside extract_xml. Mine runs about in about 80% the time as (20% faster than?) yours. I have dual-channel SSDs though, so my read/writes are fast. If you don't have that kind of hardware, opening/writing/closing will take longer. I don't know if it's enough to see the slowdown you experienced, though. To process all 16 files in the ZIP you shared, mine ran in 0.0055 seconds, and yours ran in only 0.0066 seconds. Also, in my trials I found that just commenting out your print/debug statements saved time too.

Try out my code on the sample XMLs you shared and see what it runs in compared to yours.

As for the weird write error, you'll always get that with multiple agents trying to write at once. If you really want/need to pursue parallelism, you'll need to figure out how to sync the writes so only one process at a time tries/can write to any one file... which might defeat the whole reason you wanted to parallelize in the first place.

Lemme know how this turns out for you. Good luck!

huangapple
  • 本文由 发表于 2023年6月19日 23:20:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/76508000.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定