Non UTF-8兼容字符”\x{0D}”在输出CSV行末尾。

huangapple go评论76阅读模式
英文:

Non UTF-8 compliant character "\x{0D}" at the end of output csv rows

问题

I have translated the code as requested. Here is the translated code:

import csv
import re
import time
from pathlib import Path
from lxml import etree as et

beg_main = time.time()

# Directory containing XML files
xmls_dir = Path('/PathTo/CLEAN_COMP_TEST2')
files = [e for e in xmls_dir.iterdir() if e.is_file()]
xml_files = [f for f in files if f.with_suffix(".xml")]

# Path to the CSV output file
csv_path = Path("/PathTo/My_Output.csv")
csv_file = open(csv_path, "w", newline="", encoding="utf-8")
writer = csv.writer(csv_file, delimiter=";")

# Path to the results file
results_path = Path("/PathTo/my_results.txt")
results = open(results_path, "w", encoding="utf-8")

# Path to the times file
times_path = Path("/PathTo/my_times.txt")
times = open(times_path, "w", encoding="utf-8")

# XPath expression to select 'tok' and 'dtok' elements
tok_path = et.XPath('//tok | //dtok')

def xml_extract(doc_root, fname: str):
    all_toks = tok_path(doc_root)

    matching_toks = filter(
        lambda tok: 
            tok.get('xpos') is not None and tok.get('xpos').startswith('A') and not (tok.get('xpos').startswith('AX')),
        all_toks
    )

    for el in matching_toks:
        preceding_tok = el.xpath("./preceding-sibling::tok[1][@lemma and @xpos]")
        preceding_tok_with_dtoks = el.xpath("./preceding-sibling::tok[1][not(@lemma) and not(@xpos)]")
        following_dtok_of_dtok = el.xpath("./preceding-sibling::dtok[1]")

        if el.tag == 'tok':
            tok_dtok = 'tok'
            Adj = "".join(el.itertext())
            Adj_lemma = el.get('lemma')
            Adj_xpos = el.get('xpos')

        elif el.tag == 'dtok':
            tok_dtok = 'dtok'
            Adj = el.get('form')
            Adj_lemma = el.get('lemma')
            Adj_xpos = el.get('xpos')

        pos = all_toks.index(el)

        RelevantPrecedingElements = all_toks[max(pos - 6, 0):pos]

        RelevantFollowingElements = all_toks[pos + 1:max(pos + 6, 1)]

        if RelevantPrecedingElements:
            prec1 = RelevantPrecedingElements[-1]
        else:
            prec1 = None

        if RelevantFollowingElements:
            foll1 = RelevantFollowingElements[0]
        else:
            foll1 = None

        ElementsContext = all_toks[max(pos - 6, 0):pos + 1]

        context_list = []

        if ElementsContext:
            for elem in ElementsContext:
                elem_text = "".join(elem.itertext())
                assert elem_text is not None
                context_list.append(elem_text)

        Adj = f"<{Adj}>"

        for elem in RelevantFollowingElements:
            elem_text = "".join(elem.itertext())
            assert elem_text is not None
            context_list.append(elem_text)

        fol_lem = foll1.get('lemma') if foll1 is not None else None
        prec_lem = prec1.get('lemma') if prec1 is not None else None
        fol_xpos = foll1.get('xpos') if foll1 is not None else None
        prec_xpos = prec1.get('xpos') if prec1 is not None else None

        fol_form = None

        if foll1 is not None:
            if foll1.tag == "tok":
                fol_form = foll1.text
            elif foll1.tag == "dtok":
                fol_form = foll1.get("form")

        prec_form = None
        if prec1 is not None:

            if prec1.tag == "tok":
                prec_form = prec1.text
            elif prec1.tag == "dtok":
                prec_form = prec1.get("form")

        context = " ".join(context_list).replace(", ", ",").replace(". ", ".").replace("   ", " ").replace("  ", " ")

        llista = [
            context,
            prec_form,
            Adj,
            fol_form,
            prec_lem,
            Adj_lemma,
            fol_lem,
            prec_xpos,
            Adj_xpos,
            fol_xpos,
            tok_dtok,
            xml_file.name,
            autor,
            data,
            tipus,
            dialecte,
        ]

        writer.writerow(llista)
        results.write(f"@@@ {context} @@@\n\n")
        results.write(f"Source: {fname}\n\n\n")

for xml_file in xml_files:
    if xml_file.name.startswith("."):
        continue

    beg_extract = time.time()
    doc_root = et.parse(xml_file, parser=None).getroot()
    obra = None
    autor = None
    data = None
    tipus = None
    dialecte = None

    header = doc_root.find("header")
    if header is not None:
        for el in header:
            if el.get("type") == "obra":
                obra = el.text
            elif el.get("type") == "autor":
                autor = el.text
            elif el.get("type") == "data":
                data = el.text
            elif el.get("type") == "tipologia":
                tipus = el.text
            elif el.get("type") == "dialecte":
                dialecte = el.text

    xml_extract(doc_root, xml_file.name)

    times.write(f"Time to extract {xml_file.name}: {time.time() - beg_extract}s\n")

elapsed = time.time() - beg_main
times.write(f"\n \n The end: The whole process took {elapsed}s\n")

print("Execution time:", elapsed, "seconds")

Please note that you'll need to replace /PathTo/ with the actual file paths in your system.

英文:

I have a very frustrating problem that I don't know how to solve. The following Python script processes a bunch of XML documents from a directory and it extracts information from them. With that information, it creates a csv file.

import re
import time
import csv
from lxml import etree as et
from pathlib import Path
from joblib import Parallel, delayed
from tqdm import tqdm
import ftfy
st = time.time()
XMLDIR = Path(&#39;/Users/josepm.fontana/Downloads/CICA_CORPUS_XML_CLEAN&#39;)
files = [e for e in XMLDIR.iterdir() if e.is_file()]
xml_doc = [f for f in files if f.with_suffix(&quot;.xml&quot;)]
myCSV_FILE = &quot;/Volumes/SanDisk1TB/_CORPUS_WORK/CSVs/TestDataSet19-6-23_YZh.csv&quot;
time_log = Path(
&#39;/Volumes/SanDisk1TB/_CORPUS_WORK/TEXT_FILES/log_time_pathlib.txt&#39;)
results = Path(
&#39;/Volumes/SanDisk1TB/_CORPUS_WORK/TEXT_FILES/TestDataSet19-6-23_YZh.txt&#39;)
tok_path = et.XPath(&#39;//tok | //dtok&#39;)
def xml_extract(xml_doc):
root_element = et.parse(xml_doc).getroot()
autor = None
data = None
tipus = None
dialecte = None
header = root_element.find(&quot;header&quot;)
if header is not None:
for el in header:
if el.get(&quot;type&quot;) == &quot;autor&quot;:
autor = el.text
autor = ftfy.fix_text(autor)
elif el.get(&quot;type&quot;) == &quot;data&quot;:
data = el.text
data = ftfy.fix_text(data)
elif el.get(&quot;type&quot;) == &quot;tipologia&quot;:
tipus = el.text
tipus = ftfy.fix_text(tipus)
elif el.get(&quot;type&quot;) == &quot;dialecte&quot;:
dialecte = el.text
dialecte = ftfy.fix_text(dialecte)
all_toks = tok_path(root_element)
matching_toks = filter(lambda tok: tok.get(&#39;xpos&#39;) is not None and tok.get(
&#39;xpos&#39;).startswith(&#39;A&#39;) and not (tok.get(&#39;xpos&#39;).startswith(&#39;AX&#39;)), all_toks)
for el in matching_toks:
preceding_tok = el.xpath(
&quot;./preceding-sibling::tok[1][@lemma and @xpos]&quot;)
preceding_tok_with_dtoks = el.xpath(
&quot;./preceding-sibling::tok[1][not(@lemma) and not(@xpos)]&quot;
)
following_dtok_of_dtok = el.xpath(&quot;./preceding-sibling::dtok[1]&quot;)
if el.tag == &#39;tok&#39;:
tok_dtok = &#39;tok&#39;
Adj = &quot;&quot;.join(el.itertext())
Adj_lemma = el.get(&#39;lemma&#39;)
Adj_xpos = el.get(&#39;xpos&#39;)
Adj = ftfy.fix_text(Adj)
elif el.tag == &#39;dtok&#39;:
tok_dtok = &#39;dtok&#39;
Adj = el.get(&#39;form&#39;)
Adj_lemma = el.get(&#39;lemma&#39;)
Adj_xpos = el.get(&#39;xpos&#39;)
Adj = ftfy.fix_text(Adj)
pos = all_toks.index(el)
RelevantPrecedingElements = all_toks[max(pos - 6, 0):pos]
RelevantFollowingElements = all_toks[pos + 1:max(pos + 6, 1)]
if RelevantPrecedingElements:
prec1 = RelevantPrecedingElements[-1]
else:
prec1 = None
if RelevantFollowingElements:
foll1 = RelevantFollowingElements[0]
else:
foll1 = None
ElementsContext = all_toks[max(pos - 6, 0):pos + 1]
context_list = []
if ElementsContext:
for elem in ElementsContext:
elem_text = &quot;&quot;.join(elem.itertext())
assert elem_text is not None
context_list.append(elem_text)
Adj = f&quot;&lt;{Adj}&gt;&quot;
for elem in RelevantFollowingElements:
elem_text = &quot;&quot;.join(elem.itertext())
assert elem_text is not None
context_list.append(elem_text)
fol_lem = foll1.get(&#39;lemma&#39;) if foll1 is not None else None
prec_lem = prec1.get(&#39;lemma&#39;) if prec1 is not None else None
fol_xpos = foll1.get(&#39;xpos&#39;) if foll1 is not None else None
prec_xpos = prec1.get(&#39;xpos&#39;) if prec1 is not None else None
fol_form = None
if foll1 is not None:
if foll1.tag == &quot;tok&quot;:
fol_form = foll1.text
elif foll1.tag == &quot;dtok&quot;:
fol_form = foll1.get(&quot;form&quot;)
prec_form = None
if prec1 is not None:
if prec1.tag == &quot;tok&quot;:
prec_form = prec1.text
elif prec1.tag == &quot;dtok&quot;:
prec_form = prec1.get(&quot;form&quot;)
context = &quot; &quot;.join(context_list).replace(
&quot; ,&quot;, &quot;,&quot;).replace(&quot; .&quot;, &quot;.&quot;).replace(&quot;   &quot;, &quot; &quot;).replace(&quot;  &quot;, &quot; &quot;)
llista = [
context,
prec_form,
Adj,
fol_form,
prec_lem,
Adj_lemma,
fol_lem,
prec_xpos,
Adj_xpos,
fol_xpos,
tok_dtok,
xml_doc.name,
autor,
data,
tipus,
dialecte,
]
writer = csv.writer(csv_file, delimiter=&quot;;&quot;)
writer.writerow(llista)
with open(results, &quot;a&quot;) as Results:
Results.write(f&quot;@@@ {context} @@@\n\n&quot;)
Results.write(f&quot;Source: {xml_doc.name}\n\n\n&quot;)
with open(myCSV_FILE, &quot;a+&quot;, encoding=&quot;UTF8&quot;, newline=&#39;&#39;) as csv_file:
#Parallel(n_jobs=-1,  prefer=&quot;threads&quot;)(delayed(xml_extract)(xml_doc) for xml_doc in tqdm(files))
Parallel(n_jobs=-1, prefer=&quot;threads&quot;)(delayed(xml_extract)(xml_doc) for xml_doc in tqdm(files) if not xml_doc.name.startswith(&quot;.&quot;))
elapsed_time = time.time() - st
with open(
time_log, &quot;a&quot;
) as Myfile:
Myfile.write(f&quot;\n \n The end: The whole process took {elapsed_time} \n&quot;)

The text file that is created is perfect UTF-8. All of the XML documents have been double checked and triple checked to make sure they are all also properly formated as UTF-8.

At the end of every row of the csv file that is created, however, there is the "\x{0D}" character.

I do not understand this at all. This script was based on the following script that creates properly formatted csv files where this problem does not occur. The main difference is that in the problematic code I introduced parallelization via the 'joblib' library because otherwise it took forever to process all those files.

import re
import time
import csv
from lxml import etree as et
from pathlib import Path
st = time.time()
#XMLDIR = Path(&#39;/Volumes/SanDisk1TB/_CORPUS_WORK/CICA_WORKING_NEW&#39;)
XMLDIR = Path(&#39;/Users/josepm.fontana/Downloads/CICA_CORPUS_XML_CLEAN&#39;)
files = [e for e in XMLDIR.iterdir() if e.is_file()]
xml_doc = [f for f in files if f.with_suffix(&quot;.xml&quot;)]
myCSV_FILE = &quot;/Volumes/SanDisk1TB/_CORPUS_WORK/CSVs/clitic_context_testTEST2.csv&quot;
time_log = Path(&#39;/Volumes/SanDisk1TB/_CORPUS_WORK/TEXT_FILES/log_time_pathlib.txt&#39;)
results = Path(&#39;/Volumes/SanDisk1TB/_CORPUS_WORK/TEXT_FILES/resultsTEST2.txt&#39;)
tok_path = et.XPath(&#39;//tok&#39;)
def xml_extract(root_element):
all_toks = tok_path(root_element)
matching_toks = filter(lambda tok: re.match(r&#39;^[EeLl][LlOoAa][Ss]*$&#39;, &quot;&quot;.join(tok.itertext())) is not None and not(tok.get(&#39;xpos&#39;).startswith(&#39;D&#39;)), all_toks)
for el in matching_toks: 
fake_clitic = &quot;&quot;.join(el.itertext())
pos = all_toks.index(el)
RelevantPrecedingElements = all_toks[max(pos - 6, 0):pos]
print(RelevantPrecedingElements)
prec1 = RelevantPrecedingElements[-1]
#foll1 = all_toks[pos + 1]
RelevantFollowingElements = all_toks[pos + 1:max(pos + 6, 1)]
#prec1 = RelevantFollowingElements[]
#foll1 = all_toks[pos + 1]
print(RelevantFollowingElements)
foll1 = RelevantFollowingElements[0]
context_list = []
context_clean = []
for elem in RelevantPrecedingElements:
elem_text = &quot;&quot;.join(elem.itertext())
assert elem_text is not None
context_list.append(elem_text)            
context_clean.append(elem_text)
# adjective = &#39;&lt;&#39; + str(el.text) + &#39;&gt;&#39;
fake_clitic = f&quot;&lt;{fake_clitic}&gt;&quot;
fake_clitic_clean = f&quot;{el.text}&quot;
print(fake_clitic)
context_list.append(fake_clitic)
context_clean.append(fake_clitic_clean)
for elem in RelevantFollowingElements:
elem_text = &quot;&quot;.join(elem.itertext())
assert elem_text is not None
context_list.append(elem_text)
context_clean.append(elem_text)
lema_fol = foll1.get(&#39;lemma&#39;) if foll1 is not None else None
lema_prec = prec1.get(&#39;lemma&#39;) if prec1 is not None else None
xpos_fol = foll1.get(&#39;xpos&#39;) if foll1 is not None else None
xpos_prec = prec1.get(&#39;xpos&#39;) if prec1 is not None else None
form_fol = foll1.text if foll1 is not None else None
form_prec = prec1.text if prec1 is not None else None
context = &quot; &quot;.join(context_list)
clean_context = &quot; &quot;.join(context_clean).replace(&quot; ,&quot;, &quot;,&quot;).replace(&quot; .&quot;, &quot;.&quot;)
print(f&quot;Context is: {context}&quot;)
llista = [
context,
lema_prec,
xpos_prec,
form_prec,
fake_clitic,
lema_fol,
xpos_fol,
form_fol,
]
writer = csv.writer(csv_file, delimiter=&quot;;&quot;)
writer.writerow(llista)
with open(
results, &quot;a&quot;
) as Results:
Results.write(f&quot;@@@ {context} @@@\n\n&quot;)
Results.write(f&quot;{clean_context}\n\n&quot;)
Results.write(f&quot;Source: {xml_doc.name}\n\n\n&quot;)
with open(myCSV_FILE, &quot;a+&quot;, encoding=&quot;UTF8&quot;, newline=&quot;&quot;) as csv_file:
for xml_doc in files:
if xml_doc.name.startswith(&quot;.&quot;):
continue
doc = xml_doc.stem # this was 
print(doc)
start_file_time_beforeParse = time.time()
print(start_file_time_beforeParse)
print(
f&quot;{time.time() - st} seconds after the beginning of the process I&#39;m starting to get the root of {xml_doc.name}&quot;
)
file_root = et.parse(xml_doc).getroot()
xml_extract(file_root)
print(
f&quot;I ran through {xml_doc.name} in {time.time() - start_file_time_beforeParse} seconds!&quot;
)
with open(
time_log, &quot;a&quot;
) as Myfile:
Myfile.write(&quot;Time it took to getroot and parse &quot;)
Myfile.write(xml_doc.name)
Myfile.write(&quot;\n&quot;)
Myfile.write(&quot;Time it took to loop through the entire &quot;)
Myfile.write(xml_doc.name)
Myfile.write(&quot; is: &quot;)
Myfile.write(f&quot;{time.time() - start_file_time_beforeParse} seconds!&quot;)
Myfile.write(&quot;\n&quot;)
Myfile.write(&quot;\n&quot;)
elapsed_time = time.time() - st
with open(
time_log, &quot;a&quot;
) as Myfile:
Myfile.write(f&quot;\n \n The end: The whole process took {elapsed_time} \n&quot;)
print(&quot;Execution time:&quot;, elapsed_time, &quot;seconds&quot;)

I would greatly appreciate any help you can offer. This is really frustrating.

Here is a link to some sample XML files like the ones I'm trying to process:

Sample XML files

EDIT:

Adaptation of Zach Young's script for problematic task:

import csv
import re
import time
from pathlib import Path
from lxml import etree as et
beg_main = time.time()
#xmls_dir = Path(&quot;./xmls&quot;)
xmls_dir = Path(&#39;/PathTo/CLEAN_COMP_TEST2&#39;)
files = [e for e in xmls_dir.iterdir() if e.is_file()]
xml_files = [f for f in files if f.with_suffix(&quot;.xml&quot;)]
csv_path = Path(&quot;/PathTo/My_Output.csv&quot;)
csv_file = open(csv_path, &quot;w&quot;, newline=&quot;&quot;, encoding=&quot;utf-8&quot;)
writer = csv.writer(csv_file, delimiter=&quot;;&quot;)
results_path = Path(&quot;/PathTo/my_results.txt&quot;)
results = open(results_path, &quot;w&quot;, encoding=&quot;utf-8&quot;)
times_path = Path(&quot;/PathTo/my_times.txt&quot;)
times = open(times_path, &quot;w&quot;, encoding=&quot;utf-8&quot;)
tok_path = et.XPath(&#39;//tok | //dtok&#39;)
def xml_extract(doc_root, fname: str):
all_toks = tok_path(doc_root)    
matching_toks = filter(
lambda tok: 
tok.get(&#39;xpos&#39;) is not None and tok.get
(
&#39;xpos&#39;).startswith(&#39;A&#39;) and not (tok.get(&#39;xpos&#39;).startswith(&#39;AX&#39;)
), 
all_toks
)
for el in matching_toks:
preceding_tok = el.xpath(
&quot;./preceding-sibling::tok[1][@lemma and @xpos]&quot;)
preceding_tok_with_dtoks = el.xpath(
&quot;./preceding-sibling::tok[1][not(@lemma) and not(@xpos)]&quot;
)
following_dtok_of_dtok = el.xpath(&quot;./preceding-sibling::dtok[1]&quot;)
if el.tag == &#39;tok&#39;:
tok_dtok = &#39;tok&#39;
Adj = &quot;&quot;.join(el.itertext())
Adj_lemma = el.get(&#39;lemma&#39;)
Adj_xpos = el.get(&#39;xpos&#39;)
elif el.tag == &#39;dtok&#39;:
tok_dtok = &#39;dtok&#39;
Adj = el.get(&#39;form&#39;)
Adj_lemma = el.get(&#39;lemma&#39;)
Adj_xpos = el.get(&#39;xpos&#39;)
pos = all_toks.index(el)
RelevantPrecedingElements = all_toks[max(pos - 6, 0):pos]
RelevantFollowingElements = all_toks[pos + 1:max(pos + 6, 1)]
if RelevantPrecedingElements:
prec1 = RelevantPrecedingElements[-1]
else:
prec1 = None
if RelevantFollowingElements:
foll1 = RelevantFollowingElements[0]
else:
foll1 = None
ElementsContext = all_toks[max(pos - 6, 0):pos + 1]
context_list = []
if ElementsContext:
for elem in ElementsContext:
elem_text = &quot;&quot;.join(elem.itertext())
assert elem_text is not None
context_list.append(elem_text)
Adj = f&quot;&lt;{Adj}&gt;&quot;
for elem in RelevantFollowingElements:
elem_text = &quot;&quot;.join(elem.itertext())
assert elem_text is not None
context_list.append(elem_text)
fol_lem = foll1.get(&#39;lemma&#39;) if foll1 is not None else None
prec_lem = prec1.get(&#39;lemma&#39;) if prec1 is not None else None
fol_xpos = foll1.get(&#39;xpos&#39;) if foll1 is not None else None
prec_xpos = prec1.get(&#39;xpos&#39;) if prec1 is not None else None
fol_form = None
if foll1 is not None:
if foll1.tag == &quot;tok&quot;:
fol_form = foll1.text
elif foll1.tag == &quot;dtok&quot;:
fol_form = foll1.get(&quot;form&quot;)
prec_form = None
if prec1 is not None:
if prec1.tag == &quot;tok&quot;:
prec_form = prec1.text
elif prec1.tag == &quot;dtok&quot;:
prec_form = prec1.get(&quot;form&quot;)
context = &quot; &quot;.join(context_list).replace(
&quot; ,&quot;, &quot;,&quot;).replace(&quot; .&quot;, &quot;.&quot;).replace(&quot;   &quot;, &quot; &quot;).replace(&quot;  &quot;, &quot; &quot;)
#print(f&quot;Context is: {context}&quot;)
llista = [
context,
prec_form,
Adj,
fol_form,
prec_lem,
Adj_lemma,
fol_lem,
prec_xpos,
Adj_xpos,
fol_xpos,
tok_dtok,
xml_file.name,
autor,
data,
tipus,
dialecte,
]
writer.writerow(llista)
results.write(f&quot;@@@ {context} @@@\n\n&quot;)
results.write(f&quot;Source: {fname}\n\n\n&quot;)
for xml_file in xml_files:
if xml_file.name.startswith(&quot;.&quot;):
continue
beg_extract = time.time()
doc_root = et.parse(xml_file, parser=None).getroot()
obra = None
autor = None
data = None
tipus = None
dialecte = None
header = doc_root.find(&quot;header&quot;)
if header is not None:
for el in header:
if el.get(&quot;type&quot;) == &quot;obra&quot;:
obra = el.text
elif el.get(&quot;type&quot;) == &quot;autor&quot;:
autor = el.text
elif el.get(&quot;type&quot;) == &quot;data&quot;:
data = el.text
elif el.get(&quot;type&quot;) == &quot;tipologia&quot;:
tipus = el.text
elif el.get(&quot;type&quot;) == &quot;dialecte&quot;:
dialecte = el.text
xml_extract(doc_root, xml_file.name)
times.write(f&quot;Time to extract {xml_file.name}: {time.time() - beg_extract}s\n&quot;)
elapsed = time.time() - beg_main
times.write(f&quot;\n \n The end: The whole process took {elapsed}s\n&quot;)
print(&quot;Execution time:&quot;, elapsed, &quot;seconds&quot;)

答案1

得分: 1

根据我们在评论中的小讨论,我建议从以下内容开始。您可以在最顶部一次性打开所有文件以进行写操作,然后在需要写入的地方引用它们(但不要并行,只是同步进行):

import csv
import re
import time

from pathlib import Path

from lxml import etree as et

beg_main = time.time()

xmls_dir = Path("./xmls")
files = [e for e in xmls_dir.iterdir() if e.is_file()]
xml_files = [f for f in files if f.suffix == ".xml"]

csv_path = Path("./my_output.csv")
csv_file = open(csv_path, "w", newline="", encoding="utf-8")
writer = csv.writer(csv_file, delimiter=";")

results_path = Path("./my_results.txt")
results = open(results_path, "w", encoding="utf-8")

times_path = Path("./my_times.txt")
times = open(times_path, "w", encoding="utf-8")

tok_path = et.XPath("//tok")

def xml_extract(doc_root, fname: str):
    all_toks = tok_path(doc_root)

    matching_toks = filter(
        lambda tok: (
            re.match(r"^[EeLl][LlOoAa][Ss]*$", "".join(tok.itertext())) is not None
            and not (tok.get("xpos").startswith("D"))
        ),
        all_toks,
    )

    for el in matching_toks:
        fake_clitic = "".join(el.itertext())
        pos = all_toks.index(el)

        RelevantPrecedingElements = all_toks[max(pos - 6, 0) : pos]

        prec1 = RelevantPrecedingElements[-1]

        RelevantFollowingElements = all_toks[pos + 1 : max(pos + 6, 1)]

        foll1 = RelevantFollowingElements[0]

        context_list = []
        context_clean = []

        for elem in RelevantPrecedingElements:
            elem_text = "".join(elem.itertext())
            assert elem_text is not None
            context_list.append(elem_text)
            context_clean.append(elem_text)

        fake_clitic = f"<{fake_clitic}>"
        fake_clitic_clean = f"{el.text}"

        context_list.append(fake_clitic)
        context_clean.append(fake_clitic_clean)

        for elem in RelevantFollowingElements:
            elem_text = "".join(elem.itertext())
            assert elem_text is not None
            context_list.append(elem_text)
            context_clean.append(elem_text)

        lema_fol = foll1.get("lemma") if foll1 is not None else None
        lema_prec = prec1.get("lemma") if prec1 is not None else None
        xpos_fol = foll1.get("xpos") if foll1 is not None else None
        xpos_prec = prec1.get("xpos") if prec1 is not None else None
        form_fol = foll1.text if foll1 is not None else None
        form_prec = prec1.text if prec1 is not None else None

        context = " ".join(context_list)
        clean_context = " ".join(context_clean).replace(" ,", ",").replace(" .", ".")

        llista = [
            context,
            lema_prec,
            xpos_prec,
            form_prec,
            fake_clitic,
            lema_fol,
            xpos_fol,
            form_fol,
        ]

        writer.writerow(llista)

        results.write(f"@@@ {context} @@@\n\n")
        results.write(f"{clean_context}\n\n")
        results.write(f"Source: {fname}\n\n\n")

for xml_file in xml_files:
    if xml_file.name.startswith("."):
        continue

    beg_extract = time.time()
    doc_root = et.parse(xml_file, parser=None).getroot()
    xml_extract(doc_root, xml_file.name)

    times.write(f"Time to extract {xml_file.name}: {time.time() - beg_extract}s\n")

elapsed = time.time() - beg_main
times.write(f"\n \n The end: The whole process took {elapsed}s\n")

print("Execution time:", elapsed, "seconds")

当程序退出时,Python会自动为您关闭文件,所以您不需要所有的with open(...)和缩进。

我在您分享的16个XML文件上运行了这个版本和您的版本。

在我的计算机上,这相对于在extract_xml内部打开文件有所不同。我的运行时间大约是您的时间的80%(比您的快20%?)。不过,我有双通道SSD,所以我的读写速度很快。如果您没有这种硬件,打开/写入/关闭会花费更多时间。我不知道这是否足以导致您遇到的减速。在处理您共享的所有16个ZIP文件时,我的程序运行时间为0.0055秒,而您的程序仅为0.0066秒。此外,在我的测试中,我发现只需注释掉您的打印/调试语句也可以节省时间。

请在您分享的样本XML上尝试我的代码,并查看它的运行速度与您的代码相比如何。

至于奇怪的写入错误,当多个代理尝试同时写入时,您将始终遇到这种情况。如果您真的希望/需要追求并行性,您需要弄清楚如何同步写入,以便一次只有一个进程尝试/可以写入任何一个文件...这可能会破坏您最初希望并行化的整个原因。

请告诉我这对您的结果有何影响。祝您好运!

英文:

Based on our little discussion in the comments, I recommend starting with something like the following. You can open all the files once for write at the very top, then reference them whereever you need to write (not in parallel, though, just synchronously):

import csv
import re
import time

from pathlib import Path

from lxml import etree as et

beg_main = time.time()

xmls_dir = Path(&quot;./xmls&quot;)
files = [e for e in xmls_dir.iterdir() if e.is_file()]
xml_files = [f for f in files if f.with_suffix(&quot;.xml&quot;)]

csv_path = Path(&quot;./my_output.csv&quot;)
csv_file = open(csv_path, &quot;w&quot;, newline=&quot;&quot;, encoding=&quot;utf-8&quot;)
writer = csv.writer(csv_file, delimiter=&quot;;&quot;)

results_path = Path(&quot;./my_results.txt&quot;)
results = open(results_path, &quot;w&quot;, encoding=&quot;utf-8&quot;)

times_path = Path(&quot;./my_times.txt&quot;)
times = open(times_path, &quot;w&quot;, encoding=&quot;utf-8&quot;)

tok_path = et.XPath(&quot;//tok&quot;)

def xml_extract(doc_root, fname: str):
    all_toks = tok_path(doc_root)

    matching_toks = filter(
        lambda tok: (
            re.match(r&quot;^[EeLl][LlOoAa][Ss]*$&quot;, &quot;&quot;.join(tok.itertext())) is not None
            and not (tok.get(&quot;xpos&quot;).startswith(&quot;D&quot;))
        ),
        all_toks,
    )

    for el in matching_toks:
        fake_clitic = &quot;&quot;.join(el.itertext())
        pos = all_toks.index(el)

        RelevantPrecedingElements = all_toks[max(pos - 6, 0) : pos]

        prec1 = RelevantPrecedingElements[-1]

        RelevantFollowingElements = all_toks[pos + 1 : max(pos + 6, 1)]

        foll1 = RelevantFollowingElements[0]

        context_list = []
        context_clean = []

        for elem in RelevantPrecedingElements:
            elem_text = &quot;&quot;.join(elem.itertext())
            assert elem_text is not None
            context_list.append(elem_text)
            context_clean.append(elem_text)

        fake_clitic = f&quot;&lt;{fake_clitic}&gt;&quot;
        fake_clitic_clean = f&quot;{el.text}&quot;

        context_list.append(fake_clitic)
        context_clean.append(fake_clitic_clean)

        for elem in RelevantFollowingElements:
            elem_text = &quot;&quot;.join(elem.itertext())
            assert elem_text is not None
            context_list.append(elem_text)
            context_clean.append(elem_text)

        lema_fol = foll1.get(&quot;lemma&quot;) if foll1 is not None else None
        lema_prec = prec1.get(&quot;lemma&quot;) if prec1 is not None else None
        xpos_fol = foll1.get(&quot;xpos&quot;) if foll1 is not None else None
        xpos_prec = prec1.get(&quot;xpos&quot;) if prec1 is not None else None
        form_fol = foll1.text if foll1 is not None else None
        form_prec = prec1.text if prec1 is not None else None

        context = &quot; &quot;.join(context_list)
        clean_context = &quot; &quot;.join(context_clean).replace(&quot; ,&quot;, &quot;,&quot;).replace(&quot; .&quot;, &quot;.&quot;)

        llista = [
            context,
            lema_prec,
            xpos_prec,
            form_prec,
            fake_clitic,
            lema_fol,
            xpos_fol,
            form_fol,
        ]

        writer.writerow(llista)

        results.write(f&quot;@@@ {context} @@@\n\n&quot;)
        results.write(f&quot;{clean_context}\n\n&quot;)
        results.write(f&quot;Source: {fname}\n\n\n&quot;)

for xml_file in xml_files:
    if xml_file.name.startswith(&quot;.&quot;):
        continue

    beg_extract = time.time()
    doc_root = et.parse(xml_file, parser=None).getroot()
    xml_extract(doc_root, xml_file.name)

    times.write(f&quot;Time to extract {xml_file.name}: {time.time() - beg_extract}s\n&quot;)

elapsed = time.time() - beg_main
times.write(f&quot;\n \n The end: The whole process took {elapsed}s\n&quot;)

print(&quot;Execution time:&quot;, elapsed, &quot;seconds&quot;)

When the program exits, Python will close the files for you, so you don't need all the with open(...) and the indentation.

I ran this version and your version on the 16 XML files you shared.

On my machine that makes some difference over opening the files inside extract_xml. Mine runs about in about 80% the time as (20% faster than?) yours. I have dual-channel SSDs though, so my read/writes are fast. If you don't have that kind of hardware, opening/writing/closing will take longer. I don't know if it's enough to see the slowdown you experienced, though. To process all 16 files in the ZIP you shared, mine ran in 0.0055 seconds, and yours ran in only 0.0066 seconds. Also, in my trials I found that just commenting out your print/debug statements saved time too.

Try out my code on the sample XMLs you shared and see what it runs in compared to yours.

As for the weird write error, you'll always get that with multiple agents trying to write at once. If you really want/need to pursue parallelism, you'll need to figure out how to sync the writes so only one process at a time tries/can write to any one file... which might defeat the whole reason you wanted to parallelize in the first place.

Lemme know how this turns out for you. Good luck!

huangapple
  • 本文由 发表于 2023年6月19日 23:20:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/76508000.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定