英文:
Looking for a way to run a Python script multiple times while converting txt files to csv
问题
以下是翻译好的部分:
我正在尝试将多个txt文件转换为csv。
这是我的代码:
```python
import pandas as pd
import pathlib
path=pathlib.Path(folderpathtotxtfiles)
def create_csv(filename):
try:
df=pd.read_csv(filename,sep='\s+',on_bad_lines='skip')
df.to_csv(f'{filename}.csv',header=None,index=False)
except:
print(filename,'is empty')
for filename in path.glob('*.txt'):
create_csv(filename)
这给了我所需的输出,但是否有任何方法可以多次运行此代码,每次只从我的路径中获取10个文件并将其转换为csv。就像如果我同时多次运行它,它应该先获取前10个文件,然后获取下一个10个文件,依此类推。
<details>
<summary>英文:</summary>
I am trying to convert multiple txt files to csv.
Here's my code:
import pandas as pd
import pathlib
path=pathlib.Path(folderpathtotxtfiles)
def create_csv(filename):
try:
df=pd.read_csv(filename,sep='\s+',on_bad_lines='skip')
df.to_csv(f'{filename}.csv,header=None,index=False)
except:
print(filename,'is empty')
for filename in path.glob('*.txt'):
create_csv(filename)
This gives me the required output,but is there anyway that I could run this code multiple times and each time it should only take 10 files from my path and convert it to csv. Like if I run it multiple times simultaneously it should take first 10 files then next 10 files and so on
</details>
# 答案1
**得分**: 2
以下是您要翻译的内容:
"You need some way to ensure that files processed in one run of the program are not processed in second and subsequent invocations."
"One way to do this is to create a directory to where you move the files that have already been processed. In that way they won't be 'seen' on subsequent runs."
"For optimum performance you should consider either multiprocessing or multithreading. In this case the former is probably most appropriate."
"Something like this:
import pandas
import os
import sys
import shutil
import glob
import re
from concurrent.futures import ProcessPoolExecutor
SOURCE_DIR = '/Volumes/G-Drive/src' # location of source files
SAVE_DIR = os.path.join(SOURCE_DIR, 'processed') # location of saved/processed files
BATCH_SIZE = 10
SUFFIX = 'txt'
def move(filename):
try:
os.makedirs(SAVE_DIR, exist_ok=True)
target = os.path.join(SAVE_DIR, os.path.basename(filename))
shutil.move(filename, target)
except Exception as e:
print(e, file=sys.stderr)
def create_csf(filename):
try:
df = pandas.read_csv(filename, sep=' ')
csv = re.sub(fr'{SUFFIX}$', 'csv', filename)
df.to_csv(csv, index=False)
move(filename)
except Exception as e:
print(e, file=sys.stderr)
def main():
with ProcessPoolExecutor() as ppe:
files = glob.glob(os.path.join(SOURCE_DIR, f'*.{SUFFIX}'))
ppe.map(create_csf, list(files)[:BATCH_SIZE])
if __name__ == '__main__':
main()"
<details>
<summary>英文:</summary>
You need some way to ensure that files processed in one run of the program are not processed in second and subsequent invocations.
One way to do this is to create a directory to where you move the files that have already been processed. In that way they won't be "seen" on subsequent runs.
For optimum performance you should consider either multiprocessing or multithreading. In this case the former is probably most appropriate.
Something like this:
import pandas
import os
import sys
import shutil
import glob
import re
from concurrent.futures import ProcessPoolExecutor
SOURCE_DIR = '/Volumes/G-Drive/src' # location of source files
SAVE_DIR = os.path.join(SOURCE_DIR, 'processed') # location of saved/processed files
BATCH_SIZE = 10
SUFFIX = 'txt'
def move(filename):
try:
os.makedirs(SAVE_DIR, exist_ok=True)
target = os.path.join(SAVE_DIR, os.path.basename(filename))
shutil.move(filename, target)
except Exception as e:
print(e, file=sys.stderr)
def create_csf(filename):
try:
df = pandas.read_csv(filename, sep=' ')
csv = re.sub(fr'{SUFFIX}$', 'csv', filename)
df.to_csv(csv, index=False)
move(filename)
except Exception as e:
print(e, file=sys.stderr)
def main():
with ProcessPoolExecutor() as ppe:
files = glob.glob(os.path.join(SOURCE_DIR, f'*.{SUFFIX}'))
ppe.map(create_csf, list(files)[:BATCH_SIZE])
if __name__ == '__main__':
main()
</details>
# 答案2
**得分**: 0
是的,您可以修改您的代码以多次运行,并每次处理 10 个文件。
```python
import pandas as pd
import pathlib
folder_path = pathlib.Path(folder_path_to_txt_files)
def create_csv(filename):
try:
df = pd.read_csv(filename, sep='\s+', on_bad_lines='skip')
df.to_csv(f'{filename}.csv', header=None, index=False)
except:
print(filename, '是空的')
# 获取文件夹中所有txt文件的列表
file_list = list(folder_path.glob('*.txt'))
# 每次处理的文件数量
batch_size = 10
while file_list:
batch = file_list[:batch_size]
file_list = file_list[batch_size:]
for filename in batch:
create_csv(filename)
希望这对您有帮助。
英文:
Yes you can modify your code to run multiple time and took 10 every time.
import pandas as pd
import pathlib
folder_path = pathlib.Path(folder_path_to_txt_files)
def create_csv(filename):
try:
df = pd.read_csv(filename, sep='\s+', on_bad_lines='skip')
df.to_csv(f'{filename}.csv', header=None, index=False)
except:
print(filename, 'is empty')
# Get the list of all txt files in the folder
file_list = list(folder_path.glob('*.txt'))
# Number of files to process at a time
batch_size = 10
while file_list:
batch = file_list[:batch_size]
file_list = file_list[batch_size:]
for filename in batch:
create_csv(filename)
I hope this helps.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论