寻找一种方法可以多次运行一个Python脚本,同时将txt文件转换为csv。

huangapple go评论56阅读模式
英文:

Looking for a way to run a Python script multiple times while converting txt files to csv

问题

以下是翻译好的部分:

我正在尝试将多个txt文件转换为csv

这是我的代码

```python
import pandas as pd
import pathlib
path=pathlib.Path(folderpathtotxtfiles)
def create_csv(filename):
  try:
     df=pd.read_csv(filename,sep='\s+',on_bad_lines='skip')
     df.to_csv(f'{filename}.csv',header=None,index=False)
  except:
     print(filename,'is empty')
for filename in path.glob('*.txt'):
  create_csv(filename)

这给了我所需的输出,但是否有任何方法可以多次运行此代码,每次只从我的路径中获取10个文件并将其转换为csv。就像如果我同时多次运行它,它应该先获取前10个文件,然后获取下一个10个文件,依此类推。


<details>
<summary>英文:</summary>

I am trying to convert multiple txt files to csv.




Here&#39;s my code:

import pandas as pd
import pathlib
path=pathlib.Path(folderpathtotxtfiles)
def create_csv(filename):
try:
df=pd.read_csv(filename,sep='\s+',on_bad_lines='skip')
df.to_csv(f'{filename}.csv,header=None,index=False)
except:
print(filename,'is empty')
for filename in path.glob('*.txt'):
create_csv(filename)

This gives me the required output,but is there anyway that I could run this code multiple times and each time it should only take 10 files from my path and convert it to csv. Like if I run it multiple times simultaneously it should take first 10 files then next 10 files and so on

</details>


# 答案1
**得分**: 2

以下是您要翻译的内容:

"You need some way to ensure that files processed in one run of the program are not processed in second and subsequent invocations."

"One way to do this is to create a directory to where you move the files that have already been processed. In that way they won't be 'seen' on subsequent runs."

"For optimum performance you should consider either multiprocessing or multithreading. In this case the former is probably most appropriate."

"Something like this:

import pandas
import os
import sys
import shutil
import glob
import re
from concurrent.futures import ProcessPoolExecutor

SOURCE_DIR = '/Volumes/G-Drive/src' # location of source files
SAVE_DIR = os.path.join(SOURCE_DIR, 'processed') # location of saved/processed files
BATCH_SIZE = 10
SUFFIX = 'txt'

def move(filename):
    try:
        os.makedirs(SAVE_DIR, exist_ok=True)
        target = os.path.join(SAVE_DIR, os.path.basename(filename))
        shutil.move(filename, target)
    except Exception as e:
        print(e, file=sys.stderr)

def create_csf(filename):
    try:
        df = pandas.read_csv(filename, sep=' ')
        csv = re.sub(fr'{SUFFIX}$', 'csv', filename)
        df.to_csv(csv, index=False)
        move(filename)
    except Exception as e:
        print(e, file=sys.stderr)

def main():
    with ProcessPoolExecutor() as ppe:
        files = glob.glob(os.path.join(SOURCE_DIR, f'*.{SUFFIX}'))
        ppe.map(create_csf, list(files)[:BATCH_SIZE])

if __name__ == '__main__':
    main()"

<details>
<summary>英文:</summary>

You need some way to ensure that files processed in one run of the program are not processed in second and subsequent invocations.

One way to do this is to create a directory to where you move the files that have already been processed. In that way they won&#39;t be &quot;seen&quot; on subsequent runs.

For optimum performance you should consider either multiprocessing or multithreading. In this case the former is probably most appropriate.

Something like this:

    import pandas
    import os
    import sys
    import shutil
    import glob
    import re
    from concurrent.futures import ProcessPoolExecutor
    
    SOURCE_DIR = &#39;/Volumes/G-Drive/src&#39; # location of source files
    SAVE_DIR = os.path.join(SOURCE_DIR, &#39;processed&#39;) # location of saved/processed files
    BATCH_SIZE = 10
    SUFFIX = &#39;txt&#39;
    
    def move(filename):
        try:
            os.makedirs(SAVE_DIR, exist_ok=True)
            target = os.path.join(SAVE_DIR, os.path.basename(filename))
            shutil.move(filename, target)
        except Exception as e:
            print(e, file=sys.stderr)
    
    def create_csf(filename):
        try:
            df = pandas.read_csv(filename, sep=&#39; &#39;)
            csv = re.sub(fr&#39;{SUFFIX}$&#39;, &#39;csv&#39;, filename)
            df.to_csv(csv, index=False)
            move(filename)
        except Exception as e:
            print(e, file=sys.stderr)
    
    def main():
        with ProcessPoolExecutor() as ppe:
            files = glob.glob(os.path.join(SOURCE_DIR, f&#39;*.{SUFFIX}&#39;))
            ppe.map(create_csf, list(files)[:BATCH_SIZE])
    
    if __name__ == &#39;__main__&#39;:
        main()

</details>



# 答案2
**得分**: 0

是的,您可以修改您的代码以多次运行,并每次处理 10 个文件。

```python
import pandas as pd
import pathlib

folder_path = pathlib.Path(folder_path_to_txt_files)

def create_csv(filename):
    try:
        df = pd.read_csv(filename, sep='\s+', on_bad_lines='skip')
        df.to_csv(f'{filename}.csv', header=None, index=False)
    except:
        print(filename, '是空的')

# 获取文件夹中所有txt文件的列表
file_list = list(folder_path.glob('*.txt'))

# 每次处理的文件数量
batch_size = 10

while file_list:
    batch = file_list[:batch_size]  
    file_list = file_list[batch_size:]   
    
    for filename in batch:
        create_csv(filename)

希望这对您有帮助。

英文:

Yes you can modify your code to run multiple time and took 10 every time.

import pandas as pd
import pathlib

folder_path = pathlib.Path(folder_path_to_txt_files)

def create_csv(filename):
    try:
        df = pd.read_csv(filename, sep=&#39;\s+&#39;, on_bad_lines=&#39;skip&#39;)
        df.to_csv(f&#39;{filename}.csv&#39;, header=None, index=False)
    except:
        print(filename, &#39;is empty&#39;)

# Get the list of all txt files in the folder
file_list = list(folder_path.glob(&#39;*.txt&#39;))

# Number of files to process at a time
batch_size = 10


while file_list:
    batch = file_list[:batch_size]  
    file_list = file_list[batch_size:]   

    
    for filename in batch:
        create_csv(filename)

I hope this helps.

huangapple
  • 本文由 发表于 2023年6月5日 13:07:30
  • 转载请务必保留本文链接:https://go.coder-hub.com/76403610.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定