2023年6月5日 13:07:30go评论142阅读模式

英文:

Looking for a way to run a Python script multiple times while converting txt files to csv

问题

以下是翻译好的部分：

我正在尝试将多个txt文件转换为csv。

这是我的代码：

```python
import pandas as pd
import pathlib
path=pathlib.Path(folderpathtotxtfiles)
def create_csv(filename):
  try:
     df=pd.read_csv(filename,sep='\s+',on_bad_lines='skip')
     df.to_csv(f'{filename}.csv',header=None,index=False)
  except:
     print(filename,'is empty')
for filename in path.glob('*.txt'):
  create_csv(filename)

这给了我所需的输出，但是否有任何方法可以多次运行此代码，每次只从我的路径中获取10个文件并将其转换为csv。就像如果我同时多次运行它，它应该先获取前10个文件，然后获取下一个10个文件，依此类推。


<details>
<summary>英文:</summary>

I am trying to convert multiple txt files to csv.




Here&#39;s my code:

import pandas as pd
import pathlib
path=pathlib.Path(folderpathtotxtfiles)
def create_csv(filename):
try:
df=pd.read_csv(filename,sep='\s+',on_bad_lines='skip')
df.to_csv(f'{filename}.csv,header=None,index=False)
except:
print(filename,'is empty')
for filename in path.glob('*.txt'):
create_csv(filename)

This gives me the required output,but is there anyway that I could run this code multiple times and each time it should only take 10 files from my path and convert it to csv. Like if I run it multiple times simultaneously it should take first 10 files then next 10 files and so on

</details>


# 答案1
**得分**: 2

以下是您要翻译的内容：

"You need some way to ensure that files processed in one run of the program are not processed in second and subsequent invocations."

"One way to do this is to create a directory to where you move the files that have already been processed. In that way they won't be 'seen' on subsequent runs."

"For optimum performance you should consider either multiprocessing or multithreading. In this case the former is probably most appropriate."

"Something like this:

import pandas
import os
import sys
import shutil
import glob
import re
from concurrent.futures import ProcessPoolExecutor

SOURCE_DIR = '/Volumes/G-Drive/src' # location of source files
SAVE_DIR = os.path.join(SOURCE_DIR, 'processed') # location of saved/processed files
BATCH_SIZE = 10
SUFFIX = 'txt'

def move(filename):
    try:
        os.makedirs(SAVE_DIR, exist_ok=True)
        target = os.path.join(SAVE_DIR, os.path.basename(filename))
        shutil.move(filename, target)
    except Exception as e:
        print(e, file=sys.stderr)

def create_csf(filename):
    try:
        df = pandas.read_csv(filename, sep=' ')
        csv = re.sub(fr'{SUFFIX}$', 'csv', filename)
        df.to_csv(csv, index=False)
        move(filename)
    except Exception as e:
        print(e, file=sys.stderr)

def main():
    with ProcessPoolExecutor() as ppe:
        files = glob.glob(os.path.join(SOURCE_DIR, f'*.{SUFFIX}'))
        ppe.map(create_csf, list(files)[:BATCH_SIZE])

if __name__ == '__main__':
    main()"

<details>
<summary>英文:</summary>

You need some way to ensure that files processed in one run of the program are not processed in second and subsequent invocations.

One way to do this is to create a directory to where you move the files that have already been processed. In that way they won&#39;t be &quot;seen&quot; on subsequent runs.

For optimum performance you should consider either multiprocessing or multithreading. In this case the former is probably most appropriate.

Something like this:

    import pandas
    import os
    import sys
    import shutil
    import glob
    import re
    from concurrent.futures import ProcessPoolExecutor
    
    SOURCE_DIR = &#39;/Volumes/G-Drive/src&#39; # location of source files
    SAVE_DIR = os.path.join(SOURCE_DIR, &#39;processed&#39;) # location of saved/processed files
    BATCH_SIZE = 10
    SUFFIX = &#39;txt&#39;
    
    def move(filename):
        try:
            os.makedirs(SAVE_DIR, exist_ok=True)
            target = os.path.join(SAVE_DIR, os.path.basename(filename))
            shutil.move(filename, target)
        except Exception as e:
            print(e, file=sys.stderr)
    
    def create_csf(filename):
        try:
            df = pandas.read_csv(filename, sep=&#39; &#39;)
            csv = re.sub(fr&#39;{SUFFIX}$&#39;, &#39;csv&#39;, filename)
            df.to_csv(csv, index=False)
            move(filename)
        except Exception as e:
            print(e, file=sys.stderr)
    
    def main():
        with ProcessPoolExecutor() as ppe:
            files = glob.glob(os.path.join(SOURCE_DIR, f&#39;*.{SUFFIX}&#39;))
            ppe.map(create_csf, list(files)[:BATCH_SIZE])
    
    if __name__ == &#39;__main__&#39;:
        main()

</details>



# 答案2
**得分**: 0

是的，您可以修改您的代码以多次运行，并每次处理 10 个文件。

```python
import pandas as pd
import pathlib

folder_path = pathlib.Path(folder_path_to_txt_files)

def create_csv(filename):
    try:
        df = pd.read_csv(filename, sep='\s+', on_bad_lines='skip')
        df.to_csv(f'{filename}.csv', header=None, index=False)
    except:
        print(filename, '是空的')

# 获取文件夹中所有txt文件的列表
file_list = list(folder_path.glob('*.txt'))

# 每次处理的文件数量
batch_size = 10

while file_list:
    batch = file_list[:batch_size]  
    file_list = file_list[batch_size:]   
    
    for filename in batch:
        create_csv(filename)

希望这对您有帮助。

英文:

Yes you can modify your code to run multiple time and took 10 every time.

import pandas as pd
import pathlib

folder_path = pathlib.Path(folder_path_to_txt_files)

def create_csv(filename):
    try:
        df = pd.read_csv(filename, sep=&#39;\s+&#39;, on_bad_lines=&#39;skip&#39;)
        df.to_csv(f&#39;{filename}.csv&#39;, header=None, index=False)
    except:
        print(filename, &#39;is empty&#39;)

# Get the list of all txt files in the folder
file_list = list(folder_path.glob(&#39;*.txt&#39;))

# Number of files to process at a time
batch_size = 10


while file_list:
    batch = file_list[:batch_size]  
    file_list = file_list[batch_size:]   

    
    for filename in batch:
        create_csv(filename)

I hope this helps.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

寻找一种方法可以多次运行一个Python脚本，同时将txt文件转换为csv。

问题

Django中未设置初始表单字段

使用pyarrow字符串与pandas的map或apply函数。

将一列转换为特定数量的列

解析数据框中的JSON字符串，并将提取的信息插入另一列。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论