2023年6月1日 00:18:15go评论86阅读模式

英文:

Python script to combine RTF files in a folder to CSV with filename as a separate column

问题

I've tried a plethora of solutions which don't work. I've figured out how to get the converted data into a csv column, but all text is in 1 cell and I haven't figured out how to get the FileNames added as a column.

import pandas as pd
import os
from striprtf.striprtf import rtf_to_text
dir_path = 'C:\\Users\\mairi\\Desktop\\testing txt to excel\\'
def getFiles():
    list = []
    FileNames = []
    for path in os.listdir(dir_path):
        if os.path.isfile(os.path.join(dir_path, path)):
            with open(os.path.join(dir_path, path)) as file:
                text = file.read()
                rtfText = rtf_to_text(text, encoding='utf-8')
                for text in rtfText:
                    list.append(rtfText)
                    FileNames.append(os.path.basename(rtfText))
    return list, FileNames
list, FileNames = getFiles()
Data = pd.DataFrame({'FileNames': FileNames, 'Data': list})
NewPath = 'C:\\Users\\mairi\\Desktop\\testing txt to excel\\NEW\\'
Data.to_csv(os.path.join(NewPath, r'Data.csv'), index=False, header=False)

I've tried for a few days to scrape Stackoverflow and find a solution and now I seem to be getting duplicate file data in each row?

I think my main issues are. Possible that I need to create an empty dataframe before the function, but I haven't got it working yet.

separating the text so each new line is a new cell
Distinct file content so there aren't duplicate rows
adding the FileNames.

Hopefully, the outcome looks like this...

Filename	Data
Filename_1	Data line 1
Filename_1	Data line 2
Filename_2	Data line 1
Filename_2	Data line 2
Filename_2	Data line 3

Thank you for any help

英文:

import pandas as pd
import os
from striprtf.striprtf import rtf_to_text
dir_path = &#39;C:\\Users\\mairi\\Desktop\\testing txt to excel\\&#39;
def getFiles():
    list = []
    FileNames = []
    for path in os.listdir(dir_path):
        if os.path.isfile(os.path.join(dir_path,path)):
            with open (os.path.join(dir_path,path)) as file:
                    text = file.read()
                    rtfText = rtf_to_text(text,encoding=&#39;utf-8&#39;)
                    for text in rtfText:
                        list.append(rtfText) 
                        FileNames.append(os.path.basename(rtfText))
    return list
    return FileNames
list = getFiles()
FileNames = getFiles()
Data = pd.DataFrame(columns: &#39;list&#39;,&#39;FileNames&#39;)
NewPath = &#39;C:\\Users\\mairi\\Desktop\\testing txt to excel\\NEW\\&#39;
Data.to_csv(os.path.join(NewPath,r&#39;Data.csv&#39;), index = False, header = False)

I've tried for a few days to scrape Stackoverflow and find a solution and now I seem to be getting duplicate file data in each row ?

I think my main issues are. Possible that I need to create an empty dataframe before the function, but I haven't got it working yet.

separating the text so each new line is a new cell
Distinct file content so there aren't duplicate rows
adding the FileNames.

Hopefully, the outcome looks like this...

Filename	Data
Filename_1	Data line 1
Filename_1	Data line 2
Filename_2	Data line 1
Filename_2	Data line 2
Filename_2	Data line 3

Thank you for any help

答案1

得分: 0

import pandas as pd
import os
from striprtf.striprtf import rtf_to_text

数据结构

class fileLine:
def init(self, fileName, textLine):
self.fileName = fileName
self.textLine = textLine

数据文件目录

data_dir_path = 'C:\Users\mairi\Desktop\testing txt to excel\'
output_dir_path = 'C:\Users\mairi\Desktop\testing txt to excel\NEW\'

获取给定目录中所有RTF文件的所有行

def getRTFLines():
list = []
# 对于给定目录中的每个文件
for filename in os.listdir(data_dir_path):
filePath = os.path.join(data_dir_path, filename)
# 检查它是否是RTF文件
if os.path.isfile(filePath) and filePath.endswith('.rtf'):
# 打开RTF文件
with open(filePath, encoding='utf-8', errors='ignore') as file:
# 读取RTF文件的所有内容
text = file.read()
# 解码RTF文件的内容
rtfText = rtf_to_text(text, encoding='utf-8')
# 对于RTF文件中的每一行
for line in rtfText.splitlines():
# 检查它是否为空行
if line:
# 将行插入到我们的数据结构中
list.append(fileLine(filename, line))
return list

读取数据

list = getRTFLines()

将数据转换为CSV文件

data = [[x.fileName, x.textLine] for x in list]
df = pd.DataFrame(data, columns=['文件名', '文本行'])
df.to_csv(os.path.join(output_dir_path, r'rtfData-3.csv'), escapechar="")

英文:

A friend helped me solve this pickle

I was converting and saving all text in the document instead of stripping it line by line.

import pandas as pd
import os
from striprtf.striprtf import rtf_to_text
# Data structure
class fileLine:
    def __init__(self, fileName, textLine):
        self.fileName = fileName
        self.textLine = textLine
# Data files directories
data_dir_path = &#39;C:\\Users\\mairi\\Desktop\\testing txt to excel\\&#39;
output_dir_path = &#39;C:\\Users\\mairi\\Desktop\\testing txt to excel\\NEW\\&#39;
# Get all lines of all RTF files in a giver directory
def getRTFLines():
    list = []
    # For each file in the given directory
    for filename in os.listdir(data_dir_path):
        filePath = os.path.join(data_dir_path, filename)
        # Check if it&#39;s an RTF file
        if os.path.isfile(filePath) and filePath.endswith(&#39;.rtf&#39;):
            # Open the RTF file
            with open(filePath, encoding=&#39;utf-8&#39;, errors=&#39;ignore&#39;) as file:
                # Read all the RTF file&#39;s content
                text = file.read()
                # Decode the RTF file&#39;s content
                rtfText = rtf_to_text(text, encoding=&#39;utf-8&#39;)
                # For each line in the RTF file
                for line in rtfText.splitlines():
                    # Check if it&#39;s an empty line
                    if line:
                        # Insert the line in our data structure
                        list.append(fileLine(filename, line))
    return list
# Read data
list = getRTFLines()
# Convert data to csv file
data = [[x.fileName, x.textLine] for x in list]
df = pd.DataFrame(data, columns=[&#39;File Name&#39;, &#39;Text Line&#39;])
df.to_csv(os.path.join(output_dir_path,r&#39;rtfData-3.csv&#39;), escapechar=&quot;&quot;)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python脚本将文件夹中的RTF文件合并为CSV，其中文件名作为单独的列。

问题

答案1

数据结构

数据文件目录

获取给定目录中所有RTF文件的所有行

读取数据

将数据转换为CSV文件

如何在Golang中合并两个结构体？

我需要将一个CSV文件根据列标题拆分成单独的文件 [JAVA]。

将嵌套的JSON输出导出到CSV文件

Go Golang：合并排序 Stack Overflow

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。