将文本文件中较长的句子在逗号(多个逗号)处分成两部分。

huangapple go评论59阅读模式
英文:

Split long sentences of a text file around the middle on comma (multiple commas)

问题

我有一个.srt文件我想要拆分以便在mpv中观看它是一本整本的书转化为.srt文件用于语言学习并附带有配套的有声读物

我的问题是它是日文日文单词之间没有空格所以mpv不会断开长句子而是会将它们变得非常小以适应一行的大小

我尝试了Subtitle Edit但它对日文不起作用

所以我试图写自己的脚本尽管我对此了解不多

我卡在了如何分割具有多个逗号的句子上我应该选择中间的哪一个

以下是我到目前为止得到的

```python
with open("test.txt", encoding="utf8") as file:
    for line in file:
       #print(line)
       size = len(line)
       if size > 45:
           # 使用日文逗号、将句子一分为二
英文:

I have a .srt file that I'd like to split to watch with mpv. It's a whole book turned into .srt for language learning, with an audiobook to go along.
My problem is, it's in Japanese, which doesn't have space between words, so mpv doesn't break long sentences, instead it makes them very tiny to fit the one line size.

I tried Subtitle Edit, but it's not working for Japanese.

So I'm trying to do my own script, although I don't know much about it.
I'm stuck on how to break a sentence that has multiple commas, how would I choose one around the middle?

Here's what I got so far:


with open("test.txt", encoding="utf8") as file:
    for line in file:
       #print(line)
       size = len(line)
       if size > 45:
           #break sentence in half, using Japanese comma 、

Here's the text file I'm using for testing:

10
00:00:55,640 --> 00:01:09,580
クラスで一番、明るくて、優しくて、運動神経がよくて、しかも、頭もよくて、みんなその子と友達になりたがる。

11
00:01:11,090 --> 00:01:24,500
だけどその子は、たくさんいるクラスメートの中に私がいることに気づいて、その顔にお日様みたいな眩しく、優しい微笑みをふわーっと浮かべる。

12
00:01:24,730 --> 00:01:32,250
私に近づき、「こころちゃん、ひさしぶり!」

13
00:01:32,910 --> 00:01:35,180
と挨拶をする。

14
00:01:37,450 --> 00:01:41,730
周りの子がみんな息を吞む中、「前から知ってるの。

15
00:01:42,000 --> 00:01:42,820
ね?」

16
00:01:43,820 --> 00:01:46,550
と私に目配せをする。

答案1

得分: 0

我的编译器在我尝试仅打开文件一次时出现问题,所以我的解决方案执行以下操作:读取每一行并将它们存储到一个列表中,遍历列表并找到所有字符数大于45的行,找到中间附近的逗号,然后将前后的行添加到列表中。完成后,将列表写入文件。

fileLines = []

def findCommaNearMiddle(line):
    length = len(line)
    middle = int(length/2)
    # 检查逗号出现在中间的位置
    distance = 0
    while distance <= middle:
        if line[middle+distance] == '、':
            return middle+distance
        elif line[middle-distance] == '、':
            return middle-distance
        distance += 1
    return -1 # 理想情况下,这永远不会发生

with open("test.txt", "r", encoding="utf8") as file:
    fileText = file.read()
    fileLines = fileText.split('\n');
    for i in range(len(fileLines)):
        line = fileLines[i]
        size = len(line)
        if size > 45:
            middleComma = findCommaNearMiddle(line)
            fileLines[i] = line[:middleComma]
            fileLines.insert(i+1, line[middleComma+1:]) # +1以去除逗号
    file.close()

with open("test.txt", "w", encoding="utf8") as file:
    for line in fileLines:
        file.write(line + '\n')

    file.close()

如果你想能够按照除'、'以外的字符分割,只需添加另一个条件到两个if语句中,类似于 or line[middle+distance] == '。':

英文:

My compiler was being weird when I tried to open the file only once, so my solution does the following: Read every line and store them to a list, go through the list and find all the lines that are > 45 characters, find a comma near the middle, then add the line before and after to the list. Once done, write the list to the file.

fileLines = []

def findCommaNearMiddle(line):
    length = len(line)
    middle = int(length/2)
    # check values on either side until comma is found
    distance = 0
    while distance &lt;= middle:
        if line[middle+distance] == &#39;、&#39;:
            return middle+distance
        elif line[middle-distance] == &#39;、&#39;:
            return middle-distance
        distance += 1
    return -1 # idealy, this will never happen

with open(&quot;test.txt&quot;, &quot;r&quot;, encoding=&quot;utf8&quot;) as file:
    fileText = file.read()
    fileLines = fileText.split(&#39;\n&#39;);
    for i in range(len(fileLines)):
        line = fileLines[i]
        size = len(line)
        if size &gt; 45:
            middleComma = findCommaNearMiddle(line)
            fileLines[i] = line[:middleComma]
            fileLines.insert(i+1, line[middleComma+1:]) # +1 to get rid of comma
    file.close()

with open(&quot;test.txt&quot;, &quot;w&quot;, encoding=&quot;utf8&quot;) as file:
    for line in fileLines:
        file.write(line + &#39;\n&#39;)

    file.close()

If you want to be able to split by characters other than '、', just add another condition to the two if statements that goes something like or line[middle+distance] == &#39;。&#39;:

答案2

得分: 0

你可以找到最接近句子中间的逗号并在那个逗号处分割句子。

with open("test.txt", encoding="utf8") as file:
    for line in file:
        size = len(line)
        if size > 45:
            # 找到距离行中间最近的逗号
            middle = size // 2
            comma_index = line.rfind("、", 0, middle)  # rfind() 在中间之前搜索逗号的最后一次出现
            if comma_index == -1:  # 如果在中间之前没有逗号,则在中间分割
                split_index = middle
            else:
                split_index = comma_index + 1  # 在逗号之后分割

            # 在split_index处分割行
            first_line = line[:split_index].strip()
            second_line = line[split_index:].strip()
            print(first_line)
            print(second_line)
        else:
            print(line.strip())
英文:

you can locate the comma that is closest to the middle of the sentence and split the sentence at that comma.

with open(&quot;test.txt&quot;, encoding=&quot;utf8&quot;) as file:
for line in file:
    size = len(line)
    if size &gt; 45:
        # Find the comma closest to the middle of the line
        middle = size // 2
        comma_index = line.rfind(&quot;、&quot;, 0, middle)  # rfind() searches for the last occurrence of the comma before the middle
        if comma_index == -1:  # If there is no comma before the middle, split at the middle
            split_index = middle
        else:
            split_index = comma_index + 1  # Split after the comma

        # Split the line at the split_index
        first_line = line[:split_index].strip()
        second_line = line[split_index:].strip()
        print(first_line)
        print(second_line)
    else:
        print(line.strip())

huangapple
  • 本文由 发表于 2023年2月26日 22:24:46
  • 转载请务必保留本文链接:https://go.coder-hub.com/75572640.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定