2023年2月26日 22:24:46go评论100阅读模式

英文:

Split long sentences of a text file around the middle on comma (multiple commas)

问题

我有一个.srt文件，我想要拆分以便在mpv中观看。它是一本整本的书转化为.srt文件，用于语言学习，并附带有配套的有声读物。
我的问题是，它是日文，日文单词之间没有空格，所以mpv不会断开长句子，而是会将它们变得非常小以适应一行的大小。
我尝试了Subtitle Edit，但它对日文不起作用。
所以我试图写自己的脚本，尽管我对此了解不多。
我卡在了如何分割具有多个逗号的句子上，我应该选择中间的哪一个？
以下是我到目前为止得到的：
```python
with open("test.txt", encoding="utf8") as file:
    for line in file:
       #print(line)
       size = len(line)
       if size > 45:
           # 使用日文逗号、将句子一分为二

英文:

I have a .srt file that I'd like to split to watch with mpv. It's a whole book turned into .srt for language learning, with an audiobook to go along.
My problem is, it's in Japanese, which doesn't have space between words, so mpv doesn't break long sentences, instead it makes them very tiny to fit the one line size.

I tried Subtitle Edit, but it's not working for Japanese.

So I'm trying to do my own script, although I don't know much about it.
I'm stuck on how to break a sentence that has multiple commas, how would I choose one around the middle?

Here's what I got so far:


with open(&quot;test.txt&quot;, encoding=&quot;utf8&quot;) as file:
    for line in file:
       #print(line)
       size = len(line)
       if size &gt; 45:
           #break sentence in half, using Japanese comma 、

Here's the text file I'm using for testing:

10
00:00:55,640 --&gt; 00:01:09,580
クラスで一番、明るくて、優しくて、運動神経がよくて、しかも、頭もよくて、みんなその子と友達になりたがる。
11
00:01:11,090 --&gt; 00:01:24,500
だけどその子は、たくさんいるクラスメートの中に私がいることに気づいて、その顔にお日様みたいな眩しく、優しい微笑みをふわーっと浮かべる。
12
00:01:24,730 --&gt; 00:01:32,250
私に近づき、「こころちゃん、ひさしぶり！」
13
00:01:32,910 --&gt; 00:01:35,180
と挨拶をする。
14
00:01:37,450 --&gt; 00:01:41,730
周りの子がみんな息を吞む中、「前から知ってるの。
15
00:01:42,000 --&gt; 00:01:42,820
ね？」
16
00:01:43,820 --&gt; 00:01:46,550
と私に目配せをする。

答案1

得分: 0

我的编译器在我尝试仅打开文件一次时出现问题，所以我的解决方案执行以下操作：读取每一行并将它们存储到一个列表中，遍历列表并找到所有字符数大于45的行，找到中间附近的逗号，然后将前后的行添加到列表中。完成后，将列表写入文件。

fileLines = []
def findCommaNearMiddle(line):
    length = len(line)
    middle = int(length/2)
    # 检查逗号出现在中间的位置
    distance = 0
    while distance <= middle:
        if line[middle+distance] == '、':
            return middle+distance
        elif line[middle-distance] == '、':
            return middle-distance
        distance += 1
    return -1 # 理想情况下，这永远不会发生
with open("test.txt", "r", encoding="utf8") as file:
    fileText = file.read()
    fileLines = fileText.split('\n');
    for i in range(len(fileLines)):
        line = fileLines[i]
        size = len(line)
        if size > 45:
            middleComma = findCommaNearMiddle(line)
            fileLines[i] = line[:middleComma]
            fileLines.insert(i+1, line[middleComma+1:]) # +1以去除逗号
    file.close()
with open("test.txt", "w", encoding="utf8") as file:
    for line in fileLines:
        file.write(line + '\n')
    file.close()

如果你想能够按照除'、'以外的字符分割，只需添加另一个条件到两个if语句中，类似于 or line[middle+distance] == '。':。

英文:

My compiler was being weird when I tried to open the file only once, so my solution does the following: Read every line and store them to a list, go through the list and find all the lines that are > 45 characters, find a comma near the middle, then add the line before and after to the list. Once done, write the list to the file.

fileLines = []
def findCommaNearMiddle(line):
    length = len(line)
    middle = int(length/2)
    # check values on either side until comma is found
    distance = 0
    while distance &lt;= middle:
        if line[middle+distance] == &#39;、&#39;:
            return middle+distance
        elif line[middle-distance] == &#39;、&#39;:
            return middle-distance
        distance += 1
    return -1 # idealy, this will never happen
with open(&quot;test.txt&quot;, &quot;r&quot;, encoding=&quot;utf8&quot;) as file:
    fileText = file.read()
    fileLines = fileText.split(&#39;\n&#39;);
    for i in range(len(fileLines)):
        line = fileLines[i]
        size = len(line)
        if size &gt; 45:
            middleComma = findCommaNearMiddle(line)
            fileLines[i] = line[:middleComma]
            fileLines.insert(i+1, line[middleComma+1:]) # +1 to get rid of comma
    file.close()
with open(&quot;test.txt&quot;, &quot;w&quot;, encoding=&quot;utf8&quot;) as file:
    for line in fileLines:
        file.write(line + &#39;\n&#39;)
    file.close()

If you want to be able to split by characters other than '、', just add another condition to the two if statements that goes something like or line[middle+distance] == '。':

答案2

得分: 0

你可以找到最接近句子中间的逗号并在那个逗号处分割句子。

with open("test.txt", encoding="utf8") as file:
    for line in file:
        size = len(line)
        if size > 45:
            # 找到距离行中间最近的逗号
            middle = size // 2
            comma_index = line.rfind("、", 0, middle)  # rfind() 在中间之前搜索逗号的最后一次出现
            if comma_index == -1:  # 如果在中间之前没有逗号，则在中间分割
                split_index = middle
            else:
                split_index = comma_index + 1  # 在逗号之后分割
            # 在split_index处分割行
            first_line = line[:split_index].strip()
            second_line = line[split_index:].strip()
            print(first_line)
            print(second_line)
        else:
            print(line.strip())

英文:

you can locate the comma that is closest to the middle of the sentence and split the sentence at that comma.

with open(&quot;test.txt&quot;, encoding=&quot;utf8&quot;) as file:
for line in file:
    size = len(line)
    if size &gt; 45:
        # Find the comma closest to the middle of the line
        middle = size // 2
        comma_index = line.rfind(&quot;、&quot;, 0, middle)  # rfind() searches for the last occurrence of the comma before the middle
        if comma_index == -1:  # If there is no comma before the middle, split at the middle
            split_index = middle
        else:
            split_index = comma_index + 1  # Split after the comma
        # Split the line at the split_index
        first_line = line[:split_index].strip()
        second_line = line[split_index:].strip()
        print(first_line)
        print(second_line)
    else:
        print(line.strip())

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将文本文件中较长的句子在逗号（多个逗号）处分成两部分。

问题

答案1

答案2

在Python中嵌套导入的问题

基于DataFrame的热力图

获取或设置Python中Redis中多个键的缓存

在Python 3中创建嵌套字典内的列表和元组。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。