将文本文件中较长的句子在逗号(多个逗号)处分成两部分。

huangapple go评论87阅读模式
英文:

Split long sentences of a text file around the middle on comma (multiple commas)

问题

  1. 我有一个.srt文件我想要拆分以便在mpv中观看它是一本整本的书转化为.srt文件用于语言学习并附带有配套的有声读物
  2. 我的问题是它是日文日文单词之间没有空格所以mpv不会断开长句子而是会将它们变得非常小以适应一行的大小
  3. 我尝试了Subtitle Edit但它对日文不起作用
  4. 所以我试图写自己的脚本尽管我对此了解不多
  5. 我卡在了如何分割具有多个逗号的句子上我应该选择中间的哪一个
  6. 以下是我到目前为止得到的
  7. ```python
  8. with open("test.txt", encoding="utf8") as file:
  9. for line in file:
  10. #print(line)
  11. size = len(line)
  12. if size > 45:
  13. # 使用日文逗号、将句子一分为二
英文:

I have a .srt file that I'd like to split to watch with mpv. It's a whole book turned into .srt for language learning, with an audiobook to go along.
My problem is, it's in Japanese, which doesn't have space between words, so mpv doesn't break long sentences, instead it makes them very tiny to fit the one line size.

I tried Subtitle Edit, but it's not working for Japanese.

So I'm trying to do my own script, although I don't know much about it.
I'm stuck on how to break a sentence that has multiple commas, how would I choose one around the middle?

Here's what I got so far:

  1. with open("test.txt", encoding="utf8") as file:
  2. for line in file:
  3. #print(line)
  4. size = len(line)
  5. if size > 45:
  6. #break sentence in half, using Japanese comma 、

Here's the text file I'm using for testing:

  1. 10
  2. 00:00:55,640 --> 00:01:09,580
  3. クラスで一番、明るくて、優しくて、運動神経がよくて、しかも、頭もよくて、みんなその子と友達になりたがる。
  4. 11
  5. 00:01:11,090 --> 00:01:24,500
  6. だけどその子は、たくさんいるクラスメートの中に私がいることに気づいて、その顔にお日様みたいな眩しく、優しい微笑みをふわーっと浮かべる。
  7. 12
  8. 00:01:24,730 --> 00:01:32,250
  9. 私に近づき、「こころちゃん、ひさしぶり!」
  10. 13
  11. 00:01:32,910 --> 00:01:35,180
  12. と挨拶をする。
  13. 14
  14. 00:01:37,450 --> 00:01:41,730
  15. 周りの子がみんな息を吞む中、「前から知ってるの。
  16. 15
  17. 00:01:42,000 --> 00:01:42,820
  18. ね?」
  19. 16
  20. 00:01:43,820 --> 00:01:46,550
  21. と私に目配せをする。

答案1

得分: 0

我的编译器在我尝试仅打开文件一次时出现问题,所以我的解决方案执行以下操作:读取每一行并将它们存储到一个列表中,遍历列表并找到所有字符数大于45的行,找到中间附近的逗号,然后将前后的行添加到列表中。完成后,将列表写入文件。

  1. fileLines = []
  2. def findCommaNearMiddle(line):
  3. length = len(line)
  4. middle = int(length/2)
  5. # 检查逗号出现在中间的位置
  6. distance = 0
  7. while distance <= middle:
  8. if line[middle+distance] == '、':
  9. return middle+distance
  10. elif line[middle-distance] == '、':
  11. return middle-distance
  12. distance += 1
  13. return -1 # 理想情况下,这永远不会发生
  14. with open("test.txt", "r", encoding="utf8") as file:
  15. fileText = file.read()
  16. fileLines = fileText.split('\n');
  17. for i in range(len(fileLines)):
  18. line = fileLines[i]
  19. size = len(line)
  20. if size > 45:
  21. middleComma = findCommaNearMiddle(line)
  22. fileLines[i] = line[:middleComma]
  23. fileLines.insert(i+1, line[middleComma+1:]) # +1以去除逗号
  24. file.close()
  25. with open("test.txt", "w", encoding="utf8") as file:
  26. for line in fileLines:
  27. file.write(line + '\n')
  28. file.close()

如果你想能够按照除'、'以外的字符分割,只需添加另一个条件到两个if语句中,类似于 or line[middle+distance] == '。':

英文:

My compiler was being weird when I tried to open the file only once, so my solution does the following: Read every line and store them to a list, go through the list and find all the lines that are > 45 characters, find a comma near the middle, then add the line before and after to the list. Once done, write the list to the file.

  1. fileLines = []
  2. def findCommaNearMiddle(line):
  3. length = len(line)
  4. middle = int(length/2)
  5. # check values on either side until comma is found
  6. distance = 0
  7. while distance &lt;= middle:
  8. if line[middle+distance] == &#39;、&#39;:
  9. return middle+distance
  10. elif line[middle-distance] == &#39;、&#39;:
  11. return middle-distance
  12. distance += 1
  13. return -1 # idealy, this will never happen
  14. with open(&quot;test.txt&quot;, &quot;r&quot;, encoding=&quot;utf8&quot;) as file:
  15. fileText = file.read()
  16. fileLines = fileText.split(&#39;\n&#39;);
  17. for i in range(len(fileLines)):
  18. line = fileLines[i]
  19. size = len(line)
  20. if size &gt; 45:
  21. middleComma = findCommaNearMiddle(line)
  22. fileLines[i] = line[:middleComma]
  23. fileLines.insert(i+1, line[middleComma+1:]) # +1 to get rid of comma
  24. file.close()
  25. with open(&quot;test.txt&quot;, &quot;w&quot;, encoding=&quot;utf8&quot;) as file:
  26. for line in fileLines:
  27. file.write(line + &#39;\n&#39;)
  28. file.close()

If you want to be able to split by characters other than '、', just add another condition to the two if statements that goes something like or line[middle+distance] == &#39;。&#39;:

答案2

得分: 0

你可以找到最接近句子中间的逗号并在那个逗号处分割句子。

  1. with open("test.txt", encoding="utf8") as file:
  2. for line in file:
  3. size = len(line)
  4. if size > 45:
  5. # 找到距离行中间最近的逗号
  6. middle = size // 2
  7. comma_index = line.rfind("、", 0, middle) # rfind() 在中间之前搜索逗号的最后一次出现
  8. if comma_index == -1: # 如果在中间之前没有逗号,则在中间分割
  9. split_index = middle
  10. else:
  11. split_index = comma_index + 1 # 在逗号之后分割
  12. # 在split_index处分割行
  13. first_line = line[:split_index].strip()
  14. second_line = line[split_index:].strip()
  15. print(first_line)
  16. print(second_line)
  17. else:
  18. print(line.strip())
英文:

you can locate the comma that is closest to the middle of the sentence and split the sentence at that comma.

  1. with open(&quot;test.txt&quot;, encoding=&quot;utf8&quot;) as file:
  2. for line in file:
  3. size = len(line)
  4. if size &gt; 45:
  5. # Find the comma closest to the middle of the line
  6. middle = size // 2
  7. comma_index = line.rfind(&quot;、&quot;, 0, middle) # rfind() searches for the last occurrence of the comma before the middle
  8. if comma_index == -1: # If there is no comma before the middle, split at the middle
  9. split_index = middle
  10. else:
  11. split_index = comma_index + 1 # Split after the comma
  12. # Split the line at the split_index
  13. first_line = line[:split_index].strip()
  14. second_line = line[split_index:].strip()
  15. print(first_line)
  16. print(second_line)
  17. else:
  18. print(line.strip())

huangapple
  • 本文由 发表于 2023年2月26日 22:24:46
  • 转载请务必保留本文链接:https://go.coder-hub.com/75572640.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定