在Jupyter笔记本中突出显示两个字符串之间的差异 – Python

huangapple go评论129阅读模式
英文:

Python - Highlight differences between two strings in Jupyter notebook

问题

我有一个包含两个字符串的列表,我想要突出显示并打印出这两个字符串之间的差异(特别是在Jupyter笔记本中)。通过差异,我具体指的是将一个字符串更改为另一个字符串所需的插入、删除和替换操作。

我找到了这个问题,它类似,但没有提到如何呈现这些变化。

英文:

I have a list of two strings, and I want to highlight and print differences between two strings (specifically in Jupyter notebook). By differences, I specifically mean the insertions, deletions and replacements needed to change one of the strings to the other.

I found this question which is similar but doesn't mention a way to present the changes.

答案1

得分: 2

我找到了一种有效的方法来显示这样的高亮,并希望与其他人分享。

difflib 模块提供了有效查找差异的工具,特别是 SequenceMatcher 类,而 IPython.display 模块则帮助您在笔记本设置中突出显示差异。

演示

首先,假设数据的格式如下:

cases = [
    ('afrykanerskojęzyczny', 'afrykanerskojęzycznym'),    
    ('afrykanerskojęzyczni', 'nieafrykanerskojęzyczni'),
    ('afrykanerskojęzycznym', 'afrykanerskojęzyczny'),
    ('nieafrykanerskojęzyczni', 'afrykanerskojęzyczni'),
    ('nieafrynerskojęzyczni', 'afrykanerskojzyczni'),
    ('abcdefg','xac')
]

您可以创建一个函数,该函数提供了突出显示插入、删除和替换的 HTML 字符串,使用以下代码:

from difflib import SequenceMatcher

# 高亮颜色
# 您可以根据您的喜好更改这些值
color_delete = '#811612'  # 删除的高亮颜色
color_insert = '#28862D'  # 插入的高亮颜色
color_replace = '#BABA26' # 替换的高亮颜色

# 用于突出显示段的常用格式字符串
f_str = '<span style="background: {};">{}</span>'

# 给定两个字符串(a、b),getFormattedDiff 返回 HTML 格式化的字符串(formatted_a、formatted_b)
def getFormattedDiff(a, b):
    # 初始化序列匹配器
    s = SequenceMatcher(None, a, b)

    # 用于格式化字符串的 stringbuilders
    formatted_a = []
    formatted_b = []

    # 遍历所有字符块
    for tag, i1, i2, j1, j2 in s.get_opcodes():
        if tag == 'equal':
            # 如果块相同,将块追加到两个字符串而不进行任何格式化
            formatted_a.append(a[i1:i2])
            formatted_b.append(b[j1:j2])
        elif tag == 'delete':
            # 如果这是一个删除块,将块追加到第一个字符串并使用删除高亮
            formatted_a.append(f_str.format(color_delete, a[i1:i2]))
        elif tag == 'insert':
            # 如果这是一个插入块,将块追加到第二个字符串并使用插入高亮
            formatted_b.append(f_str.format(color_insert, b[j1:j2]))
        elif tag == 'replace':
            # 如果这是一个替换块,将块追加到两个字符串并使用替换高亮
            formatted_a.append(f_str.format(color_replace, a[i1:i2]))
            formatted_b.append(f_str.format(color_replace, b[j1:j2]))

    # 返回格式化的字符串
    return ''.join(formatted_a), ''.join(formatted_b)

现在,我们运行上面定义的函数,对所有 cases 字符串进行循环,如下所示:

from IPython.display import HTML, display

# 遍历所有 cases 并显示带有高亮的两个字符串
for a, b in cases:
    formatted_a, formatted_b = getFormattedDiff(a, b)
    display(HTML(formatted_a))
    display(HTML(formatted_b))
    print()

我们将获得以下显示输出:

在Jupyter笔记本中突出显示两个字符串之间的差异 – Python

英文:

I figured out an effective way to display such highlighting and want to share it with others.

The difflib module gives you the tools to effectively find the differences, specifically the SequenceMatcher class, while the IPython.display module helps you highlight the differences in a notebook setting.

Demonstration

First, let's assume the data in the following format:

cases = [
    (&#39;afrykanerskojęzyczny&#39;, &#39;afrykanerskojęzycznym&#39;),    
    (&#39;afrykanerskojęzyczni&#39;, &#39;nieafrykanerskojęzyczni&#39;),
    (&#39;afrykanerskojęzycznym&#39;, &#39;afrykanerskojęzyczny&#39;),
    (&#39;nieafrykanerskojęzyczni&#39;, &#39;afrykanerskojęzyczni&#39;),
    (&#39;nieafrynerskojęzyczni&#39;, &#39;afrykanerskojzyczni&#39;),
    (&#39;abcdefg&#39;,&#39;xac&#39;)
]

You can create a function that gives you the HTML string which highlights the insertions, deletions and replacements, using the following code:

from difflib import SequenceMatcher

# highlight colors
# you may change these values according to your preferences
color_delete = &#39;#811612&#39;  # highlight color for deletions
color_insert = &#39;#28862D&#39;  # highlight color for insertions
color_replace = &#39;#BABA26&#39; # highlight color for replacements

# the common format string used for highlighted segments
f_str = &#39;&lt;span style=&quot;background: {};&quot;&gt;{}&lt;/span&gt;&#39;

# given two strings (a, b), getFormattedDiff returns the HTML formatted strings (formatted_a, formatted_b)
def getFormattedDiff(a, b):
    # initialize the sequence matcher
    s = SequenceMatcher(None, a, b)

    # stringbuilders for the formatted strings
    formatted_a = []
    formatted_b = []

    # iterate through all char blocks
    for tag, i1, i2, j1, j2 in s.get_opcodes():
        if tag == &#39;equal&#39;:
            # if the blovks are the same, append block to both strings without any formatting
            formatted_a.append(a[i1:i2])
            formatted_b.append(b[j1:j2])
        elif tag == &#39;delete&#39;:
            # if this is a deletion block, append block to the first string with the delete highlight
            formatted_a.append(f_str.format(color_delete, a[i1:i2]))
        elif tag == &#39;insert&#39;:
            # if this is a insertion block, append block to the second string with the insert highlight
            formatted_b.append(f_str.format(color_insert, b[j1:j2]))
        elif tag == &#39;replace&#39;:
            # if this is a replacement block, append block to both strings with the replace highlight
            formatted_a.append(f_str.format(color_replace, a[i1:i2]))
            formatted_b.append(f_str.format(color_replace, b[j1:j2]))

    # return the formatted strings
    return &#39;&#39;.join(formatted_a), &#39;&#39;.join(formatted_b)

Now we run the above defined function in a loop for all the cases strings like so:

from IPython.display import HTML, display

# iterate through all the cases and display both strings with the highlights
for a, b in cases:
    formatted_a, formatted_b = getFormattedDiff(a, b)
    display(HTML(formatted_a))
    display(HTML(formatted_b))
    print()

and we get the following display output:

在Jupyter笔记本中突出显示两个字符串之间的差异 – Python

huangapple
  • 本文由 发表于 2023年6月22日 06:45:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/76527591.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定