在字符串中检查唯一元素与将字符串转换为集合相比。

huangapple go评论54阅读模式
英文:

Checking for unique elements in a string vs. converting string to a set

问题

code1 =

string = "这是一个示例字符串"
char_list = {}
for char in string:
    if char in char_list:
        char_list[char] += 1
    else:
        char_list[char] = 1

max_frequency = max(char_list.values())
chars_with_max_frequency = [key for key, value in char_list.items() if value == max_frequency]

code2 =

string = "这是一个示例字符串"
char_list = {}
for char in set(string):
    char_list[char] = string.count(char)

max_frequency = max(char_list.values())
chars_with_max_frequency = [key for key, value in char_list.items() if value == max_frequency]

print("方法 #1:检查字符串中的唯一字符:%.4f" % timeit.timeit(code1,number=1000000))

print("方法 #2:将字符串转换为集合:%.4f" % timeit.timeit(code2,number=1000000))


方法 #1:检查字符串中的唯一字符:2.6955

方法 #2:将字符串转换为集合:3.4223

英文:

I've seen the below "code1" used for checking for most frequently used characters in a string where each character is first checked for uniqueness rather than simply converting the string to a set (as in "code2"). Is the "code1" method preferred due to the slight performance improvement? And why is converting a string to a set more expensive than checking each element of the string if it already exists in a dictionary?

code1=

string = "This is an example string"
char_list = {}
for char in string:
    if char in char_list:
        char_list[char] += 1
    else:
        char_list[char] = 1

max_frequency = max(char_list.values())
chars_with_max_frequency = [key for key, value in char_list.items() if value == max_frequency]

code2=

string = "This is an example string"
char_list = {}
for char in set(string):
    char_list[char] = string.count(char)

max_frequency = max(char_list.values())
chars_with_max_frequency = [key for key, value in char_list.items() if value == max_frequency]

print("Method #1: Checking string for unique chars: %.4f" % timeit.timeit(code1,number=1000000))

print("Method #2: Converting string to set: %.4f" % timeit.timeit(code2,number=1000000))


Method #1: Checking string for unique chars: 2.6955

Method #2: Converting string to set: 3.4223

答案1

得分: 4

在第一种方法(code1)中,您正在遍历字符串中的每个字符,并在字典中为该字符递增一个计数器。这个操作是O(n),因为每个字符只被访问一次,其中n是字符串的长度。

在第二种方法(code2)中,在将字符串转换为集合之后(这会消除重复项,并且是一个O(n)的操作),您调用了string.count(char)方法。这个方法本身是一个O(n)的操作,因为它需要遍历整个字符串来计算字符的出现次数。由于这是针对每个唯一字符都执行的操作,总操作变为O(n^2)。在最坏的情况下,这可能与字符串本身的长度一样大。

因此,总之,将字符串转换为集合在这里不是昂贵的操作。相反,第二种方法中昂贵的操作是重复使用string.count()函数。这就是为什么第一种方法比第二种方法更快的原因。

英文:

In the first method (code1), you are iterating through each character in the string and incrementing a counter for that character in a dictionary. The operation is O(n) because each character is being visited exactly once, where n is the length of the string.

In the second method (code2), after converting the string to a set (which eliminates duplicates and is an O(n) operation), you're invoking the string.count(char) method. This method itself is an O(n) operation because it needs to traverse the entire string to count occurrences of the character. As this is done for each unique character, the total operation becomes O(n^2). In the worst-case scenario, this could be as large as the length of the string itself.

So, to conclude, the conversion of the string to a set is not the expensive operation here. Instead, the expensive operation in the second method is the repeated use of the string.count() function. This is why the first method is faster than the second one.

答案2

得分: 1

In code 1, you are just checking the uniqueness of the elements and then incrementing a count. This requires traversing through the string only once, which has a time complexity of O(n), where n is the number of elements in the string.

But in code 2, you have to first convert the string to a set and then count the number of elements. This will have a time complexity of O(n) * O(n) since we need to traverse across the string once to convert it and then again traverse through it to count. So it will give a time complexity of O(n^2).

时间复杂度为O(n)远好于O(n^2),所以第一种方法更快。

英文:

in code 1 you are just checking the uniqueness of the elements and then incrementing a count. This require traversing through the string only once which has a time complexity of O(n) where n is the number of elements in the string.

but in code 2 you have to have to first convert string to a set and then count the number of elements. This will have a time complexity of O(n) * O(n) since we need to travel across the string once to convert it and then again traverse through it to count so it will give a time complexity of O(n^2).

Time complexity of O(n) is much better than O(n^2) so the first method is faster.

答案3

得分: 1

你关于你的两种技术的问题已经得到了回答,所以我会添加一个第三个选项利用stdlib来计算字符串中的字符数:

from collections import Counter

counts = Counter('AABBCCDDAAADD')
sorted_keys = sorted(counts.items(), 
                     key=lambda x:x[1], 
                     reverse=True)

print(sorted_keys) 
# 打印结果 [ ('A', 5), ('D', 4), ('B', 2), ('C', 2) ]

这将涉及两个操作:

  • 在你的初始字符串上的O(n),其中n是字符串的长度
  • 和O(nlogn),其中n是字符串中不同字符的计数,这应该是相对较小的(除非你处理一些非常多样化的UTF-8数据)
英文:

Your questions about both of your techniques have already been answered, so I'll add a third option leveraging the stdlib to count characters in a string :


from collections import Counter

counts = Counter('AABBCCDDAAADD')
sorted_keys = sorted(counts.items(), 
                     key=lambda x:x[1], 
                     reverse=True)

print(sorted_keys) 
# prints [('A', 5), ('D', 4), ('B', 2), ('C', 2)]

This will be two operations :

  • O(n) on your inital strings, where n=strlen

  • and O(nlogn) where n is the count of distinct characters in your string, which should be reasonnably small (unless you're working on some very diverse UTF-8 data)

huangapple
  • 本文由 发表于 2023年6月8日 22:28:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/76432895.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定