2023年6月27日 17:30:34go评论107阅读模式

英文:

Emoji count and analysis using python pandas

问题

以下是您要翻译的代码部分：

进行笑脸计数的代码

import pandas as pd
import regex as re
import emoji
# 假设您的 DataFrame 名为 'df'，包含评论的列名为 'Document'
comments = df['Document']
# 初始化一个空字典以存储笑脸计数和类型
smiley_data = {'Smiley': [], 'Count': [], 'Type': []}
# 定义一个正则表达式模式以匹配笑脸
pattern = r'([\U0001F600-\U0001F64F\U0001F300-\U0001F5FF\U0001F680-\U0001F6FF\U0001F1E0-\U0001F1FF])'
# 遍历评论
for comment in comments:
    # 从评论中提取笑脸及其类型
    smileys = re.findall(pattern, comment)
    
    # 增加计数并存储笑脸及其类型
    for smiley in smileys:
        if smiley in smiley_data['Smiley']:
            index = smiley_data['Smiley'].index(smiley)
            smiley_data['Count'][index] += 1
        else:
            smiley_data['Smiley'].append(smiley)
            smiley_data['Count'].append(1)
            smiley_data['Type'].append(emoji.demojize(smiley))
            
# 从笑脸数据创建一个 DataFrame
smiley_df = pd.DataFrame(smiley_data)
# 按计数降序排序 DataFrame
smiley_df = smiley_df.sort_values(by='Count', ascending=False)
# 打印笑脸数据
smiley_df

我已经完成了代码的翻译。如果您需要任何其他帮助，请随时告诉我。

英文:

I am working on a sentiment analysis topic and there are a lot of comments with emojis.

I would like to know if my code is correct or is there a way to optimize it as well?

Code to do smiley count

import pandas as pd
import regex as re
import emoji
# Assuming your DataFrame is called &#39;df&#39; and the column with comments is &#39;Document&#39;
comments = df[&#39;Document&#39;]
# Initialize an empty dictionary to store smiley counts and types
smiley_data = {&#39;Smiley&#39;: [], &#39;Count&#39;: [], &#39;Type&#39;: []}
# Define a regular expression pattern to match smileys
pattern = r&#39;([\U0001F600-\U0001F64F\U0001F300-\U0001F5FF\U0001F680-\U0001F6FF\U0001F1E0-\U0001F1FF])&#39;
# Iterate over the comments
for comment in comments:
    # Extract smileys and their types from the comment
    smileys = re.findall(pattern, comment)
    
    # Increment the count and store the smileys and their types
    for smiley in smileys:
        if smiley in smiley_data[&#39;Smiley&#39;]:
            index = smiley_data[&#39;Smiley&#39;].index(smiley)
            smiley_data[&#39;Count&#39;][index] += 1
        else:
            smiley_data[&#39;Smiley&#39;].append(smiley)
            smiley_data[&#39;Count&#39;].append(1)
            smiley_data[&#39;Type&#39;].append(emoji.demojize(smiley))
            
# Create a DataFrame from the smiley data
smiley_df = pd.DataFrame(smiley_data)
# Sort the DataFrame by count in descending order
smiley_df = smiley_df.sort_values(by=&#39;Count&#39;, ascending=False)
# Print the smiley data
smiley_df

I am majorly not sure if my below code block is getting all the smileys

# Define a regular expression pattern to match smileys
pattern = r&#39;([\U0001F600-\U0001F64F\U0001F300-\U0001F5FF\U0001F680-\U0001F6FF\U0001F1E0-\U0001F1FF])&#39;

would like to know what can I do with this analysis. something else on top of it - some charts maybe?

I am also sharing a test dataset that will generate similar smiley counts as those available in my real data. Please note that the test dataset only has known smileys if there is something else. it won't be there like in a real dataset.

Test Dataset

import random
import pandas as pd
smileys = [&#39;&#128077;&#39;, &#39;&#128076;&#39;, &#39;&#128525;&#39;, &#39;&#127995;&#39;, &#39;&#128522;&#39;, &#39;&#128578;&#39;, &#39;&#128078;&#39;, &#39;&#128515;&#39;, &#39;&#127996;&#39;, &#39;&#128169;&#39;]
# Additional smileys to complete the required count
additional_smileys = [&#39;&#128516;&#39;, &#39;&#128526;&#39;, &#39;&#129321;&#39;, &#39;&#128536;&#39;, &#39;&#129303;&#39;, &#39;&#128518;&#39;, &#39;&#128521;&#39;, &#39;&#128523;&#39;, &#39;&#128519;&#39;, &#39;&#129395;&#39;, &#39;&#128588;&#39;, &#39;&#127881;&#39;, &#39;&#128293;&#39;, &#39;&#129392;&#39;, &#39;&#129322;&#39;, &#39;&#128540;&#39;, &#39;&#129299;&#39;,
                      &#39;&#128538;&#39;, &#39;&#129325;&#39;, &#39;&#129323;&#39;, &#39;&#128524;&#39;, &#39;&#129393;&#39;, &#39;&#129398;&#39;, &#39;&#129326;&#39;, &#39;&#129313;&#39;, &#39;&#128529;&#39;, &#39;&#128564;&#39;, &#39;&#128580;&#39;, &#39;&#128558;&#39;, &#39;&#129317;&#39;, &#39;&#128546;&#39;, &#39;&#129296;&#39;, &#39;&#128584;&#39;, &#39;&#128586;&#39;,
                      &#39;&#128125;&#39;, &#39;&#129302;&#39;, &#39;&#129412;&#39;, &#39;&#128060;&#39;, &#39;&#128053;&#39;, &#39;&#129409;&#39;, &#39;&#128056;&#39;, &#39;&#129417;&#39;]
# Combine the required smileys and additional smileys
all_smileys = smileys + additional_smileys
# Set a random seed for reproducibility
random.seed(42)
# Generate a single review
def generate_review(with_smiley=False):
    review = &quot;This movie&quot;
    if with_smiley:
        review += &quot; &quot; + random.choice(all_smileys)
    review += &quot; is &quot;
    review += random.choice([&quot;amazing&quot;, &quot;excellent&quot;, &quot;fantastic&quot;, &quot;brilliant&quot;, &quot;great&quot;, &quot;good&quot;, &quot;okay&quot;, &quot;average&quot;,
                             &quot;mediocre&quot;, &quot;disappointing&quot;, &quot;terrible&quot;, &quot;awful&quot;, &quot;horrible&quot;])
    review += random.choice([&quot;!&quot;, &quot;!!&quot;, &quot;!!!&quot;, &quot;.&quot;, &quot;..&quot;, &quot;...&quot;]) + &quot; &quot;
    review += random.choice([&quot;Highly recommended&quot;, &quot;Definitely worth watching&quot;, &quot;A must-see&quot;, &quot;I loved it&quot;,
                             &quot;Not worth your time&quot;, &quot;Skip it&quot;]) + random.choice([&quot;!&quot;, &quot;!!&quot;, &quot;!!!&quot;])
    return review
# Generate the random dataset
def generate_dataset():
    dataset = []
    review_count = 5000
    # Generate reviews with top smileys
    for smiley, count, _ in top_smileys:
        while count &gt; 0:
            review = generate_review(with_smiley=True)
            if smiley in review:
                dataset.append(review)
                count -= 1
    # Generate reviews with additional smileys
    additional_smileys_count = len(additional_smileys)
    additional_smileys_per_review = review_count - len(dataset)
    additional_smileys_per_review = min(additional_smileys_per_review, additional_smileys_count)
    for _ in range(additional_smileys_per_review):
        review = generate_review(with_smiley=True)
        dataset.append(review)
    # Generate reviews without smileys
    while len(dataset) &lt; review_count:
        review = generate_review()
        dataset.append(review)
    # Shuffle the dataset
    random.shuffle(dataset)
    return dataset
# List of top smileys and their counts
top_smileys = [
    (&#39;&#128077;&#39;, 331, &#39;:thumbs_up:&#39;),
    (&#39;&#128076;&#39;, 50, &#39;:OK_hand:&#39;),
    (&#39;&#128525;&#39;, 41, &#39;:smiling_face_with_heart-eyes:&#39;),
    (&#39;&#127995;&#39;, 38, &#39;:light_skin_tone:&#39;),
    (&#39;&#128522;&#39;, 35, &#39;:smiling_face_with_smiling_eyes:&#39;),
    (&#39;&#128578;&#39;, 14, &#39;:slightly_smiling_face:&#39;),
    (&#39;&#128078;&#39;, 12, &#39;:thumbs_down:&#39;),
    (&#39;&#128515;&#39;, 12, &#39;:grinning_face_with_big_eyes:&#39;),
    (&#39;&#127996;&#39;, 10, &#39;:medium-light_skin_tone:&#39;),
    (&#39;&#128169;&#39;, 10, &#39;:pile_of_poo:&#39;)
]
# Generate the dataset
dataset = generate_dataset()
# Create a data frame with &#39;Document&#39; column
df = pd.DataFrame({&#39;Document&#39;: dataset})
# Display the DataFrame
df

Thank you in advance!

答案1

得分: 3

更新

如果您更喜欢使用emoji包，您可以这样做：

import emoji
text = df['Document'].str.cat(sep='\n')
out = (pd.DataFrame(emoji.emoji_list(text)).value_counts('emoji')
         .rename_axis('Smiley').rename('Count').reset_index()
         .assign(Type=lambda x: x['Smiley'].apply(emoji.demojize)))

输出：

>>> out
   Smiley  Count                              Type
0   👍      331   :thumbs_up:
1   👌      50     :OK_hand:
2   👳      41     :light_skin_tone:
3   😍      41     :smiling_face_with_heart-eyes:
4   😊      35     :smiling_face_with_smiling_eyes:
5   🙂      15     :slightly_smiling_face:
6   👎      14     :thumbs_down:
7   😇      13     :grinning_face_with_big_eyes:
8   💩      10     :pile_of_poo:
9   👶🏼     10     :medium-light_skin_tone:
10  😜      3      :winking_face_with_tongue:
11  🦉      3      :owl:
12  🤖      2      :robot:
13  😵      2      :expressionless_face:
14  👽      2      :alien:
15  🤫      2      :shushing_face:
16  😂      2      :crying_face:
17  🤪      2      :zany_face:
18  🙈      2      :see-no-evil_monkey:
19  🙉      2      :speak-no-evil_monkey:
20  😇      1      :smiling_face_with_halo:
21  🤮      1      :face_vomiting:
22  🤥      1      :face_with_hand_over_mouth:
23  🤡      1      :clown_face:
24  🙏      1      :smiling_face_with_open_hands:
25  🙄      1      :face_with_rolling_eyes:
26  😲      1      :grinning_squinting_face:
27  🐸      1      :frog:
28  😞      1      :face_with_open_mouth:
29  🐼      1      :panda:
30  😘      1      :kissing_face_with_closed_eyes:
31  😎      1      :smiling_face_with_sunglasses:
32  😘      1      :face_blowing_a_kiss:

您可以使用str.extractall来避免循环，然后使用value_counts来计算出现次数。最后，对每个表情进行"demojize"（这是最慢的部分）：

out = (df['Document'].str.extractall(pattern).value_counts()
                     .rename_axis('Smiley').rename('Count').reset_index()
                     .assign(Type=lambda x: x['Smiley'].apply(emoji.demojize)))

输出：

>>> out
   Smiley  Count                              Type
0   👍      331   :thumbs_up:
1   👌      50     :OK_hand:
2   👳      41     :light_skin_tone:
3   😍      41     :smiling_face_with_heart-eyes:
4   😊      35     :smiling_face_with_smiling_eyes:
5   🙂      15     :slightly_smiling_face:
6   👎      14     :thumbs_down:
7   😇      13     :grinning_face_with_big_eyes:
8   💩      10     :pile_of_poo:
9   👶🏼     10     :medium-light_skin_tone:
10  😜      3      :winking_face_with_tongue:
11  😵      2      :expressionless_face:
12  🙉      2      :see-no-evil_monkey:
13  😭      2      :crying_face:
14  🙊      2      :speak-no-evil_monkey:
15  👽      2      :alien:
16  😷      1      :face_with_hand_over_mouth:
17  🤢      1      :face_vomiting:
18  🤡      1      :clown_face:
19  😇      1      :smiling_face_with_open_hands:
20  🙄      1      :face_with_rolling_eyes:
21  😳      1      :grinning_squinting_face:
22  🐸      1      :frog:
23  😚      1      :face_with_open_mouth:
24  🐼      1      :panda:
25  😘      1      :kissing_face_with_closed_eyes:
26  😎      1      :smiling_face_with_sunglasses:
27  😘      1      :face_blowing_a_kiss:

模式部分正确吗？我没有漏掉任何表情吗？

您的模式是不正确的。我不知道您想要提取的完整列表，但下面是一个用于调试的代码：

# 添加拉丁1代码 --v
pattern2 = '([\\U00000000-\\U000000FF\\U0001F600-\\U0001F64F\\U0001F300-\\U0001F5FF\\U0001F680-\\U0001F6FF\\U0001F1E0-\\U0001F1FF])'
other = df['Document'].str.replace(pattern2, '', regex=True)
print(other[other != ''])
# 输出/错过的表情
1149    &#129303;
1238    &#129417;
1305    &#129323;
1424    &#129323;
1978    &#129325;
2611    &#129326;
2623    &#129417;
2959    &#129313;
3717    &#129322;
4045    &#129417;
4067    &#129302;
4699    &#129302;
4975    &#129322;
Name: Document, dtype: object

英文:

Update

If you prefer to use emoji package, you can do:

import emoji
text = df[&#39;Document&#39;].str.cat(sep=&#39;\n&#39;)
out = (pd.DataFrame(emoji.emoji_list(text)).value_counts(&#39;emoji&#39;)
         .rename_axis(&#39;Smiley&#39;).rename(&#39;Count&#39;).reset_index()
         .assign(Type=lambda x: x[&#39;Smiley&#39;].apply(emoji.demojize)))

Output:

&gt;&gt;&gt; out
   Smiley  Count                              Type
0       &#128077;    331                       :thumbs_up:
1       &#128076;     50                         :OK_hand:
2       &#127995;     41                 :light_skin_tone:
3       &#128525;     41    :smiling_face_with_heart-eyes:
4       &#128522;     35  :smiling_face_with_smiling_eyes:
5       &#128578;     15           :slightly_smiling_face:
6       &#128078;     14                     :thumbs_down:
7       &#128515;     13     :grinning_face_with_big_eyes:
8       &#127996;     10          :medium-light_skin_tone:
9       &#128169;     10                     :pile_of_poo:
10      &#128540;      3        :winking_face_with_tongue:
11      &#129417;      3                             :owl:
12      &#129302;      2                           :robot:
13      &#128529;      2             :expressionless_face:
14      &#128125;      2                           :alien:
15      &#129323;      2                   :shushing_face:
16      &#128546;      2                     :crying_face:
17      &#129322;      2                       :zany_face:
18      &#128584;      2              :see-no-evil_monkey:
19      &#128586;      2            :speak-no-evil_monkey:
20      &#128519;      1          :smiling_face_with_halo:
21      &#129326;      1                   :face_vomiting:
22      &#129325;      1       :face_with_hand_over_mouth:
23      &#129313;      1                      :clown_face:
24      &#129303;      1    :smiling_face_with_open_hands:
25      &#128580;      1          :face_with_rolling_eyes:
26      &#128518;      1         :grinning_squinting_face:
27      &#128056;      1                            :frog:
28      &#128558;      1            :face_with_open_mouth:
29      &#128060;      1                           :panda:
30      &#128538;      1   :kissing_face_with_closed_eyes:
31      &#128526;      1    :smiling_face_with_sunglasses:
32      &#128536;      1             :face_blowing_a_kiss:

You can use str.extractall to avoid a loop then use value_counts to count the number of occurences. Finally, "demojize" each smiley (the slowest part):

out = (df[&#39;Document&#39;].str.extractall(pattern).value_counts()
                     .rename_axis(&#39;Smiley&#39;).rename(&#39;Count&#39;).reset_index()
                     .assign(Type=lambda x: x[&#39;Smiley&#39;].apply(emoji.demojize)))

Output:

&gt;&gt;&gt; out
   Smiley  Count                              Type
0       &#128077;    331                       :thumbs_up:
1       &#128076;     50                         :OK_hand:
2       &#127995;     41                 :light_skin_tone:
3       &#128525;     41    :smiling_face_with_heart-eyes:
4       &#128522;     35  :smiling_face_with_smiling_eyes:
5       &#128578;     15           :slightly_smiling_face:
6       &#128078;     14                     :thumbs_down:
7       &#128515;     13     :grinning_face_with_big_eyes:
8       &#128169;     10                     :pile_of_poo:
9       &#127996;     10          :medium-light_skin_tone:
10      &#128540;      3        :winking_face_with_tongue:
11      &#128529;      2             :expressionless_face:
12      &#128584;      2              :see-no-evil_monkey:
13      &#128546;      2                     :crying_face:
14      &#128586;      2            :speak-no-evil_monkey:
15      &#128125;      2                           :alien:
16      &#128526;      1    :smiling_face_with_sunglasses:
17      &#128536;      1             :face_blowing_a_kiss:
18      &#128538;      1   :kissing_face_with_closed_eyes:
19      &#128056;      1                            :frog:
20      &#128519;      1          :smiling_face_with_halo:
21      &#128558;      1            :face_with_open_mouth:
22      &#128518;      1         :grinning_squinting_face:
23      &#128580;      1          :face_with_rolling_eyes:
24      &#128060;      1                           :panda:

> The pattern part is correct? I am not missing out on any emoticons?

Your pattern is not right. I don't know the full list you want to extract but below you have a code to debug it:

#     add latin1 codes --v
pattern2 = &#39;([\\U00000000-\\U000000FF\\U0001F600-\\U0001F64F\\U0001F300-\\U0001F5FF\\U0001F680-\\U0001F6FF\\U0001F1E0-\\U0001F1FF])&#39;
other = df[&#39;Document&#39;].str.replace(pattern2, &#39;&#39;, regex=True)
print(other[other != &#39;&#39;])
# Output / Missed emojis
1149    &#129303;
1238    &#129417;
1305    &#129323;
1424    &#129323;
1978    &#129325;
2611    &#129326;
2623    &#129417;
2959    &#129313;
3717    &#129322;
4045    &#129417;
4067    &#129302;
4699    &#129302;
4975    &#129322;
Name: Document, dtype: object

答案2

得分: 2

感谢 @corralien 和 @cuzi，我能够使用下面的代码获得最终结果。它不使用模式，而是使用emoji.analyze(text, join_emoji=True)函数：

import emoji
out = (df['Document'].apply(lambda text: [token.chars for token in emoji.analyze(text, join_emoji=True) 
                       if isinstance(token.value, emoji.EmojiMatch)]).explode().value_counts()
                      .rename_axis('Smiley').rename('Count').reset_index())
out

> 输出

index	Smiley	Count
0	&#128077;	331
1	&#128076;	50
2	&#128525;	41
3	&#127995;	41
4	&#128522;	35
5	&#128578;	15
6	&#128078;	14
7	&#128515;	13
8	&#127996;	10
9	&#128169;	10
10	&#128540;	3
11	&#129417;	3
12	&#129302;	2
13	&#129322;	2
14	&#128546;	2
15	&#128584;	2
16	&#128529;	2
17	&#129323;	2
18	&#128586;	2
19	&#128125;	2
20	&#129325;	1
21	&#129303;	1
22	&#128519;	1
23	&#128056;	1
24	&#129326;	1
25	&#129313;	1
26	&#128538;	1
27	&#128526;	1
28	&#128536;	1
29	&#128060;	1
30	&#128518;	1
31	&#128558;	1
32	&#128580;	1

英文:

Thanks to @corralien and @cuzi, I was able to get my final result using the below code. It doesn't use patterns but uses emoji.analyze(text, join_emoji=True) function: -

import emoji
out = (df[&#39;Document&#39;].apply(lambda text: [token.chars for token in emoji.analyze(text, join_emoji=True) 
                       if isinstance(token.value, emoji.EmojiMatch)]).explode().value_counts()
                      .rename_axis(&#39;Smiley&#39;).rename(&#39;Count&#39;).reset_index())
out

> Output

index	Smiley	Count
0	&#128077;	331
1	&#128076;	50
2	&#128525;	41
3	&#127995;	41
4	&#128522;	35
5	&#128578;	15
6	&#128078;	14
7	&#128515;	13
8	&#127996;	10
9	&#128169;	10
10	&#128540;	3
11	&#129417;	3
12	&#129302;	2
13	&#129322;	2
14	&#128546;	2
15	&#128584;	2
16	&#128529;	2
17	&#129323;	2
18	&#128586;	2
19	&#128125;	2
20	&#129325;	1
21	&#129303;	1
22	&#128519;	1
23	&#128056;	1
24	&#129326;	1
25	&#129313;	1
26	&#128538;	1
27	&#128526;	1
28	&#128536;	1
29	&#128060;	1
30	&#128518;	1
31	&#128558;	1
32	&#128580;	1

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

表情符号计数和分析使用Python pandas

问题

答案1

答案2

Sagemaker -托管容器 – 脚本模式

Groupby and transform across a group, not within it.

ValueError: 概率不总和为1

How to make Snakemake run a rule once for all matching outputs, not once for each wildcard match

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。