根据包含的字符串合并两个数据框,无需迭代器。

huangapple go评论66阅读模式
英文:

Merge two DataFrames based on containing string without iterator

问题

我有两个CSV文件导入为数据帧AC。我想要将content列的字符串与包含来自A的字符串的data.data中的条目进行匹配。

A  time_a content    C  time_c data.data
   100    f00           400    otherf00other
   101    ba7           402    onlyrandom
   102    4242          407    otherba7other
                        409    other4242other

应该变成
time_a time_c content
100    400    f00
101    407    ba7
102    409    4242

我的下面的解决方案使用了迭代器,但它运行得太慢。这个答案 解释了原因并提供了如何改进的方法。但我难以实施任何方法。

如何能够使用pandas中的优化方法来完成这个任务?

# 在两个数据帧上使用reset_index()
df_alsa_copy = df_alsa.copy() # 永远不要修改你的迭代器
df_alsa_copy['cap_fno'] = -1

for aIndex, aRow in df_alsa.iterrows():
    for cIndex, cRow in df_c.iterrows():
        if str(aRow['content']) in str(cRow['data.data']):
            df_alsa_copy.loc[aIndex, 'cap_fno'] = df_c.loc[cIndex, 'frame.number']
# https://stackoverflow.com/questions/31528819/using-merge-on-a-column-and-index-in-pandas
# 在frame.number列上进行合并(因为我选择将其包含在alsa_copy中作为一列)
df_ltnc = pd.merge(df_alsa_copy, df_c, left_on='cap_fno', right_on='frame.number')

也尝试过:

  • 如果有精确匹配,可以使用这个方法:https://stackoverflow.com/questions/44080248/pandas-join-dataframe-with-condition。
  • 我还成功将我的第二个数据帧与已知字符串进行匹配,使用series.str.contains
  • 问题是,我无法输入要在merge on=中进行匹配的数据帧列。我只能输入已知的字符串。
  • 当我使用apply时,也出现了同样的问题。
  • 我没有成功使用isin或类似的方法。

更多信息:

A保存了我输入程序的带时间戳的内容。C是网络捕获数据。我想知道在输入和捕获之间的时间。我假设:

  • 字符串在AC中以相同的顺序出现。

  • 但在C中可能会有介于它们之间的其他行。

  • 字符串代表十六进制值。

  • data.data包含除我寻找的字符串外的其他字符。

  • 也许我缺乏足够的pandas词汇来查找正确的方法。

英文:

I have two csv files imported as dataframes A and C. I want to match the strings of column content with the entry in data.data that contains the string from A.

A  time_a content    C  time_c data.data
   100    f00           400    otherf00other
   101    ba7           402    onlyrandom
   102    4242          407    otherba7other
                        409    other4242other

Should become:
time_a time_c content
100    400    f00
101    407    ba7
102    409    4242

My solution below uses iterators. But it works too slow. This answer explains why and gives methods how to improve. But I struggle to implement any.

How can I do this with the optimized methods from pandas?

# reset_index() on both df
df_alsa_copy = df_alsa.copy() # Never modify your iterator
df_alsa_copy['cap_fno'] = -1

for aIndex, aRow in df_alsa.iterrows():
    for cIndex, cRow in df_c.iterrows():
        if str(aRow['content']) in str(cRow['data.data']):
            df_alsa_copy.loc[aIndex, 'cap_fno'] = df_c.loc[cIndex, 'frame.number']
# https://stackoverflow.com/questions/31528819/using-merge-on-a-column-and-index-in-pandas
# Merge on frame.number column (bc I chose it to be included in alsa_copy as a column)
df_ltnc = pd.merge(df_alsa_copy, df_c, left_on='cap_fno', right_on='frame.number')

Also tried:

More info:

A holds timestamped content I fed into the program. C is a network capture. I want to know the time in between feeding and capture. I assume:

  • The string occur in the same order in A and C.

  • But in C there might be lines in between.

  • Strings represent hex values.

  • data.data contains other chars as well as the string I look for.

  • Maybe I lack the pandas vocabulary, to look for the correct method.

答案1

得分: 2

尝试使用 pandas.unique()pandas.Series.str.containspandas.DataFrame.merge 来实现这个方法。

unique_str = A['content'].unique()
matching_rows = C[C['data.data'].str.contains('|'.join(unique_str))]

out = pd.merge(matching_rows, A, left_on=matching_rows['data.data']
               .str.extract(f'({"|".join(unique_str)})')[0],
                right_on='content')[['time_a', 'time_c', 'content']]
print(out)

结果如下:

   time_a  time_c content
0     100     400     f00
1     101     407     ba7
2     102     409    4242
英文:

Try this approach using pandas.unique(), pandas.Series.str.contains and pandas.DataFrame.merge

unique_str = A['content'].unique()
matching_rows = C[C['data.data'].str.contains('|'.join(unique_str))]

out = pd.merge(matching_rows, A, left_on=matching_rows['data.data']
               .str.extract(f'({"|".join(unique_str)})')[0],
                right_on='content')[['time_a', 'time_c', 'content']]
print(out)

   time_a  time_c content
0     100     400     f00
1     101     407     ba7
2     102     409    4242

答案2

得分: 1

如果你想提高速度,你可以考虑使用 Polars(https://www.pola.rs/)。你可以使用 pip install polars 安装 Polars。

解决方案与 @Jamiu 提出的相同,我认为他的方法是正确的。唯一的区别是使用了 Polars 而不是 Pandas。

我通过将行数乘以 1000 来测试了这两个解决方案。Pandas 的解决方案需要 400 毫秒,而 Polars 的解决方案只需要 92 毫秒。

import polars as pl

# 将数据转换为 Polars 数据框
a, c = pl.from_pandas(A), pl.from_pandas(C)

# 计算唯一值并连接
unique_values = f"({a['content'].unique().str.concat('|').item()})"
out = (
    a
    .join(c.filter(pl.col('data.data').str.contains(unique_values)), 
    left_on = 'content', right_on = pl.col('data.data').str.extract(unique_values))
)

# 如有需要,转换回 Pandas 数据框
out_pandas = out.to_pandas()
英文:

if you want to improve speed, a different option you might consider is using Polars (https://www.pola.rs/). You can install Polars with pip install polars

The solution is the same to what @Jamiu proposes, I think he has the right approach. The only difference is Polars instead of Pandas.

I tested the 2 solutions by multiplying the number of rows by 1000. The Pandas solution takes 400ms while the Polars one takes 92ms.

import polars as pl

# convert the data to Polars dataframes
a, c = pl.from_pandas(A), pl.from_pandas(C)

# calculate unique values and join
unique_values = f"({a['content'].unique().str.concat('|').item()})"
out = (
    a
    .join(c.filter(pl.col('data.data').str.contains(unique_values)), 
    left_on = 'content', right_on = pl.col('data.data').str.extract(unique_values))
)

# convert back to Pandas if needed
out_pandas = out.to_pandas()

</details>



huangapple
  • 本文由 发表于 2023年3月9日 19:51:25
  • 转载请务必保留本文链接:https://go.coder-hub.com/75684222.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定