2023年3月9日 19:51:25go评论144阅读模式

英文:

Merge two DataFrames based on containing string without iterator

问题

我有两个CSV文件导入为数据帧A和C。我想要将content列的字符串与包含来自A的字符串的data.data中的条目进行匹配。

A  time_a content    C  time_c data.data
   100    f00           400    otherf00other
   101    ba7           402    onlyrandom
   102    4242          407    otherba7other
                        409    other4242other

应该变成：
time_a time_c content
100    400    f00
101    407    ba7
102    409    4242

我的下面的解决方案使用了迭代器，但它运行得太慢。这个答案解释了原因并提供了如何改进的方法。但我难以实施任何方法。

如何能够使用pandas中的优化方法来完成这个任务？

# 在两个数据帧上使用reset_index()
df_alsa_copy = df_alsa.copy() # 永远不要修改你的迭代器
df_alsa_copy['cap_fno'] = -1

for aIndex, aRow in df_alsa.iterrows():
    for cIndex, cRow in df_c.iterrows():
        if str(aRow['content']) in str(cRow['data.data']):
            df_alsa_copy.loc[aIndex, 'cap_fno'] = df_c.loc[cIndex, 'frame.number']
# https://stackoverflow.com/questions/31528819/using-merge-on-a-column-and-index-in-pandas
# 在frame.number列上进行合并（因为我选择将其包含在alsa_copy中作为一列）
df_ltnc = pd.merge(df_alsa_copy, df_c, left_on='cap_fno', right_on='frame.number')

也尝试过：

如果有精确匹配，可以使用这个方法：https://stackoverflow.com/questions/44080248/pandas-join-dataframe-with-condition。
我还成功将我的第二个数据帧与已知字符串进行匹配，使用series.str.contains。
问题是，我无法输入要在merge on=中进行匹配的数据帧列。我只能输入已知的字符串。
当我使用apply时，也出现了同样的问题。
我没有成功使用isin或类似的方法。

更多信息：

A保存了我输入程序的带时间戳的内容。C是网络捕获数据。我想知道在输入和捕获之间的时间。我假设：

字符串在A和C中以相同的顺序出现。
但在C中可能会有介于它们之间的其他行。
字符串代表十六进制值。
data.data包含除我寻找的字符串外的其他字符。
也许我缺乏足够的pandas词汇来查找正确的方法。

英文:

I have two csv files imported as dataframes A and C. I want to match the strings of column content with the entry in data.data that contains the string from A.

A  time_a content    C  time_c data.data
   100    f00           400    otherf00other
   101    ba7           402    onlyrandom
   102    4242          407    otherba7other
                        409    other4242other

Should become:
time_a time_c content
100    400    f00
101    407    ba7
102    409    4242

My solution below uses iterators. But it works too slow. This answer explains why and gives methods how to improve. But I struggle to implement any.

How can I do this with the optimized methods from pandas?

# reset_index() on both df
df_alsa_copy = df_alsa.copy() # Never modify your iterator
df_alsa_copy[&#39;cap_fno&#39;] = -1

for aIndex, aRow in df_alsa.iterrows():
    for cIndex, cRow in df_c.iterrows():
        if str(aRow[&#39;content&#39;]) in str(cRow[&#39;data.data&#39;]):
            df_alsa_copy.loc[aIndex, &#39;cap_fno&#39;] = df_c.loc[cIndex, &#39;frame.number&#39;]
# https://stackoverflow.com/questions/31528819/using-merge-on-a-column-and-index-in-pandas
# Merge on frame.number column (bc I chose it to be included in alsa_copy as a column)
df_ltnc = pd.merge(df_alsa_copy, df_c, left_on=&#39;cap_fno&#39;, right_on=&#39;frame.number&#39;)

Also tried:

Would work, if there is an exact match: https://stackoverflow.com/questions/44080248/pandas-join-dataframe-with-condition.
I also managed to match my second frame against a known string with series.str.contains.
The problem is, I fail to enter a dataframe column to match in merge on=. I can only enter a known string.
The same problem arose, when I used apply.
I did not succeed with isin or similar.

More info:

A holds timestamped content I fed into the program. C is a network capture. I want to know the time in between feeding and capture. I assume:

The string occur in the same order in A and C.
But in C there might be lines in between.
Strings represent hex values.
data.data contains other chars as well as the string I look for.
Maybe I lack the pandas vocabulary, to look for the correct method.

答案1

得分: 2

尝试使用 pandas.unique()，pandas.Series.str.contains 和 pandas.DataFrame.merge 来实现这个方法。

unique_str = A['content'].unique()
matching_rows = C[C['data.data'].str.contains('|'.join(unique_str))]

out = pd.merge(matching_rows, A, left_on=matching_rows['data.data']
               .str.extract(f'({"|".join(unique_str)})')[0],
                right_on='content')[['time_a', 'time_c', 'content']]
print(out)

结果如下：

   time_a  time_c content
0     100     400     f00
1     101     407     ba7
2     102     409    4242

英文:

Try this approach using pandas.unique(), pandas.Series.str.contains and pandas.DataFrame.merge

unique_str = A[&#39;content&#39;].unique()
matching_rows = C[C[&#39;data.data&#39;].str.contains(&#39;|&#39;.join(unique_str))]

out = pd.merge(matching_rows, A, left_on=matching_rows[&#39;data.data&#39;]
               .str.extract(f&#39;({&quot;|&quot;.join(unique_str)})&#39;)[0],
                right_on=&#39;content&#39;)[[&#39;time_a&#39;, &#39;time_c&#39;, &#39;content&#39;]]
print(out)

   time_a  time_c content
0     100     400     f00
1     101     407     ba7
2     102     409    4242

答案2

得分: 1

如果你想提高速度，你可以考虑使用 Polars（https://www.pola.rs/）。你可以使用 pip install polars 安装 Polars。

解决方案与 @Jamiu 提出的相同，我认为他的方法是正确的。唯一的区别是使用了 Polars 而不是 Pandas。

我通过将行数乘以 1000 来测试了这两个解决方案。Pandas 的解决方案需要 400 毫秒，而 Polars 的解决方案只需要 92 毫秒。

import polars as pl

# 将数据转换为 Polars 数据框
a, c = pl.from_pandas(A), pl.from_pandas(C)

# 计算唯一值并连接
unique_values = f"({a['content'].unique().str.concat('|').item()})"
out = (
    a
    .join(c.filter(pl.col('data.data').str.contains(unique_values)), 
    left_on = 'content', right_on = pl.col('data.data').str.extract(unique_values))
)

# 如有需要，转换回 Pandas 数据框
out_pandas = out.to_pandas()

英文:

if you want to improve speed, a different option you might consider is using Polars (https://www.pola.rs/). You can install Polars with pip install polars

The solution is the same to what @Jamiu proposes, I think he has the right approach. The only difference is Polars instead of Pandas.

I tested the 2 solutions by multiplying the number of rows by 1000. The Pandas solution takes 400ms while the Polars one takes 92ms.

import polars as pl

# convert the data to Polars dataframes
a, c = pl.from_pandas(A), pl.from_pandas(C)

# calculate unique values and join
unique_values = f&quot;({a[&#39;content&#39;].unique().str.concat(&#39;|&#39;).item()})&quot;
out = (
    a
    .join(c.filter(pl.col(&#39;data.data&#39;).str.contains(unique_values)), 
    left_on = &#39;content&#39;, right_on = pl.col(&#39;data.data&#39;).str.extract(unique_values))
)

# convert back to Pandas if needed
out_pandas = out.to_pandas()

</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

根据包含的字符串合并两个数据框，无需迭代器。

问题

也尝试过：

更多信息：

Also tried:

More info:

答案1

答案2

Button.grid() 在 PyCharm 的 tkinter 中无法正常工作

在特定列中查找列名并输入其编号

提取两个字符串在Python中的相等性，并将其存储在新的数据框中。

无法安装Sikuli。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论