英文:
Merge two DataFrames based on containing string without iterator
问题
我有两个CSV文件导入为数据帧A
和C
。我想要将content
列的字符串与包含来自A
的字符串的data.data
中的条目进行匹配。
A time_a content C time_c data.data
100 f00 400 otherf00other
101 ba7 402 onlyrandom
102 4242 407 otherba7other
409 other4242other
应该变成:
time_a time_c content
100 400 f00
101 407 ba7
102 409 4242
我的下面的解决方案使用了迭代器,但它运行得太慢。这个答案 解释了原因并提供了如何改进的方法。但我难以实施任何方法。
如何能够使用pandas中的优化方法来完成这个任务?
# 在两个数据帧上使用reset_index()
df_alsa_copy = df_alsa.copy() # 永远不要修改你的迭代器
df_alsa_copy['cap_fno'] = -1
for aIndex, aRow in df_alsa.iterrows():
for cIndex, cRow in df_c.iterrows():
if str(aRow['content']) in str(cRow['data.data']):
df_alsa_copy.loc[aIndex, 'cap_fno'] = df_c.loc[cIndex, 'frame.number']
# https://stackoverflow.com/questions/31528819/using-merge-on-a-column-and-index-in-pandas
# 在frame.number列上进行合并(因为我选择将其包含在alsa_copy中作为一列)
df_ltnc = pd.merge(df_alsa_copy, df_c, left_on='cap_fno', right_on='frame.number')
也尝试过:
- 如果有精确匹配,可以使用这个方法:https://stackoverflow.com/questions/44080248/pandas-join-dataframe-with-condition。
- 我还成功将我的第二个数据帧与已知字符串进行匹配,使用
series.str.contains
。 - 问题是,我无法输入要在
merge on=
中进行匹配的数据帧列。我只能输入已知的字符串。 - 当我使用
apply
时,也出现了同样的问题。 - 我没有成功使用
isin
或类似的方法。
更多信息:
A
保存了我输入程序的带时间戳的内容。C
是网络捕获数据。我想知道在输入和捕获之间的时间。我假设:
-
字符串在
A
和C
中以相同的顺序出现。 -
但在
C
中可能会有介于它们之间的其他行。 -
字符串代表十六进制值。
-
data.data
包含除我寻找的字符串外的其他字符。 -
也许我缺乏足够的pandas词汇来查找正确的方法。
英文:
I have two csv files imported as dataframes A
and C
. I want to match the strings of column content
with the entry in data.data
that contains the string from A
.
A time_a content C time_c data.data
100 f00 400 otherf00other
101 ba7 402 onlyrandom
102 4242 407 otherba7other
409 other4242other
Should become:
time_a time_c content
100 400 f00
101 407 ba7
102 409 4242
My solution below uses iterators. But it works too slow. This answer explains why and gives methods how to improve. But I struggle to implement any.
How can I do this with the optimized methods from pandas?
# reset_index() on both df
df_alsa_copy = df_alsa.copy() # Never modify your iterator
df_alsa_copy['cap_fno'] = -1
for aIndex, aRow in df_alsa.iterrows():
for cIndex, cRow in df_c.iterrows():
if str(aRow['content']) in str(cRow['data.data']):
df_alsa_copy.loc[aIndex, 'cap_fno'] = df_c.loc[cIndex, 'frame.number']
# https://stackoverflow.com/questions/31528819/using-merge-on-a-column-and-index-in-pandas
# Merge on frame.number column (bc I chose it to be included in alsa_copy as a column)
df_ltnc = pd.merge(df_alsa_copy, df_c, left_on='cap_fno', right_on='frame.number')
Also tried:
- Would work, if there is an exact match: https://stackoverflow.com/questions/44080248/pandas-join-dataframe-with-condition.
- I also managed to match my second frame against a known string with
series.str.contains
. - The problem is, I fail to enter a dataframe column to match in
merge on=
. I can only enter a known string. - The same problem arose, when I used
apply
. - I did not succeed with
isin
or similar.
More info:
A
holds timestamped content I fed into the program. C
is a network capture. I want to know the time in between feeding and capture. I assume:
-
The string occur in the same order in
A
andC
. -
But in
C
there might be lines in between. -
Strings represent hex values.
-
data.data
contains other chars as well as the string I look for. -
Maybe I lack the pandas vocabulary, to look for the correct method.
答案1
得分: 2
尝试使用 pandas.unique()
,pandas.Series.str.contains
和 pandas.DataFrame.merge
来实现这个方法。
unique_str = A['content'].unique()
matching_rows = C[C['data.data'].str.contains('|'.join(unique_str))]
out = pd.merge(matching_rows, A, left_on=matching_rows['data.data']
.str.extract(f'({"|".join(unique_str)})')[0],
right_on='content')[['time_a', 'time_c', 'content']]
print(out)
结果如下:
time_a time_c content
0 100 400 f00
1 101 407 ba7
2 102 409 4242
英文:
Try this approach using pandas.unique()
, pandas.Series.str.contains
and pandas.DataFrame.merge
unique_str = A['content'].unique()
matching_rows = C[C['data.data'].str.contains('|'.join(unique_str))]
out = pd.merge(matching_rows, A, left_on=matching_rows['data.data']
.str.extract(f'({"|".join(unique_str)})')[0],
right_on='content')[['time_a', 'time_c', 'content']]
print(out)
time_a time_c content
0 100 400 f00
1 101 407 ba7
2 102 409 4242
答案2
得分: 1
如果你想提高速度,你可以考虑使用 Polars(https://www.pola.rs/)。你可以使用 pip install polars
安装 Polars。
解决方案与 @Jamiu 提出的相同,我认为他的方法是正确的。唯一的区别是使用了 Polars 而不是 Pandas。
我通过将行数乘以 1000 来测试了这两个解决方案。Pandas 的解决方案需要 400 毫秒,而 Polars 的解决方案只需要 92 毫秒。
import polars as pl
# 将数据转换为 Polars 数据框
a, c = pl.from_pandas(A), pl.from_pandas(C)
# 计算唯一值并连接
unique_values = f"({a['content'].unique().str.concat('|').item()})"
out = (
a
.join(c.filter(pl.col('data.data').str.contains(unique_values)),
left_on = 'content', right_on = pl.col('data.data').str.extract(unique_values))
)
# 如有需要,转换回 Pandas 数据框
out_pandas = out.to_pandas()
英文:
if you want to improve speed, a different option you might consider is using Polars (https://www.pola.rs/). You can install Polars with pip install polars
The solution is the same to what @Jamiu proposes, I think he has the right approach. The only difference is Polars instead of Pandas.
I tested the 2 solutions by multiplying the number of rows by 1000. The Pandas solution takes 400ms while the Polars one takes 92ms.
import polars as pl
# convert the data to Polars dataframes
a, c = pl.from_pandas(A), pl.from_pandas(C)
# calculate unique values and join
unique_values = f"({a['content'].unique().str.concat('|').item()})"
out = (
a
.join(c.filter(pl.col('data.data').str.contains(unique_values)),
left_on = 'content', right_on = pl.col('data.data').str.extract(unique_values))
)
# convert back to Pandas if needed
out_pandas = out.to_pandas()
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论