2023年4月4日 17:12:07go评论86阅读模式

英文:

Find rows in Julia dataframe with substrings in several columns

问题

I'm new to Julia (and StackOverflow).

I want to extract rows, originally from an Excel file, that contain a substring in more than one column.

I found the following answer which solves the first of three aspects of my issue: https://stackoverflow.com/questions/66376876/finding-rows-in-a-julia-dataframe-where-substrings-are-not-found

Example extended with more columns:

using DataFrames
df1 = DataFrame(col1=[1,62,13,24], col2=["ab","bc","cd","de"], col3=["de","bc","cd","de"], col4=["de","bc","cd","de"])

What I additionally like to do is:
1.) How do I specify a (consecutive) range of columns to look through ("second column":"last used column")?
I can list the columns explicitly, but I would like to mention them relatively. I can't rely on the names to be the same with the new version of the Excel file.

df2 = filter(r -> any(occursin.(["e"], [r.col3, r.col4])), df1)

2.) Select only rows that contain more than one occurrence of the substring.
That's what I tried:

df2 = filter(r -> (sum(occursin.(["e"], r.col2)) > 1), df1)

which returns only an empty DataFrame.
empty DataFrame

Is a DataFrame the "Julian" way to represent my Excel file at all? In my mind, coming from the "old" programming language world, using column names contained in the Excel file as designators in the script source code feels odd.

英文:

I'm new to Julia (and StackOverflow).
I want to extract rows, originally from an Excel file, that contain a substring in more than one column.

I found the following answer which solves the first of three aspects of my issue: https://stackoverflow.com/questions/66376876/finding-rows-in-a-julia-dataframe-where-substrings-are-not-found

Example extended with more columns:

using DataFrames
df1 = DataFrame(col1=[1,62,13,24], col2=[&quot;ab&quot;,&quot;bc&quot;,&quot;cd&quot;,&quot;de&quot;], col3=[&quot;de&quot;,&quot;bc&quot;,&quot;cd&quot;,&quot;de&quot;], col4=[&quot;de&quot;,&quot;bc&quot;,&quot;cd&quot;,&quot;de&quot;])

df2 = filter(r -&gt; any(occursin.([&quot;e&quot;], [r.col3,r.col4])), df1)

2.) Select only rows that contain more than one occurance of the substring.
That's what I tried:

df2 = filter(r -&gt; (sum(occursin.([&quot;e&quot;], r.col2)) &gt; 1), df1)

which returns only an empty DataFrame.
empty DataFrame(https://i.stack.imgur.com/MSxBp.png)

Is a DataFrame the "Julian" way to represent my Excel file at all? In my mind coming from the "old" programming language world using column names contained in the Excel file as designators in the script source code feels odd.

答案1

得分: 1

这是你可能想要的内容：

julia> df2 = filter(r -> count(contains("e"), r[Between(:col2, :col4)]) > 1, df1)
2×4 DataFrame
 Row │ col1   col2    col3    col4
     │ Int64  String  String  String
─────┼───────────────────────────────
   1 │     1  ab      de      de
   2 │    24  de      de      de
julia> df2 = filter(r -> count(contains("e"), r[Between(:col2, :col3)]) > 1, df1)
1×4 DataFrame
 Row │ col1   col2    col3    col4
     │ Int64  String  String  String
─────┼───────────────────────────────
   1 │    24  de      de      de

解释：

r[Between(:col2, :col4)] 保留了位于列 :col2 和 :col4 之间的所有列（包括它们自己）；在第二个示例中，我仅包括了列直到 :col3，这会删除一行，符合预期。
contains("e") 返回一个函数，接受一个字符串，并返回 true，如果该字符串包含 "e"，例如 contains("e")("abc") 为 false，contains("e")("abec") 为 true。
count 计算条件函数（通常称为谓词）作为第一个参数返回 true 的次数。

英文:

You most likely want this:

julia&gt; df2 = filter(r -&gt; count(contains(&quot;e&quot;), r[Between(:col2, :col4)]) &gt; 1, df1)
2&#215;4 DataFrame
 Row │ col1   col2    col3    col4
     │ Int64  String  String  String
─────┼───────────────────────────────
   1 │     1  ab      de      de
   2 │    24  de      de      de
julia&gt; df2 = filter(r -&gt; count(contains(&quot;e&quot;), r[Between(:col2, :col3)]) &gt; 1, df1)
1&#215;4 DataFrame
 Row │ col1   col2    col3    col4
     │ Int64  String  String  String
─────┼───────────────────────────────
   1 │    24  de      de      de

Explanation:

r[Between(:col2, :col4)] keeps all columns between columns :col2 and :col4 (including them); in the second example I included columns only up to :col3 which drops one row as expected
contains("e") returns a function taking a string and returning true if this string contains "e", example contains("e")("abc") is false and contains("e")("abec") is true
count counts number of time the condition function (typically called predicate) passed as a first argument returns true

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在Julia数据框中查找多列中包含子字符串的行。

问题

答案1

在Pandas系列中对列执行的函数。

未报告的整数溢出在pandas数学运算中，使用np.int16，np.int32时的pandas.eval()。

在单个图中绘制多个数据集

如何使用Julia中的JuMP找到在一些约束条件下的所有解决方案

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。