问题

我有两个数据框 df1 和 df2。我需要为给定的键（k1）找到最小日期，其中最小日期不能是 01-01-2020 或更高，并且如果找到，则用 df2 中的日期（开始和结束）替换它们。示例：

data = [('A', 'a', '03-05-2010', '02-02-2019'),
   ('B', 'a', '02-12-2010', '01-02-2011'),
   ('B', 'b', '02-12-2010', '01-02-2011'),
   ('B', 'c', '02-12-2010', '01-02-2011'),
   ('B', 'd', '03-01-2013', '01-03-2015'),
   ('B', 'e', '04-01-2014', '01-01-2020'),
   ('C', 'a', '01-01-2020', '01-01-2020')
 ]

schema = StructType([ \
   StructField("k1", StringType(), True), \
   StructField("k2", StringType(), True), \
   StructField("start", StringType(), True), \
   StructField("end", StringType(), True), \
 ])

df1 = spark.createDataFrame(data=data, schema=schema)
df1.show()

+---+---+----------+----------+
| k1| k2|     start|       end|
+---+---+----------+----------+
|  A|  a|03-05-2010|02-02-2019|
|  B|  a|02-12-2010|01-02-2011|
|  B|  b|02-12-2010|01-02-2011|
|  B|  c|02-12-2010|01-02-2011|
|  B|  d|03-01-2013|01-03-2015|
|  B|  e|04-01-2014|01-01-2020|
|  C|  a|01-01-2020|01-01-2020|
+---+---+----------+----------+

# df2

+---+---+----------+----------+
| k1| k2|     start|       end|
+---+---+----------+----------+
|  A|  a|03-05-2010|02-02-2019|
|  B|  a|01-01-2008|01-02-2008|
|  B|  b|01-11-2009|01-12-2009|
|  B|  c|02-01-2010|01-02-2010|
|  B|  e|04-01-2014|01-01-2020|
|  D|  a|01-01-2000|01-01-2001|
+---+---+----------+----------+

结果应如下所示：

+---+---+----------+----------+
| k1| k2|     start|       end|
+---+---+----------+----------+
|  A|  a|03-05-2010|02-02-2019| # 与键匹配，无更改
|  B|  a|01-01-2008|01-02-2008| # 替换为 df2 中的开始和结束，因为 df1 中的 k1 的日期是最小值 01-02-2011
|  B|  b|01-11-2009|01-12-2009| # 替换为 df2 中的开始和结束，因为 df1 中的 k1 的日期是最小值 01-02-2011
|  B|  c|02-01-2010|01-02-2010| # 替换为 df2 中的开始和结束，因为 df1 中的 k1 的日期是最小值 01-02-2011
|  B|  d|03-01-2013|01-03-2015| # 不更改，因为它不是 k1 的最小值
|  B|  e|04-01-2014|01-01-2020| # 不更改，因为它不是 k1 的最小值，而且超出限制
|  C|  a|01-01-2020|01-01-2020| # 无更改，因为在 df2 中找不到匹配值
+---+---+----------+----------+

谢谢！！

英文:

I have two dataframes df1 and df2. I need to find for the given keys (k1) the minimum dates where the minimal date cannot be 01-01-2020 or above and where found replace the dates (start and end) from df2. The example;

data = [(&#39;A&#39;,&#39;a&#39;,&#39;03-05-2010&#39;,&#39;02-02-2019&#39;),
(&#39;B&#39;,&#39;a&#39;,&#39;02-12-2010&#39;,&#39;01-02-2011&#39;),
(&#39;B&#39;,&#39;b&#39;,&#39;02-12-2010&#39;,&#39;01-02-2011&#39;),
(&#39;B&#39;,&#39;c&#39;,&#39;02-12-2010&#39;,&#39;01-02-2011&#39;),
(&#39;B&#39;,&#39;d&#39;,&#39;03-01-2013&#39;,&#39;01-03-2015&#39;),
(&#39;B&#39;,&#39;e&#39;,&#39;04-01-2014&#39;,&#39;01-01-2020&#39;),
(&#39;C&#39;,&#39;a&#39;,&#39;01-01-2020&#39;,&#39;01-01-2020&#39;)
]
schema = StructType([ \
StructField(&quot;k1&quot;,StringType(),True), \
StructField(&quot;k2&quot;,StringType(),True), \
StructField(&quot;start&quot;,StringType(),True), \
StructField(&quot;end&quot;,StringType(),True), \
])
df1 = spark.createDataFrame(data=data,schema=schema)
df1.show()
+---+---+----------+----------+
| k1| k2|     start|       end|
+---+---+----------+----------+
|  A|  a|03-05-2010|02-02-2019|
|  B|  a|02-12-2010|01-02-2011|
|  B|  b|02-12-2010|01-02-2011|
|  B|  c|02-12-2010|01-02-2011|
|  B|  d|03-01-2013|01-03-2015|
|  B|  e|04-01-2014|01-01-2020|
|  C|  a|01-01-2020|01-01-2020|
+---+---+----------+----------+
#df2
+---+---+----------+----------+
| k1| k2|     start|       end|
+---+---+----------+----------+
|  A|  a|03-05-2010|02-02-2019|
|  B|  a|01-01-2008|01-02-2008|
|  B|  b|01-11-2009|01-12-2009|
|  B|  c|02-01-2010|01-02-2010|
|  B|  e|04-01-2014|01-01-2020|
|  D|  a|01-01-2000|01-01-2001|
+---+---+----------+----------+

and the result should look like this:

+---+---+----------+----------+
| k1| k2|     start|       end|
+---+---+----------+----------+
|  A|  a|03-05-2010|02-02-2019| #no change as it matches by key
|  B|  a|01-01-2008|01-02-2008| #replaced with df2&#39;s start and end because the date in df1 for k1 was the minimal value 01-02-2011 
|  B|  b|01-11-2009|01-12-2009| #replaced with df2&#39;s start and end because the date in df1 for k1 was the minimal value 01-02-2011
|  B|  c|02-01-2010|01-02-2010| #replaced with df2&#39;s start and end because the date in df1 for k1 was the minimal value 01-02-2011
|  B|  d|03-01-2013|01-03-2015| #no change as its not the minimal value for k1
|  B|  e|04-01-2014|01-01-2020| #no change as its not the minimal value for k1 plus its over the limit
|  C|  a|01-01-2020|01-01-2020| #no change as no matching value found in df2
+---+---+----------+----------+

thanks!!!

答案1

得分: 1

以下是代码的翻译部分：

也许这将是正确的：

从pyspark.sql.types导入*
数据 = [('A'，'a'，'03-05-2010'，'02-02-2019')，
       ('B'，'a'，'02-12-2010'，'01-02-2011')，
       ('B'，'b'，'02-12-2010'，'01-02-2011')，
       ('B'，'c'，'02-12-2010'，'01-02-2011')，
       ('B'，'d'，'03-01-2013'，'01-03-2015')，
       ('B'，'e'，'04-01-2014'，'01-01-2020')，
       ('C'，'a'，'01-01-2020'，'01-01-2020')
     ]

模式 = StructType([ \
       StructField("k1"，StringType()，True)，
       StructField("k2"，StringType()，True)，
       StructField("start"，StringType()，True)，
       StructField("end"，StringType()，True)，
     ])

df1 = spark.createDataFrame(data=data，schema=schema).alias("df1")
df1.show()

data2 = [('A'，'a'，'03-05-2010'，'02-02-2019')，
       ('B'，'a'，'01-01-2008'，'01-02-2008')，
       ('B'，'b'，'01-11-2009'，'01-12-2009')，
       ('B'，'c'，'02-01-2010'，'01-02-2010')，
       ('B'，'e'，'04-01-2014'，'01-01-2020')，
       ('D'，'a'，'01-01-2000'，'01-01-2001')
     ]
df2 = spark.createDataFrame(data=data2，schema=schema).alias("df2")
df2.show()

Join：

import pyspark.sql.functions as F
df1.join(df2，（df1.k1 == df2.k1）&amp;（df1.k2 == df2.k2），
            how='left') \
            .select(df1.k1，
                    df1.k2，
                    F.expr("Coalesce(CASE WHEN TO_DATE(CAST(UNIX_TIMESTAMP(df2.start，'MM-dd-yyyy') AS TIMESTAMP)) > 
                            TO_DATE(CAST(UNIX_TIMESTAMP(df1.start，'MM-dd-yyyy') AS TIMESTAMP)) THEN df1.start ELSE df2.start END，
                            df1.start)").alias("start")，
                    F.expr("Coalesce(CASE WHEN TO_DATE(CAST(UNIX_TIMESTAMP(df2.end，'MM-dd-yyyy') AS TIMESTAMP)) > 
                            TO_DATE(CAST(UNIX_TIMESTAMP(df1.end，'MM-dd-yyyy') AS TIMESTAMP)) THEN df1.end ELSE df2.end END， 
                            df1.end)").alias("end")).sort(df1.k1.asc()，df1.k2.asc() ).show()

结果：

+---+---+----------+----------+
| k1| k2|     start|       end|
+---+---+----------+----------+
|  A|  a|03-05-2010|02-02-2019|
|  B|  a|01-01-2008|01-02-2008|
|  B|  b|01-11-2009|01-12-2009|
|  B|  c|02-01-2010|01-02-2010|
|  B|  d|03-01-2013|01-03-2015|
|  B|  e|04-01-2014|01-01-2020|
|  C|  a|01-01-2020|01-01-2020|
+---+---+----------+----------+

希望这对你有所帮助。

英文:

maybe this will be correct:

from pyspark.sql.types import *
data = [(&#39;A&#39;,&#39;a&#39;,&#39;03-05-2010&#39;,&#39;02-02-2019&#39;),
(&#39;B&#39;,&#39;a&#39;,&#39;02-12-2010&#39;,&#39;01-02-2011&#39;),
(&#39;B&#39;,&#39;b&#39;,&#39;02-12-2010&#39;,&#39;01-02-2011&#39;),
(&#39;B&#39;,&#39;c&#39;,&#39;02-12-2010&#39;,&#39;01-02-2011&#39;),
(&#39;B&#39;,&#39;d&#39;,&#39;03-01-2013&#39;,&#39;01-03-2015&#39;),
(&#39;B&#39;,&#39;e&#39;,&#39;04-01-2014&#39;,&#39;01-01-2020&#39;),
(&#39;C&#39;,&#39;a&#39;,&#39;01-01-2020&#39;,&#39;01-01-2020&#39;)
]
schema = StructType([ \
StructField(&quot;k1&quot;,StringType(),True), \
StructField(&quot;k2&quot;,StringType(),True), \
StructField(&quot;start&quot;,StringType(),True), \
StructField(&quot;end&quot;,StringType(),True), \
])
df1 = spark.createDataFrame(data=data,schema=schema).alias(&quot;df1&quot;)
df1.show()
data2 = [(&#39;A&#39;,&#39;a&#39;,&#39;03-05-2010&#39;,&#39;02-02-2019&#39;),
(&#39;B&#39;,&#39;a&#39;,&#39;01-01-2008&#39;,&#39;01-02-2008&#39;),
(&#39;B&#39;,&#39;b&#39;,&#39;01-11-2009&#39;,&#39;01-12-2009&#39;),
(&#39;B&#39;,&#39;c&#39;,&#39;02-01-2010&#39;,&#39;01-02-2010&#39;),
(&#39;B&#39;,&#39;e&#39;,&#39;04-01-2014&#39;,&#39;01-01-2020&#39;),
(&#39;D&#39;,&#39;a&#39;,&#39;01-01-2000&#39;,&#39;01-01-2001&#39;)
]
df2 = spark.createDataFrame(data=data2,schema=schema).alias(&quot;df2&quot;)
df2.show()

Join:

import pyspark.sql.functions as F
df1.join(df2, (df1.k1 == df2.k1) &amp; (df1.k2 == df2.k2), how=&#39;left&#39;) \
.select(df1.k1,
df1.k2,
F.expr(&quot;&quot;&quot;Coalesce(CASE WHEN TO_DATE(CAST(UNIX_TIMESTAMP(df2.start, &#39;MM-dd-yyyy&#39;) AS TIMESTAMP)) &gt; 
TO_DATE(CAST(UNIX_TIMESTAMP(df1.start, &#39;MM-dd-yyyy&#39;) AS TIMESTAMP)) THEN df1.start ELSE df2.start END,
df1.start) &quot;&quot;&quot;).alias(&quot;start&quot;),
F.expr(&quot;&quot;&quot;Coalesce(CASE WHEN TO_DATE(CAST(UNIX_TIMESTAMP(df2.end, &#39;MM-dd-yyyy&#39;) AS TIMESTAMP)) &gt; 
TO_DATE(CAST(UNIX_TIMESTAMP(df1.end, &#39;MM-dd-yyyy&#39;) AS TIMESTAMP)) THEN df1.end ELSE df2.end END, 
df1.end)&quot;&quot;&quot;).alias(&quot;end&quot;)).sort(df1.k1.asc(),df1.k2.asc() ).show()

Result:

+---+---+----------+----------+
| k1| k2|     start|       end|
+---+---+----------+----------+
|  A|  a|03-05-2010|02-02-2019|
|  B|  a|01-01-2008|01-02-2008|
|  B|  b|01-11-2009|01-12-2009|
|  B|  c|02-01-2010|01-02-2010|
|  B|  d|03-01-2013|01-03-2015|
|  B|  e|04-01-2014|01-01-2020|
|  C|  a|01-01-2020|01-01-2020|
+---+---+----------+----------+

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在计算的条件下将两个Pyspark数据框连接起来。

问题

答案1

discord.py: sys:1: RuntimeWarning: coroutine ‘Loop._loop’ was never awaited

如何在终端中打印文件内容

Pyhton requests.post raise error SSLError(1, '[SSL: SSLV3_ALERT_HANDSHAKE_FAILURE] sslv3 alert handshake failure (_ssl.c:997)')

如何解决grpc Deadline Exceeded错误？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论