问题

TMP_BUCKET = "stg-gcs-bucket"
MAX_PARTITION_BYTES = str(512 * 1024 * 1024)
MAX_ROW_NUM_PER_FILE = "61000"

spark = SparkSession \
    .builder \
    .master('yarn') \
    .appName('crs-bq-export-csv') \
    .config('spark.sql.execution.arrow.pyspark.enabled', 'true') \
    .config('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.23.2.jar') \
    .config("spark.sql.broadcastTimeout", "36000") \
    .config("spark.sql.files.maxRecordsPerFile", MAX_ROW_NUM_PER_FILE) \
    .config("spark.sql.files.maxPartitionBytes", MAX_PARTITION_BYTES) \
    .config("spark.files.maxPartitionBytes", MAX_PARTITION_BYTES) \
    .config("spark.driver.maxResultSize", "24g") \
    .config("spark.sql.execution.arrow.pyspark.enabled", "true") \
    .getOrCreate()

# 尝试从BQ读取完整数据
df = spark.read.format('bigquery') \
    .option('table', TABLE_NAME) \
    .load()

# 按colA和colB排序并写入CSV
df.sort('colA', 'colB').write.mode('overwrite').csv(OUTPUT_PATH, header=True)

英文:

I'd plan to read data from a very large BigQuery table then output with 61,000 sequential records, I've tried code below:

TMP_BUCKET = &quot;stg-gcs-bucket&quot;
MAX_PARTITION_BYTES = str(512 * 1024 * 1024)
# 1k Account per file
# MAX_ROW_NUM_PER_FILE = &quot;18300&quot;
MAX_ROW_NUM_PER_FILE = &quot;61000&quot;

spark = SparkSession \
    .builder \
    .master(&#39;yarn&#39;) \
    .appName(&#39;crs-bq-export-csv&#39;) \
    .config(&#39;spark.sql.execution.arrow.pyspark.enabled&#39;, &#39;true&#39;) \
    .config(&#39;spark.jars&#39;, &#39;gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.23.2.jar&#39;) \
    .config(&quot;spark.sql.broadcastTimeout&quot;, &quot;36000&quot;) \
    .config(&quot;spark.sql.files.maxRecordsPerFile&quot;, MAX_ROW_NUM_PER_FILE) \
    .config(&quot;spark.sql.files.maxPartitionBytes&quot;, MAX_PARTITION_BYTES) \
    .config(&quot;spark.files.maxPartitionBytes&quot;, MAX_PARTITION_BYTES) \
    .config(&quot;spark.driver.maxResultSize&quot;, &quot;24g&quot;) \
    .config(&quot;spark.sql.execution.arrow.pyspark.enabled&quot;, &quot;true&quot;) \
    .getOrCreate()


#Try to read full data from BQ
df = spark.read.format(&#39;bigquery&#39;) \
    .option(&#39;table&#39;, TABLE_NAME) \
    .load()

df.sort(&#39;colA&#39;).sort(&#39;colB&#39;).write.mode(&#39;overwrite&#39;).csv(OUTPUT_PATH, header=True)

but the final results didn't sort with the colA and colB and they are all inordinate:
Expected CSV:

colA colB
1. 1
2. 2
3. 3
....
60001 60001

But got:

colA colB
2. 1
3. 3
2. 2
1. 3

I checked the spark doc and it will shullfle all dfs in order to get better performance, but I need to get the final csv with specific order, how can I achieve this?

How can I do for this case? Any helps will be super helpful!

答案1

得分: 1

我按照你的要求进行了翻译：

# 创建数据框
data = [("2.", "1"),
        ("3.", "3"),
        ("2.", "2"),
        ("1.", "3")]

columns = ["colA", "colB"]

df = spark.createDataFrame(data, columns)
df.show()

# 运行你的代码
df.sort('colA').sort('colB').show()

# 正确的方法
df.sort('colA', 'colB').show()
df.sort('colA', 'colB').explain()

英文:

I create the dataframe like this:

data = [(&quot;2.&quot;, &quot;1&quot;),

        (&quot;3.&quot;, &quot;3&quot;),

        (&quot;2.&quot;, &quot;2&quot;),

       (&quot;1.&quot;, &quot;3&quot;)]

columns = [&quot;colA&quot;, &quot;colB&quot;]

df = spark.createDataFrame(data, columns)
df.show()

+----+----+
|colA|colB|
+----+----+
|2.  |1   |
|3.  |3   |
|2.  |2   |
|1.  |3   |
+----+----+

If I run your code I get:

df.sort(&#39;colA&#39;).sort(&#39;colB&#39;).show()

+----+----+
|colA|colB|
+----+----+
|  2.|   1|
|  2.|   2|
|  1.|   3|
|  3.|   3|
+----+----+

Let's look at the execution plan it sorts by colB:

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [colB#1 ASC NULLS FIRST], true, 0
   +- Exchange rangepartitioning(colB#1 ASC NULLS FIRST, 200), ENSURE_REQUIREMENTS, [plan_id=94]
      +- Scan ExistingRDD[colA#0,colB#1]

And that is in line with the way the sort function is implemented - it sorts the whole dataframe based on the column values from the columns you have passed to the sort function. So, the final effect of chaining sort function calls has means that the resulting dataframe will be sorted based on the last sort function call.

Here is the correct approach for your use case:

df.sort(&#39;colA&#39;, &#39;colB&#39;).show()
df.sort(&#39;colA&#39;, &#39;colB&#39;).explain()

+----+----+
|colA|colB|
+----+----+
|  1.|   3|
|  2.|   1|
|  2.|   2|
|  3.|   3|
+----+----+

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [colA#0 ASC NULLS FIRST, colB#1 ASC NULLS FIRST], true, 0
   +- Exchange rangepartitioning(colA#0 ASC NULLS FIRST, colB#1 ASC NULLS FIRST, 200), ENSURE_REQUIREMENTS, [plan_id=148]
      +- Scan ExistingRDD[colA#0,colB#1]

As you can see in the output dataframe and in the execution plan, it sorts by both columns because I am passing both columns to the sort function, first by colA and then by colB.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

PySpark – 如何以顺序记录方式输出 CSV/Parquet 文件？

问题

答案1

如何将模式设置到spark.sql.function.from_csv中？

实体解析 – 基于3列创建唯一标识符

编程取消一个pyspark dataproc批处理作业

Kubernetes中的执行器Pod在提交Spark作业到K8s时不断创建然后移除。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论