2023年5月13日 20:10:20go评论63阅读模式

英文:

If spark isn't a storage system, how do tables work?

问题

I understand your request to translate the provided text. Here is the translation:

我正在阅读《Learning Spark, 2nd edition》，我发现其中有两个令我困惑的部分：
"与Hadoop不同，Hadoop包括存储和计算两者，而Spark将它们解耦。这意味着你可以(...)从数据源中读取数据，(...)在内存中处理它"
此外，在Reddit和课程中，我发现了关于Spark的信息，它似乎不存储数据，只处理数据。

现在我正在阅读Spark SQL部分，它说我可以创建由Spark管理的表，并保存数据（与视图相反，视图不保存数据）。我不太理解这一点。
"为了使你能够查询结构化数据(...)，Spark管理(...)在内存和磁盘上创建和管理视图和表"
我尝试过搜索和阅读，但只提到数据可以存储在HDFS等地方，但书中的示例没有提及。所以我想澄清一下 - 假设我只安装了Spark，没有单独安装HDFS和Hive - 我可以创建表并在其中保存数据吗？这些表与数据框之间有什么区别？表的数据存储在哪里？元数据存储在目录中，但实际数据呢？如果我拿一个CSV文件，创建一个带有CSV数据的表，然后删除CSV文件，我还能读取数据吗？如果可以，那么数据存储在哪里？如果不行，那么表可能并不保存数据，因此与视图没有区别？

英文:

I am reading "Learning spark, 2nd edition" and I find two confusing parts:
"Unlike Hadoop, which included both storage and compute, Spark decouples the two. That means you can (...) read data from sources, (...) process it in memory"
Also on reddit and in courses I found info that spark doesn't store data, it processes it only.

Now I am reading the spark sql part and it says that I can create tables that will be managed by spark and will hold data (in contrast to views, which do not hold data). I have trouble understanding it.

"To enable you to query structured data (...) Spark manages (...) creating and managing views and tables, both in memory and on disk"

I tried googling and reading, but it only mentions that the data can be held in hdfs etc, but the examples in the book don't mention that. So I would like to clarify - let's say I install spark only, without separate hdfs, hive installation - can I create the tables and hold data in them? What will be the difference between those tables and dataframes? How are the tables stored? Metadata is stored in the catalog, but what about the actual data? If I take a csv, create a table with data from the csv, then delete the csv, can I still read the data? If so, then where is the data? And if not, then maybe tables do not hold data and are therefore no different than views?

答案1

得分: 1

Spark是一个专门用于处理数据而不是存储数据的工具。在Hadoop的上下文中，它更类似于MapReduce，即处理引擎，而不是HDFS，即存储层。事实上，Spark可以操作存储在HDFS中的数据，有效地充当了Hadoop生态系统中MapReduce的替代品。

关于表格，Spark在存储系统的基础上创建了一个（临时）视图，便于处理数据。当您在代码中使用数据框或SQL时，Spark会将其转换为自己的内部格式进行处理。这在某种程度上类似于Pandas从文件系统读取数据并允许您使用SQL查询数据框。类似地，Spark从存储层（如HDFS或S3）读取数据，然后根据您的指令生成一个内部表/视图。

Spark中有两种类型的表格：

全局临时视图：这种类型的表格在所有会话中都可以访问，但在应用程序关闭时会消失。
本地临时视图：这个表格仅在当前会话中可用，在该会话结束时消失。

您还可以使用Spark SQL构建表格，这类似于标准数据库中的表格。这为查询和操作数据提供了一个熟悉的结构。

英文:

Spark is a tool specifically designed for processing data, not storing it. In the context of Hadoop, it's more akin to MapReduce, the processing engine, rather than HDFS, the storage layer. In fact, Spark can operate on data stored in HDFS, effectively serving as a replacement for MapReduce in the Hadoop ecosystem.

Regarding tables, Spark creates a (temporary) view on top of the storage system, facilitating the processing of that data. When you utilize dataframes or SQL in your code, Spark translates this into its own internal format for processing. This is somewhat comparable to how Pandas reads data from a file system and lets you query the dataframe using SQL. Similarly, Spark reads data from a storage layer (like HDFS or S3), then generates an internal table/view based on your instructions.

There are two kinds of tables in Spark:

Global Temporary View: This type of table is accessible in all
sessions, but it disappears when the application is closed.
Local Temporary View: This table is only available in the current
session and disappears when that session ends.

You can also construct tables using Spark SQL, which are akin to tables in a standard database. This offers a familiar structure for querying and manipulating the data.

答案2

得分: 1

在简单的话说：Spark 只是一个数据处理引擎。它用于从本地文件系统、HDFS、S3或数据库表中读取文件，并将输出写入另一个目的地或同一目的地。

你找到的这句话：“在内存和磁盘上创建和管理视图和表” 很可能指的是在过程中可以将数据缓存到内存和/或磁盘中。通常这样做是为了在作业的多个步骤中重用可能需要的数据，而无需创建任何临时输出，后续需要手动清理。此外，使用缓存的数据比两次读取文件或表要快得多。

英文:

In simple words: Spark is just a data processing engine. It is used to read files from the local file system, HDFS, S3 or a database table, and write out an output to another or to the same destination.

The sentence you have found "creating and managing views and tables, both in memory and on disk" most likely refers to the fact that you can cache data during the process in memory and/or on disk. This is usually done to reuse data that you may need across multiple steps in your job, without creating any temporary output that you need to clean up manually later. Besides, using cached data it's much faster than reading a file or a table twice.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如果Spark不是一个存储系统，表格是如何工作的？

问题

答案1

答案2

Data Profiling using Pyspark

Pyspark 从字符串列创建映射类型列

使用UDF筛选Spark DataFrame。

SynapseML LightGBM 转为 PMML

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论