2023年3月3日 18:52:57go评论103阅读模式

英文:

Length of pySpark is bigger than when using pandas

问题

我正在尝试不同的方法来将数据加载到数据框中。
我正在研究的一个框架之一是PySpark，但当我加载一个包含14149行的CSV文件并返回数据框的长度时，它返回14153，而pandas返回14149。

#pandas
df = pd.read_csv("data_file.csv")
print df.shape

#spark
spark = SparkSession
.builder
.appName("Task_1")
.getOrCreate()
df = spark.read.csv("data_file.csv")
print(df.toPandas().shape)

结果是(14149, 5)，(14153, 5)

当我检查Spark数据框时，前几行看起来正常，尾部也包含正确的信息，但行号不正确。这多出来的四行是从哪里来的，我如何防止PySpark向数据框中添加不在源文件中的行？

文件链接:
training.csv

英文:

I am experimenting with different ways to load data into dataframes.
One of the frameworks I am looking into is PySpark, but when I load a CSV with 14149 Rows and return the length of the df, it returns 14153, while pandas return 14149.

  #pandas 
  df = pd.read_csv(&quot;data_file.csv&quot;)
  print df.shape
  #spark
  spark = SparkSession \
    .builder \
    .appName(&quot;Task_1&quot;) \
    .getOrCreate()
  df = spark.read.csv(&quot;data_file.csv&quot;)
  print(df.toPandas().shape)

The result is (14149, 5), (14153, 5)

When I am inspecting the spark df, the head looks fine, and also the tail has the correct information, but the id number is off. Where are the extra four rows coming from, and how can I prevent pySpark from adding rows to a df, that are not in the src file?

Link to the file:
<a href="https://anonymfile.com/OEgRK/training.csv">training.csv</a>

答案1

得分: 1

有3行存在问题，在这些行中分隔符不是','而是\t（第12657、12658、12659行）：

nt-12657	哪个地点举办的东西方比赛次数最少？	csv/203-csv/636.csv	Oakland, CA|San Antonio, TX
nt-12658	哪一年比赛场次最多？	csv/204-csv/962.csv	2011-12
nt-12659	最后一次发行的单曲是什么？	csv/203-csv/696.csv	Je me souviens de tout&quot;&quot;&quot;

Pandas在加载时会丢弃这些行（3行）。Pandas将第一行视为标题，而PySpark不会（1行）。这实际上造成了4行的差异。

你可以使用：

# df = pd.read_csv(&#39;training.csv&#39;, index_col=0)
df = spark.read.csv(&#39;training.csv&#39;, header=True).toPandas().dropna().set_index(&#39;_c0&#39;)
print(df.shape)
# 输出
(14149, 5)

英文:

You have 3 bad lines in your file where the separator is not ',' but \t (lines 12657, 12658, 12659):

nt-12657	which location has the east-west game been played at the least?	csv/203-csv/636.csv	Oakland, CA|San Antonio, TX
nt-12658	the most matches were in what year?	csv/204-csv/962.csv	2011-12
nt-12659	what was the last single released?	csv/203-csv/696.csv	Je me souviens de tout&quot;&quot;&quot;

Pandas drops this rows while PySpark loads it (3 rows). Pandas considers the first row as header while PySpark don't (1 row). The difference is indeed 4 lines.

You can use:

# df = pd.read_csv(&#39;training.csv&#39;, index_col=0)
df = spark.read.csv(&#39;training.csv&#39;, header=True).toPandas().dropna().set_index(&#39;_c0&#39;)
print(df.shape)
# Output
(14149, 5)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

pySpark的长度大于使用pandas时

问题

答案1

如何根据边界计算不同数据框中的数值

使用sympy对一系列符号求和

如何将一个NumPy的ndarray转换成PyTorch的数据集？

AttributeError: 类对象’map’没有属性’zone_map’

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。