pySpark的长度大于使用pandas时

huangapple go评论76阅读模式
英文:

Length of pySpark is bigger than when using pandas

问题

我正在尝试不同的方法来将数据加载到数据框中。
我正在研究的一个框架之一是PySpark,但当我加载一个包含14149行的CSV文件并返回数据框的长度时,它返回14153,而pandas返回14149。

#pandas
df = pd.read_csv("data_file.csv")
print df.shape

#spark
spark = SparkSession
.builder
.appName("Task_1")
.getOrCreate()
df = spark.read.csv("data_file.csv")
print(df.toPandas().shape)

结果是(14149, 5),(14153, 5)

当我检查Spark数据框时,前几行看起来正常,尾部也包含正确的信息,但行号不正确。这多出来的四行是从哪里来的,我如何防止PySpark向数据框中添加不在源文件中的行?

文件链接:
training.csv

英文:

I am experimenting with different ways to load data into dataframes.
One of the frameworks I am looking into is PySpark, but when I load a CSV with 14149 Rows and return the length of the df, it returns 14153, while pandas return 14149.

  #pandas 
  df = pd.read_csv("data_file.csv")
  print df.shape

  #spark
  spark = SparkSession \
    .builder \
    .appName("Task_1") \
    .getOrCreate()
  df = spark.read.csv("data_file.csv")
  print(df.toPandas().shape)

The result is (14149, 5), (14153, 5)

When I am inspecting the spark df, the head looks fine, and also the tail has the correct information, but the id number is off. Where are the extra four rows coming from, and how can I prevent pySpark from adding rows to a df, that are not in the src file?

Link to the file:
<a href="https://anonymfile.com/OEgRK/training.csv">training.csv</a>

答案1

得分: 1

有3行存在问题,在这些行中分隔符不是&#39;,&#39;而是\t(第12657、12658、12659行):

nt-12657	哪个地点举办的东西方比赛次数最少	csv/203-csv/636.csv	Oakland, CA|San Antonio, TX
nt-12658	哪一年比赛场次最多	csv/204-csv/962.csv	2011-12
nt-12659	最后一次发行的单曲是什么	csv/203-csv/696.csv	Je me souviens de tout&quot;&quot;&quot;

Pandas在加载时会丢弃这些行(3行)。Pandas将第一行视为标题,而PySpark不会(1行)。这实际上造成了4行的差异。

你可以使用:

# df = pd.read_csv(&#39;training.csv&#39;, index_col=0)
df = spark.read.csv(&#39;training.csv&#39;, header=True).toPandas().dropna().set_index(&#39;_c0&#39;)
print(df.shape)

# 输出
(14149, 5)
英文:

You have 3 bad lines in your file where the separator is not &#39;,&#39; but \t (lines 12657, 12658, 12659):

nt-12657	which location has the east-west game been played at the least?	csv/203-csv/636.csv	Oakland, CA|San Antonio, TX
nt-12658	the most matches were in what year?	csv/204-csv/962.csv	2011-12
nt-12659	what was the last single released?	csv/203-csv/696.csv	Je me souviens de tout&quot;&quot;&quot;

Pandas drops this rows while PySpark loads it (3 rows). Pandas considers the first row as header while PySpark don't (1 row). The difference is indeed 4 lines.

You can use:

# df = pd.read_csv(&#39;training.csv&#39;, index_col=0)
df = spark.read.csv(&#39;training.csv&#39;, header=True).toPandas().dropna().set_index(&#39;_c0&#39;)
print(df.shape)

# Output
(14149, 5)

huangapple
  • 本文由 发表于 2023年3月3日 18:52:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/75626130.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定