英文:
Length of pySpark is bigger than when using pandas
问题
我正在尝试不同的方法来将数据加载到数据框中。
我正在研究的一个框架之一是PySpark,但当我加载一个包含14149行的CSV文件并返回数据框的长度时,它返回14153,而pandas返回14149。
#pandas
df = pd.read_csv("data_file.csv")
print df.shape
#spark
spark = SparkSession
.builder
.appName("Task_1")
.getOrCreate()
df = spark.read.csv("data_file.csv")
print(df.toPandas().shape)
结果是(14149, 5),(14153, 5)
当我检查Spark数据框时,前几行看起来正常,尾部也包含正确的信息,但行号不正确。这多出来的四行是从哪里来的,我如何防止PySpark向数据框中添加不在源文件中的行?
文件链接:
training.csv
英文:
I am experimenting with different ways to load data into dataframes.
One of the frameworks I am looking into is PySpark, but when I load a CSV with 14149 Rows and return the length of the df, it returns 14153, while pandas return 14149.
#pandas
df = pd.read_csv("data_file.csv")
print df.shape
#spark
spark = SparkSession \
.builder \
.appName("Task_1") \
.getOrCreate()
df = spark.read.csv("data_file.csv")
print(df.toPandas().shape)
The result is (14149, 5), (14153, 5)
When I am inspecting the spark df, the head looks fine, and also the tail has the correct information, but the id number is off. Where are the extra four rows coming from, and how can I prevent pySpark from adding rows to a df, that are not in the src file?
Link to the file:
<a href="https://anonymfile.com/OEgRK/training.csv">training.csv</a>
答案1
得分: 1
有3行存在问题,在这些行中分隔符不是','
而是\t
(第12657、12658、12659行):
nt-12657 哪个地点举办的东西方比赛次数最少? csv/203-csv/636.csv Oakland, CA|San Antonio, TX
nt-12658 哪一年比赛场次最多? csv/204-csv/962.csv 2011-12
nt-12659 最后一次发行的单曲是什么? csv/203-csv/696.csv Je me souviens de tout"""
Pandas在加载时会丢弃这些行(3行)。Pandas将第一行视为标题,而PySpark不会(1行)。这实际上造成了4行的差异。
你可以使用:
# df = pd.read_csv('training.csv', index_col=0)
df = spark.read.csv('training.csv', header=True).toPandas().dropna().set_index('_c0')
print(df.shape)
# 输出
(14149, 5)
英文:
You have 3 bad lines in your file where the separator is not ','
but \t
(lines 12657, 12658, 12659):
nt-12657 which location has the east-west game been played at the least? csv/203-csv/636.csv Oakland, CA|San Antonio, TX
nt-12658 the most matches were in what year? csv/204-csv/962.csv 2011-12
nt-12659 what was the last single released? csv/203-csv/696.csv Je me souviens de tout"""
Pandas drops this rows while PySpark loads it (3 rows). Pandas considers the first row as header while PySpark don't (1 row). The difference is indeed 4 lines.
You can use:
# df = pd.read_csv('training.csv', index_col=0)
df = spark.read.csv('training.csv', header=True).toPandas().dropna().set_index('_c0')
print(df.shape)
# Output
(14149, 5)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论