How to build a Pandas Dataframe with a Numpy Array from Imported CSV data with multiple numbers

huangapple go评论147阅读模式
英文:

How to build a Pandas Dataframe with a Numpy Array from Imported CSV data with multiple numbers

问题

我有点困惑。我创建了一个概念验证,其中我使用静态的Numpy数组构建了一个Pandas数据帧。我成功地使其工作,但现在我要更进一步,导入一个CSV文件来构建相同的数据帧和Numpy数组。以下是文件的一部分和我写的内容。我想提取“numbers”列的第二列,并构建每行6个数字的数组。例如,[[11],[21],[27],[36],[62],[24]],[[14],[18],[36],[49],[67],[18]]等。

CSV:

日期,numbers,multiplier
09/26/2020,11 21 27 36 62 24,3
09/30/2020,14 18 36 49 67 18,2
10/03/2020,18 31 36 43 47 20,2

代码:

data = pd.read_csv('pbhistory.csv')
data['date'] = pd.to_datetime(data.date, infer_datetime_format=True)
data.sort_values(by='date', ascending=True, inplace=True)
df = pd.DataFrame(data.numbers.str.split().tolist(), columns=['1', '2', '3', '4', '5', '6']).astype(int)
print(df.head())

错误:
我希望从df2中得到6列数据,因为我认为在从CSV导入“numbers”列后,它已被正确转换为数组,但我得到了以下错误:

ValueError: 传递值的形状为(1414, 1),索引暗示形状为(1414, 6)

所以,我将代码更改为df2 = pd.DataFrame(df, columns=['1']),并获得以下输出。问题是,我需要它有6列,而不是1列。

                     1
0 11 21 27 36 62 24
1 14 18 36 49 67 18
2 18 31 36 43 47 20

所以,正如你所看到的,我只得到了一列包含所有数字的数据,而不是包含6列数字的数组。

英文:

I'm a little stumped on this one. I've created a proof of concept where I built a Pandas Dataframe with a static Numpy Array of numbers. I got this working fine, but now I'm taking it a step further and importing a CSV file to build this same Dataframe and Numpy Array. Here is the snippet of the file and what I've written. I want to take the second column of 'numbers' and build an array of 6 numbers per line. For example, [[11],[21],[27],[36],[62],[24]], [[14],[18],[36],[49],[67],[18]], etc.

CSV:

date,numbers,multiplier
09/26/2020,11 21 27 36 62 24,3
09/30/2020,14 18 36 49 67 18,2
10/03/2020,18 31 36 43 47 20,2

CODE:

data = pd.read_csv('pbhistory.csv')
data['date'] = pd.to_datetime(data.date, infer_datetime_format=True)
data.sort_values(by='date', ascending=True, inplace=True)
df = pd.DataFrame(data.numbers).to_numpy()
df2 = pd.DataFrame(df, columns=['1', '2', '3', '4', '5', '6'])
print(df2.head())

ERROR:
I'm expecting 6 columns of data from df2 as I thought it was converted to an array properly after importing the 'numbers' column from the CSV, but I get the following:

ValueError: Shape of passed values is (1414, 1), indices imply (1414, 6)

So, I change the code to df2 = pd.DataFrame(df, columns=['1']) and get the following output. The problem is, I need it to be in 6 columns, not 1.

                   1
0  11 21 27 36 62 24
1  14 18 36 49 67 18
2  18 31 36 43 47 20

So, as you can see, I'm only getting one column with all numbers, instead of an array of numbers with 6 columns.

答案1

得分: 1

data = pd.read_csv('pbhistory.csv')
data['date'] = pd.to_datetime(data.date, infer_datetime_format=True)
data.sort_values(by='date', ascending=True, inplace=True)
df = pd.DataFrame(data.numbers).to_numpy()

然后首先拆分它

df2 = df['numbers'].str.split(' ', expand=True)
英文:
data = pd.read_csv('pbhistory.csv')
data['date'] = pd.to_datetime(data.date, infer_datetime_format=True)
data.sort_values(by='date', ascending=True, inplace=True)
df = pd.DataFrame(data.numbers).to_numpy()

Then split it first

df2 = df['numbers'].str.split(' ', expand=True)

答案2

得分: 0

CSV代表逗号分隔值,即它将两个逗号之间的所有内容视为一个输入。如果您希望数字分开,您必须在它们之间加上逗号,否则您将不得不解析6个非逗号分隔值的较长文本并重建数据框架。

英文:

Remember that CSV stands for Comma Separated Values, ie it reads everything between two commas as one input. If you want the numbers separated you have to put commas between them, otherwise you'll have to parse the longer text of 6 non-comma separated values and rebuild the dataframe.

huangapple
  • 本文由 发表于 2023年1月6日 10:38:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/75026410.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定