2023年5月17日 22:15:58go评论109阅读模式

英文:

Convert awkwardly formatted excel data into tabular format using python

问题

I have an Excel spreadsheet containing records for each day of the month. Unfortunately, the dataset has been formatted in an awkward way, making it difficult to analyse. I would like to restructure the data into a tabular format with columns for the date, venues, and the corresponding quantity under each heading.

Current Format

	0	1	2	3	4	5	6	7	8	9
0	01/01/2023	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	Venue1	QTY	Venue2	QTY	Venue3	QTY	Venue4	QTY	Venue5	QTY
2	A	0	A	0	A	1	A	0	A	0
3	B	17	B	3	B	11	B	3	B	0
4	C	0	C	0	C	1	C	0	C	0
5	D	0	D	0	D	29	D	0	D	0
6	E	0	E	0	E	0	E	0	E	0
7	F	0	F	0	F	0	F	0	F	0
8	G	0	G	0	G	0	G	0	G	0
9	H	0	H	0	H	0	H	0	H	0
10	02/01/2023	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
11	Venue1	QTY	Venue2	QTY	Venue3	QTY	Venue4	QTY	Venue5	QTY
12	A	0	A	0	A	1	A	0	A	0
13	B	11	B	3	B	0	B	6	B	2
14	C	0	C	0	C	0	C	0	C	0
15	D	20	D	0	D	28	D	0	D	24
16	E	0	E	0	E	0	E	0	E	0
17	F	0	F	0	F	0	F	0	F	0
18	G	0	G	0	G	0	G	0	G	0
19	H	0	H	0	H	0	H	0	H	0

Required Format

I've tried manipulating in pandas, but I'm not sure how to go about it exactly to get the desired result. Any suggestions or sample code would be greatly appreciated. Thank You!

英文:

Current Format

             0    1       2    3       4    5       6    7       8    9
0   01/01/2023  NaN     NaN  NaN     NaN  NaN     NaN  NaN     NaN  NaN
1       Venue1  QTY  Venue2  QTY  Venue3  QTY  Venue4  QTY  Venue5  QTY
2            A    0       A    0       A    1       A    0       A    0
3            B   17       B    3       B   11       B    3       B    0
4            C    0       C    0       C    1       C    0       C    0
5            D    0       D    0       D   29       D    0       D    0
6            E    0       E    0       E    0       E    0       E    0
7            F    0       F    0       F    0       F    0       F    0
8            G    0       G    0       G    0       G    0       G    0
9            H    0       H    0       H    0       H    0       H    0
10  02/01/2023  NaN     NaN  NaN     NaN  NaN     NaN  NaN     NaN  NaN
11      Venue1  QTY  Venue2  QTY  Venue3  QTY  Venue4  QTY  Venue5  QTY
12           A    0       A    0       A    1       A    0       A    0
13           B   11       B    3       B    0       B    6       B    2
14           C    0       C    0       C    0       C    0       C    0
15           D   20       D    0       D   28       D    0       D   24
16           E    0       E    0       E    0       E    0       E    0
17           F    0       F    0       F    0       F    0       F    0
18           G    0       G    0       G    0       G    0       G    0
19           H    0       H    0       H    0       H    0       H    0

<br>

Required Format

I've tried manipulating in pandas, but i'm not sure how to go about it exactly to get the the desired result. Any suggestions or sample code would be greatly appreciated. Thank You!

答案1

得分: 1

以下是使用 pandas 重塑数据的一种方法：

tmp = pd.read_excel("file.xlsx", header=None)
m = pd.to_datetime(tmp[0], errors="coerce").notnull()
blocks = {n: g.set_axis(g.iloc[0], axis=1).iloc[1:]
          for n,g in tmp.loc[~m].groupby(tmp[0].where(m).ffill())}
df = (
    pd.concat(blocks, names=["Date"])
        .assign(Venues= lambda x: x["Venue1"])
        .pipe(lambda x: x.set_axis(
            [f"{col}" if i%2 == 0 else f"QTY_{x.columns[i-1]}"
             for i, col in enumerate(x.columns)], axis=1))
        .filter(regex="QTY.+|Venues").set_index("Venues", append=True)
     .rename(lambda x: x.split("_")[1], axis=1).droplevel(1).unstack(1)
     .stack(0).reset_index(names=["Date", "Venues"]).rename_axis(columns=None)
)

输出：

print(df)
         Date  Venues  A   B  C   D  E  F  G  H
0  01/01/2023  Venue1  0  17  0   0  0  0  0  0
1  01/01/2023  Venue2  0   3  0   0  0  0  0  0
2  01/01/2023  Venue3  1  11  1  29  0  0  0  0
3  01/01/2023  Venue4  0   3  0   0  0  0  0  0
4  01/01/2023  Venue5  0   0  0   0  0  0  0  0
5  02/01/2023  Venue1  0  11  0  20  0  0  0  0
6  02/01/2023  Venue2  0   3  0   0  0  0  0  0
7  02/01/2023  Venue3  1   0  0  28  0  0  0  0
8  02/01/2023  Venue4  0   6  0   0  0  0  0  0
9  02/01/2023  Venue5  0   2  0  24  0  0  0  0

参考链接

英文:

Here is one way to do it with pandas reshaping :

tmp = pd.read_excel(&quot;file.xlsx&quot;, header=None)
m = pd.to_datetime(tmp[0], errors=&quot;coerce&quot;).notnull()
blocks = {n: g.set_axis(g.iloc[0], axis=1).iloc[1:]
          for n,g in tmp.loc[~m].groupby(tmp[0].where(m).ffill())}
df = (
    pd.concat(blocks, names=[&quot;Date&quot;])
        .assign(Venues= lambda x: x[&quot;Venue1&quot;])
        .pipe(lambda x: x.set_axis(
            [f&quot;{col}&quot; if i%2 == 0 else f&quot;QTY_{x.columns[i-1]}&quot;
             for i, col in enumerate(x.columns)], axis=1))
        .filter(regex=&quot;QTY.+|Venues&quot;).set_index(&quot;Venues&quot;, append=True)
     .rename(lambda x: x.split(&quot;_&quot;)[1], axis=1).droplevel(1).unstack(1)
     .stack(0).reset_index(names=[&quot;Date&quot;, &quot;Venues&quot;]).rename_axis(columns=None)
)

Output :

print(df)
         Date  Venues  A   B  C   D  E  F  G  H
0  01/01/2023  Venue1  0  17  0   0  0  0  0  0
1  01/01/2023  Venue2  0   3  0   0  0  0  0  0
2  01/01/2023  Venue3  1  11  1  29  0  0  0  0
3  01/01/2023  Venue4  0   3  0   0  0  0  0  0
4  01/01/2023  Venue5  0   0  0   0  0  0  0  0
5  02/01/2023  Venue1  0  11  0  20  0  0  0  0
6  02/01/2023  Venue2  0   3  0   0  0  0  0  0
7  02/01/2023  Venue3  1   0  0  28  0  0  0  0
8  02/01/2023  Venue4  0   6  0   0  0  0  0  0
9  02/01/2023  Venue5  0   2  0  24  0  0  0  0

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将笨拙格式的Excel数据使用Python转换成表格格式。

问题

答案1

在DolphinDB中如何从向量中提取指定步长的值？

如何在我的Azure函数应用中实现负载均衡？

使用Pandas DataFrame创建Matplotlib柱状图。

Pandas中如何根据数据范围在差异表中进行平均。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。