英文:
Convert awkwardly formatted excel data into tabular format using python
问题
I have an Excel spreadsheet containing records for each day of the month. Unfortunately, the dataset has been formatted in an awkward way, making it difficult to analyse. I would like to restructure the data into a tabular format with columns for the date, venues, and the corresponding quantity under each heading.
Current Format
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 01/01/2023 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | Venue1 | QTY | Venue2 | QTY | Venue3 | QTY | Venue4 | QTY | Venue5 | QTY |
2 | A | 0 | A | 0 | A | 1 | A | 0 | A | 0 |
3 | B | 17 | B | 3 | B | 11 | B | 3 | B | 0 |
4 | C | 0 | C | 0 | C | 1 | C | 0 | C | 0 |
5 | D | 0 | D | 0 | D | 29 | D | 0 | D | 0 |
6 | E | 0 | E | 0 | E | 0 | E | 0 | E | 0 |
7 | F | 0 | F | 0 | F | 0 | F | 0 | F | 0 |
8 | G | 0 | G | 0 | G | 0 | G | 0 | G | 0 |
9 | H | 0 | H | 0 | H | 0 | H | 0 | H | 0 |
10 | 02/01/2023 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
11 | Venue1 | QTY | Venue2 | QTY | Venue3 | QTY | Venue4 | QTY | Venue5 | QTY |
12 | A | 0 | A | 0 | A | 1 | A | 0 | A | 0 |
13 | B | 11 | B | 3 | B | 0 | B | 6 | B | 2 |
14 | C | 0 | C | 0 | C | 0 | C | 0 | C | 0 |
15 | D | 20 | D | 0 | D | 28 | D | 0 | D | 24 |
16 | E | 0 | E | 0 | E | 0 | E | 0 | E | 0 |
17 | F | 0 | F | 0 | F | 0 | F | 0 | F | 0 |
18 | G | 0 | G | 0 | G | 0 | G | 0 | G | 0 |
19 | H | 0 | H | 0 | H | 0 | H | 0 | H | 0 |
Required Format
I've tried manipulating in pandas, but I'm not sure how to go about it exactly to get the desired result. Any suggestions or sample code would be greatly appreciated. Thank You!
英文:
I have an Excel spreadsheet containing records for each day of the month. Unfortunately, the dataset has been formatted in an awkward way, making it difficult to analyse. I would like to restructure the data into a tabular format with columns for the date, venues, and the corresponding quantity under each heading. I've attached images of the current format as well as the desired output.
Current Format
0 1 2 3 4 5 6 7 8 9
0 01/01/2023 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Venue1 QTY Venue2 QTY Venue3 QTY Venue4 QTY Venue5 QTY
2 A 0 A 0 A 1 A 0 A 0
3 B 17 B 3 B 11 B 3 B 0
4 C 0 C 0 C 1 C 0 C 0
5 D 0 D 0 D 29 D 0 D 0
6 E 0 E 0 E 0 E 0 E 0
7 F 0 F 0 F 0 F 0 F 0
8 G 0 G 0 G 0 G 0 G 0
9 H 0 H 0 H 0 H 0 H 0
10 02/01/2023 NaN NaN NaN NaN NaN NaN NaN NaN NaN
11 Venue1 QTY Venue2 QTY Venue3 QTY Venue4 QTY Venue5 QTY
12 A 0 A 0 A 1 A 0 A 0
13 B 11 B 3 B 0 B 6 B 2
14 C 0 C 0 C 0 C 0 C 0
15 D 20 D 0 D 28 D 0 D 24
16 E 0 E 0 E 0 E 0 E 0
17 F 0 F 0 F 0 F 0 F 0
18 G 0 G 0 G 0 G 0 G 0
19 H 0 H 0 H 0 H 0 H 0
<br>
Required Format
I've tried manipulating in pandas, but i'm not sure how to go about it exactly to get the the desired result. Any suggestions or sample code would be greatly appreciated. Thank You!
答案1
得分: 1
以下是使用 pandas 重塑数据的一种方法:
tmp = pd.read_excel("file.xlsx", header=None)
m = pd.to_datetime(tmp[0], errors="coerce").notnull()
blocks = {n: g.set_axis(g.iloc[0], axis=1).iloc[1:]
for n,g in tmp.loc[~m].groupby(tmp[0].where(m).ffill())}
df = (
pd.concat(blocks, names=["Date"])
.assign(Venues= lambda x: x["Venue1"])
.pipe(lambda x: x.set_axis(
[f"{col}" if i%2 == 0 else f"QTY_{x.columns[i-1]}"
for i, col in enumerate(x.columns)], axis=1))
.filter(regex="QTY.+|Venues").set_index("Venues", append=True)
.rename(lambda x: x.split("_")[1], axis=1).droplevel(1).unstack(1)
.stack(0).reset_index(names=["Date", "Venues"]).rename_axis(columns=None)
)
输出:
print(df)
Date Venues A B C D E F G H
0 01/01/2023 Venue1 0 17 0 0 0 0 0 0
1 01/01/2023 Venue2 0 3 0 0 0 0 0 0
2 01/01/2023 Venue3 1 11 1 29 0 0 0 0
3 01/01/2023 Venue4 0 3 0 0 0 0 0 0
4 01/01/2023 Venue5 0 0 0 0 0 0 0 0
5 02/01/2023 Venue1 0 11 0 20 0 0 0 0
6 02/01/2023 Venue2 0 3 0 0 0 0 0 0
7 02/01/2023 Venue3 1 0 0 28 0 0 0 0
8 02/01/2023 Venue4 0 6 0 0 0 0 0 0
9 02/01/2023 Venue5 0 2 0 24 0 0 0 0
英文:
Here is one way to do it with pandas reshaping :
tmp = pd.read_excel("file.xlsx", header=None)
m = pd.to_datetime(tmp[0], errors="coerce").notnull()
blocks = {n: g.set_axis(g.iloc[0], axis=1).iloc[1:]
for n,g in tmp.loc[~m].groupby(tmp[0].where(m).ffill())}
df = (
pd.concat(blocks, names=["Date"])
.assign(Venues= lambda x: x["Venue1"])
.pipe(lambda x: x.set_axis(
[f"{col}" if i%2 == 0 else f"QTY_{x.columns[i-1]}"
for i, col in enumerate(x.columns)], axis=1))
.filter(regex="QTY.+|Venues").set_index("Venues", append=True)
.rename(lambda x: x.split("_")[1], axis=1).droplevel(1).unstack(1)
.stack(0).reset_index(names=["Date", "Venues"]).rename_axis(columns=None)
)
Output :
print(df)
Date Venues A B C D E F G H
0 01/01/2023 Venue1 0 17 0 0 0 0 0 0
1 01/01/2023 Venue2 0 3 0 0 0 0 0 0
2 01/01/2023 Venue3 1 11 1 29 0 0 0 0
3 01/01/2023 Venue4 0 3 0 0 0 0 0 0
4 01/01/2023 Venue5 0 0 0 0 0 0 0 0
5 02/01/2023 Venue1 0 11 0 20 0 0 0 0
6 02/01/2023 Venue2 0 3 0 0 0 0 0 0
7 02/01/2023 Venue3 1 0 0 28 0 0 0 0
8 02/01/2023 Venue4 0 6 0 0 0 0 0 0
9 02/01/2023 Venue5 0 2 0 24 0 0 0 0
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论