将笨拙格式的Excel数据使用Python转换成表格格式。

huangapple go评论78阅读模式
英文:

Convert awkwardly formatted excel data into tabular format using python

问题

I have an Excel spreadsheet containing records for each day of the month. Unfortunately, the dataset has been formatted in an awkward way, making it difficult to analyse. I would like to restructure the data into a tabular format with columns for the date, venues, and the corresponding quantity under each heading.

Current Format

0 1 2 3 4 5 6 7 8 9
0 01/01/2023 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Venue1 QTY Venue2 QTY Venue3 QTY Venue4 QTY Venue5 QTY
2 A 0 A 0 A 1 A 0 A 0
3 B 17 B 3 B 11 B 3 B 0
4 C 0 C 0 C 1 C 0 C 0
5 D 0 D 0 D 29 D 0 D 0
6 E 0 E 0 E 0 E 0 E 0
7 F 0 F 0 F 0 F 0 F 0
8 G 0 G 0 G 0 G 0 G 0
9 H 0 H 0 H 0 H 0 H 0
10 02/01/2023 NaN NaN NaN NaN NaN NaN NaN NaN NaN
11 Venue1 QTY Venue2 QTY Venue3 QTY Venue4 QTY Venue5 QTY
12 A 0 A 0 A 1 A 0 A 0
13 B 11 B 3 B 0 B 6 B 2
14 C 0 C 0 C 0 C 0 C 0
15 D 20 D 0 D 28 D 0 D 24
16 E 0 E 0 E 0 E 0 E 0
17 F 0 F 0 F 0 F 0 F 0
18 G 0 G 0 G 0 G 0 G 0
19 H 0 H 0 H 0 H 0 H 0

Required Format

将笨拙格式的Excel数据使用Python转换成表格格式。

I've tried manipulating in pandas, but I'm not sure how to go about it exactly to get the desired result. Any suggestions or sample code would be greatly appreciated. Thank You!

英文:

I have an Excel spreadsheet containing records for each day of the month. Unfortunately, the dataset has been formatted in an awkward way, making it difficult to analyse. I would like to restructure the data into a tabular format with columns for the date, venues, and the corresponding quantity under each heading. I've attached images of the current format as well as the desired output.

Current Format

             0    1       2    3       4    5       6    7       8    9
0   01/01/2023  NaN     NaN  NaN     NaN  NaN     NaN  NaN     NaN  NaN
1       Venue1  QTY  Venue2  QTY  Venue3  QTY  Venue4  QTY  Venue5  QTY
2            A    0       A    0       A    1       A    0       A    0
3            B   17       B    3       B   11       B    3       B    0
4            C    0       C    0       C    1       C    0       C    0
5            D    0       D    0       D   29       D    0       D    0
6            E    0       E    0       E    0       E    0       E    0
7            F    0       F    0       F    0       F    0       F    0
8            G    0       G    0       G    0       G    0       G    0
9            H    0       H    0       H    0       H    0       H    0
10  02/01/2023  NaN     NaN  NaN     NaN  NaN     NaN  NaN     NaN  NaN
11      Venue1  QTY  Venue2  QTY  Venue3  QTY  Venue4  QTY  Venue5  QTY
12           A    0       A    0       A    1       A    0       A    0
13           B   11       B    3       B    0       B    6       B    2
14           C    0       C    0       C    0       C    0       C    0
15           D   20       D    0       D   28       D    0       D   24
16           E    0       E    0       E    0       E    0       E    0
17           F    0       F    0       F    0       F    0       F    0
18           G    0       G    0       G    0       G    0       G    0
19           H    0       H    0       H    0       H    0       H    0

将笨拙格式的Excel数据使用Python转换成表格格式。

<br>

Required Format

将笨拙格式的Excel数据使用Python转换成表格格式。

I've tried manipulating in pandas, but i'm not sure how to go about it exactly to get the the desired result. Any suggestions or sample code would be greatly appreciated. Thank You!

答案1

得分: 1

以下是使用 pandas 重塑数据的一种方法:

tmp = pd.read_excel("file.xlsx", header=None)

m = pd.to_datetime(tmp[0], errors="coerce").notnull()

blocks = {n: g.set_axis(g.iloc[0], axis=1).iloc[1:]
          for n,g in tmp.loc[~m].groupby(tmp[0].where(m).ffill())}
df = (
    pd.concat(blocks, names=["Date"])
        .assign(Venues= lambda x: x["Venue1"])
        .pipe(lambda x: x.set_axis(
            [f"{col}" if i%2 == 0 else f"QTY_{x.columns[i-1]}"
             for i, col in enumerate(x.columns)], axis=1))
        .filter(regex="QTY.+|Venues").set_index("Venues", append=True)
     .rename(lambda x: x.split("_")[1], axis=1).droplevel(1).unstack(1)
     .stack(0).reset_index(names=["Date", "Venues"]).rename_axis(columns=None)
)

输出:

print(df)

         Date  Venues  A   B  C   D  E  F  G  H
0  01/01/2023  Venue1  0  17  0   0  0  0  0  0
1  01/01/2023  Venue2  0   3  0   0  0  0  0  0
2  01/01/2023  Venue3  1  11  1  29  0  0  0  0
3  01/01/2023  Venue4  0   3  0   0  0  0  0  0
4  01/01/2023  Venue5  0   0  0   0  0  0  0  0
5  02/01/2023  Venue1  0  11  0  20  0  0  0  0
6  02/01/2023  Venue2  0   3  0   0  0  0  0  0
7  02/01/2023  Venue3  1   0  0  28  0  0  0  0
8  02/01/2023  Venue4  0   6  0   0  0  0  0  0
9  02/01/2023  Venue5  0   2  0  24  0  0  0  0

参考链接

英文:

Here is one way to do it with pandas reshaping :

tmp = pd.read_excel(&quot;file.xlsx&quot;, header=None)

m = pd.to_datetime(tmp[0], errors=&quot;coerce&quot;).notnull()

blocks = {n: g.set_axis(g.iloc[0], axis=1).iloc[1:]
          for n,g in tmp.loc[~m].groupby(tmp[0].where(m).ffill())}
df = (
    pd.concat(blocks, names=[&quot;Date&quot;])
        .assign(Venues= lambda x: x[&quot;Venue1&quot;])
        .pipe(lambda x: x.set_axis(
            [f&quot;{col}&quot; if i%2 == 0 else f&quot;QTY_{x.columns[i-1]}&quot;
             for i, col in enumerate(x.columns)], axis=1))
        .filter(regex=&quot;QTY.+|Venues&quot;).set_index(&quot;Venues&quot;, append=True)
     .rename(lambda x: x.split(&quot;_&quot;)[1], axis=1).droplevel(1).unstack(1)
     .stack(0).reset_index(names=[&quot;Date&quot;, &quot;Venues&quot;]).rename_axis(columns=None)
)

Output :

print(df)

         Date  Venues  A   B  C   D  E  F  G  H
0  01/01/2023  Venue1  0  17  0   0  0  0  0  0
1  01/01/2023  Venue2  0   3  0   0  0  0  0  0
2  01/01/2023  Venue3  1  11  1  29  0  0  0  0
3  01/01/2023  Venue4  0   3  0   0  0  0  0  0
4  01/01/2023  Venue5  0   0  0   0  0  0  0  0
5  02/01/2023  Venue1  0  11  0  20  0  0  0  0
6  02/01/2023  Venue2  0   3  0   0  0  0  0  0
7  02/01/2023  Venue3  1   0  0  28  0  0  0  0
8  02/01/2023  Venue4  0   6  0   0  0  0  0  0
9  02/01/2023  Venue5  0   2  0  24  0  0  0  0

huangapple
  • 本文由 发表于 2023年5月17日 22:15:58
  • 转载请务必保留本文链接:https://go.coder-hub.com/76273090.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定