Power BI调查数据架构

huangapple go评论56阅读模式
英文:

Power BI Survey data schema

问题

我正在尝试确定将调查数据转换为适用于Power BI的最佳数据模型。

我相当确定星型模式(Star Schema)是我要寻找的,但是我的调查数据并不是所有问题都具有相同的回答(即并非每个问题都需要从1到5进行评分) - 每个20多个问题都有其自己预定义的答案列表。

大多数问题是这样的:“你最喜欢的冰淇淋是什么”,答案可以是香草、开心果、草莓等等。

有几个问题是以这种方式格式化的:
请对以下口味的冰淇淋进行评分,从1到5:
开心果(1-5)
香草(1-5)
草莓(1-5)
等等。
因此,它们更像是一种问题组。

还有几个自由文本答案字段,他们希望将其转化为词云,例如:问题:“你对未来有什么担忧” 答案:“在后启示录社会中找不到开心果冰淇淋”。

大约有7000名受访者。

我的数据结构为每个受访者一行,其中包含一堆人口统计数据列(年龄、种族、年薪等),以及约200多列的答案数据,格式如下:

受访者ID 年龄范围 宗教 Q1 香草 Q1 开心果 Q1 草莓 Q2 自由文本 Q3 评分 香草 Q3 评分 草莓 Q3 评分 开心果
1 25-30 绝地武士 1 0 0 blah blah 5 4 1
2 35-40 0 1 0 6 3 2

所以对于我上面给出的示例问题,Q1将是“你最喜欢的冰淇淋是什么?”选择答案会在适当的列中给出1或0。

Q2是一个自由文本答案字段

Q3将有一个总体问题,如:请对以下口味的冰淇淋进行评分,从1到5,每个“口味”都会获得与该答案相对应的值(而不是来自Q1的1/0)。

那么我的问题是什么?

我可以使用Power Query将源数据转换为我需要的任何格式,但是我不知道应该是什么格式,因为我可能有20多个问题的200多个不同答案。

我倾向于使用一个事实表,其中包含问题ID和答案ID,然后引用答案和问题维度表,但是在答案的问题上进行筛选是否会有问题?

还是我可以将其制作成一种雪花模式,其中答案由问题表分组,这样我只会获得适用问题的答案?

还是我将问题和答案表展平,然后使用组合键将其引用回事实表?

欢迎任何想法。

谢谢

Yonabout

英文:

I'm trying to determine the best data model to transform survey data into for Power BI.

I'm pretty sure a star schema is what I'm after, but my survey data doesn't have the same responses for all questions (i.e. it's not rate this from 1-5 for every question) - each of the 20+ questions has it's own pre-defined list of answers.

Most of the questions are along the lines of: "what's your favourite ice cream" and the answer can be one of vanilla, pistachio, strawberry etc.

A couple of the questions are in this kind of format:
Rate each of the following flavours of ice cream from one to 5:
Pistachio (1-5)
Vanilla (1-5)
Strawberry (1-5)
etc.
So they're more like a kind of question group.

There's also a couple of free text answer fields they want turning into word clouds, like: q:"What are your fears for the future" a: "not being able to find pistachio ice cream in a post apocalyptic society"

There are roughly 7000 respondents.

My data is structured with one row per respondent with a bunch of demographic data columns in it (age, ethnicity, annual salary etc.), and about 200+ columns of answer data like this:

ResponseID Age Range Religion Q1 Vanilla Q1 Pistachio Q1 Strawberry Q2 Free Text Q3 Rate Vanilla Q3 Rate Strawberry Q3 Rate Pistachio
1 25-30 Jedi 1 0 0 blah blah 5 4 1
2 35-40 None 0 1 0 6 3 2

So for the example questions I gave above, Q1 would be "What's your favourite ice cream?" selecting an answer gives a 1 or 0 in the appropriate column.

Q2 is a free text answer field

Q3 would have an overarching question like: Rate the following ice cream flavours from 1-5, and each 'flavour' gets a value corresponding to that answer (rather then the 1 / 0 from Q1)

So what's my problem?

I'm OK using power query to get the source data into whatever format I need it to be, but I don't know what that format should be, because I've got potentially 200+ different answers to 20+ questions.

I'm leaning towards one fact table with question ids and answer ids that then reference answer and question dimension tables, but then will I have problems filtering on answers for questions???

Or do I make it a kind of snowflake where the answers are grouped by a question table, so I only get answers for the appropriate questions?

Or do I flatten a "question & answer" table together with a combined key that I can reference back to the fact table?

Any thoughts appreciated.

Cheers

Yonabout

答案1

得分: 0

调查答案模型是相当常见的情景,您绝对可以将其建模为星型模式。问题维度和答案维度提供了相当多的灵活性。某些答案可以选择性地具有“分数”。

您可以通过从答案维度中选择答案来轻松筛选答案。

由于您的答案仅适用于一个问题,您绝对可以将问题和答案放在同一个维度中。

重要的是决定事实的粒度。如果每个调查中的答案是一行记录,那么事实将会更加灵活,而不是像现在这样每个人的答案集合为一行记录。

英文:

Survey answer models are pretty common scenarios and you absolutely can model it as a star schema. A question dimension and an answer dimension gives quite a lot of flexibility. Some answers can optionally have a ‘score’.

You’d be able to filter by answer easily by choosing an answer from the answer dimension.

Since your answers only apply to a question, you absolutely can have questions and answers in the same dimension.

The important thing is to decide the grain of the fact. The fact would be most flexible if it is one row per answer within the survey, not as you have it now with one row per person’s set of answers.

huangapple
  • 本文由 发表于 2023年7月11日 00:28:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/76655668.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定