英文:
Power BI Survey data schema
问题
我正在尝试确定将调查数据转换为适用于Power BI的最佳数据模型。
我相当确定星型模式(Star Schema)是我要寻找的,但是我的调查数据并不是所有问题都具有相同的回答(即并非每个问题都需要从1到5进行评分) - 每个20多个问题都有其自己预定义的答案列表。
大多数问题是这样的:“你最喜欢的冰淇淋是什么”,答案可以是香草、开心果、草莓等等。
有几个问题是以这种方式格式化的:
请对以下口味的冰淇淋进行评分,从1到5:
开心果(1-5)
香草(1-5)
草莓(1-5)
等等。
因此,它们更像是一种问题组。
还有几个自由文本答案字段,他们希望将其转化为词云,例如:问题:“你对未来有什么担忧” 答案:“在后启示录社会中找不到开心果冰淇淋”。
大约有7000名受访者。
我的数据结构为每个受访者一行,其中包含一堆人口统计数据列(年龄、种族、年薪等),以及约200多列的答案数据,格式如下:
受访者ID | 年龄范围 | 宗教 | Q1 香草 | Q1 开心果 | Q1 草莓 | Q2 自由文本 | Q3 评分 香草 | Q3 评分 草莓 | Q3 评分 开心果 |
---|---|---|---|---|---|---|---|---|---|
1 | 25-30 | 绝地武士 | 1 | 0 | 0 | blah blah | 5 | 4 | 1 |
2 | 35-40 | 无 | 0 | 1 | 0 | 6 | 3 | 2 |
所以对于我上面给出的示例问题,Q1将是“你最喜欢的冰淇淋是什么?”选择答案会在适当的列中给出1或0。
Q2是一个自由文本答案字段
Q3将有一个总体问题,如:请对以下口味的冰淇淋进行评分,从1到5,每个“口味”都会获得与该答案相对应的值(而不是来自Q1的1/0)。
那么我的问题是什么?
我可以使用Power Query将源数据转换为我需要的任何格式,但是我不知道应该是什么格式,因为我可能有20多个问题的200多个不同答案。
我倾向于使用一个事实表,其中包含问题ID和答案ID,然后引用答案和问题维度表,但是在答案的问题上进行筛选是否会有问题?
还是我可以将其制作成一种雪花模式,其中答案由问题表分组,这样我只会获得适用问题的答案?
还是我将问题和答案表展平,然后使用组合键将其引用回事实表?
欢迎任何想法。
谢谢
Yonabout
英文:
I'm trying to determine the best data model to transform survey data into for Power BI.
I'm pretty sure a star schema is what I'm after, but my survey data doesn't have the same responses for all questions (i.e. it's not rate this from 1-5 for every question) - each of the 20+ questions has it's own pre-defined list of answers.
Most of the questions are along the lines of: "what's your favourite ice cream" and the answer can be one of vanilla, pistachio, strawberry etc.
A couple of the questions are in this kind of format:
Rate each of the following flavours of ice cream from one to 5:
Pistachio (1-5)
Vanilla (1-5)
Strawberry (1-5)
etc.
So they're more like a kind of question group.
There's also a couple of free text answer fields they want turning into word clouds, like: q:"What are your fears for the future" a: "not being able to find pistachio ice cream in a post apocalyptic society"
There are roughly 7000 respondents.
My data is structured with one row per respondent with a bunch of demographic data columns in it (age, ethnicity, annual salary etc.), and about 200+ columns of answer data like this:
ResponseID | Age Range | Religion | Q1 Vanilla | Q1 Pistachio | Q1 Strawberry | Q2 Free Text | Q3 Rate Vanilla | Q3 Rate Strawberry | Q3 Rate Pistachio |
---|---|---|---|---|---|---|---|---|---|
1 | 25-30 | Jedi | 1 | 0 | 0 | blah blah | 5 | 4 | 1 |
2 | 35-40 | None | 0 | 1 | 0 | 6 | 3 | 2 |
So for the example questions I gave above, Q1 would be "What's your favourite ice cream?" selecting an answer gives a 1 or 0 in the appropriate column.
Q2 is a free text answer field
Q3 would have an overarching question like: Rate the following ice cream flavours from 1-5, and each 'flavour' gets a value corresponding to that answer (rather then the 1 / 0 from Q1)
So what's my problem?
I'm OK using power query to get the source data into whatever format I need it to be, but I don't know what that format should be, because I've got potentially 200+ different answers to 20+ questions.
I'm leaning towards one fact table with question ids and answer ids that then reference answer and question dimension tables, but then will I have problems filtering on answers for questions???
Or do I make it a kind of snowflake where the answers are grouped by a question table, so I only get answers for the appropriate questions?
Or do I flatten a "question & answer" table together with a combined key that I can reference back to the fact table?
Any thoughts appreciated.
Cheers
Yonabout
答案1
得分: 0
调查答案模型是相当常见的情景,您绝对可以将其建模为星型模式。问题维度和答案维度提供了相当多的灵活性。某些答案可以选择性地具有“分数”。
您可以通过从答案维度中选择答案来轻松筛选答案。
由于您的答案仅适用于一个问题,您绝对可以将问题和答案放在同一个维度中。
重要的是决定事实的粒度。如果每个调查中的答案是一行记录,那么事实将会更加灵活,而不是像现在这样每个人的答案集合为一行记录。
英文:
Survey answer models are pretty common scenarios and you absolutely can model it as a star schema. A question dimension and an answer dimension gives quite a lot of flexibility. Some answers can optionally have a ‘score’.
You’d be able to filter by answer easily by choosing an answer from the answer dimension.
Since your answers only apply to a question, you absolutely can have questions and answers in the same dimension.
The important thing is to decide the grain of the fact. The fact would be most flexible if it is one row per answer within the survey, not as you have it now with one row per person’s set of answers.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论