英文:
How does one fix an interleaved data set from only sampling one data set?
问题
以下是翻译好的内容:
以下代码只从一个数据集中抽取样本,为什么?
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
counts=100
Colab链接:https://colab.research.google.com/drive/1VIR66U1d7qk3Q1vU_URoo5tHEEheORpN?usp=sharing
交叉链接:
- Hugging Face Discord:https://discord.com/channels/879548962464493619/1138632039197835354
- Hugging Face Discuss:https://discuss.huggingface.co/t/how-does-one-fix-an-interleaved-data-set-from-only-sampling-one-data-set/50041
英文:
The following
from datasets import load_dataset
from datasets import interleave_datasets
# Preprocess each dataset
c4 = load_dataset("c4", "en", split="train", streaming=True)
wikitext = load_dataset("wikitext", "wikitext-103-v1", split="train", streaming=True)
# Interleave the preprocessed datasets
datasets = [c4, wikitext]
for dataset in datasets:
print(dataset.description)
interleaved = interleave_datasets(datasets, probabilities=[0.5, 0.5])
print(interleaved)
only samples from one data set, why?
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
counts=100
colab: https://colab.research.google.com/drive/1VIR66U1d7qk3Q1vU_URoo5tHEEheORpN?usp=sharing
cross:
答案1
得分: 1
interleave_datasets
函数在这里工作正常,你的结论是错误的。发生的情况是,当两个数据集交错时,它们的特征被合并。
这是c4
和wikitext
的特征:
print(c4.column_names)
>>> ['text', 'timestamp', 'url']
print(wikitext.column_names)
>>> ['text']
当你合并这些数据集时,新数据集中的所有示例都将具有特征['text', 'timestamp', 'url']
,即使它们来自wikitext
数据集。由于wikitext
数据集没有timestamp
和url
特征,它们将为None
。
虚拟示例:
from datasets import Dataset, interleave_datasets
d1 = Dataset.from_dict({
'feature_1': ['A', 'B', 'C']
})
d2 = Dataset.from_dict({
'feature_2': [1, 2, 3]
})
dataset = interleave_datasets([d1, d2], probabilities=[0.5, 0.5], seed=42)
print('Features:', dataset.column_names)
for e in dataset:
print(e)
输出:
Features: ['feature_1', 'feature_2']
{'feature_1': None, 'feature_2': 1}
{'feature_1': 'A', 'feature_2': None}
{'feature_1': None, 'feature_2': 2}
{'feature_1': None, 'feature_2': 3}
英文:
The interleave_datasets
function works correctly here, it's your conclusion that is incorrect. What happens is that when two datasets are interleaved, their features are combined.
These are the features of c4
and wikitext
:
print(c4.column_names)
>>> ['text', 'timestamp', 'url']
print(wikitext.column_names)
>>> ['text']
When you combine the datasets, all examples in the new dataset will have features ['text', 'timestamp', 'url']
, even if they come from wikitext
dataset. Since wikitext
dataset does not have features timestamp
and url
, these will be None
.
Dummy example:
from datasets import Dataset, interleave_datasets
d1 = Dataset.from_dict({
'feature_1': ['A', 'B', 'C']
})
d2 = Dataset.from_dict({
'feature_2': [1, 2, 3]
})
dataset = interleave_datasets([d1, d2], probabilities=[0.5, 0.5], seed=42)
print('Features:', dataset.column_names)
for e in dataset:
print(e)
Output:
Features: ['feature_1', 'feature_2']
{'feature_1': None, 'feature_2': 1}
{'feature_1': 'A', 'feature_2': None}
{'feature_1': None, 'feature_2': 2}
{'feature_1': None, 'feature_2': 3}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论