如何从仅对一个数据集进行抽样来修复一个交错的数据集?

huangapple go评论84阅读模式
英文:

How does one fix an interleaved data set from only sampling one data set?

问题

以下是代码部分的翻译:

from datasets import load_dataset
from datasets import interleave_datasets

# Preprocess each dataset
c4 = load_dataset("c4", "en", split="train", streaming=True) 
wikitext = load_dataset("wikitext", "wikitext-103-v1", split="train", streaming=True)

# Interleave the preprocessed datasets  
datasets = [c4, wikitext]
for dataset in datasets:
  print(dataset.description)
interleaved = interleave_datasets(datasets, probabilities=[0.5, 0.5])
print(interleaved)

只有一个数据集的样本,为什么?

example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
counts=100

希望这些翻译对您有帮助。如果您有其他问题,请随时提问。

英文:

The following

from datasets import load_dataset
from datasets import interleave_datasets

# Preprocess each dataset
c4 = load_dataset("c4", "en", split="train", streaming=True) 
wikitext = load_dataset("wikitext", "wikitext-103-v1", split="train", streaming=True)

# Interleave the preprocessed datasets  
datasets = [c4, wikitext]
for dataset in datasets:
  print(dataset.description)
interleaved = interleave_datasets(datasets, probabilities=[0.5, 0.5])
print(interleaved)

only samples from one data set, why?

example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
counts=100

colab: https://colab.research.google.com/drive/1VIR66U1d7qk3Q1vU_URoo5tHEEheORpN?usp=sharing


cross:

答案1

得分: 1

interleave_datasets函数在这里正常工作,错误在于你的结论。发生的情况是当两个数据集交错时,它们的特征被合并。

这些是c4wikitext的特征:

print(c4.column_names)

>>> ['text', 'timestamp', 'url']

print(wikitext.column_names)

>>> ['text']

当你合并这些数据集时,新数据集中的所有示例都会具有特征['text', 'timestamp', 'url'],即使它们来自wikitext数据集。由于wikitext数据集没有timestampurl特征,这些特征将为None

虚拟示例:

from datasets import Dataset, interleave_datasets
d1 = Dataset.from_dict({
  'feature_1': ['A', 'B', 'C']
})
d2 = Dataset.from_dict({
  'feature_2': [1, 2, 3]
})

dataset = interleave_datasets([d1, d2], probabilities=[0.5, 0.5], seed=42)
print('Features:', dataset.column_names)

for e in dataset:
  print(e)

输出:

Features: ['feature_1', 'feature_2']
{'feature_1': None, 'feature_2': 1}
{'feature_1': 'A', 'feature_2': None}
{'feature_1': None, 'feature_2': 2}
{'feature_1': None, 'feature_2': 3}
英文:

The interleave_datasets function works correctly here, it's your conclusion that is incorrect. What happens is that when two datasets are interleaved, their features are combined.

These are the features of c4 and wikitext:

print(c4.column_names)

>>> ['text', 'timestamp', 'url']

print(wikitext.column_names)

>>> ['text']

When you combine the datasets, all examples in the new dataset will have features ['text', 'timestamp', 'url'], even if they come from wikitext dataset. Since wikitext dataset does not have features timestamp and url, these will be None.

Dummy example:

from datasets import Dataset, interleave_datasets
d1 = Dataset.from_dict({
  'feature_1': ['A', 'B', 'C']
})
d2 = Dataset.from_dict({
  'feature_2': [1, 2, 3]
})

dataset = interleave_datasets([d1, d2], probabilities=[0.5, 0.5], seed=42)
print('Features:', dataset.column_names)

for e in dataset:
  print(e)

Output:

Features: ['feature_1', 'feature_2']
{'feature_1': None, 'feature_2': 1}
{'feature_1': 'A', 'feature_2': None}
{'feature_1': None, 'feature_2': 2}
{'feature_1': None, 'feature_2': 3}

huangapple
  • 本文由 发表于 2023年8月9日 08:37:43
  • 转载请务必保留本文链接:https://go.coder-hub.com/76863889-2.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定