2023年8月9日 08:37:43go评论112阅读模式

英文:

How does one fix an interleaved data set from only sampling one data set?

问题

以下是翻译好的内容：

以下代码只从一个数据集中抽取样本，为什么？

example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
example.keys()=dict_keys(['text', 'timestamp', 'url'])
counts=100

Colab链接：https://colab.research.google.com/drive/1VIR66U1d7qk3Q1vU_URoo5tHEEheORpN?usp=sharing

交叉链接：

Hugging Face Discord：https://discord.com/channels/879548962464493619/1138632039197835354
Hugging Face Discuss：https://discuss.huggingface.co/t/how-does-one-fix-an-interleaved-data-set-from-only-sampling-one-data-set/50041

英文:

The following

from datasets import load_dataset
from datasets import interleave_datasets
# Preprocess each dataset
c4 = load_dataset(&quot;c4&quot;, &quot;en&quot;, split=&quot;train&quot;, streaming=True) 
wikitext = load_dataset(&quot;wikitext&quot;, &quot;wikitext-103-v1&quot;, split=&quot;train&quot;, streaming=True)
# Interleave the preprocessed datasets  
datasets = [c4, wikitext]
for dataset in datasets:
  print(dataset.description)
interleaved = interleave_datasets(datasets, probabilities=[0.5, 0.5])
print(interleaved)

only samples from one data set, why?

example.keys()=dict_keys([&#39;text&#39;, &#39;timestamp&#39;, &#39;url&#39;])
example.keys()=dict_keys([&#39;text&#39;, &#39;timestamp&#39;, &#39;url&#39;])
example.keys()=dict_keys([&#39;text&#39;, &#39;timestamp&#39;, &#39;url&#39;])
example.keys()=dict_keys([&#39;text&#39;, &#39;timestamp&#39;, &#39;url&#39;])
example.keys()=dict_keys([&#39;text&#39;, &#39;timestamp&#39;, &#39;url&#39;])
example.keys()=dict_keys([&#39;text&#39;, &#39;timestamp&#39;, &#39;url&#39;])
example.keys()=dict_keys([&#39;text&#39;, &#39;timestamp&#39;, &#39;url&#39;])
example.keys()=dict_keys([&#39;text&#39;, &#39;timestamp&#39;, &#39;url&#39;])
example.keys()=dict_keys([&#39;text&#39;, &#39;timestamp&#39;, &#39;url&#39;])
example.keys()=dict_keys([&#39;text&#39;, &#39;timestamp&#39;, &#39;url&#39;])
example.keys()=dict_keys([&#39;text&#39;, &#39;timestamp&#39;, &#39;url&#39;])
example.keys()=dict_keys([&#39;text&#39;, &#39;timestamp&#39;, &#39;url&#39;])
example.keys()=dict_keys([&#39;text&#39;, &#39;timestamp&#39;, &#39;url&#39;])
example.keys()=dict_keys([&#39;text&#39;, &#39;timestamp&#39;, &#39;url&#39;])
example.keys()=dict_keys([&#39;text&#39;, &#39;timestamp&#39;, &#39;url&#39;])
example.keys()=dict_keys([&#39;text&#39;, &#39;timestamp&#39;, &#39;url&#39;])
example.keys()=dict_keys([&#39;text&#39;, &#39;timestamp&#39;, &#39;url&#39;])
example.keys()=dict_keys([&#39;text&#39;, &#39;timestamp&#39;, &#39;url&#39;])
example.keys()=dict_keys([&#39;text&#39;, &#39;timestamp&#39;, &#39;url&#39;])
example.keys()=dict_keys([&#39;text&#39;, &#39;timestamp&#39;, &#39;url&#39;])
example.keys()=dict_keys([&#39;text&#39;, &#39;timestamp&#39;, &#39;url&#39;])
example.keys()=dict_keys([&#39;text&#39;, &#39;timestamp&#39;, &#39;url&#39;])
counts=100

colab: https://colab.research.google.com/drive/1VIR66U1d7qk3Q1vU_URoo5tHEEheORpN?usp=sharing

cross:

答案1

得分: 1

interleave_datasets函数在这里工作正常，你的结论是错误的。发生的情况是，当两个数据集交错时，它们的特征被合并。

这是c4和wikitext的特征：

print(c4.column_names)
>>> ['text', 'timestamp', 'url']
print(wikitext.column_names)
>>> ['text']

当你合并这些数据集时，新数据集中的所有示例都将具有特征['text', 'timestamp', 'url']，即使它们来自wikitext数据集。由于wikitext数据集没有timestamp和url特征，它们将为None。

虚拟示例：

from datasets import Dataset, interleave_datasets
d1 = Dataset.from_dict({
  'feature_1': ['A', 'B', 'C']
})
d2 = Dataset.from_dict({
  'feature_2': [1, 2, 3]
})
dataset = interleave_datasets([d1, d2], probabilities=[0.5, 0.5], seed=42)
print('Features:', dataset.column_names)
for e in dataset:
  print(e)

输出：

Features: ['feature_1', 'feature_2']
{'feature_1': None, 'feature_2': 1}
{'feature_1': 'A', 'feature_2': None}
{'feature_1': None, 'feature_2': 2}
{'feature_1': None, 'feature_2': 3}

英文:

The interleave_datasets function works correctly here, it's your conclusion that is incorrect. What happens is that when two datasets are interleaved, their features are combined.

These are the features of c4 and wikitext:

print(c4.column_names)
&gt;&gt;&gt; [&#39;text&#39;, &#39;timestamp&#39;, &#39;url&#39;]
print(wikitext.column_names)
&gt;&gt;&gt; [&#39;text&#39;]

When you combine the datasets, all examples in the new dataset will have features ['text', 'timestamp', 'url'], even if they come from wikitext dataset. Since wikitext dataset does not have features timestamp and url, these will be None.

Dummy example:

from datasets import Dataset, interleave_datasets
d1 = Dataset.from_dict({
  &#39;feature_1&#39;: [&#39;A&#39;, &#39;B&#39;, &#39;C&#39;]
})
d2 = Dataset.from_dict({
  &#39;feature_2&#39;: [1, 2, 3]
})
dataset = interleave_datasets([d1, d2], probabilities=[0.5, 0.5], seed=42)
print(&#39;Features:&#39;, dataset.column_names)
for e in dataset:
  print(e)

Output:

Features: [&#39;feature_1&#39;, &#39;feature_2&#39;]
{&#39;feature_1&#39;: None, &#39;feature_2&#39;: 1}
{&#39;feature_1&#39;: &#39;A&#39;, &#39;feature_2&#39;: None}
{&#39;feature_1&#39;: None, &#39;feature_2&#39;: 2}
{&#39;feature_1&#39;: None, &#39;feature_2&#39;: 3}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何仅通过对一个数据集进行采样来修复交错的数据集？

问题

答案1

将字符串转换为浮点值

为什么我的Python总是在使用(f”函数时显示”语法错误”？

使用自定义编码器压缩Pydantic模型字典。

Pytest模拟在循环中对同一函数的不同行为

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。