KeyError: "marketplace" while downloading "amazon_us_reviews" dataset – huggingface datasets

huangapple go评论128阅读模式
英文:

KeyError: "marketplace" while downloading "amazon_us_reviews" dataset - huggingface datasets

问题

我正在尝试使用以下代码下载amazon_us_reviews数据集:

  1. from datasets import load_dataset
  2. dataset = load_dataset("amazon_us_reviews", "Toys_v1_00")

我遇到了以下错误:

  1. KeyError Traceback (most recent call last)
  2. /usr/local/lib/python3.10/dist-packages/datasets/builder.py in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, split_info, check_duplicate_keys, job_id)
  3. 1692 )
  4. -> 1693 self._beam_writers[split_name] = beam_writer
  5. 1694
  6. 11 frames
  7. /usr/local/lib/python3.10/dist-packages/datasets/features/features.py in encode_example(self, example)
  8. 1850
  9. -> 1851 return copy.deepcopy(self)
  10. 1852
  11. /usr/local/lib/python3.10/dist-packages/datasets/features/features.py in encode_nested_example(schema, obj, level)
  12. 1228
  13. -> 1229 This is used since some features (in particular Audio and Image) have some logic during decoding.
  14. 1230
  15. /usr/local/lib/python3.10/dist-packages/datasets/features/features.py in <dictcomp>(.0)
  16. 1228
  17. -> 1229 This is used since some features (in particular Audio and Image) have some logic during decoding.
  18. 1230
  19. /usr/local/lib/python3.10/dist-packages/datasets/utils/py_utils.py in zip_dict(*dicts)
  20. 321 def __get__(self, obj, objtype=None):
  21. --> 322 return self.fget.__get__(None, objtype)()
  22. 323
  23. /usr/local/lib/python3.10/dist-packages/datasets/utils/py_utils.py in <genexpr>(.0)
  24. 321 def __get__(self, obj, objtype=None):
  25. --> 322 return self.fget.__get__(None, objtype)()
  26. 323
  27. KeyError: 'marketplace'
  28. 上述异常是以下异常的直接原因
  29. ```python
  30. DatasetGenerationError Traceback (most recent call last)
  31. <ipython-input-25-341913a5da6a> in <cell line: 1>()
  32. ----> 1 dataset = load_dataset("amazon_us_reviews", "Toys_v1_00")

我尝试使用load_dataset_builder,它显示以下特征:

  1. from datasets import load_dataset_builder
  2. ds_builder = load_dataset_builder("amazon_us_reviews", "Toys_v1_00")
  3. print(ds_builder.info.features)

输出:

  1. {'marketplace': Value(dtype='string', id=None),
  2. 'customer_id': Value(dtype='string', id=None),
  3. 'review_id': Value(dtype='string', id=None),
  4. 'product_id': Value(dtype='string', id=None),
  5. 'product_parent': Value(dtype='string', id=None),
  6. 'product_title': Value(dtype='string', id=None),
  7. 'product_category': Value(dtype='string', id=None),
  8. 'star_rating': Value(dtype='int32', id=None),
  9. 'helpful_votes': Value(dtype='int32', id=None),
  10. 'total_votes': Value(dtype='int32', id=None),
  11. 'vine': ClassLabel(names=['N', 'Y'], id=None),
  12. 'verified_purchase': ClassLabel(names=['N', 'Y'], id=None),
  13. 'review_headline': Value(dtype='string', id=None),
  14. 'review_body': Value(dtype='string', id=None),
  15. 'review_date': Value(dtype='string', id=None)}

使用的datasets版本是2.14.4。这是正确下载数据集的方法吗?请给予建议。

英文:

I am trying to download the amazon_us_reviews dataset using the following code:

  1. from datasets import load_dataset
  2. dataset = load_dataset(&quot;amazon_us_reviews&quot;, &quot;Toys_v1_00&quot;)

I am getting the following error:

  1. KeyError Traceback (most recent call last)
  2. /usr/local/lib/python3.10/dist-packages/datasets/builder.py in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, split_info, check_duplicate_keys, job_id)
  3. 1692 )
  4. -&gt; 1693 self._beam_writers[split_name] = beam_writer
  5. 1694
  6. 11 frames
  7. /usr/local/lib/python3.10/dist-packages/datasets/features/features.py in encode_example(self, example)
  8. 1850 ```
  9. -&gt; 1851 &quot;&quot;&quot;
  10. 1852 return copy.deepcopy(self)
  11. /usr/local/lib/python3.10/dist-packages/datasets/features/features.py in encode_nested_example(schema, obj, level)
  12. 1228 &quot;&quot;&quot;Decode a nested example.
  13. -&gt; 1229 This is used since some features (in particular Audio and Image) have some logic during decoding.
  14. 1230
  15. /usr/local/lib/python3.10/dist-packages/datasets/features/features.py in &lt;dictcomp&gt;(.0)
  16. 1228 &quot;&quot;&quot;Decode a nested example.
  17. -&gt; 1229 This is used since some features (in particular Audio and Image) have some logic during decoding.
  18. 1230
  19. /usr/local/lib/python3.10/dist-packages/datasets/utils/py_utils.py in zip_dict(*dicts)
  20. 321 def __get__(self, obj, objtype=None):
  21. --&gt; 322 return self.fget.__get__(None, objtype)()
  22. 323
  23. /usr/local/lib/python3.10/dist-packages/datasets/utils/py_utils.py in &lt;genexpr&gt;(.0)
  24. 321 def __get__(self, obj, objtype=None):
  25. --&gt; 322 return self.fget.__get__(None, objtype)()
  26. 323
  27. KeyError: &#39;marketplace&#39;
  28. The above exception was the direct cause of the following exception:
  29. DatasetGenerationError Traceback (most recent call last)
  30. &lt;ipython-input-25-341913a5da6a&gt; in &lt;cell line: 1&gt;()
  31. ----&gt; 1 dataset = load_dataset(&quot;amazon_us_reviews&quot;, &quot;Toys_v1_00&quot;)
  32. /usr/local/lib/python3.10/dist-packages/datasets/load.py in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, token, use_auth_token, task, streaming, num_proc, storage_options, **config_kwargs)
  33. /usr/local/lib/python3.10/dist-packages/datasets/builder.py in download_and_prepare(self, output_dir, download_config, download_mode, verification_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, file_format, max_shard_size, num_proc, storage_options, **download_and_prepare_kwargs)
  34. 952 with FileLock(lock_path) if is_local else contextlib.nullcontext():
  35. 953 self.info.write_to_directory(self._output_dir, fs=self._fs)
  36. --&gt; 954
  37. 955 def _save_infos(self):
  38. 956 is_local = not is_remote_filesystem(self._fs)
  39. /usr/local/lib/python3.10/dist-packages/datasets/builder.py in _download_and_prepare(self, dl_manager, verification_mode, **prepare_splits_kwargs)
  40. /usr/local/lib/python3.10/dist-packages/datasets/builder.py in _download_and_prepare(self, dl_manager, verification_mode, **prepare_split_kwargs)
  41. 1047 in_memory=in_memory,
  42. 1048 )
  43. -&gt; 1049 if run_post_process:
  44. 1050 for resource_file_name in self._post_processing_resources(split).values():
  45. 1051 if os.sep in resource_file_name:
  46. /usr/local/lib/python3.10/dist-packages/datasets/builder.py in _prepare_split(self, split_generator, check_duplicate_keys, file_format, num_proc, max_shard_size)
  47. 1553
  48. 1554 class BeamBasedBuilder(DatasetBuilder):
  49. -&gt; 1555 &quot;&quot;&quot;Beam based Builder.&quot;&quot;&quot;
  50. 1556
  51. 1557 # BeamBasedBuilder does not have dummy data for tests yet
  52. /usr/local/lib/python3.10/dist-packages/datasets/builder.py in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, split_info, check_duplicate_keys, job_id)
  53. DatasetGenerationError: An error occurred while generating the dataset

I tried with the load_dataset_builder it is showing the following features:

  1. from datasets import load_dataset_builder
  2. ds_builder = load_dataset_builder(&quot;amazon_us_reviews&quot;, &quot;Toys_v1_00&quot;)
  3. print(ds_builder.info.features)

Output:

  1. {&#39;marketplace&#39;: Value(dtype=&#39;string&#39;, id=None),
  2. &#39;customer_id&#39;: Value(dtype=&#39;string&#39;, id=None),
  3. &#39;review_id&#39;: Value(dtype=&#39;string&#39;, id=None),
  4. &#39;product_id&#39;: Value(dtype=&#39;string&#39;, id=None),
  5. &#39;product_parent&#39;: Value(dtype=&#39;string&#39;, id=None),
  6. &#39;product_title&#39;: Value(dtype=&#39;string&#39;, id=None),
  7. &#39;product_category&#39;: Value(dtype=&#39;string&#39;, id=None),
  8. &#39;star_rating&#39;: Value(dtype=&#39;int32&#39;, id=None),
  9. &#39;helpful_votes&#39;: Value(dtype=&#39;int32&#39;, id=None),
  10. &#39;total_votes&#39;: Value(dtype=&#39;int32&#39;, id=None),
  11. &#39;vine&#39;: ClassLabel(names=[&#39;N&#39;, &#39;Y&#39;], id=None),
  12. &#39;verified_purchase&#39;: ClassLabel(names=[&#39;N&#39;, &#39;Y&#39;], id=None),
  13. &#39;review_headline&#39;: Value(dtype=&#39;string&#39;, id=None),
  14. &#39;review_body&#39;: Value(dtype=&#39;string&#39;, id=None),
  15. &#39;review_date&#39;: Value(dtype=&#39;string&#39;, id=None)}

The datasets version used is 2.14.4

Is this the correct way to download the dataset? Kindly advise.

答案1

得分: 1

你的语法是正确的,我认为问题出在亚马逊这边 - 即使是应该包含 README 的公开根页面现在也显示访问被拒绝:https://s3.amazonaws.com/amazon-reviews-pds/readme.html

也许现在可以使用不同的数据集,还有 amazon_reviews_multi

更新

亚马逊已决定停止分发这个数据集。

请参考这里

英文:

Your syntax is correct, I think the issue is on Amazon's side - even the publicly-facing root page which should contain a README is now throwing access denied: https://s3.amazonaws.com/amazon-reviews-pds/readme.html

Perhaps use a different dataset for now, there is also amazon_reviews_multi.

Update:

> Amazon has decided to stop distributing this dataset.

Refer here

huangapple
  • 本文由 发表于 2023年8月10日 20:24:05
  • 转载请务必保留本文链接:https://go.coder-hub.com/76875718.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定