英文:
KeyError: "marketplace" while downloading "amazon_us_reviews" dataset - huggingface datasets
问题
我正在尝试使用以下代码下载amazon_us_reviews
数据集:
from datasets import load_dataset
dataset = load_dataset("amazon_us_reviews", "Toys_v1_00")
我遇到了以下错误:
KeyError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/datasets/builder.py in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, split_info, check_duplicate_keys, job_id)
1692 )
-> 1693 self._beam_writers[split_name] = beam_writer
1694
11 frames
/usr/local/lib/python3.10/dist-packages/datasets/features/features.py in encode_example(self, example)
1850
-> 1851 return copy.deepcopy(self)
1852
/usr/local/lib/python3.10/dist-packages/datasets/features/features.py in encode_nested_example(schema, obj, level)
1228
-> 1229 This is used since some features (in particular Audio and Image) have some logic during decoding.
1230
/usr/local/lib/python3.10/dist-packages/datasets/features/features.py in <dictcomp>(.0)
1228
-> 1229 This is used since some features (in particular Audio and Image) have some logic during decoding.
1230
/usr/local/lib/python3.10/dist-packages/datasets/utils/py_utils.py in zip_dict(*dicts)
321 def __get__(self, obj, objtype=None):
--> 322 return self.fget.__get__(None, objtype)()
323
/usr/local/lib/python3.10/dist-packages/datasets/utils/py_utils.py in <genexpr>(.0)
321 def __get__(self, obj, objtype=None):
--> 322 return self.fget.__get__(None, objtype)()
323
KeyError: 'marketplace'
上述异常是以下异常的直接原因:
```python
DatasetGenerationError Traceback (most recent call last)
<ipython-input-25-341913a5da6a> in <cell line: 1>()
----> 1 dataset = load_dataset("amazon_us_reviews", "Toys_v1_00")
我尝试使用load_dataset_builder
,它显示以下特征:
from datasets import load_dataset_builder
ds_builder = load_dataset_builder("amazon_us_reviews", "Toys_v1_00")
print(ds_builder.info.features)
输出:
{'marketplace': Value(dtype='string', id=None),
'customer_id': Value(dtype='string', id=None),
'review_id': Value(dtype='string', id=None),
'product_id': Value(dtype='string', id=None),
'product_parent': Value(dtype='string', id=None),
'product_title': Value(dtype='string', id=None),
'product_category': Value(dtype='string', id=None),
'star_rating': Value(dtype='int32', id=None),
'helpful_votes': Value(dtype='int32', id=None),
'total_votes': Value(dtype='int32', id=None),
'vine': ClassLabel(names=['N', 'Y'], id=None),
'verified_purchase': ClassLabel(names=['N', 'Y'], id=None),
'review_headline': Value(dtype='string', id=None),
'review_body': Value(dtype='string', id=None),
'review_date': Value(dtype='string', id=None)}
使用的datasets
版本是2.14.4
。这是正确下载数据集的方法吗?请给予建议。
英文:
I am trying to download the amazon_us_reviews
dataset using the following code:
from datasets import load_dataset
dataset = load_dataset("amazon_us_reviews", "Toys_v1_00")
I am getting the following error:
KeyError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/datasets/builder.py in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, split_info, check_duplicate_keys, job_id)
1692 )
-> 1693 self._beam_writers[split_name] = beam_writer
1694
11 frames
/usr/local/lib/python3.10/dist-packages/datasets/features/features.py in encode_example(self, example)
1850 ```
-> 1851 """
1852 return copy.deepcopy(self)
/usr/local/lib/python3.10/dist-packages/datasets/features/features.py in encode_nested_example(schema, obj, level)
1228 """Decode a nested example.
-> 1229 This is used since some features (in particular Audio and Image) have some logic during decoding.
1230
/usr/local/lib/python3.10/dist-packages/datasets/features/features.py in <dictcomp>(.0)
1228 """Decode a nested example.
-> 1229 This is used since some features (in particular Audio and Image) have some logic during decoding.
1230
/usr/local/lib/python3.10/dist-packages/datasets/utils/py_utils.py in zip_dict(*dicts)
321 def __get__(self, obj, objtype=None):
--> 322 return self.fget.__get__(None, objtype)()
323
/usr/local/lib/python3.10/dist-packages/datasets/utils/py_utils.py in <genexpr>(.0)
321 def __get__(self, obj, objtype=None):
--> 322 return self.fget.__get__(None, objtype)()
323
KeyError: 'marketplace'
The above exception was the direct cause of the following exception:
DatasetGenerationError Traceback (most recent call last)
<ipython-input-25-341913a5da6a> in <cell line: 1>()
----> 1 dataset = load_dataset("amazon_us_reviews", "Toys_v1_00")
/usr/local/lib/python3.10/dist-packages/datasets/load.py in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, token, use_auth_token, task, streaming, num_proc, storage_options, **config_kwargs)
/usr/local/lib/python3.10/dist-packages/datasets/builder.py in download_and_prepare(self, output_dir, download_config, download_mode, verification_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, file_format, max_shard_size, num_proc, storage_options, **download_and_prepare_kwargs)
952 with FileLock(lock_path) if is_local else contextlib.nullcontext():
953 self.info.write_to_directory(self._output_dir, fs=self._fs)
--> 954
955 def _save_infos(self):
956 is_local = not is_remote_filesystem(self._fs)
/usr/local/lib/python3.10/dist-packages/datasets/builder.py in _download_and_prepare(self, dl_manager, verification_mode, **prepare_splits_kwargs)
/usr/local/lib/python3.10/dist-packages/datasets/builder.py in _download_and_prepare(self, dl_manager, verification_mode, **prepare_split_kwargs)
1047 in_memory=in_memory,
1048 )
-> 1049 if run_post_process:
1050 for resource_file_name in self._post_processing_resources(split).values():
1051 if os.sep in resource_file_name:
/usr/local/lib/python3.10/dist-packages/datasets/builder.py in _prepare_split(self, split_generator, check_duplicate_keys, file_format, num_proc, max_shard_size)
1553
1554 class BeamBasedBuilder(DatasetBuilder):
-> 1555 """Beam based Builder."""
1556
1557 # BeamBasedBuilder does not have dummy data for tests yet
/usr/local/lib/python3.10/dist-packages/datasets/builder.py in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, split_info, check_duplicate_keys, job_id)
DatasetGenerationError: An error occurred while generating the dataset
I tried with the load_dataset_builder
it is showing the following features:
from datasets import load_dataset_builder
ds_builder = load_dataset_builder("amazon_us_reviews", "Toys_v1_00")
print(ds_builder.info.features)
Output:
{'marketplace': Value(dtype='string', id=None),
'customer_id': Value(dtype='string', id=None),
'review_id': Value(dtype='string', id=None),
'product_id': Value(dtype='string', id=None),
'product_parent': Value(dtype='string', id=None),
'product_title': Value(dtype='string', id=None),
'product_category': Value(dtype='string', id=None),
'star_rating': Value(dtype='int32', id=None),
'helpful_votes': Value(dtype='int32', id=None),
'total_votes': Value(dtype='int32', id=None),
'vine': ClassLabel(names=['N', 'Y'], id=None),
'verified_purchase': ClassLabel(names=['N', 'Y'], id=None),
'review_headline': Value(dtype='string', id=None),
'review_body': Value(dtype='string', id=None),
'review_date': Value(dtype='string', id=None)}
The datasets
version used is 2.14.4
Is this the correct way to download the dataset? Kindly advise.
答案1
得分: 1
你的语法是正确的,我认为问题出在亚马逊这边 - 即使是应该包含 README 的公开根页面现在也显示访问被拒绝:https://s3.amazonaws.com/amazon-reviews-pds/readme.html
也许现在可以使用不同的数据集,还有 amazon_reviews_multi。
更新:
亚马逊已决定停止分发这个数据集。
英文:
Your syntax is correct, I think the issue is on Amazon's side - even the publicly-facing root page which should contain a README is now throwing access denied: https://s3.amazonaws.com/amazon-reviews-pds/readme.html
Perhaps use a different dataset for now, there is also amazon_reviews_multi.
Update:
> Amazon has decided to stop distributing this dataset.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论