如何在 SpaCy 的 config.cfg 文件中注册自定义组件?

huangapple go评论62阅读模式
英文:

How to register custom components in a SpaCy config.cfg file?

问题

标题中提到:
我似乎按照文档中的描述进行了操作,我已经在网上寻找了有用的答案,但到目前为止还没有找到太多。非常感谢任何帮助!谢谢!

我正在运行以下命令:
python -m spacy debug config config.cfg --code 'matcher.py' --code 'sentence.py'

python -m spacy train 'config.cfg' --output 'config\' --code 'sentence.py' --code 'matcher.py'

两者都出现了相同的错误:

```ValueError: [E002] 找不到语言英语(en)的“sentence_splitter”的工厂。这通常发生在spaCy调用nlp.create_pipe时使用了未在当前语言类上注册的自定义组件名称时。如果您使用Transformer,请确保安装了'spacy-transformers'。如果您使用自定义组件,请确保您已经添加了装饰器@Language.component(对于函数组件)或@Language.factory(对于类组件)。

可用的工厂: attribute_ruler, tok2vec, merge_noun_chunks, merge_entities, merge_subtokens, token_splitter, doc_cleaner, parser, beam_parser, lemmatizer, trainable_lemmatizer, entity_linker, entity_ruler, tagger, morphologizer, ner, beam_ner, senter, sentencizer, spancat, spancat_singlelabel, span_finder, future_entity_ruler, span_ruler, textcat, textcat_multilabel, matcher, en.lemmatizer```

以下是我的配置文件:

train = "output_data.spacy"
dev = "output_data.spacy"
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["tok2vec","ner","tagger","sentence_splitter", "parser","senter","attribute_ruler","matcher","lemmatizer","spacytextblob"]
disabled = ["senter", "tagger", "attribute_ruler","spacytextblob"]
before_creation = null
after_creation = null
after_pipeline_creation = null
batch_size = 256
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.sentence_splitter]
factory = "sentence_splitter"



[components.attribute_ruler]
factory = "attribute_ruler"
scorer = {"@scorers":"spacy.attribute_ruler_scorer.v1"}
validate = false

[components.lemmatizer]
factory = "lemmatizer"
mode = "rule"
model = null
overwrite = false
scorer = {"@scorers":"spacy.lemmatizer_scorer.v1"}

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy.Tok2Vec.v2"

[components.ner.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = 96
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
rows = [5000,1000,2500,2500]
include_static_vectors = true

[components.ner.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3

[components.parser]
factory = "parser"
learn_tokens = false
min_action_freq = 30
moves = null
scorer = {"@scorers":"spacy.parser_scorer.v1"}
update_with_oracle_cut_size = 100

[components.parser.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "parser"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.parser.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode:width}
upstream = "tok2vec"

[components.senter]
factory = "senter"
overwrite = false
scorer = {"@scorers":"spacy.senter_scorer.v1"}

[components.senter.model]
@architectures = "spacy.Tagger.v2"
nO = null
normalize = false

[components.senter.model.tok2vec]
@architectures = "spacy.Tok2Vec.v2"

[components.senter.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = 16
attrs = ["NORM","PREFIX","SUFFIX","SHAPE","SPACY"]
rows = [1000,500,500,500,50]
include_static_vectors = true

[components.senter.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 16
depth = 2
window_size = 1
maxout_pieces = 2

[components.spacytextblob]
factory = "spacytextblob"
blob_only = false
custom_blob = null

[components.tagger]
factory = "tagger"
label_smoothing = 0.0
neg_prefix = "!"
overwrite = false
scorer = {"@scorers":"spacy.tagger_scorer.v1"}

[components.tagger.model]
@architectures = "spacy.Tagger.v2"
nO = null
normalize = false

[components.tagger.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode:width}
upstream = "tok2vec"

[components.matcher]
factory = "matcher"

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode:width}
attrs = ["NORM","PREFIX","SUFFIX","SHAPE","SPACY","IS_SPACE"]
rows = [5000,1000,2500,2500,50,50]
include_static_vectors = true

[components.tok2vec.model.encode]
@architectures =

<details>
<summary>英文:</summary>

As the title states: 
I seem to have followed the documentation as described and I have looked all over the web for a useful answer but have so far have not found much. Any help is much appreciated! Thank you!

I am running the command: 

python -m spacy debug config config.cfg --code 'matcher.py' --code 'sentence.py'

and 

python -m spacy train 'config.cfg' --output 'config&#39; --code 'sentence.py' --code 'matcher.py'


Both get the same error: 

ValueError: [E002] Can't find factory for 'sentence_splitter' for language English (en). This usually happens when spaCy calls nlp.create_pipe with a
custom component name that's not registered on the current language class. If you're using a Transformer, make sure to install 'spacy-transformers'. I
f you're using a custom component, make sure you've added the decorator @Language.component (for function components) or @Language.factory (for class components).

Available factories: attribute_ruler, tok2vec, merge_noun_chunks, merge_entities, merge_subtokens, token_splitter, doc_cleaner, parser, beam_parser, le
mmatizer, trainable_lemmatizer, entity_linker, entity_ruler, tagger, morphologizer, ner, beam_ner, senter, sentencizer, spancat, spancat_singlelabel, span_finder, future_entity_ruler, span_ruler, textcat, textcat_multilabel, matcher, en.lemmatizer


Here is my config file: 

[paths]
train = "output_data.spacy"
dev = "output_data.spacy"
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["tok2vec","ner","tagger","sentence_splitter", "parser", "senter","attribute_ruler","matcher","lemmatizer","spacytextblob"]
disabled = ["senter", "tagger", "attribute_ruler","spacytextblob"]
before_creation = null
after_creation = null
after_pipeline_creation = null
batch_size = 256
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.sentence_splitter]
factory = "sentence_splitter"

[components.attribute_ruler]
factory = "attribute_ruler"
scorer = {"@scorers":"spacy.attribute_ruler_scorer.v1"}
validate = false

[components.lemmatizer]
factory = "lemmatizer"
mode = "rule"
model = null
overwrite = false
scorer = {"@scorers":"spacy.lemmatizer_scorer.v1"}

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy.Tok2Vec.v2"

[components.ner.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = 96
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
rows = [5000,1000,2500,2500]
include_static_vectors = true

[components.ner.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3

[components.parser]
factory = "parser"
learn_tokens = false
min_action_freq = 30
moves = null
scorer = {"@scorers":"spacy.parser_scorer.v1"}
update_with_oracle_cut_size = 100

[components.parser.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "parser"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.parser.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode:width}
upstream = "tok2vec"

[components.senter]
factory = "senter"
overwrite = false
scorer = {"@scorers":"spacy.senter_scorer.v1"}

[components.senter.model]
@architectures = "spacy.Tagger.v2"
nO = null
normalize = false

[components.senter.model.tok2vec]
@architectures = "spacy.Tok2Vec.v2"

[components.senter.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = 16
attrs = ["NORM","PREFIX","SUFFIX","SHAPE","SPACY"]
rows = [1000,500,500,500,50]
include_static_vectors = true

[components.senter.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 16
depth = 2
window_size = 1
maxout_pieces = 2

[components.spacytextblob]
factory = "spacytextblob"
blob_only = false
custom_blob = null

[components.tagger]
factory = "tagger"
label_smoothing = 0.0
neg_prefix = "!"
overwrite = false
scorer = {"@scorers":"spacy.tagger_scorer.v1"}

[components.tagger.model]
@architectures = "spacy.Tagger.v2"
nO = null
normalize = false

[components.tagger.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode:width}
upstream = "tok2vec"

[components.matcher]
factory = "matcher"

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode:width}
attrs = ["NORM","PREFIX","SUFFIX","SHAPE","SPACY","IS_SPACE"]
rows = [5000,1000,2500,2500,50,50]
include_static_vectors = true

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
gold_preproc = false
max_length = 0
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
gold_preproc = false
max_length = 0
limit = 0
augmenter = null

[training]
train_corpus = "corpora.train"
dev_corpus = "corpora.dev"
seed = ${system:seed}
gpu_allocator = ${system:gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 5000
max_epochs = 0
max_steps = 100000
eval_frequency = 1000
frozen_components = []
before_to_disk = null
annotating_components = []
before_update = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = true
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
tag_acc = 0.16
dep_uas = 0.0
dep_las = 0.16
dep_las_per_type = null
sents_p = null
sents_r = null
sents_f = 0.02
lemma_acc = 0.5
ents_f = 0.16
ents_p = 0.0
ents_r = 0.0
ents_per_type = null
speed = 0.0

[pretraining]

[initialize]
vocab_data = null
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
before_init = null
after_init = null

[initialize.components]

[initialize.components.ner]

[initialize.components.ner.labels]
@readers = "spacy.read_labels.v1"
path = "corpus/labels/ner.json"
require = false

[initialize.components.parser]

[initialize.components.parser.labels]
@readers = "spacy.read_labels.v1"
path = "corpus/labels/parser.json"
require = false

[initialize.components.tagger]

[initialize.components.tagger.labels]
@readers = "spacy.read_labels.v1"
path = "corpus/labels/tagger.json"
require = false

[initialize.lookups]
@misc = "spacy.LookupsDataLoader.v1"
lang = ${nlp.lang}
tables = ["lexeme_norm"]

[initialize.tokenizer]



**sentence.py file**

import spacy
from spacy.language import Language
import re

@Language.component("sentence_splitter") # stateless
def sentence_splitter(doc):
start = 0
i = 0
# print("Processing custom_sentence_splitter_improved")
#delimiter_pattern = re.compile(r"(\r?\n)+|(\n)+") # This is the magic regex
delimiter_pattern = re.compile(r"(\r?\n\s*)+|(\n\s*)+")
while i < len(doc):
if delimiter_pattern.fullmatch(doc[i].text):
# print(f"Found delimiter '{doc[i].text}' at position {i}")
for token in doc[start:i]:
token.sent_start = False
doc[i].sent_start = True
start = i + 1

        # Skip consecutive occurrences of &#39;\r&#39; and &#39;\n&#39;
        while i + 1 &lt; len(doc) and delimiter_pattern.fullmatch(doc[i + 1].text):
            doc[i + 1].sent_start = False
            i += 1
    else:
        doc[i].sent_start = False
    i += 1

for token in doc[start:]:
    token.sent_start = False

return doc

Used to add the custom component to the pipeline

nlp = spacy.load("en_core_web_lg")
nlp.add_pipe("sentence_splitter", name="sentence_splitter", after='ner')


**matcher.py file**

import spacy
from spacy.language import Language
import re
from spacy.matcher import Matcher
from spacy.tokens import Token

@Language.factory("matcher")# stateful
def create_template_matcher(nlp, name):
return TemplateMatcher(nlp.vocab)

class TemplateMatcher:
def init(self, vocab):
# Define multiple patterns
patterns1 = [blar blar blar ]
patterns2 = [blar blar blar ]
patterns3 = [blar blar blar ]
patterns4 = [blar blar blar ]

    Token.set_extension(&quot;templates&quot;, default=False, force=True)  # Register a new token extension to flag matched patterns
    self.matcher = Matcher(vocab)
    self.matcher.add(&quot;patterns1&quot;, patterns1)
    self.matcher.add(&quot;patterns2&quot;, patterns2)
    self.matcher.add(&quot;patterns3&quot;, patterns3)
    self.matcher.add(&quot;patterns4&quot;, patterns4)

def __call__(self, doc):
    matches = self.matcher(doc)
    for match_id, start, end in matches:
        for token in doc[start:end]:
            token._.templates = True
    return doc

Used to add the custom component to the pipeline

nlp = spacy.load("en_core_web_lg")
nlp.add_pipe("matcher", name="matcher", after ='parser')



</details>


# 答案1
**得分**: 1

根据[文档][1]:
&gt; `--code` 参数可用于提供一个在训练过程开始之前导入的 Python 文件。

很容易忽略,但它确实说“file”(单数),而不是“files”。我认为您不能多次使用不同的 Python 文件提供 `--code` 参数&gt;1次。

但是,如果您将您的两个自定义组件都添加到同一个模块中,例如 *custom_components.py*,然后运行 `python -m spacy init fill-config config.cfg config.cfg --code custom_components.py`,这应该会告诉您配置是否有任何问题。如果没有问题,那么您可以继续使用 `python -m spacy train config.cfg --output config/ --code custom_componets.py` 进行训练。

```python
&quot;&quot;&quot;custom_components.py 的内容

注意:
    - 我不得不导入 `SpacyTextBlob` 才能让配置知道要使用哪个工厂
    - 我不得不修改您的 `TemplateMatcher` 以使 `fill-config` 命令正常工作。
&quot;&quot;&quot;

import re

from spacy.language import Language
from spacy.matcher import Matcher
from spacy.tokens import Token
# 注意:必须导入 `SpacyTextBlob` 才能使配置文件正常工作
from spacytextblob.spacytextblob import SpacyTextBlob


@Language.component(&quot;sentence_splitter&quot;)  # 无状态
def sentence_splitter(doc):
    start = 0
    i = 0
    # print(&quot;Processing custom_sentence_splitter_improved&quot;)
    #delimiter_pattern = re.compile(r&quot;(\r?\n)+|(\n)+&quot;)  # 这是魔术正则表达式
    delimiter_pattern = re.compile(r&quot;(\r?\n\s*)+|(\n\s*)+&quot;)
    while i &lt; len(doc):
        if delimiter_pattern.fullmatch(doc[i].text):
            # print(f&quot;Found delimiter &#39;{doc[i].text}&#39; at position {i}&quot;)
            for token in doc[start:i]:
                token.sent_start = False
            doc[i].sent_start = True
            start = i + 1

            # 跳过连续的 '\r' 和 '\n' 的出现
            while i + 1 &lt; len(doc) and delimiter_pattern.fullmatch(doc[i + 1].text):
                doc[i + 1].sent_start = False
                i += 1
        else:
            doc[i].sent_start = False
        i += 1

    for token in doc[start:]:
        token.sent_start = False

    return doc


@Language.factory(&quot;matcher&quot;)  # 有状态
def create_template_matcher(nlp, name):
    return TemplateMatcher(nlp.vocab)


class TemplateMatcher:
    def __init__(self, vocab):
        # 定义多个模式
        # 注意*** 在这里进行了修改 ***
        blar = {&quot;ORTH&quot;: &quot;blar&quot;}
        patterns1 = [blar]
        patterns2 = [blar]
        patterns3 = [blar]
        patterns4 = [blar]

        Token.set_extension(&quot;templates&quot;, default=False, force=True)  # 注册一个新的令牌扩展来标记匹配的模式
        self.matcher = Matcher(vocab)
        self.matcher.add(&quot;patterns1&quot;, [patterns1])
        self.matcher.add(&quot;patterns2&quot;, [patterns2])
        self.matcher.add(&quot;patterns3&quot;, [patterns3])
        self.matcher.add(&quot;patterns4&quot;, [patterns4])

    def __call__(self, doc):
        matches = self.matcher(doc)
        for match_id, start, end in matches:
            for token in doc[start:end]:
                token._.templates = True
        return doc

在运行 python -m spacy init fill-config 命令之后...

python -m spacy init fill-config config.cfg config.cfg --code custom_components.py

我们得到了绿色的勾号。

✔ Auto-filled config with all values
✔ Saved config                                    
config.cfg                                        
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy

现在您可以运行 python -m spacy train config.cfg --output config/ --code custom_componets.py

英文:

Per the docs:
> The --code argument can be used to provide a Python file that’s imported before the training process starts.

Easy to miss, but it does say "file" (singular) rather than "files". I don't think you can supply the --code argument >1 times with a different Python file for each.

However, if you add both of your custom components to the same module, e.g. custom_components.py, and run python -m spacy init fill-config config.cfg config.cfg --code custom_components.py this should tell you if anything is wrong with your config. And if nothing is wrong, then you can proceed with training using python -m spacy train config.cfg --output config/ --code custom_componets.py

&quot;&quot;&quot;Contents of custom_components.py

Notes:
    - I had to import `SpacyTextBlob` for the config to know what 
    factory to use
    - I had to modify your `TemplateMatcher` to get the `fill-config` 
    command to work.
&quot;&quot;&quot;

import re

from spacy.language import Language
from spacy.matcher import Matcher
from spacy.tokens import Token
# NOTE: have to import `SpacyTextBlob` for config file to work
from spacytextblob.spacytextblob import SpacyTextBlob


@Language.component(&quot;sentence_splitter&quot;)  # stateless
def sentence_splitter(doc):
    start = 0
    i = 0
    # print(&quot;Processing custom_sentence_splitter_improved&quot;)
    #delimiter_pattern = re.compile(r&quot;(\r?\n)+|(\n)+&quot;)  # This is the magic regex
    delimiter_pattern = re.compile(r&quot;(\r?\n\s*)+|(\n\s*)+&quot;)
    while i &lt; len(doc):
        if delimiter_pattern.fullmatch(doc[i].text):
            # print(f&quot;Found delimiter &#39;{doc[i].text}&#39; at position {i}&quot;)
            for token in doc[start:i]:
                token.sent_start = False
            doc[i].sent_start = True
            start = i + 1

            # Skip consecutive occurrences of &#39;\r&#39; and &#39;\n&#39;
            while i + 1 &lt; len(doc) and delimiter_pattern.fullmatch(doc[i + 1].text):
                doc[i + 1].sent_start = False
                i += 1
        else:
            doc[i].sent_start = False
        i += 1

    for token in doc[start:]:
        token.sent_start = False

    return doc


@Language.factory(&quot;matcher&quot;)  # stateful
def create_template_matcher(nlp, name):
    return TemplateMatcher(nlp.vocab)


class TemplateMatcher:
    def __init__(self, vocab):
        # Define multiple patterns
        # NOTE *** modifications made here ***
        blar = {&quot;ORTH&quot;: &quot;blar&quot;}
        patterns1 = [blar]
        patterns2 = [blar]
        patterns3 = [blar]
        patterns4 = [blar]

        Token.set_extension(&quot;templates&quot;, default=False, force=True)  # Register a new token extension to flag matched patterns
        self.matcher = Matcher(vocab)
        self.matcher.add(&quot;patterns1&quot;, [patterns1])
        self.matcher.add(&quot;patterns2&quot;, [patterns2])
        self.matcher.add(&quot;patterns3&quot;, [patterns3])
        self.matcher.add(&quot;patterns4&quot;, [patterns4])

    def __call__(self, doc):
        matches = self.matcher(doc)
        for match_id, start, end in matches:
            for token in doc[start:end]:
                token._.templates = True
        return doc

After running the python -m spacy init fill-config command...

python -m spacy init fill-config config.cfg config.cfg --code custom_components.py

We get the green checkmark.

✔ Auto-filled config with all values
✔ Saved config                                    
config.cfg                                        
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy

You should be good to run python -m spacy train config.cfg --output config/ --code custom_componets.py now.

References

huangapple
  • 本文由 发表于 2023年7月18日 04:41:54
  • 转载请务必保留本文链接:https://go.coder-hub.com/76707941.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定