Can someone explain how to create a PTB Dataset And/Or Train my own model using StanfordNLP?

huangapple go评论73阅读模式
英文:

Can someone explain how to create a PTB Dataset And/Or Train my own model using StanfordNLP?

问题

我正在学习情感分析,但似乎找不到关于如何创建PTB数据集的在线信息。我正在使用Java中的StanfordNLP。我已经下载了它们使用的测试、开发和验证数据,但我无法理解它们是如何组织的:

test.txt:

(3 (2 (2 The) (2 Rock)) (4 (3 (2 is) (4 (2 destined) (2 (2 (2 (2 (2 to) (2 (2 be) (2 (2 the) (2 (2 21st) (2 (2 (2 Century) (2 's)) (2 (3 new) (2 (2 ``) (2 Conan)))))))) (2 '')) (2 and)) (3 (2 that) (3 (2 he) (3 (2 's) (3 (2 going) (3 (2 to) (4 (3 (2 make) (3 (3 (2 a) (3 splash)) (2 (2 even) (3 greater)))) (2 (2 than) (2 (2 (2 (2 (1 (2 Arnold) (2 Schwarzenegger)) (2 ,)) (2 (2 Jean-Claud) (2 (2 Van) (2 Damme)))) (2 or)) (2 (2 Steven) (2 Segal))))))))))))) (2 .)))

我猜数字与情感值对齐,但仍然不确定它是如何工作的。

简而言之,我正试图开发自己的新闻分析模型,发现StanfordNLP模型是基于电影评论进行训练的,这导致情感分析效果不佳,因此我想尝试开发自己的模型,但找不到在线教授每个元素是什么或如何做到这一点的信息。

最好的情况是在这个页面上有详细信息:https://nlp.stanford.edu/sentiment/code.html

数据集可用,以及用于训练的代码。

可以使用以下命令重新训练模型,使用PTB格式数据集:

java -mx8g edu.stanford.nlp.sentiment.SentimentTraining -numHid 25 -trainPath train.txt -devPath dev.txt -train -model model.ser.gz

我已经准备好需要解析的数据。

英文:

I'm learning about sentiment analysis and I can't seem to find anything online that outlines how to create a PTB Dataset. I'm using StanfordNLP with Java. I've downloaded the test, dev and validate data that they used and I can't get my head around how these have been outlined:

test.txt:

(3 (2 (2 The) (2 Rock)) (4 (3 (2 is) (4 (2 destined) (2 (2 (2 (2 (2 to) (2 (2 be) (2 (2 the) (2 (2 21st) (2 (2 (2 Century) (2 's)) (2 (3 new) (2 (2 ``) (2 Conan)))))))) (2 '')) (2 and)) (3 (2 that) (3 (2 he) (3 (2 's) (3 (2 going) (3 (2 to) (4 (3 (2 make) (3 (3 (2 a) (3 splash)) (2 (2 even) (3 greater)))) (2 (2 than) (2 (2 (2 (2 (1 (2 Arnold) (2 Schwarzenegger)) (2 ,)) (2 (2 Jean-Claud) (2 (2 Van) (2 Damme)))) (2 or)) (2 (2 Steven) (2 Segal))))))))))))) (2 .)))

I figure that numbers are aligned to sentiment value but I'm still not sure how it works.

TLDR; I'm trying to develop my own model for news analysis and have seen that the StanfordNLP model has been trained on movie reviews which is leading to poor sentiment analysis so, I thought to attempt to develop my own but I can't find anything online that teaches what each element is or how to even do this.

At best; outlined on this page: https://nlp.stanford.edu/sentiment/code.html

Is the dataset available and the code to train.

Models can be retrained using the following command using the PTB format dataset:

java -mx8g edu.stanford.nlp.sentiment.SentimentTraining -numHid 25 -trainPath train.txt -devPath dev.txt -train -model model.ser.gz

I have the data that I need to parse ready.

答案1

得分: 0

好的,这是翻译后的内容:

步骤 1.

  • 查找您的数据。(在我的情况下,这是关于英国住房市场的新闻文章)

步骤 2.

  • 对您的数据进行注释

注释含义

非常积极 = 4
积极 = 3
中性 = 2
消极 = 1
非常消极 = 0

结构

2 UK租户:你是否与某人闹翻了?
   //整体情感

1 与某人闹翻了
   // 消极

1 与某人闹翻了
   // 消极

2 英国租户
   // 中性

...等等...
  • 将注释的数据保存到.txt文件(sample.txt)

步骤 3:

  • 找到您的 stanford-corenlp-4.5.2.jar

    • 示例 ~/.m2/repository/edu/stanford/nlp/stanford-corenlp/4.5.2

步骤 4:

  • 打开Bash并运行
    • java -cp "*" -mx5g edu.stanford.nlp.sentiment.BuildBinarizedDataset -input /c/Users/rusku/Desktop/StanfordNPL/rusSample/sample.txt
    • 替换上述数据位置

步骤 5:

  • 结果
(2 (2 (2 (2 英国) (2 租户)) (2 :)) (2 (2 (2 (2 是) (2 你)) (2 (2 与) (2 (2 某人) (2 (2 你) (2 (2 ▒ve) (1 (1 (3 落下) (2 与)) (2 闹翻了)))))) (2 ?)))
(3 (3 (2 (3 英国) (3 物业)) (2 (3 询问) (3 价格))) (3 (3 (3 停滞) (3 (2 ,) (4 (2 抬高) (2 希望)))) (3 (3 的) (3 (3 (3 更柔和) (2 登陆)) (3 (3 为) (2 (3 住房) (3 市场)))))))

资源:培训Stanford CoreNLP有关领域特定短语的情感

这是我目前的进展。

希望这有所帮助。

英文:

Okay.. So I've done some more digging and have started to finally understand (some what) as how to create a Dataset Tree and will try to break it down for anyone who stumbles upon this post with the same troubles as I've been having.

Step 1.

  • Find your data. (In my case it's news articles about the UK housing
    market)
UK renters: are you living with someone you’ve fallen out with?
UK property asking prices stagnating, lifting hopes of softer landing for housing market

Step 2.

  • Annotate your data
2 UK renters: are you living with someone you’ve fallen out with?
1 fallen out with
1 fallen out
2 UK renters
2 living with someone
3 fallen
2 :
2 ?
2 living with
2 someone

3 UK property asking prices stagnating, lifting hopes of softer landing for housing market
2 UK property
3 asking prices stagnating
2 asking prices
4 lifting hopes
2 hopes
4 lifting hopes of softer landing
3 softer landing for housing market
2 housing market
2 lifting
2 landing
2 , 

Annotation Meanings

Very Positive= 4
Positive = 3
Neutral = 2
Negative = 1
Very Negative = 0

Structure

2 UK renters: are you living with someone you’ve fallen out with?
   //Overall sentiment

1 fallen out with
   // Negative

1 fallen out
   // Negative

2 UK renters
   // Neutral

...etc..
  • Save the annotated data to a .txt (sample.txt)

Step 3:

  • Locate your stanford-corenlp-4.5.2.jar

    • example ~/.m2/repository/edu/stanford/nlp/stanford-corenlp/4.5.2

Step 4:

  • Open Bash and run
    • java -cp "*" -mx5g edu.stanford.nlp.sentiment.BuildBinarizedDataset -input /c/Users/rusku/Desktop/StanfordNPL/rusSample/sample.txt
    • replace the above data location

Step 5:

  • Result
(2 (2 (2 (2 UK) (2 renters)) (2 :)) (2 (2 (2 (2 are) (2 you)) (2 (2 living) (2 (2 with) (2 (2 someone) (2 (2 you) (2 (2 ▒ve) (1 (1 (3 fallen) (2 out)) (2 with)))))))) (2 ?)))
(3 (3 (2 (3 UK) (3 property)) (2 (3 asking) (3 prices))) (3 (3 (3 stagnating) (3 (2 ,) (4 (2 lifting) (2 hopes)))) (3 (3 of) (3 (3 (3 softer) (2 landing)) (3 (3 for) (2 (3 housing) (3 market)))))))

Resource: Train Stanford CoreNLP about the sentiment of domain-specific phrases

This is as far as I've currently gotten.

Hope this helps.

huangapple
  • 本文由 发表于 2023年3月15日 19:58:45
  • 转载请务必保留本文链接:https://go.coder-hub.com/75744401.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定