2023年3月15日 19:58:45go评论73阅读模式

英文:

Can someone explain how to create a PTB Dataset And/Or Train my own model using StanfordNLP?

问题

我正在学习情感分析，但似乎找不到关于如何创建PTB数据集的在线信息。我正在使用Java中的StanfordNLP。我已经下载了它们使用的测试、开发和验证数据，但我无法理解它们是如何组织的：

test.txt:

(3 (2 (2 The) (2 Rock)) (4 (3 (2 is) (4 (2 destined) (2 (2 (2 (2 (2 to) (2 (2 be) (2 (2 the) (2 (2 21st) (2 (2 (2 Century) (2 's)) (2 (3 new) (2 (2 ``) (2 Conan)))))))) (2 '')) (2 and)) (3 (2 that) (3 (2 he) (3 (2 's) (3 (2 going) (3 (2 to) (4 (3 (2 make) (3 (3 (2 a) (3 splash)) (2 (2 even) (3 greater)))) (2 (2 than) (2 (2 (2 (2 (1 (2 Arnold) (2 Schwarzenegger)) (2 ,)) (2 (2 Jean-Claud) (2 (2 Van) (2 Damme)))) (2 or)) (2 (2 Steven) (2 Segal))))))))))))) (2 .)))

我猜数字与情感值对齐，但仍然不确定它是如何工作的。

简而言之，我正试图开发自己的新闻分析模型，发现StanfordNLP模型是基于电影评论进行训练的，这导致情感分析效果不佳，因此我想尝试开发自己的模型，但找不到在线教授每个元素是什么或如何做到这一点的信息。

最好的情况是在这个页面上有详细信息：https://nlp.stanford.edu/sentiment/code.html

数据集可用，以及用于训练的代码。

可以使用以下命令重新训练模型，使用PTB格式数据集：

java -mx8g edu.stanford.nlp.sentiment.SentimentTraining -numHid 25 -trainPath train.txt -devPath dev.txt -train -model model.ser.gz

我已经准备好需要解析的数据。

英文:

I'm learning about sentiment analysis and I can't seem to find anything online that outlines how to create a PTB Dataset. I'm using StanfordNLP with Java. I've downloaded the test, dev and validate data that they used and I can't get my head around how these have been outlined:

test.txt:

(3 (2 (2 The) (2 Rock)) (4 (3 (2 is) (4 (2 destined) (2 (2 (2 (2 (2 to) (2 (2 be) (2 (2 the) (2 (2 21st) (2 (2 (2 Century) (2 &#39;s)) (2 (3 new) (2 (2 ``) (2 Conan)))))))) (2 &#39;&#39;)) (2 and)) (3 (2 that) (3 (2 he) (3 (2 &#39;s) (3 (2 going) (3 (2 to) (4 (3 (2 make) (3 (3 (2 a) (3 splash)) (2 (2 even) (3 greater)))) (2 (2 than) (2 (2 (2 (2 (1 (2 Arnold) (2 Schwarzenegger)) (2 ,)) (2 (2 Jean-Claud) (2 (2 Van) (2 Damme)))) (2 or)) (2 (2 Steven) (2 Segal))))))))))))) (2 .)))

I figure that numbers are aligned to sentiment value but I'm still not sure how it works.

TLDR; I'm trying to develop my own model for news analysis and have seen that the StanfordNLP model has been trained on movie reviews which is leading to poor sentiment analysis so, I thought to attempt to develop my own but I can't find anything online that teaches what each element is or how to even do this.

At best; outlined on this page: https://nlp.stanford.edu/sentiment/code.html

Is the dataset available and the code to train.

Models can be retrained using the following command using the PTB format dataset:

java -mx8g edu.stanford.nlp.sentiment.SentimentTraining -numHid 25 -trainPath train.txt -devPath dev.txt -train -model model.ser.gz

I have the data that I need to parse ready.

答案1

得分: 0

好的，这是翻译后的内容：

步骤 1.

查找您的数据。（在我的情况下，这是关于英国住房市场的新闻文章）

步骤 2.

对您的数据进行注释

注释含义

非常积极 = 4
积极 = 3
中性 = 2
消极 = 1
非常消极 = 0

结构

2 UK租户：你是否与某人闹翻了？
   //整体情感

1 与某人闹翻了
   // 消极

1 与某人闹翻了
   // 消极

2 英国租户
   // 中性

...等等...

将注释的数据保存到.txt文件（sample.txt）

步骤 3:

找到您的 stanford-corenlp-4.5.2.jar
- 示例 ~/.m2/repository/edu/stanford/nlp/stanford-corenlp/4.5.2

步骤 4:

打开Bash并运行
- java -cp "*" -mx5g edu.stanford.nlp.sentiment.BuildBinarizedDataset -input /c/Users/rusku/Desktop/StanfordNPL/rusSample/sample.txt
- 替换上述数据位置

步骤 5:

结果

(2 (2 (2 (2 英国) (2 租户)) (2 :)) (2 (2 (2 (2 是) (2 你)) (2 (2 与) (2 (2 某人) (2 (2 你) (2 (2 ▒ve) (1 (1 (3 落下) (2 与)) (2 闹翻了)))))) (2 ?)))
(3 (3 (2 (3 英国) (3 物业)) (2 (3 询问) (3 价格))) (3 (3 (3 停滞) (3 (2 ,) (4 (2 抬高) (2 希望)))) (3 (3 的) (3 (3 (3 更柔和) (2 登陆)) (3 (3 为) (2 (3 住房) (3 市场)))))))

资源：培训Stanford CoreNLP有关领域特定短语的情感

这是我目前的进展。

希望这有所帮助。

英文:

Okay.. So I've done some more digging and have started to finally understand (some what) as how to create a Dataset Tree and will try to break it down for anyone who stumbles upon this post with the same troubles as I've been having.

Step 1.

Find your data. (In my case it's news articles about the UK housing
market)

UK renters: are you living with someone you’ve fallen out with?
UK property asking prices stagnating, lifting hopes of softer landing for housing market

Step 2.

Annotate your data

2 UK renters: are you living with someone you’ve fallen out with?
1 fallen out with
1 fallen out
2 UK renters
2 living with someone
3 fallen
2 :
2 ?
2 living with
2 someone

3 UK property asking prices stagnating, lifting hopes of softer landing for housing market
2 UK property
3 asking prices stagnating
2 asking prices
4 lifting hopes
2 hopes
4 lifting hopes of softer landing
3 softer landing for housing market
2 housing market
2 lifting
2 landing
2 ,

Annotation Meanings

Very Positive= 4
Positive = 3
Neutral = 2
Negative = 1
Very Negative = 0

Structure

2 UK renters: are you living with someone you’ve fallen out with?
   //Overall sentiment

1 fallen out with
   // Negative

1 fallen out
   // Negative

2 UK renters
   // Neutral

...etc..

Save the annotated data to a .txt (sample.txt)

Step 3:

Locate your stanford-corenlp-4.5.2.jar
- example ~/.m2/repository/edu/stanford/nlp/stanford-corenlp/4.5.2

Step 4:

Open Bash and run
- java -cp "*" -mx5g edu.stanford.nlp.sentiment.BuildBinarizedDataset -input /c/Users/rusku/Desktop/StanfordNPL/rusSample/sample.txt
- replace the above data location

Step 5:

Result

(2 (2 (2 (2 UK) (2 renters)) (2 :)) (2 (2 (2 (2 are) (2 you)) (2 (2 living) (2 (2 with) (2 (2 someone) (2 (2 you) (2 (2 ▒ve) (1 (1 (3 fallen) (2 out)) (2 with)))))))) (2 ?)))
(3 (3 (2 (3 UK) (3 property)) (2 (3 asking) (3 prices))) (3 (3 (3 stagnating) (3 (2 ,) (4 (2 lifting) (2 hopes)))) (3 (3 of) (3 (3 (3 softer) (2 landing)) (3 (3 for) (2 (3 housing) (3 market)))))))

Resource: Train Stanford CoreNLP about the sentiment of domain-specific phrases

This is as far as I've currently gotten.

Hope this helps.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Can someone explain how to create a PTB Dataset And/Or Train my own model using StanfordNLP?

问题

答案1

如何在自定义的JComponent中正确实现MouseInputListener

为什么 JProgressBar 的 setProgress 方法不接受超过 100 的值？

如何将具有对象作为键的Map列表转换为Map？

GraalVM性能

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论