英文:
Can someone explain how to create a PTB Dataset And/Or Train my own model using StanfordNLP?
问题
我正在学习情感分析,但似乎找不到关于如何创建PTB数据集的在线信息。我正在使用Java中的StanfordNLP。我已经下载了它们使用的测试、开发和验证数据,但我无法理解它们是如何组织的:
test.txt:
(3 (2 (2 The) (2 Rock)) (4 (3 (2 is) (4 (2 destined) (2 (2 (2 (2 (2 to) (2 (2 be) (2 (2 the) (2 (2 21st) (2 (2 (2 Century) (2 's)) (2 (3 new) (2 (2 ``) (2 Conan)))))))) (2 '')) (2 and)) (3 (2 that) (3 (2 he) (3 (2 's) (3 (2 going) (3 (2 to) (4 (3 (2 make) (3 (3 (2 a) (3 splash)) (2 (2 even) (3 greater)))) (2 (2 than) (2 (2 (2 (2 (1 (2 Arnold) (2 Schwarzenegger)) (2 ,)) (2 (2 Jean-Claud) (2 (2 Van) (2 Damme)))) (2 or)) (2 (2 Steven) (2 Segal))))))))))))) (2 .)))
我猜数字与情感值对齐,但仍然不确定它是如何工作的。
简而言之,我正试图开发自己的新闻分析模型,发现StanfordNLP模型是基于电影评论进行训练的,这导致情感分析效果不佳,因此我想尝试开发自己的模型,但找不到在线教授每个元素是什么或如何做到这一点的信息。
最好的情况是在这个页面上有详细信息:https://nlp.stanford.edu/sentiment/code.html
数据集可用,以及用于训练的代码。
可以使用以下命令重新训练模型,使用PTB格式数据集:
java -mx8g edu.stanford.nlp.sentiment.SentimentTraining -numHid 25 -trainPath train.txt -devPath dev.txt -train -model model.ser.gz
我已经准备好需要解析的数据。
英文:
I'm learning about sentiment analysis and I can't seem to find anything online that outlines how to create a PTB Dataset. I'm using StanfordNLP with Java. I've downloaded the test, dev and validate data that they used and I can't get my head around how these have been outlined:
test.txt:
(3 (2 (2 The) (2 Rock)) (4 (3 (2 is) (4 (2 destined) (2 (2 (2 (2 (2 to) (2 (2 be) (2 (2 the) (2 (2 21st) (2 (2 (2 Century) (2 's)) (2 (3 new) (2 (2 ``) (2 Conan)))))))) (2 '')) (2 and)) (3 (2 that) (3 (2 he) (3 (2 's) (3 (2 going) (3 (2 to) (4 (3 (2 make) (3 (3 (2 a) (3 splash)) (2 (2 even) (3 greater)))) (2 (2 than) (2 (2 (2 (2 (1 (2 Arnold) (2 Schwarzenegger)) (2 ,)) (2 (2 Jean-Claud) (2 (2 Van) (2 Damme)))) (2 or)) (2 (2 Steven) (2 Segal))))))))))))) (2 .)))
I figure that numbers are aligned to sentiment value but I'm still not sure how it works.
TLDR; I'm trying to develop my own model for news analysis and have seen that the StanfordNLP model has been trained on movie reviews which is leading to poor sentiment analysis so, I thought to attempt to develop my own but I can't find anything online that teaches what each element is or how to even do this.
At best; outlined on this page: https://nlp.stanford.edu/sentiment/code.html
Is the dataset available and the code to train.
Models can be retrained using the following command using the PTB format dataset:
java -mx8g edu.stanford.nlp.sentiment.SentimentTraining -numHid 25 -trainPath train.txt -devPath dev.txt -train -model model.ser.gz
I have the data that I need to parse ready.
答案1
得分: 0
好的,这是翻译后的内容:
步骤 1.
- 查找您的数据。(在我的情况下,这是关于英国住房市场的新闻文章)
步骤 2.
- 对您的数据进行注释
注释含义
非常积极 = 4
积极 = 3
中性 = 2
消极 = 1
非常消极 = 0
结构
2 UK租户:你是否与某人闹翻了?
//整体情感
1 与某人闹翻了
// 消极
1 与某人闹翻了
// 消极
2 英国租户
// 中性
...等等...
- 将注释的数据保存到.txt文件(sample.txt)
步骤 3:
-
找到您的
stanford-corenlp-4.5.2.jar
- 示例
~/.m2/repository/edu/stanford/nlp/stanford-corenlp/4.5.2
- 示例
步骤 4:
- 打开Bash并运行
java -cp "*" -mx5g edu.stanford.nlp.sentiment.BuildBinarizedDataset -input /c/Users/rusku/Desktop/StanfordNPL/rusSample/sample.txt
- 替换上述数据位置
步骤 5:
- 结果
(2 (2 (2 (2 英国) (2 租户)) (2 :)) (2 (2 (2 (2 是) (2 你)) (2 (2 与) (2 (2 某人) (2 (2 你) (2 (2 ▒ve) (1 (1 (3 落下) (2 与)) (2 闹翻了)))))) (2 ?)))
(3 (3 (2 (3 英国) (3 物业)) (2 (3 询问) (3 价格))) (3 (3 (3 停滞) (3 (2 ,) (4 (2 抬高) (2 希望)))) (3 (3 的) (3 (3 (3 更柔和) (2 登陆)) (3 (3 为) (2 (3 住房) (3 市场)))))))
资源:培训Stanford CoreNLP有关领域特定短语的情感
这是我目前的进展。
希望这有所帮助。
英文:
Okay.. So I've done some more digging and have started to finally understand (some what) as how to create a Dataset Tree and will try to break it down for anyone who stumbles upon this post with the same troubles as I've been having.
Step 1.
- Find your data. (In my case it's news articles about the UK housing
market)
UK renters: are you living with someone you’ve fallen out with?
UK property asking prices stagnating, lifting hopes of softer landing for housing market
Step 2.
- Annotate your data
2 UK renters: are you living with someone you’ve fallen out with?
1 fallen out with
1 fallen out
2 UK renters
2 living with someone
3 fallen
2 :
2 ?
2 living with
2 someone
3 UK property asking prices stagnating, lifting hopes of softer landing for housing market
2 UK property
3 asking prices stagnating
2 asking prices
4 lifting hopes
2 hopes
4 lifting hopes of softer landing
3 softer landing for housing market
2 housing market
2 lifting
2 landing
2 ,
Annotation Meanings
Very Positive= 4
Positive = 3
Neutral = 2
Negative = 1
Very Negative = 0
Structure
2 UK renters: are you living with someone you’ve fallen out with?
//Overall sentiment
1 fallen out with
// Negative
1 fallen out
// Negative
2 UK renters
// Neutral
...etc..
- Save the annotated data to a .txt (sample.txt)
Step 3:
-
Locate your
stanford-corenlp-4.5.2.jar
- example
~/.m2/repository/edu/stanford/nlp/stanford-corenlp/4.5.2
- example
Step 4:
- Open Bash and run
java -cp "*" -mx5g edu.stanford.nlp.sentiment.BuildBinarizedDataset -input /c/Users/rusku/Desktop/StanfordNPL/rusSample/sample.txt
- replace the above data location
Step 5:
- Result
(2 (2 (2 (2 UK) (2 renters)) (2 :)) (2 (2 (2 (2 are) (2 you)) (2 (2 living) (2 (2 with) (2 (2 someone) (2 (2 you) (2 (2 ▒ve) (1 (1 (3 fallen) (2 out)) (2 with)))))))) (2 ?)))
(3 (3 (2 (3 UK) (3 property)) (2 (3 asking) (3 prices))) (3 (3 (3 stagnating) (3 (2 ,) (4 (2 lifting) (2 hopes)))) (3 (3 of) (3 (3 (3 softer) (2 landing)) (3 (3 for) (2 (3 housing) (3 market)))))))
Resource: Train Stanford CoreNLP about the sentiment of domain-specific phrases
This is as far as I've currently gotten.
Hope this helps.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论