2023年2月27日 09:33:32go评论137阅读模式

英文:

How to polish return and save to individual JSON/csv file?

问题

# 问题1：类别（data）的顺序是Names，Colors和Places。但返回的顺序是Name，Place和Color。这并不重要，只是想知道原因。
# 问题2：不使用 print(process('jonny'))，如何从文本文件输入文本列表？
# 问题3：假设输入文本文件的名称为TEST.txt。如何将返回结果保存在TEST.JSON或TEST.csv文件中？基本上输入和输出的文件名相同。

英文:

Below code is from https://stackoverflow.com/questions/47666699/using-word2vec-to-classify-words-in-categories and I need some help on input and return saveing. Any help would be greatly appreciated.

# Category -&gt; words
data = {
  &#39;Names&#39;: [&#39;john&#39;,&#39;jay&#39;,&#39;dan&#39;,&#39;nathan&#39;,&#39;bob&#39;],
  &#39;Colors&#39;: [&#39;yellow&#39;, &#39;red&#39;,&#39;green&#39;, &#39;oragne&#39;, &#39;purple&#39;],
  &#39;Places&#39;: [&#39;tokyo&#39;,&#39;bejing&#39;,&#39;washington&#39;,&#39;mumbai&#39;],
}
# Words -&gt; category
categories = {word: key for key, words in data.items() for word in words}
# Load the whole embedding matrix
embeddings_index = {}
with open(&#39;glove.6B.100d.txt&#39;, encoding=&#39;utf-8&#39;) as f:
  for line in f:
    values = line.split()
    word = values[0]
    embed = np.array(values[1:], dtype=np.float32)
    embeddings_index[word] = embed
print(&#39;Loaded %s word vectors.&#39; % len(embeddings_index))
# Embeddings for available words
data_embeddings = {key: value for key, value in embeddings_index.items() if key in categories.keys()}
# Processing the query
def process(query):
  query_embed = embeddings_index[query]
  scores = {}
  for word, embed in data_embeddings.items():
    category = categories[word]
    dist = query_embed.dot(embed)
    dist /= len(data[category])
    scores[category] = scores.get(category, 0) + dist
  return scores
# Testing
print(process(&#39;jonny&#39;))
print(process(&#39;green&#39;))
print(process(&#39;park&#39;))

And the return looks like:

Loaded 400000 word vectors.
{&#39;Names&#39;: 7.965438079833984, &#39;Places&#39;: -0.3282392770051956, &#39;Colors&#39;: 1.803783965110779}
{&#39;Names&#39;: 11.360316085815429, &#39;Places&#39;: 3.536876901984215, &#39;Colors&#39;: 21.82199630737305}
{&#39;Names&#39;: 10.234728145599364, &#39;Places&#39;: 8.739515662193298, &#39;Colors&#39;: 10.761297225952148}

Below are the changes I want to make to this scrip but keep failing Please help.

Question 1: The order or category (data) is Names, Colors, and Places. But why does the retun has Name, Place, Color order instead? This is not important but was wondering why.

Question 2: Instead of using print(process('jonny')), how can I input list of text from text file?

Question 3: Lets suppose name of input text file is TEST.txt. How can I save the return in TEST.JSON or TEST.csv file? Basically input and output as same name.

Thank yo so much!

答案1

得分: 0

# 问题 1:
# 数据的顺序或类别是Names、Colors和Places。但为什么返回的顺序是Name、Place、Color呢？这并不重要，只是想知道为什么。
# 可能是因为'glove.6B.100d.txt'的内容是如何排序/排列的原因。
# 问题 2:
# 而不是使用`print(process('jonny'))`，我如何从文本文件中输入文本列表？[假设输入文本文件的名称是TEST.txt。]
# 假设'TEST.txt'中的每一行都有一个输入，如下所示
# jonny
# green
# park
# [input#4]
# [input#5]
# 然后你可以将它们读入一个字符串列表中，以便循环遍历并应用`process`：
with open('TEST.txt') as f: 
    inputList = f.read().splitlines()
outputList = [process(inp) for inp in inputList] 
for op in outputList: print(op)
# 问题 3:
# 如何将返回值保存在TEST.JSON或TEST.csv文件中？基本上是输入和输出相同的名称。
# 要保存为CSV，你可以使用pandas的`to_csv`函数：
import pandas as pd
pd.DataFrame(
    [{'input': inp, **process(inp)} for inp in inputList]
).set_index('input').to_csv('TEST.csv')
# 要保存为JSON，你可以使用`json.dump`：
import json
with open('TEST.json', 'w') as f: 
    json.dump({inp: process(inp) for inp in inputList}, f)

添加编辑:

假设我有一个文本文件列表。那么我如何一次处理所有文本文件并将返回值保存在相同的文件名中呢？例如，如果我使用text1.txt、text2.txt和text3.txt...返回值将分别是text1.json、text2.json和text3.json。

inpFiles = ['text1.txt', 'text2.txt', 'text3.txt']
for inpf in inpFiles:
    with open(inpf) as f: 
        inputList = f.read().splitlines()
    with open(f'{inpf[:-4]}.json', 'w') as f:   
        json.dump({inp: process(inp) for inp in inputList}, f, indent=4)

[使用f'{inpf[:-4]}.json'假设所有文件名在inpFiles中都以'.txt'结尾]

英文:

> Question 1: The order or category (data) is Names, Colors, and Places. But why does the return has Name, Place, Color order instead? This is not important but was wondering why.

It's probably because of how the contents of 'glove.6B.100d.txt' are ordered/arranged.

> Question 2: Instead of using print(process('jonny')), how can I input list of text from text file? [Lets suppose name of input text file is TEST.txt.]

Assuming 'TEST.txt' has an input in each line like

> > jonny > green > park > [input#4] > [input#5] >

Then you could read them into a list of strings to loop through and apply process to:

with open(&#39;TEST.txt&#39;) as f: 
    inputList = f.read().splitlines()
# for inp in inputList: print(process(inp)) ## OR
outputList = [process(inp) for inp in inputList] 
for op in outputList: print(op)

> Question 3: [...] How can I save the return in TEST.JSON or TEST.csv file? Basically input and output as same name.

To save as CSV, you could use pandas .to_csv(view examples)

import pandas as pd
# pd.DataFrame(outputList, index=inputList).to_csv(&#39;TEST.csv&#39;) ## same as:
# pd.DataFrame([process(i) for i in inputList], index=inputList).to_csv(&#39;TEST.csv&#39;)
 
pd.DataFrame(
    [{&#39;input&#39;: inp, **process(inp)} for inp in inputList]
).set_index(&#39;input&#39;).to_csv(&#39;TEST.csv&#39;)

and to save as JSON, you can use json.dump(view examples: op1, op2)

import json
with open(&#39;TEST.json&#39;, &#39;w&#39;) as f: 
    # json.dump([{&#39;input&#39;:inp, &#39;output&#39;: process(inp)} for inp in inputList], f) ## op1
    json.dump({inp: process(inp) for inp in inputList}, f) #, indent=4) ## op2

Added EDIT:

> Let's suppose I have a list of text files for this. Then how would I be able to process all the text files at once and save the return in the same file name? For example, if I use text1.txt, text2.txt, and text3.txt.....return will be text1.json, text2.json, and text3.json.

inpFiles = [&#39;text1.txt&#39;, &#39;text2.txt&#39;, &#39;text3.txt&#39;]
# ifLen = len(ifLen)
for inpf in inpFiles: # for ifi, inpf in enumerate(inpFiles, 1):
    # print(&#39;&#39;, end=f&#39;\r[{ifi} of {ifLen}] processing &quot;{inpf}&quot;...&#39;)
    with open(inpf) as f: inputList = f.read().splitlines()
    with open(f&#39;{inpf[:-4]}.json&#39;, &#39;w&#39;) as f:   
        json.dump({inp: process(inp) for inp in inputList}, f, indent=4)

[Using f'{inpf[:-4]}.json' assumes all file names in inpFiles end with '.txt']

答案2

得分: 0

感谢，@Driftr95
以下代码允许输入多个文本文件，然后将返回保存在单独的json文件中。
```python
inpFiles = ['text1.txt', 'text2.txt', 'text3.txt']
# ifLen = len(ifLen)
for inpf in inpFiles: # for ifi, inpf in enumerate(inpFiles, 1):
    # print('', end=f'\r[{ifi} of {ifLen}] 正在处理 "{inpf}"...')
    with open(inpf) as f: inputList = f.read().splitlines()
    with open(f'{inpf[:-4]}.json', 'w') as f:   
        json.dump({inp: process(inp) for inp in inputList}, f, indent=4)


<details>
<summary>英文:</summary>
Thanks a lot, @Driftr95
The below code allows to input of multiple text files and then saving the return in individual json files.

inpFiles = ['text1.txt', 'text2.txt', 'text3.txt']

ifLen = len(ifLen)

for inpf in inpFiles: # for ifi, inpf in enumerate(inpFiles, 1):
# print('', end=f'\r[{ifi} of {ifLen}] processing "{inpf}"...')
with open(inpf) as f: inputList = f.read().splitlines()
with open(f'{inpf[:-4]}.json', 'w') as f:
json.dump({inp: process(inp) for inp in inputList}, f, indent=4)


</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何将返回值进行整理并保存为单独的JSON/CSV文件？

问题

答案1

添加编辑:

Added EDIT:

答案2

ifLen = len(ifLen)

使用Pillow在Tkinter中如何插入图像

你可以在Manim中如何将FadeIn和FadeOut组合在一个动画中？

return json parsed output in for loop in golang

TypeError: 字符串索引必须是整数，应该怎么办？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。