英文:
trim freebase data dump to only English entities
问题
我有一个压缩的Freebase数据转储文件,其中包含所有实体。我该如何使用grep或其他方法将数据转储文件修剪为只包含英文实体?
以下是我想要转换成rdf转储文件的样子:http://play.golang.org/p/-WwSysL3y3
<card>
<title></title>
<image></image>
<text></text>
<facts>
<fact></fact>
<fact></fact>
<fact></fact>
</fact>
</card>
其中,每个实体都是一个card,包含所有子元素的内容。title是/type/object/name属性的值。image是由"https://usercontent.googleapis.com/freebase/v1/image"%s\n", id
生成的主题mid的图像。text是实体的/common/document/text属性的值。facts及其fact子元素是在搜索中显示在知识面板中的年龄、出生日期、身高等事实。
以下是我尝试在Go(Golang)中将rdf解析为xml的代码。如果有人能帮助我将rdf转换成这种形式,我将不胜感激。
这是我尝试的算法或逻辑:
对于每个以英文编写的实体:
解析"type/object/name"属性,并将其写入xml文件的"title"元素中。
解析mid并将其添加到"https://usercontent.googleapis.com/freebase/v1/image",然后将结果写入xml文件的"image"元素中。
解析common/document/text属性,并将其值写入"text"元素中。
最后,对于实体的每个事实,将它们写入XML文件中作为"facts"元素的子元素的"fact"元素。
英文:
I have a compressed freebase data dump that has all the entities in it. How can I use grep or something else to trim the data dump to only contain english entities?
Here is what I am trying to get the rdf dump to look like: http://play.golang.org/p/-WwSysL3y3
<card>
<title></title>
<image></image>
<text></text>
<facts>
<fact></fact>
<fact></fact>
<fact></fact>
</fact>
</card>
Where card is each entity with content in all of the children elements. Title is the /type/object/name. Text is the image for mid of the topic done by "https://usercontent.googleapis.com/freebase/v1/image"%s"\n", id
. Text is the /common/document/text for the entity. and facts and its fact children as the facts like age, birth-date, height, the facts that show up in the knowledge panels in search.
Here is my attempt to parse the rdf into xml like this in Go ( Golang ). I'd appreciate it if someone could help me get the rdf in this form.
Here is the algorithm or logic of what I am trying to do:
For every entity written in english:
parse the `type/object/name`property's and write that to the xml file in the `<title></title>` element.
parse the mid and add that to `https://usercontent.googleapis.com/freebase/v1/image`and then write the result to the xml file in the <image></image> element.
parse the common/document/text property and writes its value to the <text></text> element.
And lastly, for each fact about the entity, write them to the <fact></fact> elements in the XML file, which are all children of the <facts></facts> element.
答案1
得分: 0
我同意Joshua Taylor的观点,这个问题很难理解,因为“entity”通常是指Freebase对象的同义词,该对象可能有多种语言的标签(或者根本没有标签/文本)。
如果我们将问题改为类似于“如何从压缩的Freebase转储中过滤掉所有非英文文本?”,那么我们就可以给出一个实际的答案。
在RDF中,所有的字符串都带有它们的语言标签,所以如果我们看到像这样的内容:
ns:award.award_winner rdfs:label "Lauréat"@fr.
我们可以知道Lauréat
是英文中称为Award Winner
的Freebase类型的法语名称。
要过滤掉非英文标签,可以使用zgrep来过滤那些匹配“@...”但不匹配“@en.”的行。
这将给出所有类型、属性、数字和英文标签/描述,但不会排除那些至少没有一个英文标签的对象(这是你问题的另一个可能解释)。要进行这种级别的过滤,你可能需要比grep更强大的工具。
英文:
I agree with Joshua Taylor that the question is difficult to decipher, because entity is usually a synonym for Freebase object, which may have labels in multiple languages (or no labels/text at all).
If we recast the question as something along the lines of "How do I filter all non-English text from the compressed Freebase dump?," it becomes something that we can actually answer.
In RDF, all strings are labeled with their language, so if we see something like
ns:award.award_winner rdfs:label "Lauréat"@fr.
We can tell that Lauréat
is the French name for the Freebase type called Award Winner
in English.
To filter out non-English labels, use zgrep to filter those lines which match "@... but not "@en.
This will give you all the types, properties, numbers, and English labels/descriptions, but won't exclude those objects which don't have at least one English label (another possible interpretation of your question). To do that level of filtering, you'll probably need something more powerful than grep.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论