2014年10月16日 03:36:56go评论122阅读模式

英文:

Parsing RDF triples with Go. Some items falsely-passing regex

问题

当解析Freebase RDF数据转储时，我尝试仅解析基于标题和文本的特定实体。我使用正则表达式来匹配标题和文本，即使它们不匹配并返回false，但内容仍然通过了。

我决定将什么转换为XML的依据是properties["/type/object/name"]不为空，或者如果它包含@en，并且properties["/common/document/text"]不为空。

什么是空的定义？我注意到，通过打印所有名称（properties["/type/object/name"]）和文本（properties["/common/document/text"]），我注意到其中一些只是"[]"。我不想要那些。我想要的是那些不是那样的，并且在名称（properties["/type/object/name"]）中包含@en的实体。文本（properties["/common/document/text"]）不会有@en，所以如果它不是"[]"并且它对应的名称有@en，那么该实体应该转换为XML。

当我运行我的代码时，我使用正则表达式来查看它是否匹配和不匹配这些内容，我发现它们被忽略了，那些"空实体"仍然被转换为XML。

这是我从终端获取的一些输出：

<card>
<title>"[]"</title>
<image>"https://usercontent.googleapis.com/freebase/v1/image"</image>
%!(EXTRA string=/american_football/football_player/footballdb_id)<text>"[]"</text>
<facts>
    <fact property="/type/object/type">/type/property</fact>
    <fact property="/type/property/schema">/american_football/football_player</fact>
    <fact property="/type/property/unique">true</fact>
    <fact property="http://www/w3/org/2000/01/rdf-schema#label">"footballdb ID"@en</fact>
    <fact property="/type/property/expected_type">/type/enumeration</fact>
    <fact property="http://www/w3/org/1999/02/22-rdf-syntax-ns#type">http://www/w3/org/2002/07/owl#FunctionalProperty</fact>
    <fact property="http://www/w3/org/2000/01/rdf-schema#domain">/american_football/football_player</fact>
    <fact property="http://www/w3/org/2000/01/rdf-schema#range">/type/enumeration</fact>
 </facts>
 </card>

这是我的代码，下面是我做错了什么？难道它不应该匹配正则表达式，然后不写入它所写的内容吗？

func validTitle(content []string) bool{
    for _, v := range content{
         emptyTitle, _ := regexp.MatchString("\"[]\"", v)
         validTitle, _ := regexp.MatchString("^[A-Za-z0-9][A-Za-z0-9_-]*$", v)
         englishTitle, _ := regexp.MatchString("@en", v)
         if (!validTitle || !englishTitle) && !emptyTitle{
              return false
         }
    }
    return true 
 }
 func validText(content []string) bool{
      for _, v := range content{
          emptyTitle, _ := regexp.MatchString("\"[]\"", v)
          validText, _ := regexp.MatchString("^[A-Za-z0-9][A-Za-z0-9_-]*$", v)
          if !validText && !emptyTitle{
             return false
          }
      }
      return true
 }
 func processTopic(id string, properties map[string][]string, file io.Writer){
      if validTitle(properties["/type/object/name"]) &&       validText(properties["/common/document/text"]){
           fmt.Fprintf(file, "<card>\n")
           fmt.Fprintf(file, "<title>\"%s\"</title>\n", properties["/type/object/name"])
           fmt.Fprintf(file, "<image>\"%s\"</image>\n", "https://usercontent.googleapis.com/freebase/v1/image", id)
           fmt.Fprintf(file, "<text>\"%s\"</text>\n", properties["/common/document/text"])
           fmt.Fprintf(file, "<facts>\n")
           for k, v := range properties{
                for _, value := range v{
                    fmt.Fprintf(file, "<fact property=\"%s\">%s</fact>\n", k, value)
                }
           }
           fmt.Fprintf(file, "</facts>\n")
           fmt.Fprintf(file, "</card>\n")
      }
 }

英文:

When parsing the Freebase RDF data-dump, I'm trying to only parse certain entities based on their title's and text. I'm using regexps to match the titles and text and even though they are not matching, returning false, the content is still passing.

How I'm deciding what to turn into XML is the properties["/type/object/name"] is not empty or if it contains @en and if the properties["/common/document/text"] is not empty.

What defines empty? I've noticed, by printing all the names ( properties["/type/object/name"] ) and text ( properties["/common/document/text"] ), and I noticed that some of them are just "[]". I don't want those. What I do want are the ones that are not that and contain @en in the name ( properties["/type/object/name"] ). The text ( properties["/common/document/text"] ) won't have the @en so if it is not "[]" and its corresponding name has @en, then that entity should be converted to XML.

As I run my code, I'm using regexps to see if it matches and doesn't match those things, I'm seeing those are being ignored and those ' empty entities ' are still being converted to XML.

Here is some output I grabbed from the terminal:

&lt;card&gt;
&lt;title&gt;&quot;[]&quot;&lt;/title&gt;
&lt;image&gt;&quot;https://usercontent.googleapis.com/freebase/v1/image&quot;&lt;/image&gt;
%!(EXTRA string=/american_football/football_player/footballdb_id)&lt;text&gt;&quot;[]&quot;&lt;/text&gt;
&lt;facts&gt;
    &lt;fact property=&quot;/type/object/type&quot;&gt;/type/property&lt;/fact&gt;
    &lt;fact property=&quot;/type/property/schema&quot;&gt;/american_football/football_player&lt;/fact&gt;
    &lt;fact property=&quot;/type/property/unique&quot;&gt;true&lt;/fact&gt;
    &lt;fact property=&quot;http://www/w3/org/2000/01/rdf-schema#label&quot;&gt;&quot;footballdb ID&quot;@en&lt;/fact&gt;
    &lt;fact property=&quot;/type/property/expected_type&quot;&gt;/type/enumeration&lt;/fact&gt;
    &lt;fact property=&quot;http://www/w3/org/1999/02/22-rdf-syntax-ns#type&quot;&gt;http://www/w3/org/2002/07/owl#FunctionalProperty&lt;/fact&gt;
    &lt;fact property=&quot;http://www/w3/org/2000/01/rdf-schema#domain&quot;&gt;/american_football/football_player&lt;/fact&gt;
    &lt;fact property=&quot;http://www/w3/org/2000/01/rdf-schema#range&quot;&gt;/type/enumeration&lt;/fact&gt;
 &lt;/facts&gt;
 &lt;/card&gt;

Here is my code, below, what am I doing wrong? Shouldn't it match the regexps and then not write what it did write?

func validTitle(content []string) bool{
    for _, v := range content{
         emptyTitle, _ := regexp.MatchString(&quot;\&quot;[]\&quot;&quot;, v)
         validTitle, _ := regexp.MatchString(&quot;^[A-Za-z0-9][A-Za-z0-9_-]*$&quot;, v)
         englishTitle, _ := regexp.MatchString(&quot;@en&quot;, v)
         if (!validTitle || !englishTitle) &amp;&amp; !emptyTitle{
              return false
         }
    }
    return true 
 }
 func validText(content []string) bool{
      for _, v := range content{
          emptyTitle, _ := regexp.MatchString(&quot;\&quot;[]\&quot;&quot;, v)
          validText, _ := regexp.MatchString(&quot;^[A-Za-z0-9][A-Za-z0-9_-]*$&quot;, v)
          if !validText &amp;&amp; !emptyTitle{
             return false
          }
      }
      return true
 }
 func processTopic(id string, properties map[string][]string, file io.Writer){
      if validTitle(properties[&quot;/type/object/name&quot;]) &amp;&amp;       validText(properties[&quot;/common/document/text&quot;]){
           fmt.Fprintf(file, &quot;&lt;card&gt;\n&quot;)
           fmt.Fprintf(file, &quot;&lt;title&gt;\&quot;%s\&quot;&lt;/title&gt;\n&quot;, properties[&quot;/type/object/name&quot;])
           fmt.Fprintf(file, &quot;&lt;image&gt;\&quot;%s\&quot;&lt;/image&gt;\n&quot;, &quot;https://usercontent.googleapis.com/freebase/v1/image&quot;, id)
           fmt.Fprintf(file, &quot;&lt;text&gt;\&quot;%s\&quot;&lt;/text&gt;\n&quot;, properties[&quot;/common/document/text&quot;])
           fmt.Fprintf(file, &quot;&lt;facts&gt;\n&quot;)
           for k, v := range properties{
                for _, value := range v{
                    fmt.Fprintf(file, &quot;&lt;fact property=\&quot;%s\&quot;&gt;%s&lt;/fact&gt;\n&quot;, k, value)
                }
           }
           fmt.Fprintf(file, &quot;&lt;/facts&gt;\n&quot;)
           fmt.Fprintf(file, &quot;&lt;/card&gt;\n&quot;)
      }
 }

答案1

得分: 1

你的正则表达式无效，如果你检查错误，它会告诉你具体的原因：

error parsing regexp: missing closing ]: `[]"`
regexp.MatchString(`"[]"`, v)
// 应该是
regexp.MatchString(`"\[\]"`, v)

另外，由于你多次使用它，你应该在函数外部编译它并使用，例如：

var (
    emptyRe   = regexp.MustCompile(`"\[\]"`)
    titleRe   = regexp.MustCompile("^[A-Za-z0-9][A-Za-z0-9_-]*$")
    englishRe = regexp.MustCompile("@en")
)
func validTitle(content []string) bool {
    for _, v := range content {
        if emptyRe.MatchString(v) || !(englishRe.MatchString(v) || titleRe.MatchString(v)) {
            return false
        }
    }
    return true
}

这一行期望输入一个值，但你给了两个：

fmt.Fprintf(file, "<image>\"%s\"</image>\n", 
            "https://usercontent.googleapis.com/freebase/v1/image", // 这个匹配了 %s
             id, // 这个没有
)

应该是

fmt.Fprintf(file, "<image>\"%s/%s\"</image>\n", "https://usercontent.googleapis.com/freebase/v1/image", id)

英文:

Your regexp is invalid, if you check the error it will tell you exactly why:

error parsing regexp: missing closing ]: `[]&quot;`
regexp.MatchString(&quot;\&quot;[]\&quot;&quot;, v)
// should be
regexp.MatchString(`&quot;\[\]&quot;`, v)

Also since you use it multiple times, you should compile it outside the function and use it, for example:

var (
	emptyRe   = regexp.MustCompile(`&quot;\[\]&quot;`)
	titleRe   = regexp.MustCompile(&quot;^[A-Za-z0-9][A-Za-z0-9_-]*$&quot;)
	englishRe = regexp.MustCompile(&quot;@en&quot;)
)
func validTitle(content []string) bool {
	for _, v := range content {
		if emptyRe.MatchString(v) || !(englishRe.MatchString(v) || titleRe.MatchString(v)) {
			return false
		}
	}
	return true
}

This line expects 1 value as input but you're giving it two:

fmt.Fprintf(file, &quot;&lt;image&gt;\&quot;%s\&quot;&lt;/image&gt;\n&quot;, 
            &quot;https://usercontent.googleapis.com/freebase/v1/image&quot;, // this matches the %s
             id, // this doesn&#39;t
)

It should be

fmt.Fprintf(file, &quot;&lt;image&gt;\&quot;%s/%s\&quot;&lt;/image&gt;\n&quot;, &quot;https://usercontent.googleapis.com/freebase/v1/image&quot;, id)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用Go解析RDF三元组。一些项错误地通过了正则表达式。

问题

答案1

如何在Golang中从SQL中查询以2023-06-08 19:54:41 +0000格式存储的sql.NullTime值？

Go的http包无法处理没有路径的HTTP请求。

In Go Lang Set the Tag for a Whole Struct

并行 For 循环

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。