使用Go解析RDF三元组。一些项错误地通过了正则表达式。

huangapple go评论72阅读模式
英文:

Parsing RDF triples with Go. Some items falsely-passing regex

问题

当解析Freebase RDF数据转储时,我尝试仅解析基于标题和文本的特定实体。我使用正则表达式来匹配标题和文本,即使它们不匹配并返回false,但内容仍然通过了。

我决定将什么转换为XML的依据是properties["/type/object/name"]不为空,或者如果它包含@en,并且properties["/common/document/text"]不为空。

什么是空的定义?我注意到,通过打印所有名称(properties["/type/object/name"])和文本(properties["/common/document/text"]),我注意到其中一些只是"[]"。我不想要那些。我想要的是那些不是那样的,并且在名称(properties["/type/object/name"])中包含@en的实体。文本(properties["/common/document/text"])不会有@en,所以如果它不是"[]"并且它对应的名称有@en,那么该实体应该转换为XML。

当我运行我的代码时,我使用正则表达式来查看它是否匹配和不匹配这些内容,我发现它们被忽略了,那些"空实体"仍然被转换为XML。

这是我从终端获取的一些输出:

<card>
<title>"[]"</title>
<image>"https://usercontent.googleapis.com/freebase/v1/image"</image>
%!(EXTRA string=/american_football/football_player/footballdb_id)<text>"[]"</text>
<facts>
    <fact property="/type/object/type">/type/property</fact>
    <fact property="/type/property/schema">/american_football/football_player</fact>
    <fact property="/type/property/unique">true</fact>
    <fact property="http://www/w3/org/2000/01/rdf-schema#label">"footballdb ID"@en</fact>
    <fact property="/type/property/expected_type">/type/enumeration</fact>
    <fact property="http://www/w3/org/1999/02/22-rdf-syntax-ns#type">http://www/w3/org/2002/07/owl#FunctionalProperty</fact>

    <fact property="http://www/w3/org/2000/01/rdf-schema#domain">/american_football/football_player</fact>

    <fact property="http://www/w3/org/2000/01/rdf-schema#range">/type/enumeration</fact>
 </facts>
 </card>

这是我的代码,下面是我做错了什么?难道它不应该匹配正则表达式,然后不写入它所写的内容吗?

func validTitle(content []string) bool{
    for _, v := range content{
         emptyTitle, _ := regexp.MatchString("\"[]\"", v)
         validTitle, _ := regexp.MatchString("^[A-Za-z0-9][A-Za-z0-9_-]*$", v)
         englishTitle, _ := regexp.MatchString("@en", v)
         if (!validTitle || !englishTitle) && !emptyTitle{
              return false
         }
    }
    return true 
 }

 func validText(content []string) bool{
      for _, v := range content{
          emptyTitle, _ := regexp.MatchString("\"[]\"", v)
          validText, _ := regexp.MatchString("^[A-Za-z0-9][A-Za-z0-9_-]*$", v)
          if !validText && !emptyTitle{
             return false
          }
      }
      return true
 }

 func processTopic(id string, properties map[string][]string, file io.Writer){
      if validTitle(properties["/type/object/name"]) &&       validText(properties["/common/document/text"]){
           fmt.Fprintf(file, "<card>\n")
           fmt.Fprintf(file, "<title>\"%s\"</title>\n", properties["/type/object/name"])
           fmt.Fprintf(file, "<image>\"%s\"</image>\n", "https://usercontent.googleapis.com/freebase/v1/image", id)
           fmt.Fprintf(file, "<text>\"%s\"</text>\n", properties["/common/document/text"])
           fmt.Fprintf(file, "<facts>\n")
           for k, v := range properties{
                for _, value := range v{
                    fmt.Fprintf(file, "<fact property=\"%s\">%s</fact>\n", k, value)
                }
           }
           fmt.Fprintf(file, "</facts>\n")
           fmt.Fprintf(file, "</card>\n")
      }
 }
英文:

When parsing the Freebase RDF data-dump, I'm trying to only parse certain entities based on their title's and text. I'm using regexps to match the titles and text and even though they are not matching, returning false, the content is still passing.

How I'm deciding what to turn into XML is the properties["/type/object/name"] is not empty or if it contains @en and if the properties["/common/document/text"] is not empty.

What defines empty? I've noticed, by printing all the names ( properties["/type/object/name"] ) and text ( properties["/common/document/text"] ), and I noticed that some of them are just "[]". I don't want those. What I do want are the ones that are not that and contain @en in the name ( properties["/type/object/name"] ). The text ( properties["/common/document/text"] ) won't have the @en so if it is not "[]" and its corresponding name has @en, then that entity should be converted to XML.

As I run my code, I'm using regexps to see if it matches and doesn't match those things, I'm seeing those are being ignored and those ' empty entities ' are still being converted to XML.

Here is some output I grabbed from the terminal:

&lt;card&gt;
&lt;title&gt;&quot;[]&quot;&lt;/title&gt;
&lt;image&gt;&quot;https://usercontent.googleapis.com/freebase/v1/image&quot;&lt;/image&gt;
%!(EXTRA string=/american_football/football_player/footballdb_id)&lt;text&gt;&quot;[]&quot;&lt;/text&gt;
&lt;facts&gt;
    &lt;fact property=&quot;/type/object/type&quot;&gt;/type/property&lt;/fact&gt;
    &lt;fact property=&quot;/type/property/schema&quot;&gt;/american_football/football_player&lt;/fact&gt;
    &lt;fact property=&quot;/type/property/unique&quot;&gt;true&lt;/fact&gt;
    &lt;fact property=&quot;http://www/w3/org/2000/01/rdf-schema#label&quot;&gt;&quot;footballdb ID&quot;@en&lt;/fact&gt;
    &lt;fact property=&quot;/type/property/expected_type&quot;&gt;/type/enumeration&lt;/fact&gt;
    &lt;fact property=&quot;http://www/w3/org/1999/02/22-rdf-syntax-ns#type&quot;&gt;http://www/w3/org/2002/07/owl#FunctionalProperty&lt;/fact&gt;

    &lt;fact property=&quot;http://www/w3/org/2000/01/rdf-schema#domain&quot;&gt;/american_football/football_player&lt;/fact&gt;

    &lt;fact property=&quot;http://www/w3/org/2000/01/rdf-schema#range&quot;&gt;/type/enumeration&lt;/fact&gt;
 &lt;/facts&gt;
 &lt;/card&gt;

Here is my code, below, what am I doing wrong? Shouldn't it match the regexps and then not write what it did write?

func validTitle(content []string) bool{
    for _, v := range content{
         emptyTitle, _ := regexp.MatchString(&quot;\&quot;[]\&quot;&quot;, v)
         validTitle, _ := regexp.MatchString(&quot;^[A-Za-z0-9][A-Za-z0-9_-]*$&quot;, v)
         englishTitle, _ := regexp.MatchString(&quot;@en&quot;, v)
         if (!validTitle || !englishTitle) &amp;&amp; !emptyTitle{
              return false
         }
    }
    return true 
 }

 func validText(content []string) bool{
      for _, v := range content{
          emptyTitle, _ := regexp.MatchString(&quot;\&quot;[]\&quot;&quot;, v)
          validText, _ := regexp.MatchString(&quot;^[A-Za-z0-9][A-Za-z0-9_-]*$&quot;, v)
          if !validText &amp;&amp; !emptyTitle{
             return false
          }
      }
      return true
 }

 func processTopic(id string, properties map[string][]string, file io.Writer){
      if validTitle(properties[&quot;/type/object/name&quot;]) &amp;&amp;       validText(properties[&quot;/common/document/text&quot;]){
           fmt.Fprintf(file, &quot;&lt;card&gt;\n&quot;)
           fmt.Fprintf(file, &quot;&lt;title&gt;\&quot;%s\&quot;&lt;/title&gt;\n&quot;, properties[&quot;/type/object/name&quot;])
           fmt.Fprintf(file, &quot;&lt;image&gt;\&quot;%s\&quot;&lt;/image&gt;\n&quot;, &quot;https://usercontent.googleapis.com/freebase/v1/image&quot;, id)
           fmt.Fprintf(file, &quot;&lt;text&gt;\&quot;%s\&quot;&lt;/text&gt;\n&quot;, properties[&quot;/common/document/text&quot;])
           fmt.Fprintf(file, &quot;&lt;facts&gt;\n&quot;)
           for k, v := range properties{
                for _, value := range v{
                    fmt.Fprintf(file, &quot;&lt;fact property=\&quot;%s\&quot;&gt;%s&lt;/fact&gt;\n&quot;, k, value)
                }
           }
           fmt.Fprintf(file, &quot;&lt;/facts&gt;\n&quot;)
           fmt.Fprintf(file, &quot;&lt;/card&gt;\n&quot;)
      }
 }

答案1

得分: 1

你的正则表达式无效,如果你检查错误,它会告诉你具体的原因:

error parsing regexp: missing closing ]: `[]"`

regexp.MatchString(`"[]"`, v)
// 应该是
regexp.MatchString(`"\[\]"`, v)

另外,由于你多次使用它,你应该在函数外部编译它并使用,例如:

var (
    emptyRe   = regexp.MustCompile(`"\[\]"`)
    titleRe   = regexp.MustCompile("^[A-Za-z0-9][A-Za-z0-9_-]*$")
    englishRe = regexp.MustCompile("@en")
)

func validTitle(content []string) bool {
    for _, v := range content {
        if emptyRe.MatchString(v) || !(englishRe.MatchString(v) || titleRe.MatchString(v)) {
            return false
        }
    }
    return true
}

这一行期望输入一个值,但你给了两个:

fmt.Fprintf(file, "<image>\"%s\"</image>\n", 
            "https://usercontent.googleapis.com/freebase/v1/image", // 这个匹配了 %s
             id, // 这个没有
) 

应该是

fmt.Fprintf(file, "<image>\"%s/%s\"</image>\n", "https://usercontent.googleapis.com/freebase/v1/image", id)
英文:

Your regexp is invalid, if you check the error it will tell you exactly why:

error parsing regexp: missing closing ]: `[]&quot;`

regexp.MatchString(&quot;\&quot;[]\&quot;&quot;, v)
// should be
regexp.MatchString(`&quot;\[\]&quot;`, v)

Also since you use it multiple times, you should compile it outside the function and use it, for example:

var (
	emptyRe   = regexp.MustCompile(`&quot;\[\]&quot;`)
	titleRe   = regexp.MustCompile(&quot;^[A-Za-z0-9][A-Za-z0-9_-]*$&quot;)
	englishRe = regexp.MustCompile(&quot;@en&quot;)
)

func validTitle(content []string) bool {
	for _, v := range content {
		if emptyRe.MatchString(v) || !(englishRe.MatchString(v) || titleRe.MatchString(v)) {
			return false
		}
	}
	return true
}

This line expects 1 value as input but you're giving it two:

fmt.Fprintf(file, &quot;&lt;image&gt;\&quot;%s\&quot;&lt;/image&gt;\n&quot;, 
            &quot;https://usercontent.googleapis.com/freebase/v1/image&quot;, // this matches the %s
             id, // this doesn&#39;t
) 

It should be

fmt.Fprintf(file, &quot;&lt;image&gt;\&quot;%s/%s\&quot;&lt;/image&gt;\n&quot;, &quot;https://usercontent.googleapis.com/freebase/v1/image&quot;, id)

huangapple
  • 本文由 发表于 2014年10月16日 03:36:56
  • 转载请务必保留本文链接:https://go.coder-hub.com/26390686.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定