我如何获得所有具有相同模式的字符串

huangapple go评论70阅读模式
英文:

How can I get all the strings that have the same pattern

问题

我有一个具有以下结构的 XML 文件:

<expression>[Customer ].[Sales ].[L_MOIS]</expression><expression>cast_varchar([Customer ].[Sales ].[L_MOIS_ANNEE])
+ ' ' + 
cast_varchar([Customer ].[Sales ].[C_ANNEE])</expression></dataItem></selection><detailFilters><detailFilter><filterExpression>[Customer ].[Sales ].[DT_JOUR] <= getdate()</filterExpression></detailFilter></detailFilters></query><query name="RSmag"><source><model /></source><selection><dataItem aggregate="none" name="Code magasin"><expression>[Customer statistics].[Stores].[C_MAGASIN]</expression></dataItem><dataItem aggregate="none" name="Libellé magasin" sort="ascending"><expression>[Customer statistics].[Stores].[L_MAGASIN]</expression></dataItem></selection><detailFilters><detailFilter><filterExpression>[Customer statistics].[Stores].[C_DEPOT] <>'500'</filterExpression></detailFilter><detailFilter><filterExpression>[Customer statistics].[Stores].[C_MAGASIN] not in ('005120';'005130';'005140')</filterExpression></detailFilter></detailFilters>
</query><query name="CAdept_avec_metier_cumul"><source><model /></source><selection><dataItem aggregate="none" name="Cod Metier" rollupAggregate="none"><expression>[Customer ].[Articles].[COD_DPTG]</expression></dataItem><dataItem name="Nombre de tickets" rollupAggregate="total">
<expression>count(distinct [Customer ].[Sales ].[ID_TICKET])</expression></dataItem><dataItem name="Nombre de tickets non affecté" rollupAggregate="total"><expression>count(distinct 
(case 
when [Customer ].[Sales ].[C_AFFECTATION] <> 1  
then [Customer ].[Sales ].[ID_TICKET]
else null 
end)
)</expression>

我想要提取所有标签的名称,在结果中应该包括:
[Customer ].[Sales ].[C_ANNEE]
[Customer ].[Sales ].[DT_JOUR]

但现在我得到的是:

Customer

Sales

C_ANNEE

File f = new File("");  
        BufferedReader in = new BufferedReader(
                new InputStreamReader(new FileInputStream(f), "UTF-8"));
        String str;
        while ((str = in.readLine()) != null) {

            Matcher m = Pattern.compile("\\[(.*?)\\]").matcher(str);
            while (m.find()) {
                listres.add(m.group(1));

            }
        }
英文:

I have an xml file that has this structure

<expression>[Customer ].[Sales ].[L_MOIS]</expression><expression>cast_varchar([Customer ].[Sales ].[L_MOIS_ANNEE])
+ ' ' + 
cast_varchar([Customer ].[Sales ].[C_ANNEE])</expression></dataItem></selection><detailFilters><detailFilter><filterExpression>[Customer ].[Sales ].[DT_JOUR] <= getdate()</filterExpression></detailFilter></detailFilters></query><query name="RSmag"><source><model /></source><selection><dataItem aggregate="none" name="Code magasin"><expression>[Customer statistics].[Stores].[C_MAGASIN]</expression></dataItem><dataItem aggregate="none" name="Libellé magasin" sort="ascending"><expression>[Customer statistics].[Stores].[L_MAGASIN]</expression></dataItem></selection><detailFilters><detailFilter><filterExpression>[Customer statistics].[Stores].[C_DEPOT] <>'500'</filterExpression></detailFilter><detailFilter><filterExpression>[Customer statistics].[Stores].[C_MAGASIN] not in ('005120';'005130';'005140')</filterExpression></detailFilter></detailFilters>
</query><query name="CAdept_avec_metier_cumul"><source><model /></source><selection><dataItem aggregate="none" name="Cod Metier" rollupAggregate="none"><expression>[Customer ].[Articles].[COD_DPTG]</expression></dataItem><dataItem name="Nombre de tickets" rollupAggregate="total">
<expression>count(distinct [Customer ].[Sales ].[ID_TICKET])</expression></dataItem><dataItem name="Nombre de tickets non affecté" rollupAggregate="total"><expression>count(distinct 
(case 
when [Customer ].[Sales ].[C_AFFECTATION] <> 1  
then [Customer ].[Sales ].[ID_TICKET]
else null 
end)
)</expression>

I want to extract all the names of tab, in result I should have:
[Customer ].[Sales ].[C_ANNEE]
[Customer ].[Sales ].[DT_JOUR]

But now what I'm getting is:

Customer

Sales

C_ANNEE

File f = new File("");  
        BufferedReader in = new BufferedReader(
                new InputStreamReader(new FileInputStream(f), "UTF-8"));
        String str;
        while ((str = in.readLine()) != null) {

            Matcher m = Pattern.compile("\\[(.*?)\\]").matcher(str);
            while (m.find()) {
                listres.add(m.group(1));

            }
        }

答案1

得分: 1

分解问题为两个独立的部分:

1)使用合适的 XML 解析器解析 XML 数据,提取所需的文本。

2)针对提取的文本字段,使用正则表达式提取所需的子字符串。

以下示例使用 SAX 解析器(我顺便提一下,我正在使用 Java 13)。

假设我们有一个包含以下 XML 内容的文件:

<root>
...
</root>

注意以下内容:

a)我根据问题的示例数据进行了合理的猜测,创建了一个有效的 XML 文档。

b)我通过使用 &amp;lt;&amp;gt; 转义了文本中的 &lt;&gt; 符号。

第一步 - 解析数据

这个解决方案使用 SAX 进行解析 - 还有很多其他的替代方法。

以下代码将读取输入文件的每一行,丢弃除 &lt;expression&gt;&lt;filterExpression&gt; 标签之外的任何标签。这个集合可以根据需要进行调整(watchedElements)。

代码收集这些标签内部的文本,并通过移除换行符和额外的空格来清理它。

这给我们提供了一组类似于以下的 10 个文本字符串:

[Customer ].[Sales ].[L_MOIS]
cast_varchar([Customer ].[Sales ].[L_MOIS_ANNEE]) + &#39; &#39; + cast_varchar([Customer ].[Sales ].[C_ANNEE])
[Customer ].[Sales ].[DT_JOUR] &lt;= getdate()
...

第二步 - 应用正则表达式

对于这些字符串中的每一个,我们使用正则表达式找到我们想要的数据:

\[.*?\](\.\[.*?\])*

这个正则表达式会搜索从开头的“[”到下一个“]”的内容,并且会重复此过程,零个或多个后续的以点分隔的“[”和“]”字符串。

为了处理不需要的子匹配,我们只保留第零组:

Matcher m = pattern.matcher(text);
while (m.find()) {
    System.out.println("*** 匹配项: " + m.group(0));
}

这会给我们以下 12 个结果:

[Customer ].[Sales ].[L_MOIS]
[Customer ].[Sales ].[L_MOIS_ANNEE]
...

完整解决方案

import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
import java.util.Set;
import java.util.HashSet;
import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class ParseFromFileUsingSax {

    Pattern pattern = Pattern.compile("\\[.*?\\](\\.\\[.*?\\])*");

    public void parseUsingSax() {
        try {
            ...
        } catch (Exception e) {
            System.err.print(e);
        }
    }

    private String formatString(String text) {
        ...
    }

    private void printMatches(String text) {
        ...
    }
}
英文:

Break the problem into two separate parts:

  1. Parse the XML data using a suitable XML parser, to extract the text we want.

  2. For the extracted text fields, use a regular expression to extract the required sub-strings.

The following example uses a SAX parser (I am using Java 13, by the way).

Assume we have a file containing the following XML:

&lt;root&gt;
&lt;query name=&quot;RSmag&quot;&gt;
  &lt;source&gt;
    &lt;model /&gt;
  &lt;/source&gt;
  &lt;selection&gt;
    &lt;dataItem aggregate=&quot;none&quot; name=&quot;Code magasin&quot;&gt;
      &lt;expression&gt;
        [Customer ].[Sales ].[L_MOIS]
      &lt;/expression&gt;
      &lt;expression&gt;
        cast_varchar([Customer ].[Sales ].[L_MOIS_ANNEE]) + &#39; &#39; + cast_varchar([Customer ].[Sales ].[C_ANNEE])
      &lt;/expression&gt;
    &lt;/dataItem&gt;
  &lt;/selection&gt;
  &lt;detailFilters&gt;
    &lt;detailFilter&gt;
      &lt;filterExpression&gt;
        [Customer ].[Sales ].[DT_JOUR] &amp;lt;= getdate()
      &lt;/filterExpression&gt;
    &lt;/detailFilter&gt;
  &lt;/detailFilters&gt;
&lt;/query&gt;
&lt;query name=&quot;RSmag&quot;&gt;
  &lt;source&gt;
    &lt;model /&gt;
  &lt;/source&gt;
  &lt;selection&gt;
    &lt;dataItem aggregate=&quot;none&quot; name=&quot;Code magasin&quot;&gt;
      &lt;expression&gt;
        [Customer statistics].[Stores].[C_MAGASIN]
      &lt;/expression&gt;
    &lt;/dataItem&gt;
    &lt;dataItem aggregate=&quot;none&quot; name=&quot;Libell&#233; magasin&quot; sort=&quot;ascending&quot;&gt;
      &lt;expression&gt;
        [Customer statistics].[Stores].[L_MAGASIN]
      &lt;/expression&gt;
    &lt;/dataItem&gt;
  &lt;/selection&gt;
  &lt;detailFilters&gt;
    &lt;detailFilter&gt;
      &lt;filterExpression&gt;
        [Customer statistics].[Stores].[C_DEPOT] &amp;lt;&amp;gt; &#39;500&#39;
      &lt;/filterExpression&gt;
    &lt;/detailFilter&gt;
    &lt;detailFilter&gt;
      &lt;filterExpression&gt;
        [Customer statistics].[Stores].[C_MAGASIN] 
        not in (&#39;005120&#39;;&#39;005130&#39;;&#39;005140&#39;)
      &lt;/filterExpression&gt;
    &lt;/detailFilter&gt;
  &lt;/detailFilters&gt;
&lt;/query&gt;
&lt;query name=&quot;CAdept_avec_metier_cumul&quot;&gt;
  &lt;source&gt;
    &lt;model /&gt;
  &lt;/source&gt;
  &lt;selection&gt;
    &lt;dataItem aggregate=&quot;none&quot; name=&quot;Cod Metier&quot; rollupAggregate=&quot;none&quot;&gt;
      &lt;expression&gt;
        [Customer ].[Articles].[COD_DPTG]
      &lt;/expression&gt;
    &lt;/dataItem&gt;
    &lt;dataItem name=&quot;Nombre de tickets&quot; rollupAggregate=&quot;total&quot;&gt;
      &lt;expression&gt;
        count(distinct [Customer ].[Sales ].[ID_TICKET])
      &lt;/expression&gt;
    &lt;/dataItem&gt;
    &lt;dataItem name=&quot;Nombre de tickets non affect&#233;&quot; rollupAggregate=&quot;total&quot;&gt;
      &lt;expression&gt;count(distinct 
                  (case 
                   when [Customer ].[Sales ].[C_AFFECTATION] &amp;lt;&amp;gt; 1  
                   then [Customer ].[Sales ].[ID_TICKET]
                   else null 
                   end)
                  )
      &lt;/expression&gt;
    &lt;/dataItem&gt;
  &lt;/selection&gt;
&lt;/query&gt;
&lt;/root&gt;

Note the following:

a) I made an educated guess to create a valid XML document, based on the question's sample data.

b) I escaped the &lt; and &gt; symbols in the text, by using &amp;lt; and &amp;gt;.

Step 1 - Parsing the Data

This solution uses SAX for parsing - there are plenty of alternatives.

The following will read each line of the input file, discarding any tags which are not &lt;expression&gt; or &lt;filterExpression&gt; tags. This set can be adjusted as needed (watchedElements).

The code collects the text inside each of these tags, and cleans it up by removing newlines, and extra whitespace.

This gives us a set of 10 text strings, like this:

[Customer ].[Sales ].[L_MOIS]
cast_varchar([Customer ].[Sales ].[L_MOIS_ANNEE]) + &#39; &#39; + cast_varchar([Customer ].[Sales ].[C_ANNEE])
[Customer ].[Sales ].[DT_JOUR] &lt;= getdate()
[Customer statistics].[Stores].[C_MAGASIN]
[Customer statistics].[Stores].[L_MAGASIN]
[Customer statistics].[Stores].[C_DEPOT] &lt;&gt; &#39;500&#39;
[Customer statistics].[Stores].[C_MAGASIN] not in (&#39;005120&#39;;&#39;005130&#39;;&#39;005140&#39;)
[Customer ].[Articles].[COD_DPTG]
count(distinct [Customer ].[Sales ].[ID_TICKET])
count(distinct (case when [Customer ].[Sales ].[C_AFFECTATION] &lt;&gt; 1 then [Customer ].[Sales ].[ID_TICKET] else null end) )

Step 2 - Applying the Regex

For each of these strings we use a regular expression to find the data we want:

\[.*?\](\.\[.*?\])*

This searches for an opening "[", through to the next "]", and it repeats this for zero or more subsequent "[" and "]" strings separated by a period.

To deal with unwanted sub-matches we only keep group zero:

Matcher m = pattern.matcher(text);
while (m.find()) {
    System.out.println(&quot;*** Matches found  : &quot; + m.group(0));
}

This gives us the following 12 results:

[Customer ].[Sales ].[L_MOIS]
[Customer ].[Sales ].[L_MOIS_ANNEE]
[Customer ].[Sales ].[C_ANNEE]
[Customer ].[Sales ].[DT_JOUR]
[Customer statistics].[Stores].[C_MAGASIN]
[Customer statistics].[Stores].[L_MAGASIN]
[Customer statistics].[Stores].[C_DEPOT]
[Customer statistics].[Stores].[C_MAGASIN]
[Customer ].[Articles].[COD_DPTG]
[Customer ].[Sales ].[ID_TICKET]
[Customer ].[Sales ].[C_AFFECTATION]
[Customer ].[Sales ].[ID_TICKET]

Full Solution

import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
import java.util.Set;
import java.util.HashSet;
import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class ParseFromFileUsingSax {

    // Looks for an opening &quot;[&quot; followed by a closing &quot;]&quot; with an 
    // optional &quot;.&quot; to string items together into one group.
    Pattern pattern = Pattern.compile(&quot;\\[.*?\\](\\.\\[.*?\\])*&quot;);

    public void parseUsingSax() {
        try {

            SAXParserFactory factory = SAXParserFactory.newInstance();
            SAXParser saxParser = factory.newSAXParser();

            // the tags we will inspect (all others will be skipped):
            Set&lt;String&gt; watchedElements = new HashSet();
            watchedElements.add(&quot;expression&quot;);
            watchedElements.add(&quot;filterExpression&quot;);

            DefaultHandler handler = new DefaultHandler() {

                private boolean inElement = false;
                private StringBuilder stringBuilder;

                @Override
                public void startElement(String uri, String localName, String name,
                        Attributes attributes) throws SAXException {
                    if (watchedElements.contains(name)) {
                        inElement = true;
                        stringBuilder = new StringBuilder();
                    }
                }

                @Override
                public void characters(char[] buffer, int start, int length) throws SAXException {
                    if (inElement) {
                        stringBuilder.append(buffer, start, length);
                    }
                }

                @Override
                public void endElement(String uri, String localName,
                        String name) throws SAXException {
                    if (watchedElements.contains(name)) {
                        inElement = false;
                        String extractedText = formatString(stringBuilder.toString());
                        System.out.println();
                        System.out.println(&quot;Extracted XML text : &quot; + extractedText);
                        printMatches(extractedText);
                    }
                }

            };

            saxParser.parse(&quot;C:/tmp/query_data.xml&quot;, handler);

        } catch (Exception e) {
            System.err.print(e);
        }

    }

    private String formatString(String text) {
        text = text.replaceAll(&quot;\\r\\n|\\r|\\n&quot;, &quot; &quot;); // remove newlines
        text = text.replaceAll(&quot;  *&quot;, &quot; &quot;); // collapse multiple spaces
        return text.trim(); // remove leading/trailing whitespace
    }

    private void printMatches(String text) {
        Matcher m = pattern.matcher(text);
        while (m.find()) {
            System.out.println(&quot;*** Matches found  : &quot; + m.group(0));
        }
    }

}

huangapple
  • 本文由 发表于 2020年3月17日 00:59:55
  • 转载请务必保留本文链接:https://go.coder-hub.com/60710140.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定