2020年1月6日 17:48:45go评论93阅读模式

英文:

Finding volume and item count using regular expressions

问题

我目前正在构建一个用于杂货店的JavaScript网络爬虫，它可以处理产品的标题，然后返回产品的数量、容量和每升价格。大多数产品标题看起来像这样：

Coca cola（香草口味）12 x 330 mL

为了获取有关此产品的元数据，我编写了一个正则表达式。它将查找一个词边界，后面跟着一个1或2位数的数字，然后是空格、字符串 'x'、另一个空格，最后是1、2或3位数字：

const filter = new RegExp(/\b\d{1,2}\sx\s\d{1,3}/);

然后，我对每个结果使用正则表达式进行匹配，然后计算产品数量、产品体积、升数以及每升价格。

if (result.title.match(filter)) {
   result.itemCount = parseInt(result.title.match(/\d{1}\s/));
   result.itemVolume = parseInt(result.title.match(/\d{2,3}\s/));
   result.litreVolume = (result.itemCount * result.itemVolume) / 1000;
   result.pricePerLitre = +(result.price / result.litreVolume).toFixed(2);
} else {
   result.itemCount = 1;
   result.itemVolume = parseInt(result.title.match(/\d{2,3}\s/));
   result.litreVolume = result.itemVolume / 1000;
   result.pricePerLitre = +(result.price / result.litreVolume).toFixed(2);
}

90%的结果看起来很好，但有时我会得到意外的结果。例如：

数量为NaN，这可能与一些标题包含更多数字有关（例如Coca Cola (4-Way) 12 x 330 mL）
体积为无穷大
每升价格太高

显然，我在计算所需的元数据时做错了什么。有没有更好的使用正则表达式进行计算的方法？我是否忽略了某些东西，可以使我的计算不太容易出错？

英文:

I am currently building a JavaScript web scraper for a grocery store that processes a title of a product and then returns the item count, volume and price per litre of a product. Most of the product titles look something like this:

Coca cola (vanilla flavour) 12 x 330 mL

In order to obtain meta data about this product, I have written a Regular Expression. It will look for look for a word boundary followed by a 1 or 2 digit number, whitespace, the string 'x', another whitespace and finally a 1, 2 or 3 digit number:

const filter = new RegExp(/\b\d{1,2}\sx\s\d{1,3}/);

I then test each result for a match with the Regular Expression and then calculate the item count, item volume, volume in litres and then the price per litre.

  if (result.title.match(filter)) {
     result.itemCount = parseInt(result.title.match(/\d{1}\s/));
     result.itemVolume = parseInt(result.title.match(/\d{2,3}\s/));
     result.litreVolume = (result.itemCount * result.itemVolume) / 1000;
     result.pricePerLitre = +(result.price / result.litreVolume).toFixed(2);
  } else {
     result.itemCount = 1;
     result.itemVolume = parseInt(result.title.match(/\d{2,3}\s/));
     result.litreVolume = result.itemVolume / 1000;
     result.pricePerLitre = +(result.price / result.litreVolume).toFixed(2);
  }

90% of the results look good, but sometimes I get unexpected results. For example:

an item count of NaN, which may have to do with the fact that some titles contain several more numbers (Coca Cola (4-Way) 12 x 330 mL))
a volume of Infinity
a price per litre that is way too high

Clearly I am doing something wrong with my approach to calculating the desired meta data. What would be a better way of doing calculations with RegEx? Am I missing something that would make my calculations less prone to errors?

答案1

得分: 1

如果我理解正确，过滤器\b\d{1,2}\sx\s\d{1,3}是有效的，但您的子过滤器不起作用（\d{1}\s）...

我以前只在C#中使用正则表达式，但我看到您也可以在Java中使用分组。
将您的模式更改为(\b\d{1,2})\sx\s(\d{1,3})。当您在正则表达式中放置括号时，该部分将成为一个分组，您之后可以访问它。

正如我所说，我已经几年没有使用Java了，但我从网上找到了这段代码片段。它展示了如何在Java中使用分组。您应该使用模式(\b\d{1,2})\sx\s(\d{1,3})。如果与C#中相同，group(0)是整个结果，group(1)是您的第一个实际分组，group(2)是第二个。

// 创建一个Pattern对象
Pattern r = Pattern.compile(pattern);
// 现在创建匹配器对象。
Matcher m = r.matcher(line);
if (m.find()) {
    System.out.println("Found value: " + m.group(0));
    System.out.println("Found value: " + m.group(1));
}

我认为您可以用比上面提到的代码更少的代码来编写它，但您已经明白了思路；-)

英文:

If i understand correctly filter \b\d{1,2}\sx\s\d{1,3} works, but your sub filters do not (\d{1}\s)...

I only used to using regex in c# but, i saw you could use groups in java also.
change your pattern to (\b\d{1,2})\sx\s(\d{1,3}). When you put brackets in your regex, that part becomes a group that you can acces afterwards.

As i said, i haven't used java in a few years, but i picked this code snippet from the web. It shows how to use groups in java. As pattern you should use the (\b\d{1,2})\sx\s(\d{1,3}). If it is the same as in c# group(0) is the whole result, group(1) is your first actual group, group(2) is the second.

// Create a Pattern object
Pattern r = Pattern.compile(pattern);
// Now create matcher object.
Matcher m = r.matcher(line);
      
if (m.find( )) {
System.out.println(&quot;Found value: &quot; + m.group(0) );
System.out.println(&quot;Found value: &quot; + m.group(1) );
}

I think you can write it with less code than stated above, but you get the picture

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

“使用正则表达式查找卷和项数”

问题

答案1

在Reactjs中未定义的环境变量

无法在本地依赖中使用钩子。

创建一个正则表达式来从Markdown中提取代码的问题

如何从文件中获取JavaScript的导入路径

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。