“使用正则表达式查找卷和项数”

huangapple go评论77阅读模式
英文:

Finding volume and item count using regular expressions

问题

我目前正在构建一个用于杂货店的JavaScript网络爬虫,它可以处理产品的标题,然后返回产品的数量、容量和每升价格。大多数产品标题看起来像这样:

Coca cola(香草口味)12 x 330 mL

为了获取有关此产品的元数据,我编写了一个正则表达式。它将查找一个词边界,后面跟着一个1或2位数的数字,然后是空格、字符串 'x'、另一个空格,最后是1、2或3位数字:

const filter = new RegExp(/\b\d{1,2}\sx\s\d{1,3}/);

然后,我对每个结果使用正则表达式进行匹配,然后计算产品数量、产品体积、升数以及每升价格。

if (result.title.match(filter)) {
   result.itemCount = parseInt(result.title.match(/\d{1}\s/));
   result.itemVolume = parseInt(result.title.match(/\d{2,3}\s/));
   result.litreVolume = (result.itemCount * result.itemVolume) / 1000;
   result.pricePerLitre = +(result.price / result.litreVolume).toFixed(2);
} else {
   result.itemCount = 1;
   result.itemVolume = parseInt(result.title.match(/\d{2,3}\s/));
   result.litreVolume = result.itemVolume / 1000;
   result.pricePerLitre = +(result.price / result.litreVolume).toFixed(2);
}

90%的结果看起来很好,但有时我会得到意外的结果。例如:

  • 数量为NaN,这可能与一些标题包含更多数字有关(例如Coca Cola (4-Way) 12 x 330 mL
  • 体积为无穷大
  • 每升价格太高

显然,我在计算所需的元数据时做错了什么。有没有更好的使用正则表达式进行计算的方法?我是否忽略了某些东西,可以使我的计算不太容易出错?

英文:

I am currently building a JavaScript web scraper for a grocery store that processes a title of a product and then returns the item count, volume and price per litre of a product. Most of the product titles look something like this:

Coca cola (vanilla flavour) 12 x 330 mL

In order to obtain meta data about this product, I have written a Regular Expression. It will look for look for a word boundary followed by a 1 or 2 digit number, whitespace, the string 'x', another whitespace and finally a 1, 2 or 3 digit number:

const filter = new RegExp(/\b\d{1,2}\sx\s\d{1,3}/);

I then test each result for a match with the Regular Expression and then calculate the item count, item volume, volume in litres and then the price per litre.

  if (result.title.match(filter)) {
     result.itemCount = parseInt(result.title.match(/\d{1}\s/));
     result.itemVolume = parseInt(result.title.match(/\d{2,3}\s/));
     result.litreVolume = (result.itemCount * result.itemVolume) / 1000;
     result.pricePerLitre = +(result.price / result.litreVolume).toFixed(2);
  } else {
     result.itemCount = 1;
     result.itemVolume = parseInt(result.title.match(/\d{2,3}\s/));
     result.litreVolume = result.itemVolume / 1000;
     result.pricePerLitre = +(result.price / result.litreVolume).toFixed(2);
  }

90% of the results look good, but sometimes I get unexpected results. For example:

  • an item count of NaN, which may have to do with the fact that some titles contain several more numbers (Coca Cola (4-Way) 12 x 330 mL))
  • a volume of Infinity
  • a price per litre that is way too high

Clearly I am doing something wrong with my approach to calculating the desired meta data. What would be a better way of doing calculations with RegEx? Am I missing something that would make my calculations less prone to errors?

答案1

得分: 1

如果我理解正确,过滤器\b\d{1,2}\sx\s\d{1,3}是有效的,但您的子过滤器不起作用(\d{1}\s)...

我以前只在C#中使用正则表达式,但我看到您也可以在Java中使用分组。
将您的模式更改为(\b\d{1,2})\sx\s(\d{1,3})。当您在正则表达式中放置括号时,该部分将成为一个分组,您之后可以访问它。

正如我所说,我已经几年没有使用Java了,但我从网上找到了这段代码片段。它展示了如何在Java中使用分组。您应该使用模式(\b\d{1,2})\sx\s(\d{1,3})。如果与C#中相同,group(0)是整个结果,group(1)是您的第一个实际分组,group(2)是第二个。

// 创建一个Pattern对象
Pattern r = Pattern.compile(pattern);

// 现在创建匹配器对象。
Matcher m = r.matcher(line);

if (m.find()) {
    System.out.println("Found value: " + m.group(0));
    System.out.println("Found value: " + m.group(1));
}

我认为您可以用比上面提到的代码更少的代码来编写它,但您已经明白了思路;-)

英文:

If i understand correctly filter \b\d{1,2}\sx\s\d{1,3} works, but your sub filters do not (\d{1}\s)...

I only used to using regex in c# but, i saw you could use groups in java also.
change your pattern to (\b\d{1,2})\sx\s(\d{1,3}). When you put brackets in your regex, that part becomes a group that you can acces afterwards.

As i said, i haven't used java in a few years, but i picked this code snippet from the web. It shows how to use groups in java. As pattern you should use the (\b\d{1,2})\sx\s(\d{1,3}). If it is the same as in c# group(0) is the whole result, group(1) is your first actual group, group(2) is the second.

// Create a Pattern object
Pattern r = Pattern.compile(pattern);

// Now create matcher object.
Matcher m = r.matcher(line);
      
if (m.find( )) {
System.out.println("Found value: " + m.group(0) );
System.out.println("Found value: " + m.group(1) );
}

I think you can write it with less code than stated above, but you get the picture “使用正则表达式查找卷和项数”

huangapple
  • 本文由 发表于 2020年1月6日 17:48:45
  • 转载请务必保留本文链接:https://go.coder-hub.com/59609861.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定