Apache POI 无法正确读取 Excel 单元格中的秒数。

huangapple go评论76阅读模式
英文:

Apache POI doesn't correctly read seconds from Excel cells

问题

我目前正在调试他人的代码库。目标是将Excel文件中的数据导入到数据库中。Excel文件中的每一行都包含在列0中的时间戳和进一步列中的一些标签值。

时间戳包含年、月、日、小时、分钟和秒。为了解析Excel文件并读取单个单元格,使用以下API和代码:

import java.time.LocalDateTime;
import java.time.OffsetDateTime;
import java.time.ZoneOffset;

import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.CellType;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;

...

// 一些循环

   LocalDateTime dateCellValue = cell.getLocalDateTimeCellValue();

大多数情况下,日期的解析工作正常,但我发现,并非所有情况都成立。问题是,Java有时会通过纳秒扩展时间戳的秒数。例如,时间戳"12.09.2018 12:39:11"被Java解释为"2018-09-12T12:39:10.995"。问题直接影响数据库:文件中的第11秒被保存为数据库中的第10秒(因此纳秒部分被丢弃)。

为了了解问题的原因,我阅读了以下文档:https://github.com/apache/poi/blob/trunk/src/java/org/apache/poi/hssf/usermodel/HSSFCell.java

从中我得出结论,问题的原因是一个舍入误差:API将每个日期值解释为double。在Excel中,您可以通过将日期时间值转换为文本或数字来查看此双精度值;例如,"20.03.2019 08:36:39"变为"43544,3587847222"。因此,由于某些值的数据类型,舍入就不准确,因此Java无法正确解析该值。

我的问题是,如何快速而准确地修复这个问题。基本上,我考虑了两种可能性:

  1. 不使用Apache POI,而是通过DateFormatter解析日期。然而,这样做的缺点是,用户定义的日期总是需要具有相同的格式,而它们目前没有。
  2. 在代码中,我可以尝试四舍五入以更正秒值并去掉纳秒部分。然而,我不确定是否始终需要舍入到上面的下一个秒值(就像上面显示的例子中那样),或者是否有情况(例如纳秒值小于0.5)需要舍入到下面的下一个秒值。我的问题在于,我不完全理解程序在舍入误差方面的行为背后的确切系统。

有人有建议吗?非常感谢帮助!

编辑:
问题的原因是纳秒已经存在于Excel表中,但肉眼无法识别,因为相关的数据类型没有显示它们。

英文:

I am currently debugging someone elses code base. The aim is to import data from Excel files into a database. Each row in the excel file contains a timestamp in column 0 and some label-values in the further columns.

The timestamps contain year, month, day_of_month, hour, minute, and second. To parse the excel files and read individual cells, the following APIs and code is used:

import java.time.LocalDateTime;
import java.time.OffsetDateTime;
import java.time.ZoneOffset;

import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.CellType;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;

...

// some loop

   LocalDateTime dateCellValue = cell.getLocalDateTimeCellValue();

The parsing of dates works properly in most cases, but I found out, that this doesn't hold for all cases. The problem is, that Java sometimes extends the seconds of a timestamp by nanoseconds. For example, the timestamp "12.09.2018 12:39:11" is interpreted by Java as "2018-09-12T12:39:10.995". The problem is directly translated into the db: second 11 from the file is being saved as second 10 in the db (so the nanosecond-artifact is gone).

To understand the cause of the problem, I read the following documentation: https://github.com/apache/poi/blob/trunk/src/java/org/apache/poi/hssf/usermodel/HSSFCell.java

From this I concluded, that the cause of my problem is a rounding error: the API interprets each date value as double. In Excel you can see this double value by converting a date time value e.g. into text or numeric; e.g. "20.03.2019 08:36:39" becomes "43544,3587847222". Hence, due to the datatype for certain values the rounding simply is inaccurate and therefore Java doesn't correctly parse the value.

My question is, how the quickes and accurate fix for such a problem could look like. Basically, I thought about 2 possibilities:

  1. Instead of using the Apache POI I could parse the dates through a DateFormatter instead. This, however, has the disadvantage, that user-defined dates always need to be of the same format, which they are currently not.
  2. In the code, I could try to round to correct second values and cut of the nanosecond-artifacts. However, I am not sure, if I always need to round to the next second-value above (like in the above shown example) or if there are cases (e.g. nanosecond-values below 0.5) in which I need to round to the next second-value below. My problem here is, that I don't fully understand the exact system behind the programs behavior with regard to the rounding error.

Does someone have a suggestion? Help is very much appreciated!

Edit:
The cause of the problem was, that the nanoseconds were already in the Excel-sheets, but not recognizable for the bare eye, because the related datatype didn't display them.

答案1

得分: 3

我无法重现这个问题。如果 apache poi 得到了 LocalDateTime 2018-09-12T12:39:10.995,那么 Excel 单元格已经存储了这个确切的日期和时间。当然,由于日期格式的舍入,Excel 可能无法完全显示它。例如,日期格式 DD.MM.YYYY hh:mm:ss 将会显示 2018-09-12T12:39:10.995 为 12.09.2018 12:39:11。但实际存储的是确切的日期和时间。

但是,如果只需要以秒为精度获取 LocalDateTime,可以添加 0.5 秒(500 毫微秒),然后截断为秒。这种方法会将 LocalDateTime 舍入到秒。

LocalDateTime dateCellValue = cell.getLocalDateTimeCellValue(); //直接从Excel获取
dateCellValue = dateCellValue.plusNanos(500000000).truncatedTo(ChronoUnit.SECONDS); //舍入到秒

完整示例:

Excel 表格如下所示:

Apache POI 无法正确读取 Excel 单元格中的秒数。

这里的列 B 中的单元格值是日期时间值。单元格的数字格式为 TT.MM.YYYY hh:mm:ss.000

代码:

import org.apache.poi.ss.usermodel.CellType;
import org.apache.poi.ss.usermodel.*;
import org.apache.poi.ss.util.*;

import java.time.LocalDateTime;
import java.time.temporal.ChronoUnit;

import java.io.FileInputStream;

class ExcelReadLocalDateTime {

 public static void main(String[] args) throws Exception {
  //Workbook workbook = WorkbookFactory.create(new FileInputStream("Workbook.xls")); String filePath = "WorkbookNew.xls";
  Workbook workbook = WorkbookFactory.create(new FileInputStream("Workbook.xlsx")); String filePath = "WorkbookNew.xlsx";

  Sheet sheet = workbook.getSheetAt(0);

  for (Row row : sheet) {
   for (Cell cell : row) {
    switch (cell.getCellType()) {
     case STRING:
      System.out.println(cell.getRichStringCellValue().getString());
      break;
     case NUMERIC:
      if (DateUtil.isCellDateFormatted(cell)) {
       LocalDateTime dateCellValue = cell.getLocalDateTimeCellValue(); //直接从Excel获取
       System.out.println(dateCellValue);
       dateCellValue = dateCellValue.plusNanos(500000000).truncatedTo(ChronoUnit.SECONDS); //舍入到秒
       System.out.println(dateCellValue);
      } else {
       System.out.println(cell.getNumericCellValue());
      }
      break;
     default:
      System.out.println();
    }
   }
  }

  workbook.close();
 }
}

结果:

Text
DateTime
DT 1
2018-09-12T12:39:10
2018-09-12T12:39:10
DT 2
2018-09-12T12:39:10.123
2018-09-12T12:39:10
DT 3
2018-09-12T12:39:10.245
2018-09-12T12:39:10
DT 4
2018-09-12T12:39:10.370
2018-09-12T12:39:10
DT 5
2018-09-12T12:39:10.495
2018-09-12T12:39:10
DT 6
2018-09-12T12:39:10.500
2018-09-12T12:39:11
DT 7
2018-09-12T12:39:10.620
2018-09-12T12:39:11
DT 8
2018-09-12T12:39:10.745
2018-09-12T12:39:11
DT 9
2018-09-12T12:39:10.870
2018-09-12T12:39:11
DT 10
2018-09-12T12:39:10.995
2018-09-12T12:39:11
英文:

I cannot reproduce the issue. If apache poi gets LocalDateTime 2018-09-12T12:39:10.995 then the Excel cell has stored that exact date time. Of curse Excel might not show it exactly because of a date format which rounds already. For example date format DD.MM.YYYY hh:mm:ss will show 12.09.2018 12:39:11 for 2018-09-12T12:39:10.995. But stored is the exact date time.

But if the need is getting the LocalDateTime only in accuracy of seconds, one could add 0.5 seconds (500 thousandths of a second) and then truncate to seconds. That approach will round the LocalDateTime to seconds.

LocalDateTime dateCellValue = cell.getLocalDateTimeCellValue(); //got directly from Excel
dateCellValue  = dateCellValue.plusNanos(500000000).truncatedTo(ChronoUnit.SECONDS); //round to seconds

Complete example:

Excel sheet looks like:

Apache POI 无法正确读取 Excel 单元格中的秒数。

Cell values in column B here are date time values. Cell number format is TT.MM.YYYY hh:mm:ss.000.

Code:

import org.apache.poi.ss.usermodel.CellType;
import org.apache.poi.ss.usermodel.*;
import org.apache.poi.ss.util.*;
import java.time.LocalDateTime;
import java.time.temporal.ChronoUnit;
import java.io.FileInputStream;
class ExcelReadLocalDateTime {
public static void main(String[] args) throws Exception {
//Workbook workbook = WorkbookFactory.create(new FileInputStream("Workbook.xls")); String filePath = "WorkbookNew.xls";
Workbook workbook = WorkbookFactory.create(new FileInputStream("Workbook.xlsx")); String filePath = "WorkbookNew.xlsx";
Sheet sheet = workbook.getSheetAt(0);
for (Row row : sheet) {
for (Cell cell : row) {
switch (cell.getCellType()) {
case STRING:
System.out.println(cell.getRichStringCellValue().getString());
break;
case NUMERIC:
if (DateUtil.isCellDateFormatted(cell)) {
LocalDateTime dateCellValue = cell.getLocalDateTimeCellValue(); //got directly from Excel
System.out.println(dateCellValue);
dateCellValue  = dateCellValue.plusNanos(500000000).truncatedTo(ChronoUnit.SECONDS); //round to seconds
System.out.println(dateCellValue);
} else {
System.out.println(cell.getNumericCellValue());
}
break;    
default:
System.out.println();
}
}
}
workbook.close();
}
}

Result:

Text
DateTime
DT 1
2018-09-12T12:39:10
2018-09-12T12:39:10
DT 2
2018-09-12T12:39:10.123
2018-09-12T12:39:10
DT 3
2018-09-12T12:39:10.245
2018-09-12T12:39:10
DT 4
2018-09-12T12:39:10.370
2018-09-12T12:39:10
DT 5
2018-09-12T12:39:10.495
2018-09-12T12:39:10
DT 6
2018-09-12T12:39:10.500
2018-09-12T12:39:11
DT 7
2018-09-12T12:39:10.620
2018-09-12T12:39:11
DT 8
2018-09-12T12:39:10.745
2018-09-12T12:39:11
DT 9
2018-09-12T12:39:10.870
2018-09-12T12:39:11
DT 10
2018-09-12T12:39:10.995
2018-09-12T12:39:11

huangapple
  • 本文由 发表于 2020年8月27日 18:17:36
  • 转载请务必保留本文链接:https://go.coder-hub.com/63613873.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定