Apache Pig过滤元组中的空值或文字常量

huangapple go评论105阅读模式
英文:

Apache Pig filtering null values or literals in tuple

问题

以下是已翻译的内容:

我已经编写了下面的Pig UDF,用于测试一个chararray列是否具有有效的 'yyyy-MM-dd' 日期格式。但是在使用下面的脚本进行测试时,我遇到了以下错误。数据是否存在问题,因为我正在处理空元组以考虑数据中的NULL值以及不存在的值。另外,我应该删除数据文件中的空行吗?

引发的错误: org.apache.pig.backend.executionengine.ExecException: ERROR 2114: 预期输入应为chararray,但得到了NULL
        位于 IsValidDateTime.exec(IsValidDateTime.java:41)
        位于 IsValidDateTime.exec(IsValidDateTime.java:18)

dates.txt

2019-12-27,2020-08-20
2017-05-09,2018-10-04
2016-09-25,2020-01-19
,2020-08-20
NULL,2017-09-28
2016-11-15,NULL
2018-04-17,Thu Aug-20 2020
2017-05-09,2020-08-20
Mon Jan-20 2020,2020-08-20

dates_valid(预期所有有效的 'yyyy-MM-dd' start_dt 和 end_dt)

2019-12-27,2020-08-20
2017-05-09,2018-10-04
2016-09-25,2020-01-19
2017-05-09,2020-08-20

Pig脚本

REGISTER 'IsValidDateTime.jar'

DEFINE IsValidDateTime IsValidDateTime();

dates = LOAD 'dates.txt' USING PigStorage(',') AS (start_dt:chararray, end_dt:chararray);
DUMP dates;
dates_valid = FILTER dates BY (IsValidDateTime(start_dt) AND IsValidDateTime(end_dt));
DUMP dates_valid;

IsValidDateTime 过滤器 UDF

import org.apache.pig.FilterFunc;
import org.apache.pig.PigException;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.data.DataType;
import org.apache.pig.data.Tuple;

import java.io.IOException;
import java.text.ParseException;
import java.text.SimpleDateFormat;
public class IsValidDateTime extends FilterFunc {
    private static String datePattern = "yyyy-MM-dd";
    public Boolean exec(Tuple input) throws IOException {
        if (input == null || input.size() == 0)
            return false;
        try {
            Object date = input.get(0);
            if(DataType.findType(date) == DataType.CHARARRAY){
                String dateStr = String.valueOf(date);
                if(dateStr != null && dateStr.length() != 0) {
                    try {
                        SimpleDateFormat format = new SimpleDateFormat(datePattern);
                        format.setLenient(false);
                        format.parse(dateStr);
                    } catch (ParseException | IllegalArgumentException e) {
                        return false; //date string does not match 'yyyy-MM-dd' format
                    }
                    return true; //date string is of valid format 'yyyy-MM-dd'
                }
                return false; //empty or null date string
            } else {
                int errCode = 2114;
                String msg = "Expected input to be chararray, but got " +  DataType.findTypeName(date);
                throw new ExecException(msg, errCode, PigException.BUG);
            }
        } catch(ExecException ee) {
            throw ee;
        }
    }
}
英文:

I have written the below Pig UDF for testing if a chararray column is having valid 'yyyy-MM-dd' date format or not. But while testing using below script, I am getting the below error. Is there any problem with the data because I am handling null tuples to consider the NULL values in data as well as non-existent values. Also, should I remove the empty line in the data file?

Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2114: Expected input to be chararray, but got NULL
        at IsValidDateTime.exec(IsValidDateTime.java:41)
        at IsValidDateTime.exec(IsValidDateTime.java:18) 

> dates.txt

2019-12-27,2020-08-20
2017-05-09,2018-10-04
2016-09-25,2020-01-19
,2020-08-20
NULL,2017-09-28
2016-11-15,NULL
2018-04-17,Thu Aug-20 2020
2017-05-09,2020-08-20
Mon Jan-20 2020,2020-08-20
<empty line>
------------------------------------------

> dates_valid (expected all valid 'yyyy-MM-dd' start_dt and end_dt)

2019-12-27,2020-08-20
2017-05-09,2018-10-04
2016-09-25,2020-01-19
2017-05-09,2020-08-20

> Pig script

REGISTER 'IsValidDateTime.jar'

DEFINE IsValidDateTime IsValidDateTime();

dates = LOAD 'dates.txt' USING PigStorage(',') AS (start_dt:chararray, end_dt:chararray);
DUMP dates;
dates_valid = FILTER dates BY (IsValidDateTime(start_dt) AND IsValidDateTime(end_dt));
DUMP dates_valid;

> IsValidDateTime Filter UDF

import org.apache.pig.FilterFunc;
import org.apache.pig.PigException;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.data.DataType;
import org.apache.pig.data.Tuple;

import java.io.IOException;
import java.text.ParseException;
import java.text.SimpleDateFormat;
public class IsValidDateTime extends FilterFunc {
    private static String datePattern = "yyyy-MM-dd";
    public Boolean exec(Tuple input) throws IOException {
        if (input == null || input.size() == 0)
            return false;
        try {
            Object date = input.get(0);
            if(DataType.findType(date) == DataType.CHARARRAY){
                String dateStr = String.valueOf(date);
                if(dateStr != null && dateStr.length() != 0) {
                    try {
                        SimpleDateFormat format = new SimpleDateFormat(datePattern);
                        format.setLenient(false);
                        format.parse(dateStr);
                    } catch (ParseException | IllegalArgumentException e) {
                        return false; //date string does not match 'yyyy-MM-dd' format
                    }
                    return true; //date string is of valid format 'yyyy-MM-dd'
                }
                return false; //empty or null date string
            } else {
                int errCode = 2114;
                String msg = "Expected input to be chararray, but got " +  DataType.findTypeName(date) ;
                throw new ExecException(msg, errCode, PigException.BUG);
            }
        } catch(ExecException ee) {
            throw ee;
        }
    }
}

答案1

得分: 1

移除了外部的if条件检查DataType.CHARARRAY,直接从输入Tuple中获取值并将其转换为String,然后检查是否为null或空。这是唯一需要的条件。以下是最终代码示例。

import org.apache.pig.FilterFunc;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.data.Tuple;

import java.io.IOException;
import java.text.ParseException;
import java.text.SimpleDateFormat;

public class IsValidDateTime extends FilterFunc {
    private static String datePattern = "yyyy-MM-dd";
    public Boolean exec(Tuple input) throws IOException {
        try {
            String date = (String)input.get(0);
            if(date != null && date.length() != 0) {
                try {
                    SimpleDateFormat format = new SimpleDateFormat(datePattern);
                    format.setLenient(false);
                    format.parse(date);
                } catch (ParseException | IllegalArgumentException e) {
                    return false; //日期字符串不匹配'yyyy-MM-dd'格式
                }
                return true; //日期字符串格式有效'yyyy-MM-dd'
            } else {
                return false; //空或null日期字符串
            }

        } catch(ExecException ee) {
            throw ee;
        }
    }
}

获取预期输出

(2019-12-27,2020-08-20)
(2017-05-09,2018-10-04)
(2016-09-25,2020-01-19)
(2017-05-09,2020-08-20)
英文:

Removed the outer if-condition checking the DataType.CHARARRAY, just get the value from the input Tuple into a String and check if null or empty. That is the only condition needed. Below is the final code.

import org.apache.pig.FilterFunc;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.data.Tuple;

import java.io.IOException;
import java.text.ParseException;
import java.text.SimpleDateFormat;

public class IsValidDateTime extends FilterFunc {
    private static String datePattern = "yyyy-MM-dd";
    public Boolean exec(Tuple input) throws IOException {
        try {
            String date = (String)input.get(0);
            if(date != null && date.length() != 0) {
                try {
                    SimpleDateFormat format = new SimpleDateFormat(datePattern);
                    format.setLenient(false);
                    format.parse(date);
                } catch (ParseException | IllegalArgumentException e) {
                    return false; //date string does not match 'yyyy-MM-dd' format
                }
                return true; //date string is of valid format 'yyyy-MM-dd'
            } else {
                return false; //empty or null date string
            }

        } catch(ExecException ee) {
            throw ee;
        }
    }
}

> Getting the expected output

(2019-12-27,2020-08-20)
(2017-05-09,2018-10-04)
(2016-09-25,2020-01-19)
(2017-05-09,2020-08-20)

huangapple
  • 本文由 发表于 2020年8月30日 00:53:47
  • 转载请务必保留本文链接:https://go.coder-hub.com/63649567.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定