英文:
Parse Kinesis data stream in AWS Lambda Java
问题
我正在使用Java创建一个AWS Lambda函数来处理Kinesis数据流。
我的当前解析设置包括:
- 使用UTF-8进行字符串化,正如AWS文档建议的那样
for(KinesisEvent.KinesisEventRecord rec : event.getRecords())
{
String stringRecords = new String(rec.getKinesis().getData().array(), "UTF-8");
pageEventList.add(pageEvent);
}
- 使用正则表达式模式清理字符
a. 非ASCII字符: "[^\\x00-\\x7F]";
b. ASCII控制字符: "[\\p{Cntrl}&&[^\\r\\n\\t]]";
c. 不可打印字符: "\\p{C}";
- 格式化JSON字符串对象,去除方括号和逗号
int firstBeginningCurlyBracketIndex = cleanString.indexOf("{");
if (firstBeginningCurlyBracketIndex != -1 ){
cleanString = cleanString.substring(firstBeginningCurlyBracketIndex + 1);
cleanString = "[{" + cleanString;
}
int lastIndexOfCurlyBracketIndex = cleanString.lastIndexOf("}");
if (lastIndexOfCurlyBracketIndex != -1) {
cleanString = cleanString.substring(0, lastIndexOfCurlyBracketIndex);
cleanString = cleanString + "}]";
}
cleanString = cleanString.replaceAll("}\\{", "},{");
目前,到这一步为止,我正在使用正则表达式解析来将它们分隔并解析为JSON对象。参考:https://stackoverflow.com/questions/17759004/how-to-match-string-within-parentheses-nested-in-java/17759264#17759264
String REGEX_BRACKET_PATTERN_TWO_LAYERS = "(\\{(?:[^}{]+|\\{(?:[^}{]+|\\{[^}{]*\\})*\\})*\\})";
Pattern splitDelRegex = Pattern.compile(REGEX_BRACKET_PATTERN_TWO_LAYERS);
Matcher regexMatcher = splitDelRegex.matcher(nonAsciiRemovedString);
List<String> matcherList = new ArrayList<String>();
while (regexMatcher.find()) {
String perm = regexMatcher.group(1);
matcherList.add(perm);
}
我尝试过使用Gson和Jackson来解析步骤3后的字符串JSON数组(参考:https://stackoverflow.com/questions/2591098/how-to-parse-json-in-java)。解析工作正常,直到从数据流中出现随机无效的JSON/字符串并抛出异常 - java.lang.Exception: com.google.gson.JsonSyntaxException: java.lang.IllegalStateException: Expected BEGIN_ARRAY but was STRING at line 2 column 1 path $
导致此异常的无效JSON看起来像这样:
[
...
{
"name": "banana",
"description": "description"
},
{
"name": "orange",
"description": "description"
}
GD~
{}
FDSE-}
]
我的问题是:
-
由于最后的随机字符串部分非常随机,我很难确保整个字符串始终为有效的字符串JSON数组。如果有人有好的想法,以确保此字符串JSON数组始终有效,请告诉我。
-
除了我在解析Kinesis数据流为Json数据的步骤中所描述的内容,如果有人对这个解析过程有经验,请与社区分享。我觉得AWS关于Lambda-Kinesis这个主题的文档不够详细,以确保整个解析过程。
另外,我知道这可能只是因为数据流的数据质量问题。听到其他人处理这个主题的经验也会很好。
英文:
I am creating a AWS Lambda function in Java to process Kinesis Data Stream.
My current setup of parsing involves:
- Stringify using UTF-8 as suggested in AWS Documentation
for(KinesisEvent.KinesisEventRecord rec : event.getRecords())
{
String stringRecords = new String(rec.getKinesis().getData().array(), "UTF-8");
pageEventList.add(pageEvent);
}
- Clean up characters using Regex Patterns
a. non-ascii: "[^\\x00-\\x7F]";
b. ascii-control-characters: "[\\p{Cntrl}&&[^\r\n\t]]";
c. non-printable-characters: "\\p{C}";
- Format json string objects without square brackets and commas
int firstBeginningCurlyBracketIndex = cleanString.indexOf("{");
if (firstBeginningCurlyBracketIndex != -1 ){
cleanString = cleanString.substring(firstBeginningCurlyBracketIndex + 1);
cleanString = "[{" + cleanString;
}
int lastIndexOfCurlyBracketIndex = cleanString.lastIndexOf("}");
if (lastIndexOfCurlyBracketIndex != -1) {
cleanString = cleanString.substring(0, lastIndexOfCurlyBracketIndex);
cleanString = cleanString + "}]";
}
cleanString = cleanString.replaceAll("}\\{", "\\},\\{");
Currently, when I got this far, I am using Regex parsing to separate and parse them into JSON object. Reference: https://stackoverflow.com/questions/17759004/how-to-match-string-within-parentheses-nested-in-java/17759264#17759264
String REGEX_BRACKET_PATTERN_TWO_LAYERS = "(\\{(?:[^}{]+|\\{(?:[^}{]+|\\{[^}{]*\\})*\\})*\\})";
Pattern splitDelRegex = Pattern.compile(REGEX_BRACKET_PATTERN_TWO_LAYERS);
Matcher regexMatcher = splitDelRegex.matcher(nonAsciiRemovedString);
List<String> matcherList = new ArrayList<String>();
while (regexMatcher.find()) {
String perm = regexMatcher.group(1);
matcherList.add(perm);
}
I have attempted to use Gson and Jackson to parse string-json-array after step 3 (ref: https://stackoverflow.com/questions/2591098/how-to-parse-json-in-java). Parsing works fine until a random invalid JSON / string appears out of Data Stream and throws exception - java.lang.Exception: com.google.gson.JsonSyntaxException: java.lang.IllegalStateException: Expected BEGIN_ARRAY but was STRING at line 2 column 1 path $
Invalid json that causes this exception looks something like this:
[
...
{
"name": "banana"
"description": "description"
},
{
"name": "orange"
"description": "description"
}
GD~
{}
FDSE-}
]
My questions are:
-
Since the last random string part is very random, I am having difficulties formatting the whole string into valid string json array. If anybody has a good Idea to make sure this string json array is always valid.
-
Aside from what I have described in steps to parse Kinesis Data Stream to Json data, which by the way is working using REGEX although I still notice that random string at the end, if anybody has experience in this parsing process, please share with the community. I feel like AWS Documentation on this topic of Lambda-Kinesis is not detail enough to make sure the whole parsing process.
Adding to this, I am aware that this could just all be because of the quality of data from data stream. It would also be nice just to hear other people's experience on handling their data on this topic.
答案1
得分: 0
我尝试使用Gson库:
String jsonString = "{'username':'apple2','description':'这是一个示例{where problem is2}'}";
GsonBuilder builder = new GsonBuilder();
Map<String,String> o =(Map<String,String>)builder.create().fromJson(jsonString,Object.class);
System.out.println("Map对象:" + o);
System.out.println("用户名:" + o.get("username"));
System.out.println("描述:" + o.get("description"));
输出:
Map对象:{username=apple2,description=这是一个示例{where problem is2}}
用户名:apple2
描述:这是一个示例{where problem is2}
英文:
I tried with Gson library :
String jsonString = "{\"username\": \"apple2\", \"description\": \"this is an example{where problem is2}\" }";
GsonBuilder builder = new GsonBuilder();
Map<String,String> o = (Map<String, String>) builder.create().fromJson(jsonString, Object.class);
System.out.println("Map object : " + o);
System.out.println("UserName : " + o.get("username"));
System.out.println("Description : " + o.get("description"));
Output :
Map object : {username=apple2, description=this is an example{where problem is2}}
UserName : apple2
Description : this is an example{where problem is2}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论