MongoDB v5… ingest of public business data. I need to do a titleCase() of the $name field. Works great to about 400,000 records, then errors

huangapple go评论61阅读模式
英文:

MongoDB v5... ingest of public business data. I need to do a titleCase() of the $name field. Works great to about 400,000 records, then errors

问题

I end up having anywhere from 1.5Million to 3Million documents in the collection. Ingest is from public Government CSV data aggregated to gov_data.businesses collection. Everything is ALLCAPS. I aggregated the data to a new collection with the address.city and name fields $toLower. Now I need to titleCase those fields. using address.city instead of name in the following code takes a while (28 minutes), but succeeds. name however fails with TypeError: Cannot read properties of undefined (reading 'toUpperCase') after some 400,000 documents at about 8 (minutes). Feels like a data size issue, but I've no idea. I'm relatively new to aggregations and coding in mongo/mongosh.

I borrowed the script from here: https://stackoverflow.com/questions/63113037/how-to-update-field-value-to-tittlecase-in-mongodb

use gov_data
function titleCase(str) {
    return str && str.toLowerCase().split(/\s/).map(function(word) {
        return word.replace(word[0], word[0].toUpperCase());
    }).join(' ');
}

console.log(titleCase(undefined));

console.log(titleCase(""));

console.log(titleCase(null));

console.log(titleCase("NAMAR"));

db.businesses.aggregate().forEach(function(doc){
    db.businesses.bulkWrite(
        { "_id": doc._id },
        { "$set": { "name": titleCase(doc.name) } }
    );
});

Please note that the provided code contains JavaScript, and the code itself does not need translation.

英文:

I end up having anywhere from 1.5Million to 3Million documents in the collection. Ingest is from public Government CSV data aggregated to gov_data.businesses collection. Everything is ALLCAPS. I aggregated the data to a new collection with the address.city and name fields $toLower. Now I need to titleCase those fields. using address.city instead of name in the following code takes a while (28 minutes), but succeeds. name however fails with TypeError: Cannot read properties of undefined (reading 'toUpperCase') after some 400,000 documents at about 8 (minutes). Feels like a data size issue, but I've no idea. I'm relatively new to aggregations and coding in mongo/mongosh.

I borrowed the script from here: https://stackoverflow.com/questions/63113037/how-to-update-field-value-to-tittlecase-in-mongodb

use gov_data
function titleCase(str) {
    return str && str.toLowerCase().split(/\s/).map(function(word) {
        return word.replace(word[0], word[0].toUpperCase());
    }).join(' ');
}

    console.log(titleCase(undefined));

    console.log(titleCase(""));

    console.log(titleCase(null));

    console.log(titleCase("NAMAR"));

db.businesses.aggregate().forEach(function(doc){
    db.businesses.bulkWrite(
        { "_id": doc._id },
        { "$set": { "name": titleCase(doc.name) } }
    );
});

答案1

得分: 1

你可能有一个以空格或连续空格开头的文档。split函数将解析为一个数组,如 [ '', 'toto' ],空字符没有 toUpperCase 函数。

你应该更新你的 titleCase 函数来解决这个问题。你可以按照以下方式操作(注意它会移除额外的空格):

function titleCase(str) {
    return str && str.toLowerCase().split(/\s/).reduce(function(element, word) {
        if (word.length > 0) {
            element.push(word.replace(word[0], word[0].toUpperCase()));
        }
        return element;
    }, []).join(' ');
}

这应该解决你的问题。

此外,你应该考虑使用 updateMany 函数来更新所有文档,而不是使用 forEach 进行迭代。

db.businesses.updateMany(
    {},
    [{ "$set": { "name": titleCase("$name") } }]
);
英文:

You might have a document starting with a space or with consecutive space in it. The split function will resolve to an array like [ '', 'toto' ] and the empty char has no toUpperCase function.

You should update your titleCase function to fix this.
You could do as follows (warning it will remove the extra spaces):

function titleCase(str) {
    return str && str.toLowerCase().split(/\s/).reduce(function(element,word) {
        if (word.length>0){
        element.push(word.replace(word[0], word[0].toUpperCase()));
        }
        return element;
    }, []
    ).join(' ');
}

This should fix your issue.

On top of that, you should consider using the updateMany function to update all document instead of iterating with forEach.

db.businesses.updateMany(
        {},
        [{ "$set": { "name": titleCase("$name") } }]
    );
});

huangapple
  • 本文由 发表于 2023年7月18日 03:21:17
  • 转载请务必保留本文链接:https://go.coder-hub.com/76707521.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定