英文:
Pyspark Dataframe Phone Number Format
问题
我有一个表,大约有1,000行和两列。第一列是emp_id,第二列是tel_num。tel_num列的格式不都相同...一些示例是(555) 555-9876,+18763334455,505-999-888x222,有些没有值...等等。目标是将它们格式化成相同的10位数字,去掉前导的1和任何扩展。
表格如下所示
| emp_id | tel_num | 
|---|---|
| Jon Doe | +18763334455 | 
| Cal Foe | 505-999-8888x222 | 
| Ho Moe | nan | 
| GI Joe | 676.909.4321 | 
试图将其转化成如下格式...
| 列A | tel_format | 
|---|---|
| Jon Doe | (876) 333-4455 | 
| Cal Foe | (505) 999-8888 | 
| Ho Moe | nan | 
| GI Joe | (679) 909-4321 | 
我尝试了这个格式...
我正在使用databricks。
我尝试的当前流程有点像这样...
def formatphone(ph_var):
    ...一些处理
    return formatted_ph
df = df.withColumn('tel_format', formatphone(df.tel_num))
我无法让它工作。
英文:
I have a table, approx. 1K rows and two columns. The first row is the emp_id and the second is tel_num. The tel_num column is not formatted all the same...some examples are (555) 555-9876, +18763334455, 505-999-888x222, some have no values...and so on. The goal is to format them all the same 10 digits without the leading 1s or any extensions.
The table looks like the following
| emp_id | tel_num | 
|---|---|
| Jon Doe | +18763334455 | 
| Cal Foe | 505-999-8888x222 | 
| Ho Moe | nan | 
| GI joe | 676.909.4321 | 
trying to make this...
| Column A | tel_format | 
|---|---|
| Jon Doe | (876) 333-4455 | 
| Cal Foe | (505) 999-8888 | 
| Ho Moe | nan | 
| GI Joe | (679) 909-4321 | 
I tried this format...
I'm using databricks.
The current process i tried is somewhat like this...
def formatphone(ph_var):
    ...some process
    return formatted_ph
df = df.withColumn('tel_format', formatphone(df.tel_num))
I can't get it to work.
答案1
得分: 0
你可以使用以下函数,假设你的示例数据中显示了所有可能的格式。要在 withColumn() 中使用此函数,您需要从中创建一个 UDF。
@F.udf(returnType=F.StringType())
def format_telephone_number(phone_number):
    if phone_number is None:
        return None
    if phone_number == 'nan':
        return None
    if phone_number[0] == '+':
        return '(' + phone_number[2:5] + ') ' + phone_number[5:8] + '-' + phone_number[8:12]
    if '-' in phone_number:
        return '(' + phone_number[0:3] + ') ' + phone_number[4:7] + '-' + phone_number[8:12]
    if '.' in phone_number:
        return '(' + phone_number[0:3] + ') ' + phone_number[4:7] + '-' + phone_number[8:12]
    else:
        return None
(Note: The code part is not translated.)
英文:
You can use the following function, assuming that all possible formats are shown in your sample data.
To use this function in withColumn(), you need to create a UDF from it.
@F.udf(returnType=F.StringType())
def format_telephone_number(phone_number):
    if phone_number is None:
        return None
    if phone_number=='nan':
        return None
    if phone_number[0]== '+':
        return '('  + phone_number[2:5] + ') ' + phone_number[5:8] + '-' + phone_number[8:12]
    if '-' in phone_number:
        return '('  + phone_number[0:3] + ') ' + phone_number[4:7] + '-' + phone_number[8:12]
    if '.' in phone_number:
        return '('  + phone_number[0:3] + ') ' + phone_number[4:7] + '-' + phone_number[8:12]
    else:
        return None
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论