Pyspark Dataframe 电话号码格式化

huangapple go评论66阅读模式
英文:

Pyspark Dataframe Phone Number Format

问题

我有一个表,大约有1,000行和两列。第一列是emp_id,第二列是tel_num。tel_num列的格式不都相同...一些示例是(555) 555-9876,+18763334455,505-999-888x222,有些没有值...等等。目标是将它们格式化成相同的10位数字,去掉前导的1和任何扩展。

表格如下所示

emp_id tel_num
Jon Doe +18763334455
Cal Foe 505-999-8888x222
Ho Moe nan
GI Joe 676.909.4321

试图将其转化成如下格式...

列A tel_format
Jon Doe (876) 333-4455
Cal Foe (505) 999-8888
Ho Moe nan
GI Joe (679) 909-4321

我尝试了这个格式...
我正在使用databricks。
我尝试的当前流程有点像这样...

def formatphone(ph_var):
    ...一些处理
    return formatted_ph

df = df.withColumn('tel_format', formatphone(df.tel_num))

我无法让它工作。

英文:

I have a table, approx. 1K rows and two columns. The first row is the emp_id and the second is tel_num. The tel_num column is not formatted all the same...some examples are (555) 555-9876, +18763334455, 505-999-888x222, some have no values...and so on. The goal is to format them all the same 10 digits without the leading 1s or any extensions.

The table looks like the following

emp_id tel_num
Jon Doe +18763334455
Cal Foe 505-999-8888x222
Ho Moe nan
GI joe 676.909.4321

trying to make this...

Column A tel_format
Jon Doe (876) 333-4455
Cal Foe (505) 999-8888
Ho Moe nan
GI Joe (679) 909-4321

I tried this format...
I'm using databricks.
The current process i tried is somewhat like this...

def formatphone(ph_var):
    ...some process
    return formatted_ph

df = df.withColumn('tel_format', formatphone(df.tel_num))

I can't get it to work.

答案1

得分: 0

你可以使用以下函数,假设你的示例数据中显示了所有可能的格式。要在 withColumn() 中使用此函数,您需要从中创建一个 UDF。

@F.udf(returnType=F.StringType())
def format_telephone_number(phone_number):
    if phone_number is None:
        return None
    if phone_number == 'nan':
        return None
    if phone_number[0] == '+':
        return '(' + phone_number[2:5] + ') ' + phone_number[5:8] + '-' + phone_number[8:12]
    if '-' in phone_number:
        return '(' + phone_number[0:3] + ') ' + phone_number[4:7] + '-' + phone_number[8:12]
    if '.' in phone_number:
        return '(' + phone_number[0:3] + ') ' + phone_number[4:7] + '-' + phone_number[8:12]
    else:
        return None

(Note: The code part is not translated.)

英文:

You can use the following function, assuming that all possible formats are shown in your sample data.
To use this function in withColumn(), you need to create a UDF from it.

@F.udf(returnType=F.StringType())
def format_telephone_number(phone_number):
    if phone_number is None:
        return None
    if phone_number=='nan':
        return None
    if phone_number[0]== '+':
        return '('  + phone_number[2:5] + ') ' + phone_number[5:8] + '-' + phone_number[8:12]
    if '-' in phone_number:
        return '('  + phone_number[0:3] + ') ' + phone_number[4:7] + '-' + phone_number[8:12]
    if '.' in phone_number:
        return '('  + phone_number[0:3] + ') ' + phone_number[4:7] + '-' + phone_number[8:12]
    else:
        return None

huangapple
  • 本文由 发表于 2023年1月6日 12:03:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/75026829.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定