英文:
Pyspark Dataframe Phone Number Format
问题
我有一个表,大约有1,000行和两列。第一列是emp_id,第二列是tel_num。tel_num列的格式不都相同...一些示例是(555) 555-9876,+18763334455,505-999-888x222,有些没有值...等等。目标是将它们格式化成相同的10位数字,去掉前导的1和任何扩展。
表格如下所示
emp_id | tel_num |
---|---|
Jon Doe | +18763334455 |
Cal Foe | 505-999-8888x222 |
Ho Moe | nan |
GI Joe | 676.909.4321 |
试图将其转化成如下格式...
列A | tel_format |
---|---|
Jon Doe | (876) 333-4455 |
Cal Foe | (505) 999-8888 |
Ho Moe | nan |
GI Joe | (679) 909-4321 |
我尝试了这个格式...
我正在使用databricks。
我尝试的当前流程有点像这样...
def formatphone(ph_var):
...一些处理
return formatted_ph
df = df.withColumn('tel_format', formatphone(df.tel_num))
我无法让它工作。
英文:
I have a table, approx. 1K rows and two columns. The first row is the emp_id and the second is tel_num. The tel_num column is not formatted all the same...some examples are (555) 555-9876, +18763334455, 505-999-888x222, some have no values...and so on. The goal is to format them all the same 10 digits without the leading 1s or any extensions.
The table looks like the following
emp_id | tel_num |
---|---|
Jon Doe | +18763334455 |
Cal Foe | 505-999-8888x222 |
Ho Moe | nan |
GI joe | 676.909.4321 |
trying to make this...
Column A | tel_format |
---|---|
Jon Doe | (876) 333-4455 |
Cal Foe | (505) 999-8888 |
Ho Moe | nan |
GI Joe | (679) 909-4321 |
I tried this format...
I'm using databricks.
The current process i tried is somewhat like this...
def formatphone(ph_var):
...some process
return formatted_ph
df = df.withColumn('tel_format', formatphone(df.tel_num))
I can't get it to work.
答案1
得分: 0
你可以使用以下函数,假设你的示例数据中显示了所有可能的格式。要在 withColumn()
中使用此函数,您需要从中创建一个 UDF。
@F.udf(returnType=F.StringType())
def format_telephone_number(phone_number):
if phone_number is None:
return None
if phone_number == 'nan':
return None
if phone_number[0] == '+':
return '(' + phone_number[2:5] + ') ' + phone_number[5:8] + '-' + phone_number[8:12]
if '-' in phone_number:
return '(' + phone_number[0:3] + ') ' + phone_number[4:7] + '-' + phone_number[8:12]
if '.' in phone_number:
return '(' + phone_number[0:3] + ') ' + phone_number[4:7] + '-' + phone_number[8:12]
else:
return None
(Note: The code part is not translated.)
英文:
You can use the following function, assuming that all possible formats are shown in your sample data.
To use this function in withColumn()
, you need to create a UDF from it.
@F.udf(returnType=F.StringType())
def format_telephone_number(phone_number):
if phone_number is None:
return None
if phone_number=='nan':
return None
if phone_number[0]== '+':
return '(' + phone_number[2:5] + ') ' + phone_number[5:8] + '-' + phone_number[8:12]
if '-' in phone_number:
return '(' + phone_number[0:3] + ') ' + phone_number[4:7] + '-' + phone_number[8:12]
if '.' in phone_number:
return '(' + phone_number[0:3] + ') ' + phone_number[4:7] + '-' + phone_number[8:12]
else:
return None
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论