所以我有一个程序可以提取电子邮件和电话号码。
我已经查过了,电话号码也没问题。但是,这些邮件会导致:
例如:3465usjohnson@astate.eduprovost而不是sjohnson@astate.edu
从中提取的环绕文字:
870-972-3465usjohnson@astate.eduprovost和副总理lynita cooksey870-972-2 030 870-972-2036邮箱:ulcooksey@astate.edu
在实际的PDF中,有白度和间距,但是当复制和粘贴时,它们之间没有间距,因此我生成了电子邮件。(它看起来像:
enter image description here
#! python 3
import re, pyperclip
# Regex for phone numbers
phoneRegex = re.compile(r'''
# 860-555-3951, 555-3951, (860) 555-3951, 555-3951 ext 12345, ext. 12345, x12345
(
((\d\d\d)|(\(\d\d\d\)))? #area code (optional)
(\s|-) #first seperator
\d\d\d #first 3 digits
- #second seperator
\d\d\d\d #last 4 digits
(((ext(\.)?\s)|x) #Extension-words (optional)
(\d{2,5}))? #Extension - numbers (optional)
)
''', re.VERBOSE)
#Regex for Emails
emailRegex = re.compile(r'''
#some._+thing@(/d{2,5}))?.com
[a-zA-Z0-9_.+]+ #Name part
@ #@ symbol
[a-zA-Z0-9_.+]+ #domain
''', re.VERBOSE)
#pyperclip get text off
text = pyperclip.paste()
#extract
extractedPhone = phoneRegex.findall(text)
extractedEmail = emailRegex.findall(text)
allPhoneNumbers = []
for phoneNumber in extractedPhone:
allPhoneNumbers.append(phoneNumber[0])
#copy to clipboard
results = '\n'.join(allPhoneNumbers) + '\n'.join(extractedEmail)
pyperclip.copy(results)