Py学习  »  Python

电子邮件提取以不需要的字符开始和结束(python)

Miraj Keshwala • 4 年前 • 676 次点击  

所以我有一个程序可以提取电子邮件和电话号码。 我已经查过了,电话号码也没问题。但是,这些邮件会导致: 例如:3465usjohnson@astate.eduprovost而不是sjohnson@astate.edu 从中提取的环绕文字: 870-972-3465usjohnson@astate.eduprovost和副总理lynita cooksey870-972-2 030 870-972-2036邮箱:ulcooksey@astate.edu

在实际的PDF中,有白度和间距,但是当复制和粘贴时,它们之间没有间距,因此我生成了电子邮件。(它看起来像: enter image description here

#! python 3

import re, pyperclip

# Regex for phone numbers
phoneRegex = re.compile(r'''
# 860-555-3951, 555-3951, (860) 555-3951, 555-3951 ext 12345, ext. 12345, x12345
(
((\d\d\d)|(\(\d\d\d\)))?    #area code (optional)
(\s|-)              #first seperator
\d\d\d              #first 3 digits
-                   #second seperator
\d\d\d\d            #last 4 digits
(((ext(\.)?\s)|x)   #Extension-words (optional)
(\d{2,5}))?         #Extension - numbers (optional)
)
''', re.VERBOSE)


#Regex for Emails
emailRegex = re.compile(r'''
#some._+thing@(/d{2,5}))?.com

[a-zA-Z0-9_.+]+   #Name part 
@    #@ symbol
[a-zA-Z0-9_.+]+ #domain


''', re.VERBOSE)


#pyperclip get text off 
text = pyperclip.paste()



#extract
extractedPhone = phoneRegex.findall(text)
extractedEmail = emailRegex.findall(text)

allPhoneNumbers = []
for phoneNumber in extractedPhone:
    allPhoneNumbers.append(phoneNumber[0])


#copy to clipboard
results = '\n'.join(allPhoneNumbers) + '\n'.join(extractedEmail)
pyperclip.copy(results)
Python社区是高质量的Python/Django开发社区
本文地址:http://www.python88.com/topic/38207
 
676 次点击  
文章 [ 2 ]  |  最新文章 4 年前
SKD
Reply   •   1 楼
SKD    5 年前

我是巨蟒的新手。如果文本是从' 阿斯塔特.edu '网站,我想你可以使用这个regex:

text='70-972-3465Usjohnson@astate.eduUProvost and Vice ChancellorDr. Lynita Cooksey870-972-2 030 870-972-2036Ulcooksey@astate.edu'    
import re
email= re.findall('[a-z]+\@\w+\.edu',text)
#output
['sjohnson@astate.edu', 'lcooksey@astate.edu']

祝你好运!

FailSafe
Reply   •   2 楼
FailSafe    5 年前

因为我没有您的原始文本,所以我将使用您示例中的字符串。

看看下面的两个正则表达式是否适合您。我还包括第三个更精确的。

'(?<=\dU)[\w]+@[\w\.]+?(?=U|\s|$)'

.

'(?<=\dU)[\w]+@[\w]+\.[\w]+?(?=U|\s|$)'

.

示例测试

>>> import re


>>> string = '''3465Usjohnson@astate.eduUProvost instead of sjohnson@astate.edu The surround text that it is being extracted from: 870-972-3465Usjohnson@astate.eduUProvost and Vice ChancellorDr. Lynita Cooksey870-972-2 030 870-972-2036Ulcooksey@astate.edu'''


>>> re.findall('(?<=\dU)[\w]+@[\w\.]+?(?=U|\s|$)', string)

#Output
['sjohnson@astate.edu', 'sjohnson@astate.edu', 'lcooksey@astate.edu']




>>> re.findall('(?<=\dU)[\w]+@[\w]+\.[\w]+?(?=U|\s|$)', string)

#Output
['sjohnson@astate.edu', 'sjohnson@astate.edu', 'lcooksey@astate.edu']

.

更准确一点,因为电子邮件都以 .edu

'(?<=\dU)[\w]+@[\w]*\.edu(?=U|\s|$)'

.

示例测试

>>> string = '''3465Usjohnson@astate.eduUProvost instead of sjohnson@astate.edu The surround text that it is being extracted from: 870-972-3465Usjohnson@astate.eduUProvost and Vice ChancellorDr. Lynita Cooksey870-972-2 030 870-972-2036Ulcooksey@astate.edu'''


>>> re.findall('(?<=\dU)[\w]+@[\w]*\.edu(?=U|\s|$)', string)

#Output
['sjohnson@astate.edu', 'sjohnson@astate.edu', 'lcooksey@astate.edu']