The first task is to extract the text from the Web page. To extract the text from the Web page, use the Python urlopen function from urllib module.
from urllib import urlopen text = urlopen('http://the.web.url')
Second, define the regular expression to identify the email. Compile it into variable named pattern.pattern = re.compile(r"[w!#$%&'*+/=?^_`{|}~-]+"
+ r"(?:.[w!#$%&'*+/=?^_`{|}~-]+)*"
+ r"@(?:[a-z0-9](?:[w-]*[w])?.)+"
+ r"(?:[w^d]{2}|"
+ r"com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum)b")
+ r"(?:.[w!#$%&'*+/=?^_`{|}~-]+)*"
+ r"@(?:[a-z0-9](?:[w-]*[w])?.)+"
+ r"(?:[w^d]{2}|"
+ r"com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum)b")
Lastly, use the Python re module to search and extract the email, from the text that had been extracted previously.
pattern.findall(text)
It will return a Python list of all emails in that Web page.This is the full code listing:
import re
from urllib import urlopen
def extractEmail(theUrl):
text = urlopen(theUrl).read()
pattern = re.compile(r"[w!#$%&'*+/=?^_`{|}~-]+"
+ r"(?:.[w!#$%&'*+/=?^_`{|}~-]+)*"
+ r"@(?:[a-z0-9](?:[w-]*[w])?.)+"
+ r"(?:[w^d]{2}|"
+ r"com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum)b")
return pattern.findall(text)
from urllib import urlopen
def extractEmail(theUrl):
text = urlopen(theUrl).read()
pattern = re.compile(r"[w!#$%&'*+/=?^_`{|}~-]+"
+ r"(?:.[w!#$%&'*+/=?^_`{|}~-]+)*"
+ r"@(?:[a-z0-9](?:[w-]*[w])?.)+"
+ r"(?:[w^d]{2}|"
+ r"com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum)b")
return pattern.findall(text)
To remove duplicates from the list, use
set
.return set(pattern.findall(text))
Please use this code wisely. Thank you.Last update: Monday, May 31, 2010.