Showing posts with label regex. Show all posts
Showing posts with label regex. Show all posts

Thursday, December 11, 2008

How to extract email from a Web page using Python

To build an email extractor from a Web page is very easy using Python and regular expression (regex).

The first task is to extract the text from the Web page. To extract the text from the Web page, use the Python urlopen function from urllib module.

from urllib import urlopen text = urlopen('http://the.web.url')
Second, define the regular expression to identify the email. Compile it into variable named pattern.
pattern = re.compile(r"[w!#$%&'*+/=?^_`{|}~-]+"
    + r"(?:.[w!#$%&'*+/=?^_`{|}~-]+)*"
    + r"@(?:[a-z0-9](?:[w-]*[w])?.)+"
    + r"(?:[w^d]{2}|"
+ r"com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum)b")

Lastly, use the Python re module to search and extract the email, from the text that had been extracted previously.
pattern.findall(text)
It will return a Python list of all emails in that Web page.

This is the full code listing:
import re
from urllib import urlopen
def extractEmail(theUrl):
    text = urlopen(theUrl).read()
    pattern = re.compile(r"[w!#$%&'*+/=?^_`{|}~-]+"
                + r"(?:.[w!#$%&'*+/=?^_`{|}~-]+)*"
                + r"@(?:[a-z0-9](?:[w-]*[w])?.)+"
                + r"(?:[w^d]{2}|"
         + r"com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum)b")
    return pattern.findall(text)

To remove duplicates from the list, use set.
return set(pattern.findall(text))
Please use this code wisely. Thank you.

Last update: Monday, May 31, 2010.