28 May 2012

An Exercise: Contact Scraping or How to Get Hired with ZeroMail

I’m trying out ZeroMail and saw their job listing at zeromail.

Though I’m happily and gainfully employed, it looked like an interesting exercise. The challenge is (verbatim from website): (1) Download http://zeromail.com/static/download/emails.txt (2) Write a python program that extracts contact information from signatures in the emails (3) Send your solution to bart@zeromail.com or skype it to bartjellema

After an hour of tinkering with it on my own terms (my Ruby > my Python), I’m realizing it’s more complex than I initially expected.

I see it as a multipart process: (1) Strip junk characters and lines out of the data (2) Determine and locate signature blocks (3) Extract pertinent data from signature blocks (might resort to wordlists of given names)

I tried using easily identifiable information through regexes (ie looking for Australian and US format phone numbers). This yielded decent results on the numbers themselves and I’m considering using phone number’s (or email addresses’) line numbers as ‘hotspots’ that I can extract and further process.

If I spend more time on the process, I’ll post my results here.