I recently read an article that showed Google was the email provider for the majority of the top startups in US. This inspired me to find out, on a larger scale, which companies are the top providers of email across the whole Internet. However, what I ended up doing was to gather some stats on Mail Exchanger (MX) providers market share. The difference is subtle but important: an email provider offers a complete package where a customer is given an email address, an inbox where incoming emails are stored and a facility to send out emails. On the other hand, an MX provider only provides servers to receive emails. For example, they don’t necessarily provide storage (e.g. forward only).
The rest of this post documents how I went about it, the results and some other notes.
- I used .com zone file as the starting point for a list of domains. This zone file is available per request.
- A custom written script was used to resolve the MX records for each domain by doing recursive lookups.
- Multiple copies of the script ran in parallel to collect the data.
- The collected MX domains were counted by their 2nd level domain and reverse sorted by count.
- Top n records from the previous list were looked up on WHOIS database for their registering organization.
- All domains were looked up in the previous list by their MX records and grouped together by organization name, and top m was chosen as the final result.
- Alternative to zone file is brute forcing the domain names. First, you will never get a complete list, or anything close to it, and you will soon be banned for excessive requests to the TLD servers.
- I wrote a custom script for making the DNS request because playing around with unbound for an hour didn’t seem to get me the speed I wanted. The script OTOH sends and receives packets sequentially on a single UDP port and parallelizing it is as simple as running it multiple times on different ports.
- The stats are only for .com domains. Note that .com makes up nearly half of all domains.
- The zone file is not guaranteed to be an accurate record of registered domains. However, given the purpose we’re using it for, it’s not that important if few hundred or even thousand domains are off.
- Only those domains whose nameservers (NS) responded to MX queries at the time of this experiment were considered (zone file retrieved on 4 November 2015). 34% of NSes either did not respond or did not list MX records.
- Results show the percentage of domains, and not users.
Below is the breakdown of the top 16 MX providers for 79.8 million domains that listed MXes, out of 120 million .com domains in total:
- I had previously written a DNS packet parser library in some 500 lines of Python, which made writing the resolver script a 10 minutes task.
- The custom resolver script was rather dumb. For instance, it couldn’t deal with CNAME as a response to MX request. So for about 8% of domains, I had to use the recursive resolver provided by my cloud provider. It seems like I could have done away with that to begin with, as there didn’t seem to be any limits of any kind.
- Splitting data into separate files and running one instance of the
script for each file well outperformed other options I tried, GNU
parallel being the main one. The way I used GNU parallel was to
split the input stream using
--blockoptions and pass the data in round robin fashion to persistent processes using
cat domains | parallel --pipe --block 16K --round resolver.pyUnfortunately parallel died long before all the source data was consumed and I couldn’t figure out why.
- I learnt about threaded options of various unix tools I used. For
instance, pigz and
--paralleloption of sort.
Stats about educational and non-for-profit institutions would be very interesting because companies like Google offer their services for free to these places. Government sites are another interesting area. At this scale, I believe it makes more sense to base the market share on the number of staff/students for each institution.