Suppose you are launching a new web application. You have a list of users with personal information, and you need to mail them instructions for how to sign on to their new accounts. For security reasons, you cannot mail send both their username and password by mail. Instead, you must mail each users either their username or their password, along with instructions for how to construct the other field from personal information. It is important that the user can easily construct the unknown field, but that another person who happens upon the letter cannot. Each username must be unique.
In this case it is best to mail each user their password. This allows us to create strong, secure passwords, and to have our usernames be constructible from personal information.
A potential solution
One solution to this problem is to to assign each user a username consisting of the first initial of their first name, their last name, and the last four digits of their social security number. A person named John Smith with a social security number of 123-45-6789 would have the username JSmith6789. Assuming a user keeps their SSN secret, this satisfies the security criteria for the usernames, and at first glance also seems to satisfy the uniqueness criteria. After all, if you have the username JSmith6789, the odds are very small that another user has the same first initial, last name, and last four of their SSN.
The birthday problem
However, it’s not enough that there is a low probability that JSmit6789 will encounter another user with the same username. We need to assure that no two of our users will have the same username. To illustrate how different these probabilities can be, consider the Birthday Problem. Among a group of 25 people there is a fairly small chance (1-(364/365)^24 = 6.7%) that another person in the group will share your birthday. However, there is a very high change (1-(364/365)(363/365) … (341/365) = 56.8%) that two people in the room will share the same birthday.
Our potential solution might work for a small set of users, but if we have a large amount, say 50,000, of users we may run into problems. Using census data, we find that approximately 1.006% of the population has “Smith” as a last name and 11.9% of the population have a first name beginning with the letter “J”. If we assume the distribution of names in our group of users is similar to that of the US, we are likely to have around 59 people with the name J. Smith.
Each J. Smith has 10^4 = 10,000 possibilities for the last 4 digits of their social security number. Assuming the last 4 digits are distributed evenly, if we pick two J. Smiths at random there is only a 1/10,000 = .01% chance that they have the same social security number. However, using the techniques we used to solve the birthday problem, we can calculate that there is a 15.75% chance that there will be two J. Smiths in our group of users with the same last 4 digits of their SSN.
What does this mean
We made several assumptions about the name and SSN distribution of our users above, and we stopped short of full statistical analysis, but our results are alarming. If there is a 15.75% chance that there will be two people named J. Smith in our group of users with colliding usernames, we cannot be confident of avoiding username collision.
What to do
Since our first proposed username scheme looks flawed, where can we go from here? Luckily, due to the nature of the birthday problem, adding additional information to our usernames will greatly reduce our chances of username collision. For example, if we add the user’s birthday to the end of the username (JSmith67890131) the chance of two J. Smiths having the same username drops to 0.04%. If we are still unsatisfied we could add additional data such as the full first name or birth year to further decrease the chance of a collision. Check out the code used to calculate these probabilities here.