The good news is: we have an estimated 200 million unique visitors per month. The bad news is: we don’t always know very much about these visitors.
There are two simple reasons for this. First, not all of our visitors have user accounts. Second, not every user logs in when visiting our sites. We know very little about these “anonymous” visitors, so let’s call them Strangers.
Some visitors we do know quite a bit about though: those who have logged in with a user account containing information about themselves, e.g., age, gender, and home location. Let’s call these visitors Users.
In our User Modeling team, the data we collect about Users is leveraged to build models which can predict things like age and gender for Strangers. Ideally, all of our Strangers will eventually become Users. Until that happens, these models help us predict the information we’re missing, effectively creating an inferred User Profile.
Ultimately, this inferred information helps us:
- Better understand our visitors (“what’s the demographic breakdown of readers of this article?”)
- Serve more relevant content (“what content should we show people browsing our sites?”)
Age and Classifieds
Now, let’s dive right into the data. What is immediately visible in the data we collect today, and how can this be used to predict age?
By analyzing the millions of page views generated by users of Finn (our classified site in Norway), we can see how interest in classified categories changes over time:
The proportion of pages visited in each category is represented by the height of the line (the y-axis). This is indexed at 1 so we can compare across categories. Think of the lines as measures of user interest in these different categories, sorted by age.
For instance, the graph shows people in their early 20’s have a high interest in house or flat rentals. It also shows that people in their 50’s are more interested in holiday houses.
Intuitively, the graph makes sense, and is explained by these underlying facts:
- Part-time jobs are mainly relevant during your teenage years.
- Your focus shifts towards finding a home to rent when you move out of your parent’s house. This typically happens around age 20 (Scandinavians leave the nest early!)
- Norwegians generally aim to own a home – home ownership rates in Norway are higher than in the rest of Western Europe – so interest peaks fairly early, at age 30.
- Holiday houses become more interesting almost linearly as you accumulate wealth and approach retirement.
The cool thing is that the data gives us all this information. A machine doesn’t need to know about Norwegian home ownership rates or their linearly increasing need to migrate to warmer temperatures – it can just look at our data!
And this is basically how we build models: just as in the graph above, where we know the ground truth (in this case the age of the users) machine-learning algorithms look at rough data representations to find relationships between interests and age, resulting in a model that can predict age.
If you give the algorithm a visitor’s page view history (a proxy for their interests), it can predict their age with some degree of error.
One of the things I love about Schibsted is that we have classified sites all over the world. This means we can compare users in different countries doing the same things: browsing, buying, and selling stuff online.
The curves are fascinatingly similar. Both peak in your early twenties, stay high for about three years, then rapidly decline. The French home-renting interest starts just a little bit later – one year, to be exact. This is more or less consistent with the aforementioned statistics (these stats indicate the interest should start two years later though).
The graph for Leboncoin is created from data for just one day*.
Think about that for a moment. By inspecting data points generated by our users on a single day, we can discover their interests over their entire lives.
This is your life in classifieds.
We can create these curves for any category on a classified site. Below is the interest curve for “Video Games” from Leboncoin.
What other categories might reveal a person’s age? Tell us in the comment field below and we might test some of your hypotheses.
* This is the reason the Leboncoin graph jumps around more: towards the right we have less data. We built the Finn graph on 30 days’ data. This cancels out more noise and creates a smoother curve.