# Introduction

Millions of cars are sold every year across Schibsted’s 42 worldwide marketplaces, on sites such as Coches.net in Spain, Finn.no in Norway or Blocket.se in Sweden. What more can we do to help people sell their cars – or, indeed, anything?

Setting an asking price is probably the most important action for any seller in a marketplace. The asking price *directly* affects the final price, and (because every seller wants the best price) also directly impacts the user experience on Schibsted classified sites.

However, the price has less obvious effects, as well. Too low and the seller may be flooded with offers and regret the price. Too high and they might receive no interest in the item – or many haggling requests, and be forced to lower the price. By supporting users in setting realistic prices, we believe we can improve their marketplace experience.

But *how* exactly can we support them? Classified advertising sites typically require only a photo, title, description, category and price, making price recommendation difficult, because we are forced to find similarities between classified ad features before applying pricing logic.

However, ads for some product types have more structured information – cars are a good example. Here collecting information on the car make, model and release year is very common. This makes it much simpler to recommend prices, because we can apply conventional regression techniques without thinking about complex similarity measures to unveil the dynamics of pricing.

Let’s shine a light on what drives car prices and how well we can predict them…

# Data

One privilege of working as Data Scientists at Schibsted is the incredible amount of data available for studies like this one. The dataset used in this post comes from one of Schibsted’s classified sites. It is comprised of a few million car ads with the following nine features, in addition to the asking price, which is our target variable.

*Table1: Data Model*

The dataset consists of 39 car brands, with the following distribution:

*Figure 1: Proportion of Car Brands*

The dataset also consists of some 520 distinct car models. We limited our study to the 300 car models that have more than 50 observations in the dataset.

### Depreciation

Depreciation is the difference between a car’s value from when you buy it until you sell it. In this study, we will be looking at how age and mileage can impact your car’s value.

### Age

We obtain the age of a car as the difference between the year of the *published_date* field, and the *model_year* field. Below is a plot of the relative counts for each age group of size 5.

*Figure 2: Histogram of car age.*

As we can see, 25% of cars are younger than five years, whereas about 60% of cars are less than ten years old.

### How does **age** impact your car’s value?

Car owners can easily estimate the cost of insurance, fuel and service per year, but they may have a harder time figuring out how much their car will be worth over in the future.

What will be the value of your car in, say, four years? And will the depreciation change if your car is a BMW or a Ford? Without data, these questions are hard to answer. But with the amounts of historical data we have at Schibsted, the depreciation of a car is more easily determined.

We decided to test this by plotting the depreciation curves for the 6 most popular car brands, in Figure 3 below. For each of the six car brands, we calculated the median price of the cars for each age level, and scale the values by setting the price to 100% at age zero. Hence in Figure 3, we can think of the vertical axis as the relative price of a car with respect to its initial value.

*Figure 3: Depreciation with time*

As it turns out, we can say that *the price of your car has most likely halved after four years*** ,** no matter if it’s a BMW or a Ford. That seems a good rule of thumb. Actually, the price decreases by approximately 15% every year in an exponential fashion. At year 4 it is 50% of initial price, at year 8 it is 25%, and so on. If you don’t believe us, do the maths: 85% x 85% x 85% x 85% = 52%. This is what in finance is called “the magic of compounding”.

### Mileage

*Mileage* in our dataset is grouped in buckets of size 5,000 km. From Figure 4, we can see that the largest group is in fact cars that have less that 5,000 km, which is comprised of about 5.5 % of the cars in the dataset. Note that in the graph below, the rightmost bar is the group of all cars driven 300,000 km or more.

*Figure 4: Histogram of car mileages*

It’s a known phenomenon that as the mileage of a car grows, its value falls. It’s common sense. However, can we derive a rule of thumb for the depreciation effect of mileage, like we did for the age? Let’s have a look.

### How does **mileage** impact your car’s value?

Just as what we did with the age of the car, we plot the evolution of its price over mileage, but now directly in relative terms, i.e. as percentage of the price at zero mileage:

*Figure 5: Depreciation of cars with mileage*

As expected, we can see that all car brands lose value quickly. However, it’s also apparent that Volkswagens and Fords depreciate quicker in the first 25,000 km, and in Ford’s case, it doesn’t even catch up with the other brands. Alongside the data from the 6 brands we have plotted the exponential decay curve y = c*e^ax, fitted on the medians of car prices from all the 6 brands.

It’s apparent that the exponential fits the curve of the car brands quite nicely, except for the case of Ford. While the price of a car roughly halves after 85,000 km, for some cars it halves quicker, like the Fords, which lose 50% of their value just after 65,000 km.

Figure 5 also allows us to establish some equivalences. We can see that the first 15% of the price loss, which is roughly equivalent to one year as per Figure 3, happens at around 16,000 km. The next 15% is at approximately 32,000 km. As a rule of thumb, a mileage of 16,000 km (which is about 10,000 miles) has the same effect as one year.

There is of course a flaw in the charts above. We’ve plotted the median price for each car brand evolving with the age or mileage, but including all the different models. It is likely that some car models depreciate differently from the median. Figure 6 shows the variation in price for Volvos, by plotting the interpercentile ranges of prices for Volvos for different mileages.

*Figure 6: Interpercentile ranges of prices of Volvo’s grouped by mileage*

It is clear that there is a lot of variance in the price that the mileage alone doesn’t explain, and that the depreciation will quite possibly change for different car models. For example, if we look at the area between 25th and 75th percentiles, for cars with 50,000 km, the prices range from 50% to 80% of the initial price. The median, however, lies at around 65%. This shows how different prices can be inside the same brand, and it may be for many other factors not related to mileage.

## Price Predictions

In this section, we will discuss some basic approaches to predicting prices for vehicles. For this purpose, we divide the dataset into a training set, which we use to train our models, and a test set, on which we test how accurate they can predict a price. The split is based on time, so that we use the newest 20% of the observations to test.

In order to measure the accuracy of our model, we use the mean and the median of the absolute percentage error (APE). Although it sounds complicated, it is not. Bear with us one moment.

The absolute percentage error is given by |(truth – prediction) / truth| for each observation. For example:

We have two cars A and B with their “true prices” €10,000 and €20,000 respectively. For car A, we predict a price of €12,000, and for car B we predict €15,000. The absolute percentage error for car A is |10000 – 12000| / 10000 = 20%. For car B, the absolute percentage error is |20000 – 15000| / 20000 = 25%. The mean absolute percentage error is therefore (20% + 25%) / 2 = 22.5%.

Of course the lower the Mean APE of the model, the better it is. Now that we have a way to measure how good a model is, we can build a very simple one by grouping the training set by different fields and calculating the mean (or some other number) of the price based on the observations in that group. Then, for any car that matches on those fields, we will use the computed number, i.e. the mean in this example, as the prediction for this car.

For example, if we group only by “make_name”, the prediction of this simple model for an Audi A4 Avant from 2010 will be just the average price of all cars of the make Audi. We dub this benchmark approach the GroupByRegressor. In Table 2 we report the Median and Mean APE on the test set:

*Table 2: Results obtained by using group by and mean calculation*

Interestingly, with this simple model we are on average about 15% away from the observed asking price – which actually isn’t *that *bad! Note that although the GroupByRegressor seems to work here, it will do a very bad job in the event that you have an observation in the test set that doesn’t match any observation in the training set on the grouped features – hence, this model won’t generalize to completely new observations.

We can do better.

A slightly more sophisticated approach is to train a K Nearest Neighbors (KNN) model. A KNN model identifies for every data point in the test set, given some error metric and the number K, the K closest data points in the training set. We then report the average price of those K neighbors as the prediction. This methodology mimics the behavior of users in our marketplaces, where they search for similar cars to the one they want to sell and use the prices of cars in the search results as the basis for setting the price of their own ad.

Using the nine features in Table 1, we define a simple pipeline where we transform the discrete features, *make_name*, *model_name*, *model_version_name*, *gearbox* and *chassis* into multiple binary vectors in a procedure known as *one-hot encoding*. The *model_year *and *published_date* are transformed into the *age* of a car by the doing year(*published_date) – model_year*, and is then along with *engine_size* and *mileage* scaled into having zero mean and unit variance.

We present the results of the KNN model in Table 3. We see an improvement of the Mean APE from 15.0% of the GroupByRegressor to 11.9% for the KNN. Not a bad improvement.

We also trained a gradient boosting machines model, using the XGBoost implementation. For this case we tried something different: we trained one model per unique car model (denoted OPMM below). We do this to avoid having to model interaction effects, i.e. the fact that the mileage affects the price of a Ford differently than the price of an Audi, as we observed before. Hence our model uses only the following five features: *gearbox, chassis, age, engine_size* and *mileage.* Included in Table 3 are the scores of the XGBoost regression trained on the entire training set, using all features, with all parameters set to the default of the xgboost sklearn interface, but without interaction effects.

*Table 3: Regression model results*

It’s immediately clear from Table 3 that both KNN and XGBoost were able to beat the results of the benchmark model, the GroupByRegressor. However, the method performing the best on this test set, XGBoost, only manages to beat it by 3.5 percentage points in terms of Mean APE. Interestingly, recommending the mean price of the 10 nearest neighbors works surprisingly well, being only .5 and .06 percentage points behind XGBoost in median and mean absolute percentage error, respectively.

## Some final words

We now have a rule of thumb for the fall in a car’s price by age and mileage, namely that it halves every four years *or* after every 85,000km driven. We also built a model that can, with reasonable accuracy, estimate a car’s price based on that of similar cars.

This price prediction assists users to choose the right price for their item, but also help them figure out how much they should pay for something second-hand. The best news is, we’re only scratching the surface in doing this with cars – there’s much more work to be done!

Disclosure:

The data in this article is taken from one of Schibsted’s marketplaces, and was extracted for summer 2016. Prices in this article are listing prices, not the actual sales price.