Carl Svard
Written by Carl Svard
Published 2017-06-08

Translating words into vectors

Matching a job description with a candidate’s experience is not an easy task, even for humans.

An HR professional’s workload includes lots of data-heavy tasks (like sifting through tons of candidate experiences), which can be very time-consuming. With the great AI awakening, can we expect machines to help HR with these repetitive and complex tasks?

At Schibsted, we operate classified ad services around the world, including several leading job search sites, and we are continuously working to improve the efficiency of our marketplaces to provide a good match between offer and demand. We see AI and deep learning as an enabler to further improve our user experience. 

In the future, we expect AI in the job classified ads business to both help HR professionals source good candidates and enable candidates to find their next job. This post explores how the latest progress in NLP (Natural Language Processing) could reshape the job / candidate matching problem by using good representations (embeddings) of the content. 

Challenges to overcome

The simplest way to match a candidate’s skills with a job description is to use keywords in a term-based search. That means, for instance, in the IT field, if a Java specialist is looking for a new position, then “Java” would obviously be a good keyword to match. A term-based search will then match all the job offers where the term “Java” appears. However, no listings with other relevant keywords (like, say, J2EE) would appear, because they do not contain the keyword “Java”. Therefore, the candidate may miss seeing some relevant job offers. 

This is why not only precision, but also recall (the fraction of relevant listings actually matched) does matter.

Among the challenges to good matching, some are particularly crucial:

  • Synonyms: Depending on the corporate culture, similar job positions and skills might be described using different words. A good model has to be able to catch synonyms and related words.
  • Polysemy: On the other hand, one specific keyword can have very different meaning in different contexts. E.g, in French, “chef” can refer both to a kitchen chef or to a team leader.
  • Typos: Classified ads content can sometimes contain typos and spelling mistakes.

Embeddings trained on classified ads content

NLP (Natural Language Processing) has seen huge progress lately. One of the most popular recent techniques is the use of word embeddings (a.k.a., distributional semantic models). It has gained a lot of popularity by showing that we can represent a word with a vector of doubles that will “encode” the word from a semantic point of view.

The most famous word embeddings model was introduced in 2013 (Mikolov et al 2013) with the word2vec model. Word2vec has been analysed by many researchers, for instance Goldberg and Levy. The concept of this model is that a word representation can be inferred from its context. So if two words appear frequently in the same context, they should be represented by two nearby vectors. Internally, given a word (its representation in fact), the model tries to predict the surrounding word representations. In practice, it means that word2vec will affect high similarities between words that can be used in the same manner.

Here is the example developed in the word2vec article:

Say we represent each word of a vocabulary with a vector (e.g., in dimension 100). Each word is then represented by a distribution of numbers across 100 dimensions. The word “king” is represented like this:

W(“king”) = (0.92, 0.81, -0.2, … )

 

This representation is a projection into a 100-dimensional space. This projection can be seen as a compression. This list of 100 elements does not mean anything in itself, just like a zip file is un-readable in itself, but compresses the information.

Other words of the vocabulary, like “man” or “woman” have their own representation:

W(“man”) = (0.11, 0.85, -0.18, … )

W(“woman”) = (0.12, -0.25, 0.91, … )

We can then compose these vectors, making addition or substraction, like:

W(“king”) – W(“man”) + W(“woman”) = (0.93, -0.29, 0.89, …)  

This resulting vector is a point in our 100-dimensional space. What if we look at what is the word whose representation in that space is the closest to (0.93, -0.29, 0.89, …) ? Answer is that we get the word “queen”. So, to recap:

W(“king”) – W(“man”) + W(“woman”) = (0.93, -0.29, 0.89, …) ≈  W(“queen”)

i.e., king – man + woman ≈  queen

This kind of result seems very powerful. The input of the model is only a corpus of text, and absolutely no notion of gender or royalty has been explicitly given to the model (through a dictionary or a relationship network).

Applying this model, if we feed an open-source, NLP python library (namely gensim) with job offer content from Leboncoin, Schibsted’s french marketplace, we can compute which vectors are most similar to a given input.

Example 1 (IT techno)

Even if Leboncoin is not a niche specialist in the IT field, these similarities seem pretty good.

Example 2 (car make)

From the king – man + woman = queen example, which can be re-formulated as

“man” is to “woman” what “king” is to ?

we can also apply the same logic to typos:

“cuisinier” (french word for ‘cook’) is to “cuisnier” (typo of the latter) “chauffeur” is to what?

(i.e if “cuisnier” is a typo of “cuisinier” what would be a typo of “chauffeur”? ) :

Example 3 (typos)

As we see with these outcomes, the most similar vectors (in the 100-dimensional space) are also quite close in terms of semantics. That is why some present the similarity between two vectors as a ‘semantic distance.’ However, if that seems to perform well for positive similarity, it does not really handle negative similarity (antonyms).

This type of vector analogy is not specific to embeddings, but embeddings do a remarkable job at preserving the semantic relationship in low dimensions. On top of this, word2vec comes with computational optimizations that make it possible to train on very large corpuses very efficiently (in particular using a ‘negative’ sampling technique).

Applications

Assuming that we have a good representation of words, we will have a base on which to build more advanced applications.

  1. Query expansion: In an experiment conducted in Schibsted’s Norwegian marketplace, Finn, we used the word vectors as an under-the-hood query expansion, with good customer feedback. From a small search query of typically 5 words, 4-5 times as many similar words were retrieved, using the word2vec model to form the expanded search query. This expanded query was used to broaden the search and retrieve more candidates to pool from.
  2. Ad2Vec: from word embeddings to classified ads embeddings

Getting a good representation of a classified ad job offer would enable us to estimate its similarity to other offers based on their descriptions (to improve recommendations for instance).

Generally speaking, a classified ad can be seen as three elements: a title, a description and some pictures (when relevant). For job offers, we can focus on the first two elements.

Our first approach was to simply average out all the word embeddings of the ad content to get a representation of the whole ad. However, this did not appear to be the best aggregation method. Instead, we experimented with a couple of alternatives, using a tf-idf weighted sum of the underlying embeddings or projecting the ad into a higher-dimension space to get better representations.

From that, not only could one-to-one similarity be computed, but we can also try to get a more global view of the embeddings. We used the Google Tensorboard Embeddings Visualization tool, with a 3d T-SNE projection on a sub-sample of the ads. Each point is a job offer classified ad, colored according to its field. The plot below shows a sub-sample of 7 fields. The ad embeddings coupled with the T-SNE projection very clearly split the different fields.

T-SNE projection

Zooming in on the IT positions ads, we also get a very homogeneous word cloud (in French):

Word cloud from IT position cluster

Conclusion

The word embeddings model and its derivatives is a very promising technique to qualify and enrich classifieds ads content. Natural Language Processing (NLP) techniques used here bring us closer to automating the Natural Language Understanding (NLU).

We focused here on job descriptions, but, for more general purposes, a global content embeddings will have to synthesize the text, pictures and all the meta-data attached to an ad.

What about deep learning? Word2vec does not actually use a deep network (it’s more part of the shallow machine learning models). Deep learning applied to text is still a huge research area. If you want to help us dig deeper on this topic, we are hiring!

Written by Carl Svard
Published 2017-06-08