Many data scientists are familiar with word embedding models such as word2vec, which capture semantic similarity among words and phrases in a corpus. However, word embeddings are limited in their ability to interrogate a corpus alongside other context or over time.  Moreover, word embedding models either need significant amounts of data, or tuning through transfer learning of a domain-specific vocabulary that is unique to most commercial applications. 

In this talk, I will introduce exponential family embeddings. Developed by Rudolph and Blei, these methods extend the idea of word embeddings to other types of high-dimensional data. I will demonstrate how they can be used to conduct advanced topic modeling on datasets that are medium-sized, which are specialized enough to require significant modifications of a word2vec model and contain more general data types (including categorical, count, continuous).


I will discuss how we implemented a dynamic embedding model using scikit-learn and Tensor Flow and our proprietary corpus of job descriptions. Using both categorical and natural language data associated with jobs, we charted the development of different skill sets over the last 3 years. I specifically focus description of my results on how data science and quantitative skill sets have developed, grown and pollinated other types of jobs over time.

I will specifically discuss the following:

  • Introduction to word embeddings models (word2vec, GLoVE) focussing on barriers to real-world/industrial implementation
  • Background on exponential family embeddings (with reference to Rudolph and Blei), focussing on applications of multivariate and Bernoulli models.
  • Description of data used to train the model (size, types of data as well as processing steps that we optimized with)
  • Description of results from model
    • What 'fringe' data science skills have become 'core' data science skills
    • What data science skills have pollinated other types of roles?
      • Where are the primary functions/roles where there is pollination
    • What are 'new' data science skills that are emerging?

Download Slides

Maryam Jahanshahi

Research Scientist | TapRecruit

Maryam runs research at TapRecruit, a startup that is building software tools to implement evidence-based talent management. TapRecruit's research program integrate recent advances in NLP, data science and decision science to identify robust methods to reduce bias in talent decision-making and attract more qualified and diverse candidate pools. In a past life, Maryam was a cancer biologist and a data journalist. She holds a PhD from the Icahn School of Medicine at Mount Sinai.

Maryam Jahanshahi

Experience talks like this and many more at our upcoming event

Learn More