Many data scientists are familiar with word embedding models such as word2vec, which capture semantic similarity among words and phrases in a corpus. However, word embeddings are limited in their ability to interrogate a corpus alongside other context or over time. Moreover, word embedding models either need significant amounts of data, or tuning through transfer learning of a domain-specific vocabulary that is unique to most commercial applications.
In this talk, I will introduce exponential family embeddings. Developed by Rudolph and Blei, these methods extend the idea of word embeddings to other types of high-dimensional data. I will demonstrate how they can be used to conduct advanced topic modeling on datasets that are medium-sized, which are specialized enough to require significant modifications of a word2vec model and contain more general data types (including categorical, count, continuous).
I will discuss how we implemented a dynamic embedding model using scikit-learn and Tensor Flow and our proprietary corpus of job descriptions. Using both categorical and natural language data associated with jobs, we charted the development of different skill sets over the last 3 years. I specifically focus description of my results on how data science and quantitative skill sets have developed, grown and pollinated other types of jobs over time.
I will specifically discuss the following:
Maryam runs research at TapRecruit, a startup that is building software tools to implement evidence-based talent management. TapRecruit's research program integrate recent advances in NLP, data science and decision science to identify robust methods to reduce bias in talent decision-making and attract more qualified and diverse candidate pools. In a past life, Maryam was a cancer biologist and a data journalist. She holds a PhD from the Icahn School of Medicine at Mount Sinai.