What is AQM?

Applied Quantitative Methods (AQM) is a rigorous 10-month program designed for highly quantitative graduates, postgraduates and experienced professionals interested in making the transition to the lucrative field of Data Science.

Team members are immediately immersed into handling complex, real-world data while undergoing training in all aspects of Data Science from data storage, data processing and cloud computing to computational statistics, machine learning, and visualization. After 4 months of intensive training, the AQM team embarks on exciting value-added projects with some of Canada’s largest companies.

Seminar Summary

The seminar will offer an inside look at some cutting-edge Natural Language Processing (NLP) methods currently being employed by AQM, the details of which are outlined below:

1) Word2vec and Word Embedding

Word Vector Embedding is an unsupervised technique which maps words to finite-dimensional, real-valued vectors. The vectors give semantic meaning as a linear combination of 'semantic basis vectors'. We will show how to load a pretrained Word2vec Embedder in Python and use it as a preprocessing step in more advanced Natural Language Processing (NLP) tasks, such as classification and the Word Mover's Distance calculation.

2) Word Mover's Distance (WMD)

The WMD is a generalization of the vector euclidean distance between the embedded word vectors of a sentence or document. Unlike methods such as cosine distance, WMD preserves the semantics of a sentence or document. WMD works by minimizing the mass-distance product of two collections of embedded words. We will use a pretrained word vectorizer and WMD to write a simple document similarity tool to help us search for similar documents.

3) TDF-IF Clustering

Term Document Frequency-Inverse Frequency (TDF-IF) is a similarity metric used on a 'bag of words' (BoW) model that can be used to cluster documents. TDF-IF has the advantage of being simple and fast to use. We will take a look at how we can combine K-Means and TDF-IF to cluster a collection of documents based on word TDF-IF similarity.