topic modeling workshop

Tuesday, June 11, 2019. 8:30am to 4:30pm.

Join Vector Analytics for a day of in-depth training on topic modeling right here in Charlotte, NC.

Workshop Overview:

Learn how to build better topic models for analysis of customer complaints, consumer reviews, web logs, and other text / unstructured data your organization can leverage for business insights and trend analysis. 

The quality of a topic model depends on hyperparameter optimization, assuring convergence, understanding model diagnostic statistics, careful consideration and treatment of incoming text, structured human review of topics, and the topic model (Latent Dirichlet Allocation) package you decide to employ. 


This one-day workshop is packed full of ideas to optimize your topic modeling so your efforts generate a model that yields coherent, understandable, high-quality topics which provide real insight into your organization's data.

ICYDK (In Case You Didn't Know): Topic models are statistical models traditional used for uncovering the latent or implicit structure in document collections. They simultaneously soft-cluster words and documents into a fixed number of topics. (Definition from Boyd-Graber, Mimno, and Newman; 2014).


This workshop builds upon knowledge that most data scientists learn during initial data mining classes. During this workshop, attendees will be exposed to 50% of the material in Cornell University's Advanced Topic Modeling graduate level class taught by Dr. David Mimno, an established expert/scholar on topic modeling and maintainer of MAchine Learning for LanguagE Toolkit (MALLET). 

click here to Register for the workshop!

What attendees will learn:


  • Under the hood: understand how Latent Dirichlet Allocation (LDA) actually works. 
  • Python Gensim LDA versus MALLET* LDA.
  • Optimization of text parsing, cleaning, and tokenization.
  • A topic model development workflow.
  • Understanding diagnostic statistics.
  • Evaluating model quality.
  • Determining the right number of topics.
  • Visualization of topic models.
  • Click here to see detailed workshop syllabus.

*MALLET is a topic model package developed by the University of Massachusetts Amherst; we will run it from the command line.

Workshop Prerequisites:

(I.e. Recommended Background of Attendees):

This intermediate workshop is targeted at attendees who seek to expand their knowledge and toolkit of topic modeling techniques (such as model  evaluation, optimization, and visualization).  It is designed for individuals who have some previous exposure to topic modeling and/or text clustering. 

This is a Python based workshop, and attendees should understand how to read/write files, clean text strings, and perform regex.  Code will be provided in Python scripts  and we will "walk through" interesting code, but attendees should be familiar with how to manipulate files and text in Python. 

Understanding how to execute from the command line and use of shell commands are also expected (a cheat sheet of shell commands will be provided). A list of packages/modules expected to be downloaded and installed prior to the workshop will be distributed. Attendees will use their own laptops/powerbooks. 

Workshop Logistics:

Instructor: Marcia Price, CEO and Chief Data Scientist of Vector Analytics. Read her bio here.

Date/time: Tuesday June 11, 2019. 8:30am to 4:30pm.

Location: Advent Coworking, 933 Louise Avenue, Charlotte, NC 28204. (map here).

Cost: $360 per attendee (includes a 20% early bird discount for those registering by May 31, 2019)! $450 per attendee registering after May 31, 2019.

Max class size: 15 attendees (allowing lots of one-on-one support during exercises). 

Click here to register (via Eventbrite). Payment due upon registration. You may cancel your registration (you will be refunded less Eventbrite fee and credit card processing fees) and your registration is fully transferable to another individual from your organization.

Lunch and coffee/snacks: Will be provided. 

Marcia Price is an expert at classical LDA topic modeling. She is the architect of a data fusion platform based on topic modeling that provides a first-of-its-kind analysis of the US Department of Defense budget. 

Find out more about Vector's DIA Platform

Want a Custom Workshop?

Interested in a workshop customized for your organization's specific needs?

Want to optimize a topic model that ingests your company's data during the workshop?

Want to conduct the workshop in-house to foster proprietary discussion amongst your staff? 

No problem, we are happy to customize the workshop to meet your organization's needs. And we are happy to sign an NDA to work with your specific data during the workshop. Contact us to talk details and set it up!

Detailed workshop syllabus:

==>  We aren't fooling around. This is an intense and interactive learning experience. You will be drinking from a firehose. Attendees will receive a copy of all lecture slides and code used in the workshop. <==

Review of Latent Dirichlet Allocation(LDA) based topic modeling: 

  • Review/intro (will be adjusted based on attendees prior knowledge, though some prior exposure to topic modeling is assumed).
  • Most importantly, when is it appropriate to use topic modeling and where does it work best.
  • Under the hood: let's understand how Latent Dirichlet Allocation works in more detail (so later, we can better tune hyperparameters).

Python Gensim LDA versus MALLET LDA:

  • The differences.
  • The pros/cons of each.
  • Why you should try both.
  • MALLET from the command line or through the Python wrapper: which is best.

Exercise: run a simple topic model in Gensim and/or MALLET, explore options. (We'll be using a publicly available complaint dataset from the  Consumer Financial Protection Bureau during workshop exercises.)

Optimization of text parsing, cleaning, and tokenization:

  • To stem or lemmatize?
  • "Super" lemmatization trick.
  • Stopword lists (why its important to keep your domain in mind).
  • Deletion phrases versus stopwords.
  • Text hints from David Mimno and David Blei (the fathers of topic modeling).
  • Uni-, bi-, or tri-grams (how to implement, how to evaluate).
  • To prune or not to prune (the dictionary).

Exercise: text processing tricks for topic modeling in Python.

A topic model development workflow:

  • Let's review a generic workflow or pipeline for development of a high quality topic model.
  • Note differences between Gensim and MALLET (based on package output files).
  • Identify supplemental packages/libraries, visualization tools, and custom code (some provided by Vector) required for optimizing topic models.
  • Let's talk about when and how to engage a domain Subject-Matter-Expert (if that's not you).

Diagnostic statistics and model quality (Part I):

  • Perplexity: how is it calculated and what it is good for.
  • Ten topic model diagnostic statistics explained.
  • Model convergence: how to tell when the model is baked.

Exercise: Review diagnostic statistics for a set of model versions; note good/bad; share in workshop discussion.

Diagnostic statistics and model quality (Part II):

  • Fused, duplicate, and messy topics: how to find; how to minimize.
  • Human review: unfortunately, its required, but let's make it easier.
  • The "dials" we have to optimize a topic model.
  • A methodology for tracking model versions, hyperparameter tuning, and topic quality.
  • How to home in on the "right" number of topics.
  • Importance of iterative, "design of experiments" approach to model optimization.

Examining your Gensim topic model using python's pyLDAvis package:

  • Understanding saliency, relevance.
  • Understanding the intertopic distance map.
  • In-class exploration of the pyLDAvis interactive tool.

Examining your MALLET topic model using a custom built Tableau visualization:

  • Use Vector's Tableau workbook as example.
  • Importance of reviewing high proportion document texts during topic quality assessment.
  • Python code and instructions for creating a vis of MALLET output will be provided.

Exercise: examine topic quality using pyLDAvis, note comments, share in round-table discussion.

Visualizing a topic model (all topics) in a 2D visualization:

  • The benefits of this visualization.
  • What you should see (if your model is good).
  • Some options (and why they don't always work): pyLDAvis, NetworkX, clustering, simple organization chart.
  • Thoughts on PCA, MDS, distance selection, normalization/scaling.

Mention (brief, just a tease) of content to be covered during our Advanced Topic Modeling workshop offered later this summer:

  • Dynamic topic modeling (the best way to analyze topics over time).
  • Neural topic modeling, word embeddings, LDA2Vec.
  • Thoughts on GloVe, ELMo, BERT.
  • Examples of using topic models for info retrieval, data fusion, and anomaly/outlier detection (for network monitoring, insider threat detection).
  • Distributed (cloud-based) topic modeling.

It's a wrap (a review of the day and workshop evaluation).

Click here to register for the June 11, 2019 Workshop! Register now to receive a 20% discount!

questions? call us at 910-585-6228 or use our contact form!