Join Vector Analytics for a day of in-depth training on topic modeling. Workshop can be delivered virtually or onsite if Covid-19 safety protocols are followed.
Learn how to build better topic models for analysis of customer complaints, consumer reviews, web logs, and other text / unstructured data your organization can leverage for business insights and trend analysis.
The quality of a topic model depends on hyperparameter optimization, assuring convergence, understanding model diagnostic statistics, careful consideration and treatment of incoming text, structured human review of topics, and the topic model (Latent Dirichlet Allocation) package you decide to employ.
This one-day workshop is packed full of ideas to optimize your topic modeling so your efforts generate a model that yields coherent, understandable, high-quality topics which provide real insight into your organization's data.
ICYDK (In Case You Didn't Know): Topic models are statistical models traditional used for uncovering the latent or implicit structure in document collections. They simultaneously soft-cluster words and documents into a fixed number of topics. (Definition from Boyd-Graber, Mimno, and Newman; 2014).
This workshop builds upon knowledge that most data scientists learn during initial data mining classes. During this workshop, attendees will be exposed to 50% of the material in Cornell University's Advanced Topic Modeling graduate level class taught by Dr. David Mimno, an established expert/scholar on topic modeling and maintainer of MAchine Learning for LanguagE Toolkit (MALLET).
*MALLET is a topic model package developed by the University of Massachusetts Amherst; we will run it from the command line.
This intermediate workshop is targeted at attendees who seek to expand their knowledge and toolkit of topic modeling techniques (such as model evaluation, optimization, and visualization). It is designed for individuals who have some previous exposure to topic modeling and/or text clustering.
This is a Python based workshop, and attendees should understand how to read/write files, clean text strings, and perform regex. Code will be provided in Python scripts and we will "walk through" interesting code, but attendees should be familiar with how to manipulate files and text in Python.
Understanding how to execute from the command line and use of shell commands are also expected (a cheat sheet of shell commands will be provided). A list of packages/modules expected to be downloaded and installed prior to the workshop will be distributed. Attendees will use their own laptops/powerbooks.
Instructor: Marcia Price, CEO and Chief Data Scientist of Vector Analytics. Read her bio here.
Date/time: You pick the date! 8:30am to 4:30pm (or whatever fits your schedule).
Location: Virtually, at our coworking location (we have a great classroom), or at your facility.
Cost: Dependent on number of attendees.
Max class size: 15 attendees (this limit allows for adequate one-on-one support during class exercises).
Marcia Price is an expert at classical LDA topic modeling. She is the architect of a data fusion platform based on topic modeling that provides a first-of-its-kind analysis of the US Department of Defense budget.
Interested in a workshop customized for your organization's specific needs?
Want to optimize a topic model that ingests your company's data during the workshop?
Want to conduct the workshop in-house to foster proprietary discussion amongst your staff?
No problem, we are happy to customize the workshop to meet your organization's needs. And we are happy to sign an NDA to work with your specific data during the workshop. Contact us to talk details and set it up!
==> We aren't fooling around. This is an intense and interactive learning experience. You will be drinking from a firehose. Attendees will receive a copy of all lecture slides and code used in the workshop. <==
Review of Latent Dirichlet Allocation(LDA) based topic modeling:
Python Gensim LDA versus MALLET LDA:
Exercise: run a simple topic model in Gensim and/or MALLET, explore options. (We'll be using a publicly available complaint dataset from the Consumer Financial Protection Bureau during workshop exercises.)
Optimization of text parsing, cleaning, and tokenization:
Exercise: text processing tricks for topic modeling in Python.
A topic model development workflow:
Diagnostic statistics and model quality (Part I):
Exercise: Review diagnostic statistics for a set of model versions; note good/bad; share in workshop discussion.
Diagnostic statistics and model quality (Part II):
Examining your Gensim topic model using python's pyLDAvis package:
Examining your MALLET topic model using a custom built Tableau visualization:
Exercise: examine topic quality using pyLDAvis, note comments, share in round-table discussion.
Visualizing a topic model (all topics) in a 2D visualization:
Mention (brief, just a tease) of content to be covered during our Advanced Topic Modeling workshop offered later this summer:
It's a wrap (a review of the day and workshop evaluation).