In the first Teaching with Technology Tuesday of the fall 2011 semester, David Newman delivered a presentation on topic modeling to a full house in Bass's L01 classroom. His research concentrates on data mining and machine learning, and he has been working with Yale for the past three years in an IMLS funded project on the applications of topic modeling in museum and library collections. In Tuesday's talk, David broke down what topic modeling is, how it can be useful, and introduced a tool he designed to make the process accessible to anyone who can use a computer.
What is Topic Modeling and How is it Useful?
David introduced topic modeling as an "answer to information overload." In short, it's a system to have a computer automatically search and categorize large archives, combing them for patterns that can eventually be used to get a better idea of what's inside. The process works best when there are thousands to millions of documents involved, and the output can be thought of as a list of subject tags, although that description is not completely accurate. As the computer sifts through the documents, it identifies words that repeat and words that co-occur. It then identifies sets of these "tokens" and groups them together. The result is a list of keyword groups that link to the documents that contain those keywords - a form of AI subject classification.
Although the computer can never be quite as creative or accurate as a human reader, it compensates in sheer volume - making topic modeling perfect for large data sets. As books are scanned and archives digitized, topic modeling provides a fast way to help collections managers figure out what they are holding, and gives researchers better metadata to quickly find what they need.
Applications of topic modeling are diverse. The NSF uses topic modeling to figure out what subjects are most active in publications, helping to produce "field surveys" that assist in funding decisions and understanding the state of research. Historians can use topic modeling to try to identify changes in the historical record over time. Social scientists may wish to identify trending topics on social networks. Creative humanists can even model long books, although David concedes that the output, even in a long text divided by pages, can vary in quality. At Yale, topic modeling is being applied to art metadata in the Haas Art and Architecture library in an effort to make collections more accessible to researchers. With all of the applications of the technology, aspiring topic-modelers will be glad to know that Dr. Newman has helped to produce a piece of open-source software that makes the process accessible to anyone.
DIY Topic Modeling with the Topic Modeling Tool (TMT)
While topic modeling has applications in diverse disciplines, the amount of intensive computer work involved scares away many academics who could potentially benefit from the technique. For this reason, the tool presented by David focuses on keeping the process simple and automated, allowing the researcher to spend more time analyzing and less time typing.
Accessible here, David's software (called simply the "topic-modeling-tool") is a graphical user interface for an existing open source project called mallet, which is included in the download and does the behind the scenes heavy lifting. Written in java for maximum portability, the TMT allows users to import text files, either as files in a folder or as a single giant text file, set a few options for how they want topics identified, specify how many topic categories they want produced, and a few minutes later, get out both HTML and CSV formatted results with both the topics generated and the list of documents containing those topics.
Instructions and sample files are given on the website, and the options are intuitive enough to allow users to "learn by playing," but David gave us some tips on how to approach topic modeling projects with the TMT. Users should expect to increase the number of output topics if they wish results to be more precise. For example, if trying to identify documents that discuss music, 10 topics should be sufficient. If trying to differentiate between types of music, 20 topics may be necessary. Results can also be made more specific through the use of stopwords, which are ignored by the computer as it models the documents. This can be used to cut down on word "polluters," for example, text that may appear frequently in by-lines. Thresholds for tagging can also be set to increase the resolution of results, for example, the document must repeat the key text at least five times in order to be tagged.
In addition to being easy to use, TMT is not limited to English, and can process any language with clearly delimited words, including languages that use Cyrillic or Arabic alphabets. Unfortunately, some East Asian languages pose a challenge as the computer has difficulty distinguishing between tokens.
David's presentation exposed some of the uses for and tools of topic modeling, and the TMT opens up this powerful system of analysis to almost anyone. As some audience members pointed out, however, the greatest difficulty of of topic modeling arguably comes from getting the data one wishes to analyze in a usable form. Yale has a number of resources to help with this challenge, including an upcoming workshop on using the open source package R in conjunction with Google Documents for data mining, and also next week's TwTT workshop which will include information on how to work with large archives in the humanities.
For full coverage of this session, please click the video below (note a slight delay upon initial playback):