This lecture covers the topics of text clustering and text similarity. Additionally, it introduces fundamental Python libraries for machine learning, which will be utilized in the exercises: numpy and scikit learn.
Module | Topic | Lecture material |
---|---|---|
L2-1 | Text clustering | lecture video, slides |
L2-2 | Text similarity | lecture video, slides |
L2-3 | Text clustering evaluation | Read: Zhai & Massung: Section 14.4 |
L2-4 | Numpy tutorial | external video |
L2-5 | Scikit learn | Complete the following two official scikit-learn tutorials: An introduction to machine learning with scikit-learn and Working with text data |
- Text Data Management and Analysis (Zhai & Massung)
- Chapter 14 (except Section 14.3)
Key concepts in this lecture:
- Problem of (text) clustering
- Similarity-based clustering algorithms (Agglomerative Hierarchical Clustering and K-means)
- Measuring text similarity (Jaccard and cosine)
- Working with numpy arrays
- Machine learning basics
- Working with text data in scikit-learn