This is not just a link dump. These resources are carefully curated textbook stand-ins, and you are fully expected to learn from them! There are multiple types:
- Online tutorials. Watch, practice and learn. I pre-screened and narrowed down to very essential & relevant contents only, so you can stop wondering if you should learn the whole thing!
- Articles. Read them -- they will be referenced in lectures and used in classroom discussions.
- Book and book chapters. Python Data Science Handbook neatly aligns with our data science focus and doubles up as a reference book. Parts of the NLTK Book will also be referenced.
- Software installation links. Download and install on your machine.
- Bookmark pages. These are lists of useful links compiled by someone else, which often contain pointers to data sets or resources. Explore them and use them as needed; you should become familiar with what's on them.
- References -- for looking things up.
- Linguistics Data Repositories, Guides at OSH [link]
- Linguistic Linked Open Data [link]
- Linguistic Data Consortium (LDC) [link]
- Data Management Plans for Linguistic Research, Workshop at 2017 LSA Summer Institute [link] [slides]
- Justin Kitzes. (2018) The Basic Reproducible Workflow Template. In Justin Kitzes, Daniel Turek, Fatma Deniz (Eds.) The Practice of Reproducible Research. [link]
- D-Scholarship @ Pitt: Institutional Repository at the University of Pittsburgh [link]
- Copyright and Intellectual Property Toolkit by Lauren Collister [link]
- TEI: A Gentle Introduction to XML [link]
- json.org: Introducing JSON [link], JSON example (vs. XML) [link]
- Stefan Th. Gries and John Newman. (2013) Creating and using corpora. In Podesva, Robert J., and Devyani Sharma. (Ed.), Research Methods in Linguistics. [PDF]
- NLTK Book Ch.2 Accessing Text Corpora and Lexical Resources [chapter]
- NLTK Book Ch.11 Managing Linguistic Data [chapter]
- NLTK Corpora Index [link] [GitHub repo]
- FSNLP Ch.4 Corpus-Based Work Links [link]
- Corpus-based Linguistics Links [link]
- Corpus Resource Database (CoRD) [link]
- NLTK Book Ch.11 Managing Linguistic Data [chapter]
- Adding Linguistic Annotation, Geoffrey Leech [link]
- Natural Language Annotation for Machine Learning [Chapter 1, full ebook]
- WordNet: a lexical database for English [link]
- Universal Dependencies [project home]
- AMR: Abstract Meaning Representation [project home]
- WebAnno annotation tool [home, GitHub]
- Python Data Science Handbook. (2016) O'Reilly Media [book]
- (DataCamp) Introduction to Python for Data Science, Ch.4 NumPy [tutorial]
- (DataCamp) Intermediate Python for Data Science. Focus on Matplotlib, Numpy & Pandas. [tutorial]
- (DataCamp) pandas Foundations [tutorial]
- (DataCamp) Manipulating DataFrames with pandas [tutorial]
- Visualization: pandas 0.20.3 documentation [link]
- Chris Albon's Notes on ML & AI, "Data Wrangling" [link]
- 19 Essential Snippets in Pandas [link]
- Favio Vasquez's list of Data Science Cheatsheets to rule the world [link]
- Twitter text mining tutorials: [The Code Way], [Adil Moujahid], [Marco Bonzanini]
- Scrapy tutorial [link]
- Mapping the United Swears of America by Jack Grieve [link]
- Python Data Science Handbook. (2016) O'Reilly Media [book]
- Movie Reviews Sentiment Analysis with Scikit-Learn [link]
- Topic Modeling with Scikit Learn [link]
- (DataCamp) Supervised Learning with scikit-learn [tutorial]
- (DataCamp) Unsupervised Learning in Python [tutorial]
- (DataCamp) NLP Fundamentals in Python [tutorial]
- Why and How to Use Pandas with Large Data (but not big data...) [link]
- A Beginner's Guide to Big O Notation [link]
- Learn Big Data Analytics using Top YouTube Videos, TED Talks & other resources [link]
- spaCy: Industrial-Strength Natural Language Processing in Python [link]
- CRC: Center for Research Computing at Pitt [link] [h2p] [hub]
Below focuses more on the software tools side of resources.
- Git download & installation [link]
- Software Carpentry Lesson: Version Control with Git [tutorial]
- LSA 2019 Reproducible Research Workshop tutorials: Part 1 Intro to Git, Part 2 Linking Git with GitHub
- GitHub Help: Fork a repo [link]
- How to get started with Git and GitHub [YouTube]
- git - the simple guide [link]
- Tutorials: Become a git guru. (Uses BitBucket instead of GitHub, ignore parts on SVN) [link]
- Anaconda Python download & installation: use version 3.7. [link]
- Don't want Anaconda? If you already have another Python distribution installed, you can simply add on
jupyter
viapip3
. Follow the directions on this page, under OPTION 2 for Python. - Dataquest Tutorial: Jupyter Notebook for Beginners [link]
- Jupyter Notebook Tutorial: The Definitive Guide on DataCamp (more advanced) [link]
- Software Carpentry Lesson: The Unix Shell [link]
- Thirty Useful Unix Commands [PDF]
- Regular Expressions Tutorial [link]
- The best introduction ever to
grep
, by softpanorama.org [link]
- Atom [link] recommended for all systems.
- Also good: Notepad++ [link] (Windows only) and Sublime Text [link] (all platforms).
- On the command-line side,
nano
is easiest to use. It is already on Macs; on Windows, it comes with Git-Bash.
The topics below are not among the focus areas of this course, but parts of them will be relevant. They are provided for reference.
- Natural Language Toolkit (NLTK) Project Home [link]
- NLTK Book, Python3 Edition [index] [navigation panel]
- NLTK How-tos [link]
- LING 1330/2330 Intro to CL
- Python 3 Quick Reference [PDF]
- Python 3 Notes