Jump to: Components, Submission, Milestones, Project Ideas, Project Plan, 1st Progress Report.
Individual students will work on a project of their own choice and design over the course of the semester, culminating with a class presentation followed by a final project delivery. The goal of this project is to make a linguistic discovery through application of data-intensive methods.
A project consists of three main components: data, analysis, and presentation.
Start with found data. Many linguistics research projects begin with a targeted data collection effort -- field work, surveys, elicitation, human subjects, and more. But the underlying assumption of data science is that data exists in the wild, and it is up to a data scientist to harness it. True to this assumption, we will have you start with data that is found in the wild, be it published data sets, corpora, or social media streams.
Add value. You should not, however, be content with data as it is packaged and presented to you. In many cases, your data will need a lot of work -- sourcing, cleaning up, and reorganizing. In other cases, you may be dealing with published data that's more or less ready for analysis. You are, then, expected to add value: augmenting, annotating and leveraging multiple data sets are all potential avenues.
Follow best data practices. Throughout this semester, we will be learning about best data practices, both emerging and firmly established in data science circles. Make sure your own data efforts and the output are in compliance.
Linguistic analysis. You will have designed your data with a research question in mind. Your data should make a suitable empirical basis for your linguistic inquiry; your research question should be properly motivated and addressed in a theoretically and methodologically sound manner. You interpretations of the findings should likewise be rigorously supported by your data. Even with meticulous preparation, however, your data in the end may not prove fruitful grounds for your original research question. Pivoting is therefore allowed up to a certain point; whether or not this move is ultimately successful, reasons for pivoting and/or failure of the original research agenda must be thoroughly probed and documented, since this sort of outcome is all part and parcel in research efforts deeply grounded in real-life data and, further, provides valuable insight.
Computational methods. In your linguistic analysis, you are expected to employ various computational methods including natural language processing, statistics, machine learning, topic modeling and more. Proper techniques should be used in accordance with your research question and the specifics of your data. At the same time, you should demonstrate mastery of these techniques by justifying your choice of computational methods and thoroughly evaluating the outcome, rather than blindly applying them and accepting the returned output. As with linguistic analysis, failed experimentation should not be brushed aside, but rather receive proper investigation and documentation, as this is all part of the discovery process.
This component encompasses all audience-facing aspects of your project, which include but are not limited to:
- Proper use of GitHub as a project-hosting and publication platform.
- Overall documentation.
- Structure, readability and organization of your Python code in the form of Jupyter notebooks.
- Visualization through graphs and plots.
- Your oral presentation, scheduled in the last two weeks of class.
- Your final report: language, content, clarity, precision, organization, citation, etc.
Weight distribution. Ideally, a project will have the three components in perfect balance: a total of say 180 points will be equally split between data/analysis/presentation as 60-60-60. In reality, everyone's project will be different: some will have ambitious and challenging data curation plans, while others might wish to focus their efforts on extensive use of advanced computational methods. To accommodate this, a limited amount of trade-off is provisioned between the "data" and the "analysis" components: more data-focused projects therefore may have up to 70-50-60 distribution with more data-side contribution, while projects heavily focused on analysis are allowed to go easier on data-related efforts, with up to 50-70-60 split.
Your project should be initiated and developed in the form of a GitHub-hosted public repository. The final deliverables should include:
- A README document and a LICENSE document accompanying your GitHub repository.
- A written report containing a summary of your data and linguistic analysis. Anywhere between 5 and 10 pages, of which a minimum of 3 pages must be devoted to written descriptions (not including charts, graphs, examples, tables, etc.).
- Your data.
- Python scripts in Jupyter Notebook form, that you created and used to process, explore and analyze the data.
- Slides or other materials you used for your in-class presentation.
The term project carries a total of 400 points, which you accrue over the course of the semester through meeting several, structured, milestones. Refer to the Schedule page for the dates.
Milestone | Points | Distribution: Data Ⓓ; Analysis Ⓐ; Presentation ⓟ | What | |
1 | Project ideas | 20 | ⒹⒹⒶⒶ | Send instructor 1-2 project ideas. |
2 | Project plan | 20 | ⒹⒶⓟⓟ | Finalize project plan, create a GitHub project repository. |
3 | 1st progress report | 40 | ⒹⒹⒹⒹⒹⒹⓟⓟ | Focus on data curation, report progress. |
4 | 2nd progress report | 40 | ⒹⒹⒹⒹⒶⒶⓟⓟ | Continue with data curation, attempt analysis. |
5 | 3rd progress report | 40 | ⒹⒹⒶⒶⒶⒶⓟⓟ | Data-side effort should be done; ramp up analysis. |
6 | Project presentation | 60 | ⒶⒶⒶⒶⒶⒶⓟⓟⓟⓟⓟⓟ | Oral presentation of your work in classroom. |
7 | Final project submission | 180 | ⒹⒹⒹⒹⒹⒹⒶⒶⒶⒶⒶⒶⓟⓟⓟⓟⓟⓟ ⒹⒹⒹⒹⒹⒹⒶⒶⒶⒶⒶⒶⓟⓟⓟⓟⓟⓟ | Turn in final project in the form of a GitHub repository. |
More detail will follow as each milestone approaches.
You should come up with one or two project ideas. Include these details:
- A working title.
- A brief summary.
- The DATA portion. Example points you should address: What will your data look like? What sorts of data sourcing and cleaning up effort will be involved? Do you have a sense of the overall data size you should be aiming for? Do you have an existing data source in mind that you can start with, and if so, what are the URLs or references?
- The ANALYSIS portion. Example points you should address: What is your end goal? What linguistic analysis do you have in mind? Any hypothesis you will be testing? Are you planning to do any predictive analysis (machine learning, classification, etc.), and using what methods?
- The PRESENTATION portion. We don't expect a whole lot of variability on this, but describe anything noteworthy you have in mind.
Submission: In the project_ideas/
directory of Class-Exercise-Repo
, create a markdown-formatted text file named project_ideas_YOURNAME.md
. Commit, push to your fork, and create a pull request for the instructors.
Launch your project as a GitHub repository and publish a project plan.
- Create a repository within our GitHub class organization.
- Give it a descriptive name that is not too long. Good choice: "Inaugural-Address-Analysis", bad: "Jevon-Term-Project".
- Provide a description. This is a short tagline that appears under your repo title. Start with something simple. Make sure your name is in there. (See the two screenshots above.)
- The repository should be public.
- Initialize with a README.
- This is YOUR repo! No forking necessary: just clone it and get to work.
- Your repo should have the following files:
README.md
: Include your name, project title, and a brief summary here. We'll keep this page minimal for now.LICENSE.md
: You will eventually need to specify a license for your project. Build it now as a place holder.project_plan.md
: This is your project plan. Start with your project ideas document and polish it up.progress_report.md
: This is where you will log your progress. Add your first entry..gitignore
: Include the usual files and directories to be ignored.
- Can't wait to get started? Read the following.
- You are welcome to put other directories and files in your local repo as you see fit, but do NOT commit them to Git yet. Reason: once anything is on Git and GitHub, it's always there (i.e., recoverable) as part of the commit history.
- Likewise, don't commit your data files yet. You are likely unsure at this stage whether or not you have the rights to share the data freely.
- A suggestion: create a directory called
private/
where you will keep any private notes and data files. Add this directory to your.gitignore
file. - Having said that, don't be afraid to publish changes to your GitHub repo on an ongoing basis. The instructors have access to your repo's state at any given point in time, so there is no need to keep your repo pristine & frozen until a grade is posted.
**Submission**: Your project repo counts as your submission.
For the 1st progress report, focus on your data. This milestone consists of 30 data points and 10 presentation points. Goals:
- Attempt and mostly complete the data acquisition process.
- Start and make a head way into cleaning and reorganizing your data.
- By now, you should have concrete ideas on the "data end game": what your data's final form will be like, the target total size, format, etc.
- Devise a couple of options regarding the "sharing plan" of your data.
Contents:
progress_report.md
- Create a section entitled "1st Progress Report", and then provide a summary of what you accomplished. Keep it short (a screen-full), and provide links to related documents, including your Jupyter Notebook and data samples.
- Include a subsection where you outline a couple of options (or a single option, if you are fairly sure) regarding the "sharing plan" for your data. You should plan out how much and what you will be sharing. Make sure to include a justification.
- A python script in the form of a Jupyter Notebook.
- Provide an overview of your data. Clearly document each step of your data processing pipeline.
- Compile some basic stats on your data: the size and the make up are the bare minimum.
- Bullet points have their uses, but let's see some written summaries and explanations too.
- Remember: your Jupyter Notebook file is also your presentation. Make it easy for the instructors and your classmates to understand what you are doing. Explain your goals, show your data and your processes.
- Some form of your data. If all of your data is currently stored in a git-ignored directory, make an appropriately sized samples available in a directory called
data_samples/
.
Above are the minimum requirements, but do feel free to impose additional organization as you see fit. This is your project after all! But when you do so, make sure you provide an explanation.
- Change your GitHub repository's name. That changes its URL, so you will need to update your local Git's remotes setting.
- In your
project_plan.md
file, round up the old content into a section and mark it clearly as your old plan which is no longer current. On the top, write out your new plan. - Upate your
progress_report.md
file with an explanation of what happened and why this change of course was necessary. You'll need your 1st progress report too. - You should edit
README.md
and any other files accordingly to fit your new project direction. They shouldn't contain references to your old plan.
Submission: Your project repo counts as your submission.