- Project 1: ESL Article Acquisition
- What I liked: The project was very well-organized and pretty easy to understand, for the most part. The graphs were readable, clear, and colorful. I also took a peek at the first progress report notebook, and the code was simplistic, which is a good thing in my opinion; I have a habit of taking an overcomplicated approach to some projects, so the fact that this student used more simplistic code to filter the data and produce fascinating results gives me hope that this might be a bit easier than I first thought.
- What could be improved: I feel like it would have helped the project a bit to include more than just the three languages; I feel like there could be some other important parts of article acquisition not just based on whether a language has articles or not, but also what sort of mix of articles a certain L1 has and how that speaker's L1 is apparent in their use of English.
- What I learned: I learned both that a "zero article" exists. I thought there was just an absence of articles in some cases, rather than having a specific term for the absence of an article. Kinda reminds me of the empty set, for anyone familiar with set theory.
- Project 2: Reddit Comment Analysis
- What I liked: Both machine learning and data from social media are super interesting to me, so I was excited to see a project that combined the two. I like how diverse the subreddits chosen for the projects were; I saw r/politics and figured it'd be a very interesting project. I was also impressed with the data cleaning, considering the size of the corpus used.
- What could be improved: I wasn't entirely sure what was being analyzed here; I looked at the project plan, and "linguistic trends" seems a bit vague. I also wasn't really able to figure out what these trends were, but maybe that's because I'm not familiar with the types of graphs that were used. Also, I frequently use Reddit, but not everyone does, so I feel like including a glossary of some sort would have been helpful in the end.
- What I learned: I learned that lzma is a thing that exists that's super helpful for working with compressed files, which really helps with large datasets. I don't intend on using a dataset nearly as large as this for my project in this class, but it could be useful if/when I work with other larger datasets in the future.
- Project 1: Russian russian_rhyme
(https://github.com/Data-Science-for-Linguists-2019/russian_rhyme)
- What I liked: There is a really clear project plan that includes research goals and discussions. This is easier for the audience to get to know what’s the big picture of the project with less confusion. The author also did a good job on the analysis part. Some other project plans might have interesting ideas but tend to have vague analysis overall.
- What could be improved: In the final report, the author was probably rushed towards the end. It seems to me that the analysis could be better explained with many important information lined up.
- What I learned: We’ve been told to use comments to explain thoughts and logic while coding, but not everyone can keep this good habit (me for instance…). The author makes sure that the audience can understand the ideas behind the coding by commenting almost all the time! In this way, the coding and analysis will make more sense to the audience. Also, this is VERY important if we will be collaborating with other data analysts or programmers in the future! Explain what and why are you coding something can save lots of time in team works (a lesson learned…).
- Project 2: Blog Sentiment Analysis
(https://github.com/Data-Science-for-Linguists-2019/Blog-Sentiment-Analysis)
- What I liked: I do enjoy the visualization provided in the progress reports. Instead of loading long and super detailed information, the author distributed them clean and organized.
- What could be improved: The project topic seems to be an interesting and trendy one. A lot of sentiment analysis related projects can be really fascinating in nature, however, often swamped by huge amount of the information the project is trying to analyze and explain. The author could probably narrow down the scope so that the audience/author can be less overwhelmed by the idea.
- What I learned: As I mentioned, I do like the author keep the visualization short and clean. Furthermore, I think Jupyter nbviewer is a nice way to share and view Notebook files instead of loading them via Github which can be less efficient sometimes. I see the Jupyter nbviewer in other projects, and I think I will also keep this for future use.
- Project 1: Bigram analysis of writing from the ELI
- What I liked: Data cleaning efforts are particularly well documented. Due to the nature of the data, it was necessary to anonymize, normalize code, and deal with Null/NaN values and empty strings, among other tasks. I also like that the project found a practical, straightforward application for bigram (token and type) frequencies.
- What could be improved: I think it would be useful to account for the words included in the prompts. If there were bigrams in the prompts with a very high (or very low) MI, that could skew the results. I might have missed it, but I don't think the predictive model accounts for individual learner variation, which is also important for this kind of analysis.
- What I learned: I was not familiar with the metric of Mutual Information (Simpson-Vlach & Ellis, 2010). I also learned about the challenges (and benefits) of doing collaborative corpus research.
- Project 2: ESL article acquisition
- What I liked: The theoretical foundation of the project is very clear. The choice of L1s makes sense because of the [art] feature of each language. The repository is clearly organized and I like that there are different notebooks for different taks.
- What could be improved: K-bands might be a good measure to add to the analyses and find further differences or similarities between L1 Arabic and L1 Spanish/L1 Korean learners. I think it would also be useful to see if there any outliers that might be skewing the group aggregates.
- What I learned: I wasn't familiar with Jupyter nbviewer. I think it's a nice complement for the GitHub repository. I also learned of the spaCy library. On a more practical note, I learned how to make use of regular expressions for efficient and transparent data cleaning.
- Project 1: Spell-Checker
- What I liked: I really liked how thought out this project seemed to be. A lot of issues seemed to be accounted for, and as I read, it was just "here's a potential problem, here's how to deal with it." Spell checkers are always fun to look at because spelling is just so subjective. Looking at how each word could be corrected was really interesting.
- What could be improved: After fixing a lot of issues, it seemed like there wasn't much data to work with. In this case, I think finding more data seems helpful. The first correction was often not a correction I would have immediately made, but without context, knowing what the student was writing about, or knowing if the student was actually answering the prompt or just writing are all factors in how a correction should be made. The prompts were part of the data analyzed, but the corrections were not personalized to the prompts.
- What I learned: The spell checker was part of a python library, which was definitely something that caught my eye, because when we worked with it in 1330, I think it may have been part of a library, but it was definitely a different set up. Seeing how this particular student utilized python was fascinating, and it just constantly reminds me of how much information there is already out there for our use.
- Project 2:Gendered Interaction Online
- What I liked: I liked being able to see how women and men interact with each other. The hypotheses posed before the analysis were carefully thought out, and I think for the most part they were correct. I was so interested, but not surprised, to see that men dominated most sites, and I liked how the student accounted for the different reasons why this might be, and didn't just assume that men were just annoying.
- What could be improved: I think exploring different reasons why there might be discrepancies would be really revealing. I liked how improvements were also included in the presentation, and I agree with those as well.
- What I learned: Like with the Spell-Checker, TfidVectorizer and MultinomialNB are things I wasn't familiar with, so seeing how they work was pretty cool.
-
Project 1: Reddit Comment Analysis
- What I liked: I'm just getting back into reddit, so this topic definitely interests me. I like that the author uses the niche aspects of reddit to get the best results, in this case it was through Flairs. This is a situation where, similar to Linguists in Data Science, specific knowledge of a topic can help yield better results.
- What could be improved: Though the findings are interesting, the report is quite confusing. The author laid out some goals that he wished to find in his project plan, but I only see him predicting subreddits in the actual report. I would've liked to have seen some of these other questions explored.
- What I learned: I learned that predictive modeling can be used to determine the origins of text based on words, which would be interesting to do with Twitter data if I were to change my topic, especially if used in conjunction with sentiment analysis.
-
Project 2: Blog Sentiment Analysis
- What I liked: Since I'm also doing sentiment analysis, this project caught my eye. One thing I liked specifically was that the author used the astrological sign of the blogger as a way to classify them. Classifying text is one of the main tasks in this class, and this was a very creative way to do that.
- What could be improved: I felt that the subject matter here was quite vague. The title was "Blog Sentiment Analysis," yet half of the report was on other things like word frequencies. I think the project would've been much better if the author had delved into the nuance of the sentiment analysis.
- What I learned: I learned that VADER, part of the nltk library, provides objects that are very useful for sentiment analysis. I will have to determine whether or not this is better than the one I found, TextBlob, and consider using it instead.
-
Project 1: Reddit Comment Analysis
- What I liked: First of all, the memes in the presentation. Big look. Also, really enjoyed his persistence in his ML efforts. Tremendous stuff. It looks like it was primarily a ML project, with a decent amount of data organizing (and downsizing), with his least heavy component being EDA. I think his project is similar to mine, but my proportions will likely be very different.
- What could be improved: I think the presentation didn't really flesh out his motivations for the methods he underwent? I feel like that would be very useful information for someone like me right now, but also anyone who is looking at this project with a critical eye.
- What I learned: I'm definitely going to be referencing the ML techniques from this project, as he really did an impressive job upping his baseline numbers. His data also reminds me of mine in some ways, and since my biggest weakness is data cleaning I think this project will be a great jump-off point for me.
-
Project 2: Blog Sentiment Analysis
- What I liked: Eva was my TA last semester, how the tables have turned! Just kidding, this is a really cool and well executed project. I really loved how she lined up her project's main goals into the three/four main categories we've been talking about: what can I get (EDA/cleaning), can ML help me get more (EDA), and how do I visualize(Analysis) And also hey, three project reports.
- What could be improved: This is pretty nit-picky but I think she could have used color better. I actually loved her visualizations and thought she used graphs very intuitively, which is why the colors sort of stuck out to me. They were rainbow when a solid color would've made more sense, and they were solid when they might have needed clearer distinction. Also, 'negative' was green which is kind of contrary to synethesia.
- What I learned: Data visualization is very important. We're working on projects for three or four months here, we should make our conclusions obvious and convincing. I think she had great, achievable goals and accomplished them. Hoping I can replicate this!
- Project 1: Sentiment Analysis of Figures in the New York Times
- What I liked: This project has a lot of potential as an interdisciplinary foray into US history through the lens of linguistics, particularly in the 4th research question "are there any months that harbor a particular sentiment?" I also like that this repo contains a copy of the presentation that the author gave that further details their motivation behind the project and contextualizes the results.
- What could be improved: Not all terminology is well-defined, such as "compound sentiment score," and this means the accompanying visualizations are also somewhat difficult to interpret. This is further compounded by the fact that the notebook containing the data cleaning and analysis portions of this project is not well-commented, leaving readers with little clarity as to the mechanics of the project beyond what they are able to discern for themselves.
- What I learned: In short, this has reaffirmed for me the importance of having well-documented code and data for people viewing your work. Trying to parse out an understanding of the author's was difficult, given that at times the visualizations seemed to not match up with the written analysis.
- Project 2: ESL Article Acquisition
- What I liked: This author does an excellent job of showing how they considered multiple ways to analyze their data, going from TTR to a more refined measure of lexical complexity (Guiraud's R) and clearly documenting the process. It is easy, as a reader, to follow how this project went from idea to realization.
- What could be improved: It's somewhat unclear to me what role "level" is supposed to be playing in this work because it's not given in terms that may be comprehensible to a broad audience. What is the difference between each of the levels? Do they equate to a common proficiency framework, like the ACTFL or CEFR scales? More information would be helpful.
- What I learned: The prose element of a project is equally as important as the coding and the data that inform the final product. The narrative that accompanies this project does a great job at contextualizing the materials that accompany it and showing the process of how this work developed.
Project 1: [Bigram analysis of writing from the ELI](https://github.com/Data-Science-for-Linguists/Bigram-analysis-of-writing-from-the-ELI/blob/master/final_report.md)
- **What I liked:**
He thoroughly explained all the decisions he made during the process, making it easy to follow along. The heatmap is a very nice way to display the accuracy of his model's predictions and the ways it made mistakes.
- **What could be improved:** Figure 11 seems pretty strange, he's using the graded level as the response variable here. He seems to be predicting level by MI score which is a fine idea, but there are only 3 possible values for this makes it a pretty opaque way to demonstrate the expected values.
- **What I learned**
This is the first time I've been exposed to MI and it seems like a very useful tool for comparing the speech of speakers at different levels. I'll look into it further, if it checks out it could be very useful for my project.
Project 2: [2016 Election Project](https://github.com/Data-Science-for-Linguists/2016-Election-Project/blob/master/final_report.md)
- **What I liked:**
It's obvious that a lot of care went into how the data was presented, everything came out beautifully. The project itself was very interesting, especially the comparisons between Trump and Clinton as the debates progressed. You can clearly see Trump dropping his use of professional titles as a response to Clinton's decision to only refer to him by his first name.
- **What could be improved:**
The only thing I can think of is that name-calling was displayed in the key of the graphs but didn't seem to appear at all. I'm sure there was some technical issue with the tagger but this would have been a great addition to the analysis.
- **What I learned**
I'll need to use whichever graphing library Paige is using here, I'll most likely use it in my project.
- Project 1: Document Clustering
- What I liked: I loved the idea behind the project, because I also have the same problem. I liked the visuals provided by the scatter plots, because I think they lended well to what the research was about. I also liked the explanations about the different kinds of clustering as it provided a good background
- What could be improved: I think that the data could be labelled and explained better. The charts were only helpful because of the way the data was presented, the student did not label it clearly.
- What I learned: I learned about heirarchical and k-means clustering, and also about another application of how beautifulsoup can be used for a project. This student used beutifulsoup to gather keywords to sort bookmarks.
- Project 2: Native and Non-Native English
- What I liked: This student provided a lot of background about the project and the methods used. It was very helpful to read this report as someone who is just starting to manipulate data. I liked seeing how simple things I had learned in Computational Linguistics, like bigrams, could be applied to a larger project.
- What could be improved: Compared to the first project I reviewed, these graphs were not as elegant, and could've been better constructed. Some of the bar graphs didn't need multiple colors because it only seemed to be comparing one set, not a subset or anything like that.
- What I learned: I learned that contractions were higher between non-native groups than native groups which was not a result I expected. I would've thought that contractions are something that may not come as naturally to L2 speakers as not all words can be contractions, but it is interesting to note that they do use them often.