Table of Contents
February 5, 2020
Added:
README
file that will contain a brief overview of the project.gitignore
with some basics that should not be screened for upload by GitHubLICENSE
file that will eventually contain a license delineating terms of useproject_plan
that contains a general overview of the project, and will be fleshed out in the future- and of course this
progress_report
file
February 25, 2020
The first thing that became necessary was changing up my whole plan for this project, since finding a corpus that fit my original idea of looking at a syntactic structure unique to spoken Arabic wasn't panning out. In the search for a corpus to use, I came across the Arabic Learner Corpus and have decided to use it for this project. Please see my Project Plan for more detailed information.
Aside from rethinking my project's scope, this progress report focused on the following:
Finding a corpus was about the only thing that went to plan at this stage. The data are available under a Creative Commons license, so I'm free to use them for non-commercial purposes as long as I give proper attribution to the original authors. The entire corpus is freely available for download from the ALC website in a number of formats (plain text, XML, etc.), so this step was straightforward.
Data organization
was a process easier said than done, especially when working with XML and in a language other than English! BeautifulSoup gave me a lot of issues at first
and wasn't reading the text in correctly. The hunch was that encoding was to blame, and a visit to Jevon's office hours (thanks Jevon!) helped get to the
bottom of everything. At this point, I'm able to read in the entries; next step is to get things into dataframe
objects in pandas
and make something
useful out of everything.
You can look at my test data here. Right
now, I'm planning on leaving all of the original corpus files in their original forms and keeping them in a private folder for use. The plan is to
have pandas
dataframe
objects that contain the relevant tagged data from the original essays (learner L1, text, age, location, etc.) along with
additional information like text length and TTR^2 that I will compute myself.
The corpus I'm using comes with a Creative Commons license already, so I plan on using the same one and similarly allowing anything that results from my work to be used for copying, redistribution, and adaptation for non-commercial purposes and given proper attribution. I'm committed to this license because 1) it's in keeping with the licensure of the original work, 2) it aligns with principles of open science and data sharing, and 3) it doesn't seem to pose any ethical issues with identifiability of the learners who were participants and had their data collected. For more specific information, visit my project's License page, which will be further fleshed out in a future update.
March 17, 2020
This progress report focused on the following:
Already done as of the last progress update; the main issue to work through this time was reading in, cleaning, and (re)organizing data.
The information that I'm interested in is now more or less in its final form in a pandas
DataFrame
for use, and the final form
has been safely pickled and saved locally on my machine. Data were cleaned to remove any NaN values from the "Title" column of the
DataFrame
; otherwise, all of the columns had information present. Additionally, information like tokenized word lists and TTR value
for the sample was added to each entry.
Done; please see the ALC Data Organization notebook for this information. In a nutshell, I'm working with all of the 1,585 entries from the original corpus, with 17 columns of data for each entry.
4. Finalize the sharing scheme of the "found" portion of the data as well as the license for your own data and project.
I've elected to maintain the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license that the original corpus was licensed under. This allows for the work to be shared and modified for non-commercial use, provided proper attribution to the original authors. It didn't feel right (or legal?) to use a different license here, since the original work did not allow for commercial usage of the data.
Unfortunately, GitHub did not allow for all 1,585 XML files to be uploaded to my project's repository, but anyone interested in replicating the project can download the full corpus from the source (signup/login required). I'll removed the 'test_data' folder in favor of just keeping the 'data' folder on GitHub, since removing the 'Data' one would mean changing filepaths again. (Any guidance on how to git-rm remove a folder with 152 files?)
5. Start bringing in the analysis part into your project. In particular, your manipulation of data should be shaped by the linguistic analysis you are after.
To this end, I've created a new Notebook, ALC Data Analysis in which I plan to continue my work with the data I've curated.
For this slew of project updates, I focused on finalizing the form of my data
(adding an L1 family column to my DataFrame
with some help from Ethnologue and attempting a preliminary run of training an SVC classifier
to identify the L1 family of the writer based on their text. Short story: it didn't go so hot in terms of accuracy, capping at around 65%. It
seems like the imbalance of L1s represented in the data really is a huge stumbling block, and it makes me wonder if maybe collapsing
the categories further into "L1 Arabic" and "non-L1 Arabic" is the necessary conclusion to make something usable here. I'm comfortable
with saying at this point that I'm just interested in seeing if these data are conducive to training a classifier, narrowing the focus
of my initial broad research interest down from "can this dataset be useful for SLA" toward "can this dataset be useful for ML"? However,
I'd still like to investigate the distributions of the various subgroups in the data (i.e. TTR broken out by L1 family) and do some more work
on that. I just feel a bit less comfortable with that task, since figuring out classification has dominated my work up until now. Next step
is to refine the conclusions drawn from the classifier's performance and take a look around for meaningful patterns in the subgroups.
On a separate note, my license has been up and ready to go since the last update, and you can also now stop by and sign my Project Guestbook if it floats your boat!