Text classification of parsed tokenized HTML. Visually displayed text from the page of a website is extracted as a navigable string using Python's Beautiful soup 4 package. Matrices containing ordered data extracted from the navigable string object were used as the classifier's inputs. The text contained in a webpage was represented its HTML tag, the tags following it, its parents' tags, the length of the text, and the occurrence of regular expressions. It's corresponding class is determined by the user when building the dataset. The input data was obtained from a website displaying COVID-19 testing centers near Montreal (QC, Canada). Each navigable string was manually assigned as being part of one of the following categories: nothing, the clinic's name, wether the clinic accepts walk-in or is appointment based, opening hours, address and the contact number or email, prior to constructing the ordered matrix defining it/used to classify it.
*Website specific: https://santemontreal.qc.ca/population/coronavirus-covid-19/depistage-covid-19-a-montreal/
machanosoup.py
Different classifiers were trained on the obtained data. A machine learning (LSTM-CNN) based approach was developed and compared with common classifiers as implemented in Sklearn (python package). All classifiers were trained on a pseudo randomly selected subset of the manually labeled data. To assess the model's performance, the macro average F1 score was considered on the remaining portion of the data not used for training (~ 33 %).
The data representing the text being classified consisted in a matrix of numbers pertaining to the text's navigable string object. More precisely the amount of characters it consisted of, the amount of parent tags, the text's HTML tag, the ordered parent HTML tags, wether or not the parent tags hed defined classes or ids, the tags which were located "next" and "previous", and wether or not certain regular expressions occured in the text itself.
Classifiers were repeatedly trained and assessed using a 0.33 validation/test split. The mean F1 macro_avg score was compared using a series of Mann-Whitney tests as implemented by SPSS.
All observations were independently collected. The dependent variable is continuous (F1 macro_avg). The Kruskal-Wallis test was chosen as a non-parametric alternative to an ANOVA because data for the depend variable (F1 macro score) violated assumptions of normality as determined by Shapiro-Wilk’s (1965) test of normality and did not have homogeneous variance as determined by Levene's (Olkin, 1960) test of equal variance.
Post hoc pair-wise comparison of the Kruskal-Wallis test (a series of Mann-Whitney tests as implemented by SPSS (Bergmann et al., 2000)) revealed that significant differences we're observed. They are depicted in Figure 1.5.
Figure 1.1: Benchmarking parsed HTML classifiers based on F1-macro average. Classifiers taken from documentation here with the addition of a KNN. A LSTM-CNN model made using Tensorflow was developed.
Figure 1.2: Assessing the relevance of the data by looking at the decrease in classifier performance after removal.
Figure 1.3: Assessing the relevance of the data by looking at vector input to train classifiers.
Figure 1.4: Execution speed of classifier performing a prediction based on 100 executions
Figure 1.5: Pairwise comparison showing rank for each classifier for balanced dataset. When imbalanced dataset (all class 0 null cases were kept) was used, MLP ~ SVC ~ LSTM with MLP performing the best.
The custom LSTM-CNN network investigated performed better than other classifiers as implemented by SKLEARN. The execution speed was many orders of magnitude higher as seen in Figure 1.4. Even when training data are in the hundreds of samples, performant classifiers can be trained using the latest machine learning techniques.
In the case when a single vector is used, the LSTM-CNN network performs much betters for the previous and next data vectors.
For this type of task, rule based systems are still standard practice for a variety of reasons (refer to rulebased.py). The methodology developed in machanosoup.py were implemented to demonstrate the methodologies employed in properly assessing machine learning models (validation and training split) and comparing them. It also lays the groundwork involved in creating valid/meaningful data representations for parsed HTML using beautifulsoup4. Using this approach to scrape non domain specific data will require concerted effort and is the long term goal for this project.
Checking the effect of labeling error.
Assessing the effect of hyperparameter tuning.
- Tried for svc but did not find any better model
Assessing the effect of early stopping.
Comparing the most important features from various sites.
Do it for another website
MLP classifier performed best when soup was not pruned. Remained similar to LSTM-CNN and
Bergmann, R., P. J. M. S. Will, and J. Ludbrook. 2000. Different outcomes of the Wilcoxon-MannWhitney test from different statistics packages. The American Statistician 54(1):72-77
Pan W et al. 2017. Optimizing the Multiclass F-Measure Via Biconcave Programming. Proceedings - IEEE International Conference on Data Mining, ICDM, 1101-1106, pp. 1101–1106. doi: 10.1109/ICDM.2016.184.
Shapiro, S. S., and M. B. Wilk. 1965. An analysis of variance test for normality (complete samples). Biometrika 52(3-4):591-611.
Olkin, I. 1960. Contributions to probability and statistics; essays in honor of Harold Hotelling. Stanford studies in mathematics and statistics, 2. Stanford University Press, Stanford, Calif.