From 08990e0352e677a753003c706bab5a2c5047277b Mon Sep 17 00:00:00 2001 From: Maximilian Krahn Date: Wed, 22 Jul 2020 21:51:51 +0200 Subject: [PATCH 1/6] Update README.md * corrected some spelling mistakes * added tokenize to the demo code snippeds *explanation of which functions take tokenize input Co-authored-by: Henri Froese --- README.md | 111 +++++++++++++++++++++++++++++++----------------------- 1 file changed, 63 insertions(+), 48 deletions(-) diff --git a/README.md b/README.md index 46132457..d7711c85 100644 --- a/README.md +++ b/README.md @@ -44,7 +44,7 @@

From zero to hero

-Texthero is a python toolkit to work with text-based dataset quickly and effortlessly. Texthero is very simple to learn and designed to be used on top of Pandas. Texthero has the same expressiveness and power of Pandas and is extensively documented. Texthero is modern and conceived for programmers of the 2020 decade with little knowledge if any in linguistic. +Texthero is a python toolkit to work with text-based dataset quickly and effortlessly. Texthero is very simple to learn and designed to be used on top of Pandas. Texthero has the same expressiveness and power of Pandas and is extensively documented. Texthero is modern and conceived for programmers of the 2020 decade with little knowledge if any in linguistic. You can think of Texthero as a tool to help you _understand_ and work with text-based dataset. Given a tabular dataset, it's easy to _grasp the main concept_. Instead, given a text dataset, it's harder to have quick insights into the underline data. With Texthero, preprocessing text data, mapping it into vectors, and visualizing the obtained vector space takes just a couple of lines. @@ -55,15 +55,15 @@ Texthero include tools for: * Vector space analysis: clustering (K-means, Meanshift, DBSCAN and Hierarchical), topic modeling (wip) and interpretation. * Text visualization: vector space visualization, place localization on maps (wip). -Texthero is free, open-source and [well documented](https://texthero.org/docs) (and that's what we love most by the way!). +Texthero is free, open-source and [well documented](https://texthero.org/docs) (and that's what we love most by the way!). We hope you will find pleasure working with Texthero as we had during his development. -

Hablas español? क्या आप हिंदी बोलते हैं? 日本語が話せるのか?

+

¿Hablas español? क्या आप हिंदी बोलते हैं? 日本語が話せるのか?

Texthero has been developed for the whole NLP community. We know how hard it is to deal with different NLP tools (NLTK, SpaCy, Gensim, TextBlob, Sklearn): that's why we developed Texthero, to simplify things. -Now, the next main milestone is to provide *multilingual support* and for this big step, we need the help of all of you. ¿Hablas español? Sie sprechen Deutsch? 你会说中文? 日本語が話せるのか? Fala português? Parli Italiano? Вы говорите по-русски? If yes or you speak another language not mentioned here, then you might help us develop multilingual support! Even if you haven't contributed before or you just started with NLP, contact us or open a Github issue, there is always a first time :) We promise you will learn a lot, and, ... who knows? It might help you find your new job as an NLP-developer! +Now, the next main milestone is to provide *multilingual support* and for this big step, we need the help of all of you. ¿Hablas español? Sprechen Sie Deutsch? 你会说中文 日本語が話せるのか?Fala português? Parli Italiano? Вы говорите по-русски? If yes or you speak another language not mentioned here, then you might help us develop multilingual support! Even if you haven't contributed before or you just started with NLP, contact us or open a Github issue, there is always a first time :) We promise you will learn a lot, and, ... who knows? It might help you find your new job as an NLP-developer! For improving the python toolkit and provide an even better experience, your aid and feedback are crucial. If you have any problem or suggestion please open a Github [issue](https://github.com/jbesomi/texthero/issues), we will be glad to support you and help you. @@ -92,7 +92,7 @@ pip install texthero

Getting started

-The best way to learn Texthero is through the Getting Started docs. +The best way to learn Texthero is through the Getting Started docs. In case you are an advanced python user, then `help(texthero)` should do the work. @@ -102,20 +102,21 @@ In case you are an advanced python user, then `help(texthero)` should do the wor ```python -import texthero as hero -import pandas as pd - -df = pd.read_csv( - "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv" -) - -df['pca'] = ( - df['text'] - .pipe(hero.clean) - .pipe(hero.tfidf) - .pipe(hero.pca) -) -hero.scatterplot(df, 'pca', color='topic', title="PCA BBC Sport news") +>>> import texthero as hero +>>> import pandas as pd +>>> +>>> df = pd.read_csv( +... "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv" +... ) +>>> +>>> df['pca'] = ( +... df['text'] +... .pipe(hero.clean) +... .pipe(hero.tokenize) +... .pipe(hero.tfidf) +... .pipe(hero.pca) +... ) +>>> hero.scatterplot(df, 'pca', color='topic', title="PCA BBC Sport news") ```

@@ -125,28 +126,29 @@ hero.scatterplot(df, 'pca', color='topic', title="PCA BBC Sport news")

2. Text preprocessing, TF-IDF, K-means and Visualization

```python -import texthero as hero -import pandas as pd - -df = pd.read_csv( - "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv" -) - -df['tfidf'] = ( - df['text'] - .pipe(hero.clean) - .pipe(hero.tfidf) -) - -df['kmeans_labels'] = ( - df['tfidf'] - .pipe(hero.kmeans, n_clusters=5) - .astype(str) -) - -df['pca'] = df['tfidf'].pipe(hero.pca) - -hero.scatterplot(df, 'pca', color='kmeans_labels', title="K-means BBC Sport news") +>>> import texthero as hero +>>> import pandas as pd +>>> +>>> df = pd.read_csv( +... "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv" +... ) + +>>> df['tfidf'] = ( +... df['text'] +... .pipe(hero.clean) +... .pipe(hero.tokenize) +... .pipe(hero.tfidf) +... ) +>>> +>>> df['kmeans_labels'] = ( +... df['tfidf'] +... .pipe(hero.kmeans, n_clusters=5) +... .astype(str) +... ) +>>> +>>> df['pca'] = df['tfidf'].pipe(hero.pca) +>>> +>>> hero.scatterplot(df, 'pca', color='kmeans_labels', title="K-means BBC Sport news") ```

@@ -180,7 +182,7 @@ Remove all types of brackets and their content. ```python >>> s = hero.remove_brackets(s) ->>> s +>>> s 0 This sèntencé needs to be cleaned! dtype: object ``` @@ -189,7 +191,7 @@ Remove diacritics. ```python >>> s = hero.remove_diacritics(s) ->>> s +>>> s 0 This sentence needs to be cleaned! dtype: object ``` @@ -198,7 +200,7 @@ Remove punctuation. ```python >>> s = hero.remove_punctuation(s) ->>> s +>>> s 0 This sentence needs to be cleaned dtype: object ``` @@ -207,7 +209,7 @@ Remove extra white-spaces. ```python >>> s = hero.remove_whitespace(s) ->>> s +>>> s 0 This sentence needs to be cleaned dtype: object ``` @@ -217,7 +219,16 @@ Sometimes we also want to get rid of stop-words. ```python >>> s = hero.remove_stopwords(s) >>> s -0 This sentence needs cleaned +0 This sentence needs cleaned +dtype: object +``` + +There is also the option to clean the text automatically by calling the "clean"-function instead of doing it step by step. +```python +>>> text = "This sèntencé (123 /) needs to [OK!] be cleaned! " +>>> s = pd.Series(text) +>>> hero.clean(s) +0 sentence needs cleaned dtype: object ``` @@ -243,9 +254,11 @@ Full documentation: [nlp](https://texthero.org/docs/api-nlp) **Scope:** map text data into vectors and do dimensionality reduction. Supported **representation** algorithms: -1. Term frequency (`count`) +1. Term frequency (`term_frequency`) 1. Term frequency-inverse document frequency (`tfidf`) +For the "representation" functions it is strongly recommended to tokenize the input series first with the `hero.tokenize(s)` function from the texthero library. + Supported **clustering** algorithms: 1. K-means (`kmeans`) 1. Density-Based Spatial Clustering of Applications with Noise (`dbscan`) @@ -295,7 +308,7 @@ The website will be soon moved from Docusaurus to Sphinx: read the [open issue t **Are you good at writing?** -Probably this is the most important piece missing now on Texthero: more tutorials and more "Getting Started" guide. +Probably this is the most important piece missing now on Texthero: more tutorials and more "Getting Started" guide. If you are good at writing you can help us! Why don't you start by [Adding a FAQ page to the website](https://github.com/jbesomi/texthero/issues/41) or explain how to [create a custom pipeline](https://github.com/jbesomi/texthero/issues/38)? Need help? We are there for you. @@ -314,6 +327,8 @@ If you have just other questions or inquiry drop me a line at jonathanbesomi__AT - [bobfang1992](https://github.com/bobfang1992) - [Ishan Arora](https://github.com/ishanarora04) - [Vidya P](https://github.com/vidyap-xgboost) +- [Henri Froese](https://github.com/henrifroese) +- [Maximilian Krahn](https://github.com/mk2510)

License

From 0f5c0c7c781b1c6e333e9d3784f60e743dc9bea0 Mon Sep 17 00:00:00 2001 From: Maximilian Krahn Date: Mon, 27 Jul 2020 17:42:43 +0200 Subject: [PATCH 2/6] removed the >>> removed the >>> where no output was shown Co-authored-by: Henri Froese --- README.md | 76 +++++++++++++++++++++++++++---------------------------- 1 file changed, 38 insertions(+), 38 deletions(-) diff --git a/README.md b/README.md index 2c13a276..ba5ce48e 100644 --- a/README.md +++ b/README.md @@ -102,21 +102,21 @@ In case you are an advanced python user, then `help(texthero)` should do the wor ```python ->>> import texthero as hero ->>> import pandas as pd ->>> ->>> df = pd.read_csv( -... "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv" -... ) ->>> ->>> df['pca'] = ( -... df['text'] -... .pipe(hero.clean) -... .pipe(hero.tokenize) -... .pipe(hero.tfidf) -... .pipe(hero.pca) -... ) ->>> hero.scatterplot(df, 'pca', color='topic', title="PCA BBC Sport news") +import texthero as hero +import pandas as pd + +df = pd.read_csv( + "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv" +) + +df['pca'] = ( + df['text'] + .pipe(hero.clean) + .pipe(hero.tokenize) + .pipe(hero.tfidf) + .pipe(hero.pca) +) +hero.scatterplot(df, 'pca', color='topic', title="PCA BBC Sport news") ```

@@ -126,29 +126,29 @@ In case you are an advanced python user, then `help(texthero)` should do the wor

2. Text preprocessing, TF-IDF, K-means and Visualization

```python ->>> import texthero as hero ->>> import pandas as pd ->>> ->>> df = pd.read_csv( -... "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv" -... ) - ->>> df['tfidf'] = ( -... df['text'] -... .pipe(hero.clean) -... .pipe(hero.tokenize) -... .pipe(hero.tfidf) -... ) ->>> ->>> df['kmeans_labels'] = ( -... df['tfidf'] -... .pipe(hero.kmeans, n_clusters=5) -... .astype(str) -... ) ->>> ->>> df['pca'] = df['tfidf'].pipe(hero.pca) ->>> ->>> hero.scatterplot(df, 'pca', color='kmeans_labels', title="K-means BBC Sport news") +import texthero as hero +import pandas as pd + +df = pd.read_csv( + "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv" +) + +df['tfidf'] = ( + df['text'] + .pipe(hero.clean) + .pipe(hero.tokenize) + .pipe(hero.tfidf) +) + +df['kmeans_labels'] = ( + df['tfidf'] + .pipe(hero.kmeans, n_clusters=5) + .astype(str) +) + +df['pca'] = df['tfidf'].pipe(hero.pca) + +hero.scatterplot(df, 'pca', color='kmeans_labels', title="K-means BBC Sport news") ```

From 34f8b35876aec950738694fbdbc8a951f155a4c8 Mon Sep 17 00:00:00 2001 From: Maximilian Krahn Date: Fri, 7 Aug 2020 19:50:48 +0200 Subject: [PATCH 3/6] added types to README.md --- README.md | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/README.md b/README.md index 6bdcdb45..4b3a9ece 100644 --- a/README.md +++ b/README.md @@ -254,6 +254,7 @@ Full documentation: [nlp](https://texthero.org/docs/api-nlp) **Scope:** map text data into vectors and do dimensionality reduction. Supported **representation** algorithms: +1. counting terms (`count`) 1. Term frequency (`term_frequency`) 1. Term frequency-inverse document frequency (`tfidf`) @@ -281,6 +282,18 @@ Supported functions: Full documentation: [visualization](https://texthero.org/docs/api-visualization) +

Series Typing

+ +What is series typing? Each category of functions, which are defined above, accepts a different type of pandas series. Those are explained here. Those types will in general improve the efficiency of the different algorithms as they are especially designed for those. + +For the first step you don't need to worry too much about them, because, if you use a typical texthero pipeline, the functions will have the right input and output types. The *typical* hero pipeline will: +- first "clean" the text with any preprocessing function, e. g. `clean`, +- then perform a NLP function to categorise the text, e. g. `tokenize` and `tfidf` +- analyse or display the series with the representation function or the clustering functions, e. g. `kmeans` +- do a dimensionality reduction to display them better in 2D or 3D, e. g. `pca` + +If your pipeline will differ, you might want to check out the tutorial on series types to understand which type of series you will get returned and which types will be accepted by the functions. +

FAQ

Why Texthero
From df5b79ae655de9ffecd8ea1b7531d28e2a8db740 Mon Sep 17 00:00:00 2001 From: Henri Froese Date: Sat, 22 Aug 2020 14:35:13 +0200 Subject: [PATCH 4/6] Update README (normalization; DocumentTermDF) --- README.md | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index 4b3a9ece..efe58aca 100644 --- a/README.md +++ b/README.md @@ -17,7 +17,7 @@ Github license - +

@@ -114,6 +114,7 @@ df['pca'] = ( .pipe(hero.clean) .pipe(hero.tokenize) .pipe(hero.tfidf) + .pipe(hero.normalize) .pipe(hero.pca) ) hero.scatterplot(df, 'pca', color='topic', title="PCA BBC Sport news") @@ -133,20 +134,20 @@ df = pd.read_csv( "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv" ) -df['tfidf'] = ( +df['tokenized'] = ( df['text'] .pipe(hero.clean) .pipe(hero.tokenize) - .pipe(hero.tfidf) ) df['kmeans_labels'] = ( - df['tfidf'] + df['tokenized'] + .pipe(hero.tfidf) + .pipe(hero.normalize) .pipe(hero.kmeans, n_clusters=5) - .astype(str) ) -df['pca'] = df['tfidf'].pipe(hero.pca) +df['pca'] = df['tokenized'].pipe(hero.tfidf).pipe(hero.normalize)pipe(hero.pca) hero.scatterplot(df, 'pca', color='kmeans_labels', title="K-means BBC Sport news") ``` From 8afe9b4d4a866355f17762c7f1b02d9eae6fa321 Mon Sep 17 00:00:00 2001 From: Maximilian Krahn Date: Mon, 14 Sep 2020 17:42:28 +0200 Subject: [PATCH 5/6] updated README.md to incorporate the type changes --- README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index efe58aca..0c99444d 100644 --- a/README.md +++ b/README.md @@ -283,9 +283,9 @@ Supported functions: Full documentation: [visualization](https://texthero.org/docs/api-visualization) -

Series Typing

+

Pandas Typing

-What is series typing? Each category of functions, which are defined above, accepts a different type of pandas series. Those are explained here. Those types will in general improve the efficiency of the different algorithms as they are especially designed for those. +What is pandas typing? Each category of functions, which are defined above, accepts a different type of pandas series or a pandas dataframe. Those are explained here. These types will in general improve the efficiency of the different algorithms as they are especially designed for those. For the first step you don't need to worry too much about them, because, if you use a typical texthero pipeline, the functions will have the right input and output types. The *typical* hero pipeline will: - first "clean" the text with any preprocessing function, e. g. `clean`, @@ -293,7 +293,7 @@ For the first step you don't need to worry too much about them, because, if you - analyse or display the series with the representation function or the clustering functions, e. g. `kmeans` - do a dimensionality reduction to display them better in 2D or 3D, e. g. `pca` -If your pipeline will differ, you might want to check out the tutorial on series types to understand which type of series you will get returned and which types will be accepted by the functions. +If your pipeline will differ, you might want to check out the tutorial on series types to understand which type of pandas objects you will get returned and which types will be accepted by the functions.

FAQ

From afde59e365c24f15536ce470b9e0a317b5113940 Mon Sep 17 00:00:00 2001 From: Maximilian Krahn Date: Mon, 14 Sep 2020 18:55:42 +0200 Subject: [PATCH 6/6] incorp suggested changes --- README.md | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/README.md b/README.md index 0c99444d..9741a37d 100644 --- a/README.md +++ b/README.md @@ -48,12 +48,12 @@ Texthero is a python toolkit to work with text-based dataset quickly and effortl You can think of Texthero as a tool to help you _understand_ and work with text-based dataset. Given a tabular dataset, it's easy to _grasp the main concept_. Instead, given a text dataset, it's harder to have quick insights into the underline data. With Texthero, preprocessing text data, mapping it into vectors, and visualizing the obtained vector space takes just a couple of lines. -Texthero include tools for: -* Preprocess text data: it offers both out-of-the-box solutions but it's also flexible for custom-solutions. +Texthero includes tools for: + +* Preprocessing text data: clean and tokenize texts with support for both out-of-the-box and custom solutions. * Natural Language Processing: keyphrases and keywords extraction, and named entity recognition. -* Text representation: TF-IDF, term frequency, and custom word-embeddings (wip) -* Vector space analysis: clustering (K-means, Meanshift, DBSCAN and Hierarchical), topic modeling (wip) and interpretation. -* Text visualization: vector space visualization, place localization on maps (wip). +* Text Representation: vectorization with TF-IDF / Term Frequency / Count, and custom word embeddings (wip) +* Vector space analysis: dimensionality reduction (PCA, NMF, t-SNE) and clustering Texthero is free, open-source and [well documented](https://texthero.org/docs) (and that's what we love most by the way!). @@ -287,13 +287,13 @@ Full documentation: [visualization](https://texthero.org/docs/api-visualization) What is pandas typing? Each category of functions, which are defined above, accepts a different type of pandas series or a pandas dataframe. Those are explained here. These types will in general improve the efficiency of the different algorithms as they are especially designed for those. -For the first step you don't need to worry too much about them, because, if you use a typical texthero pipeline, the functions will have the right input and output types. The *typical* hero pipeline will: -- first "clean" the text with any preprocessing function, e. g. `clean`, -- then perform a NLP function to categorise the text, e. g. `tokenize` and `tfidf` -- analyse or display the series with the representation function or the clustering functions, e. g. `kmeans` -- do a dimensionality reduction to display them better in 2D or 3D, e. g. `pca` +You usually don't need to worry about them, because, if you use a typical texthero pipeline, the different types are seamlessly integrated across all modules". The typical hero pipeline will: +* first clean the text "with preprocessing functions, e.g. `clean`, and tokenize it with `tokenize`, +* then use `tfidf`, `term_frequency`, `count`, or other embeddings (wip) to vectorize the text, +* then do dimensionality reduction with `pca`, `tsne` or `nmf` to reduce noise and showcase the differences between the vectors +* and finally use clustering to find topics, and visualize the dataset. -If your pipeline will differ, you might want to check out the tutorial on series types to understand which type of pandas objects you will get returned and which types will be accepted by the functions. +If your pipeline differs, you might want to check out the tutorial on series types to understand which type of pandas objects you will get returned and which types will be accepted by the functions.

FAQ