Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement hero.describe(s), Closes #166 #168

Open
wants to merge 26 commits into
base: master
Choose a base branch
from

Conversation

henrifroese
Copy link
Collaborator

@henrifroese henrifroese commented Aug 26, 2020

Straightforward implementation of description in #166 .

Example:

>>> import texthero as hero
>>> import pandas as pd
>>> df = pd.read_csv("https://raw.githubusercontent.com/jbesomi/texthero/master/dataset/bbcsport.csv")
>>> hero.describe(df["text"], df["topic"])
                                                                                              Value
number of documents                                                                             737
number of unique documents                                                                      727
number of missing documents                                                                       0
most common words                                          [the, to, a, in, and, of, for, ", I, is]
most common words excluding stopwords             [said, first, england, game, one, year, two, w...
average document length                                                                     387.803
length of shortest document                                                                     119
length of longest document                                                                     1855
standard deviation of document lengths                                                      210.728
25th percentile document lengths                                                                241
50th percentile document lengths                                                                340
75th percentile document lengths                                                                494
label distribution                     football                                            0.359566
                                       rugby                                               0.199457
                                       cricket                                              0.16825
                                       athletics                                           0.137042
                                       tennis                                              0.135685

Screenshot of pretty-printed output from Google Colab:

Screenshot from 2020-08-26 15-41-24

Note: only so many lines changed because this builds upon #157

mk2510 and others added 16 commits August 18, 2020 22:06
suport MultiIndex as function parameter

returns MultiIndex, where Representation was returned

* missing: correct test


Co-authored-by: Henri Froese <[email protected]>
*missing: test adopting for new types


Co-authored-by: Henri Froese <[email protected]>
- add functionality for decorator @InputSeries to handle several allowed input types
- Add typing decorator/hints to representation.py
- add tests for _types DocumentTermDF

Co-authored-by: Maximilian Krahn <[email protected]>
Co-authored-by: Maximilian Krahm <[email protected]>
Co-authored-by: Henri Froese <[email protected]>
@mk2510
Copy link
Collaborator

mk2510 commented Sep 22, 2020

this branch is now based on the master and ready for review/to be merged 🦸 🦸‍♂️ 🦸‍♀️

@mk2510 mk2510 marked this pull request as ready for review September 22, 2020 12:57
@jbesomi
Copy link
Owner

jbesomi commented Apr 8, 2021

Amazing, will check soon and let you know! 🎉 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants