Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement hero.describe(s) #166

Closed
mk2510 opened this issue Aug 26, 2020 · 4 comments
Closed

Implement hero.describe(s) #166

mk2510 opened this issue Aug 26, 2020 · 4 comments
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@mk2510
Copy link
Collaborator

mk2510 commented Aug 26, 2020

In pandas, there is a s.describe() function that provides some information / statistics about a series or dataframe. Example:

>>> s = pd.read_csv("https://raw.githubusercontent.com/jbesomi/texthero/master/dataset/bbcsport.csv")["text"]
>>> s.describe()
count                                                   737
unique                                                  727
top       India's top six secure - Ganguly\n\nCaptain So...
freq                                                      2
Name: text, dtype: object

We can see that it's not all that useful for text data.

TODO

Implement a function hero.describe(s) where s is a TextSeries (so a Series where every cell is a string, like s in the example above. Output might include

  • number of documents (= count from above = len(s))
  • number of unique documents (= unique from above = len(s.unique())
  • number of empty / missing (NaN) documents (= (~hero.has_content(s)).values.sum() )
  • most common single words (= s.pipe(hero.tokenize).pipe(hero.top_words)[:10] for the 10 most common words)
  • average length of documents (= s.pipe(hero.tokenize).map(lambda x: len(x)).mean()
  • shortest document (= s.pipe(hero.tokenize).map(lambda x: len(x)).min()
  • longest document (= s.pipe(hero.tokenize).map(lambda x: len(x)).max()

and should be a nice-looking Series like above.

We believe this is a great way for our users to get first insights into their text datasets.

@mk2510 mk2510 added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers labels Aug 26, 2020
@mk2510 mk2510 changed the title Implement hero.info(s: TextSeries) Implement hero.info(s) Aug 26, 2020
@mk2510 mk2510 changed the title Implement hero.info(s) Implement hero.describe(s) Aug 26, 2020
@henrifroese
Copy link
Collaborator

henrifroese commented Aug 26, 2020

Will start work on this now 🍺

@mk2510
Copy link
Collaborator Author

mk2510 commented Aug 26, 2020

Great 🍻

@k0pernicus
Copy link

Solved on #168 if I don't make any mistake - we can close this issue I think :)

@jbesomi
Copy link
Owner

jbesomi commented Dec 4, 2020

Thanks, Antoin! Keep up the good work!! 👍

@jbesomi jbesomi closed this as completed Dec 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants