Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Visualize hero.describe with plots #178

Open
wants to merge 25 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
fa342a9
added MultiIndex DF support
mk2510 Aug 18, 2020
59a9f8c
beginning with tests
henrifroese Aug 19, 2020
19c52de
implemented correct sparse support
mk2510 Aug 19, 2020
66e566c
Merge branch 'master_upstream' into change_representation_to_multicolumn
mk2510 Aug 21, 2020
41f55a8
added back list() and rm .tolist()
mk2510 Aug 21, 2020
217611a
rm .tolist() and added list()
mk2510 Aug 21, 2020
6a3b56d
Adopted the test to the new dataframes
mk2510 Aug 21, 2020
b8ff561
wrong format
mk2510 Aug 21, 2020
e3af2f9
Address most review comments.
henrifroese Aug 21, 2020
77ad80e
Add more unittests for representation
henrifroese Aug 21, 2020
f7eb7c3
- Update _types.py with DocumentTermDF
henrifroese Aug 22, 2020
4937a4f
Fix DocumentTermDF example DataFrame column names
henrifroese Aug 22, 2020
5fc720c
Implement hero.describe
henrifroese Aug 26, 2020
55dcd7f
Change hero.describe to return DataFrame for pretty-printing in Noteb…
henrifroese Aug 26, 2020
f3bbc08
Auto stash before merge of "hero_describe_function" and "origin/hero_…
mk2510 Aug 26, 2020
9e72c85
Add tests for hero.describe
mk2510 Aug 26, 2020
5aaa579
added right black version
mk2510 Sep 6, 2020
4d398a0
added test and formatting
mk2510 Sep 6, 2020
aa3aa56
added correct order
mk2510 Sep 6, 2020
ea5c640
added test and formatting:
mk2510 Sep 6, 2020
d72128f
added correct order
mk2510 Sep 6, 2020
f6b2fbf
Merge remote-tracking branch 'origin/visualize_describe_with_plots' i…
mk2510 Sep 6, 2020
8cd4a1b
added format
mk2510 Sep 6, 2020
4cb1058
Incorporate suggested changes.
henrifroese Sep 9, 2020
5c774d1
Merge branch 'master_upstream' into visualize_describe_with_plots
mk2510 Sep 22, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ jobs:
env: PATH=/c/Python38:/c/Python38/Scripts:$PATH
install:
- pip3 install --upgrade pip # all three OSes agree about 'pip3'
- pip3 install black
- pip3 install black==19.10b0
- pip3 install ".[dev]" .
# 'python' points to Python 2.7 on macOS but points to Python 3.8 on Linux and Windows
# 'python3' is a 'command not found' error on Windows but 'py' works on Windows only
Expand Down
2 changes: 1 addition & 1 deletion setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ install_requires =
# TODO pick the correct version.
[options.extras_require]
dev =
black>=19.10b0
black==19.10b0
pytest>=4.0.0
Sphinx>=3.0.3
sphinx-markdown-builder>=0.5.4
Expand Down
18 changes: 3 additions & 15 deletions tests/test_indexes.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,21 +56,9 @@
]

test_cases_representation = [
[
"count",
lambda x: representation.flatten(representation.count(x)),
(s_tokenized_lists,),
],
[
"term_frequency",
lambda x: representation.flatten(representation.term_frequency(x)),
(s_tokenized_lists,),
],
[
"tfidf",
lambda x: representation.flatten(representation.tfidf(x)),
(s_tokenized_lists,),
],
["count", representation.count, (s_tokenized_lists,),],
["term_frequency", representation.term_frequency, (s_tokenized_lists,),],
["tfidf", representation.tfidf, (s_tokenized_lists,),],
["pca", representation.pca, (s_numeric_lists, 0)],
["nmf", representation.nmf, (s_numeric_lists,)],
["tsne", representation.tsne, (s_numeric_lists,)],
Expand Down
57 changes: 57 additions & 0 deletions tests/test_preprocessing.py
Original file line number Diff line number Diff line change
Expand Up @@ -381,3 +381,60 @@ def test_remove_hashtags(self):
s_true = pd.Series("Hi , we will remove you")

self.assertEqual(preprocessing.remove_hashtags(s), s_true)

"""
Test describe DataFrame
"""

def test_describe(self):
df = pd.DataFrame(
[
["here here here here go", "sport"],
["There There There", "sport"],
["Test, Test, Test, Test, Test, Test, Test, Test", "sport"],
[np.nan, "music"],
["super super", pd.NA],
[pd.NA, pd.NA],
["great great great great great", "music"],
],
columns=["text", "topics"],
)
df_description = preprocessing.describe(df["text"], df["topics"])
df_true = pd.DataFrame(
[
7,
7,
2,
["Test", "great", "here", "There", "super", "go"],
["test", "great", "super", "go"],
6.0,
2.0,
15.0,
5.196152422706632,
3.0,
5.0,
5.0,
0.6,
0.4,
],
columns=["Value"],
index=pd.MultiIndex.from_tuples(
[
("number of documents", ""),
("number of unique documents", ""),
("number of missing documents", ""),
("most common words", ""),
("most common words excluding stopwords", ""),
("average document length", ""),
("length of shortest document", ""),
("length of longest document", ""),
("standard deviation of document lengths", ""),
("25th percentile document lengths", ""),
("50th percentile document lengths", ""),
("75th percentile document lengths", ""),
("label distribution", "sport"),
("label distribution", "music"),
]
),
)
pd.testing.assert_frame_equal(df_description, df_true, check_less_precise=True)
Loading