GitHub - codebuzzer01/Real-Time-Abuse-Detection

Tokenizer for Hindi

This package tends to implement a Tokenizer and a stemmer for Hindi language.

To import the package,

from HindiTokenizer import Tokenizer

This package implements various funcions, which are listed as below:

read_from_file
generate_sentences
tokenize
generate_freq_dict
generate_stem_word
generate_stem_dict
remove_stopwords
clean_text
print_sentences
print_tokens
print_freq_dict
print_stem_dict
len_text
sentence_count
tokens_count
concordance

The Tokenizer can be created in two ways

t=Tokenizer("यह वाक्य हिन्दी में है।")

Or

t=Tokenizer()
t.read_from_file('filename_here')

A brief description about all the functions

read_from_file

This function takes the name of the file which is present in the current directory and reads it.

t.read_from_file('hindi_file.txt')

generate_sentences

Given a text, this will generate a list of sentences.

t.generate_sentences()

print_sentences

This will print the sentences generated by print_sentences.

t.generate_sentences()
t.print_sentences()

tokenize

This will generate a list of tokens from the given text

t.tokenize()

print_tokens

This will print the sentences generated by print_tokens.

t.tokenize()
t.print_tokens()

generate_freq_dict

This will generate a dictionary of frequency of words and return it.

freq_dict=t.generate_freq_dict()

print_freq_dict

This will print the dictionary of frequency of words generated by generate_freq_dict.

freq_dict=t.generate_freq_dict()
t.print_freq_dict(freq_dict)

generate_stem_word

Given a word, this will generate its stem word.

word=t.generate_stem_word("भारतीय")
print word
भारत

generate_stem_dict

This will return the dictionary of stemmed words.

stem_dict=t.generate_stem_dict()

print_stem_dict

This will print the dictionary of stemmed words generated by generate_stem_dict.

stem_dict=t.generate_stem_dict()
t.print_stem_dict(stem_dict)

remove_stopwords

This will remove all the stopwords occuring from the given text.

t.remove_stopwords()

clean_text

This will remove all the punctuation symbols occuring in the given text.

t.clean_text()

len_text

Given a text, this will return the length of it.

print t.len_text()

sentence_count

Given a text, this will return the number of sentences in it.

print t.sentence_count()

tokens_count

Given a text, this will return the number of tokens in it.

print t.tokens_count()

concordance

Given a text, and a word, it will print all the sentences where that word is occuring.

sentences=t.concordace("हिन्दी")
t.print_sentences(sentences)

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md
abusedetectiondata.csv		abusedetectiondata.csv
abusestopword.txt		abusestopword.txt
app.cpython-38.pyc		app.cpython-38.pyc
app.py		app.py
bg.png		bg.png
clean.csv		clean.csv
data.ipynb		data.ipynb
get_tweets.cpython-38.pyc		get_tweets.cpython-38.pyc
get_tweets.py		get_tweets.py
hindi_cleandata.ipynb		hindi_cleandata.ipynb
mydataset.csv		mydataset.csv
text_classification.joblib		text_classification.joblib
twitter.html		twitter.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

codebuzzer01/Real-Time-Abuse-Detection

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages