Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Java heap space Error : while importing large data #129

Open
karthikasathishkumar opened this issue Mar 29, 2018 · 1 comment
Open

Java heap space Error : while importing large data #129

karthikasathishkumar opened this issue Mar 29, 2018 · 1 comment

Comments

@karthikasathishkumar
Copy link

i have modified 1g into 10g in MEMORY=10g in bin/mallet shell script and executed import command with input size 5GB in ubuntu14 64-bit ram size 16GB.
i am getting the below error in mallet and how to overcome this error.
kindly suggest a better way to import data(total size of the input data = 5GB).

java.lang.OutOfMemoryError: Java heap space
	at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:68)
	at java.lang.StringBuffer.<init>(StringBuffer.java:128)
	at cc.mallet.pipe.Input2CharSequence.pipe(Input2CharSequence.java:94)
	at cc.mallet.pipe.Input2CharSequence.pipe(Input2CharSequence.java:83)
	at cc.mallet.pipe.Input2CharSequence.pipe(Input2CharSequence.java:47)
	at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:295)
	at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:283)
	at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:291)
	at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:283)
	at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:291)
	at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:283)
	at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:291)
	at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:283)
	at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:291)
	at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:283)
	at cc.mallet.types.InstanceList.addThruPipe(InstanceList.java:267)
	at cc.mallet.classify.tui.Text2Vectors.main(Text2Vectors.java:322) 
@mimno
Copy link
Owner

mimno commented Apr 25, 2018

You might be able to use the "bulk load" feature. It has fewer options, but may be more efficient.

$ bin/mallet bulk-load --help
Efficient tool for importing large amounts of text into Mallet format
--help TRUE|FALSE
Print this command line option usage information. Give argument of TRUE for longer documentation
Default is false
--prefix-code 'JAVA CODE'
Java code you want run before any other interpreted code. Note that the text is interpreted without modification, so unlike some other Java code options, you need to include any necessary 'new's when creating objects.
Default is null
--config FILE
Read command option values from a file
Default is null
--input FILE
The file containing data, one instance per line
Default is null
--output FILE
Write the instance list to this file
Default is mallet.data
--preserve-case [TRUE|FALSE]
If true, do not force all strings to lowercase.
Default is false
--remove-stopwords [TRUE|FALSE]
If true, remove common "stop words" from the text.
This option invokes a minimal English stoplist.
Default is false
--stoplist FILE
Read newline-separated words from this file,
and remove them from text. This option overrides
the default English stoplist triggered by --remove-stopwords.
Default is null
--keep-sequence [TRUE|FALSE]
If true, final data will be a FeatureSequence rather than a FeatureVector.
Default is false
--line-regex REGEX
Regular expression containing regex-groups for label, name and data.
Default is ^([^\t])\t([^\t])\t(.)
--name INTEGER
The index of the group containing the instance name.
Use 0 to indicate that this field is not used.
Default is 1
--label INTEGER
The index of the group containing the label string.
Use 0 to indicate that this field is not used.
Default is 2
--data INTEGER
The index of the group containing the data.
Default is 3
--prune-count N
Reduce features to those that occur more than N times.
Default is 0
--prune-doc-frequency N
Remove features that occur in more than (X
100)% of documents. 0.05 is equivalent to IDF of 3.0.
Default is 1.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants