1.3分词速度很慢 #57

joyJZhang · 2018-02-09T10:47:41Z

使用1.2时很快，但是使用1.3时打印了许多日志，相当慢。
18:23:38.131 [main] INFO o.a.w.s.SegmentationFactory - 构造分词实现类：org.apdplat.word.segmentation.impl.MaxNgramScore
18:23:38.159 [main] INFO org.apdplat.word.util.WordConfTools - 开始加载配置文件
18:23:38.160 [main] INFO org.apdplat.word.util.WordConfTools - 加载配置文件：word.conf
18:23:38.164 [main] INFO org.apdplat.word.util.WordConfTools - 未找到配置文件：word.local.conf
18:23:38.165 [main] INFO org.apdplat.word.util.WordConfTools - 配置文件加载完毕，耗时5 毫秒，配置项数目：33
18:23:38.165 [main] INFO org.apdplat.word.util.WordConfTools - 配置信息：
18:23:38.263 [main] INFO org.apdplat.word.util.WordConfTools - 1、auto.detect=true
这该怎么处理~

YLongo · 2018-04-27T09:07:41Z

只是第一次加载的时候很慢而已。或者你可以选择改一下日志的打印等级。

sosojustdo · 2018-04-28T03:33:39Z

@joyJZhang 就是项目启动的时候，加载配置文件，分词，模型建立的确慢，但是项目一旦启动成功后，直接调用就很快了。

jankeyfu · 2018-05-13T14:25:58Z

请问怎么进行预加载操作？

sosojustdo · 2018-05-14T06:23:25Z

@fujiangkun 定义一个普通类，spring.xml配置一下bean，指定init-method就行了，在init-method里面实现就行了。

lixuanli · 2018-05-28T09:36:07Z

@sosojustdo 请问具体怎么操作，我在项目启动时调用了DictionaryFactory.reload()方法，但是后面每次分词，调用WordSegmenter.seg()方法的时候，还是会初始化配置。

jenopob · 2018-05-28T14:05:00Z

我也遇到楼上的问题，怎么解决的

sosojustdo · 2018-05-30T09:28:34Z

@lixuanli @jenopob 写一个process处理类，spring容器启动的时候，执行init方法，初始化好seg, 程序中直接使用process对应的bean去调用操作seg相关的方法就行啦。
`package com.llb.cloud.nlp.process;

import java.io.IOException;
import java.util.Comparator;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.TreeMap;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.atomic.AtomicInteger;

import org.apache.commons.collections.CollectionUtils;
import org.apache.commons.collections.MapUtils;
import org.apache.commons.lang.StringUtils;
import org.apdplat.word.dictionary.DictionaryFactory;
import org.apdplat.word.recognition.StopWord;
import org.apdplat.word.segmentation.Segmentation;
import org.apdplat.word.segmentation.SegmentationAlgorithm;
import org.apdplat.word.segmentation.SegmentationFactory;
import org.apdplat.word.segmentation.Word;
import org.apdplat.word.util.WordConfTools;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import com.llb.cloud.util.WordUtil;

/**

Description: 词频统计处理
@Version 1.0 2017年3月31日下午12:17:24 by 代鹏（[email protected]）创建
*/
public class WordFrequencyStatisticsProcess{

private static final Logger logger = LoggerFactory.getLogger(WordFrequencyStatisticsProcess.class);

private final Segmentation segmentation = SegmentationFactory.getSegmentation(SegmentationAlgorithm.MaxNgramScore);//分词器
private ConcurrentHashMap<String, AtomicInteger> countMap = new ConcurrentHashMap<String, AtomicInteger>();//统计map
private boolean removeStopWord = false;//是否排除统计词汇
private static Set myDicSet = new HashSet();
public void init(){
try {
//加载自定义词库
myDicSet = WordUtil.readFileToSet("/nlpconfig/my_dic.txt");
```
 	//强制设置
 	WordConfTools.set("dic.path", "classpath:nlpconfig/my_dic.txt,classpath:dic.txt");
 	WordConfTools.set("stopwords.path", "classpath:nlpconfig/my_stopwords.txt,classpath:stopwords.txt");
 	WordConfTools.set("ngram", "no");
 	WordConfTools.set("person.name.recognize", "false");
 	WordConfTools.set("recognition.tool.enabled", "false");
 	DictionaryFactory.reload();
 	
 	WordFrequencyStatisticsProcess wordProcess = new WordFrequencyStatisticsProcess();
 	wordProcess.seg("初始化...");
 } catch (IOException e) {
 	logger.error("init load my dic file data error:", e);
 }
```
}

public boolean isRemoveStopWord() {
return removeStopWord;
}

public void setRemoveStopWord(boolean removeStopWord) {
this.removeStopWord = removeStopWord;
}

public Map<String, AtomicInteger> getCountMap() {
return countMap;
}

public void reSet(){
countMap.clear();
}

/**
- Description: 分词统计处理
- @Version1.0 2017年3月31日下午1:34:58 by 代鹏（[email protected]）创建
- @param text
  */
  public void seg(String text) {
  List words = segmentation.seg(text);
  if(CollectionUtils.isNotEmpty(words)){
  for(Word word:words){
  if(isRemoveStopWord() && StopWord.is(word.getText())){
  return;
  }
  //只统计指定词汇的频率
  if(StringUtils.isNotBlank(word.getText()) && myDicSet.contains(word.getText())){
  statistics(word, 1, countMap);
  }
  }
  }
  }
private void statistics(Word word, int times, ConcurrentHashMap<String, AtomicInteger> container){
statistics(word.getText(), times, container);
}

private void statistics(String word, int times, ConcurrentHashMap<String, AtomicInteger> container){
container.putIfAbsent(word, new AtomicInteger());
container.get(word).addAndGet(times);
}

/**
- Description: 获取所有词频统计
- @Version1.0 2017年3月31日下午1:34:49 by 代鹏（[email protected]）创建
- @return
  */
  public TreeMap<String, AtomicInteger> getAllStatisticMap() {
  if(MapUtils.isNotEmpty(countMap)){
  ValueComparator valueComparator = new ValueComparator(countMap);
  TreeMap<String, AtomicInteger> statisticMap = new TreeMap<String, AtomicInteger>(valueComparator);
  statisticMap.putAll(countMap);
  return statisticMap;
  }
  return new TreeMap<String, AtomicInteger>();
  }
/**
- Description: 获取Top词频统计
- @Version1.0 2017年3月31日下午1:44:03 by 代鹏（[email protected]）创建
- @param top
- @return
  */
  public TreeMap<String, AtomicInteger> topStatisticMap(int top) {
  TreeMap<String, AtomicInteger> totalStatisticMap = this.getAllStatisticMap();
  int size = totalStatisticMap.size();
  if(size <= top){
  return totalStatisticMap;
  }else{
  TreeMap<String, AtomicInteger> subMap = new TreeMap<String, AtomicInteger>();
  int loop = 0;
  for(Map.Entry<String, AtomicInteger> entry:totalStatisticMap.entrySet()){
  if(loop >= top){
  break;
  }
  subMap.put(entry.getKey(), entry.getValue());
  loop++;
  }
  return subMap;
  }
  }
/**
- Description: Map Value Comparator
- All Rights Reserved.
- @Version 1.0 2017年3月31日下午1:25:30 by 代鹏（[email protected]）创建
  */
  class ValueComparator implements Comparator {
  
  Map<String, AtomicInteger> base;
  public ValueComparator(Map<String, AtomicInteger> base) {
  this.base = base;
  }
  
  @OverRide
  public int compare(String a, String b) {
  if (base.get(a).get() >= base.get(b).get()) {
  return -1;
  } else {
  return 1;
  }
  }
  }

}`

spring.xml配置:

<bean id="wordFrequencyStatisticsProcess" class="com.llb.cloud.nlp.process.WordFrequencyStatisticsProcess" init-method="init"/>

jenopob · 2018-06-02T03:38:56Z

已解决，this

lixuanli · 2018-06-06T06:36:52Z

@sosojustdo 谢谢

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.3分词速度很慢 #57

1.3分词速度很慢 #57

joyJZhang commented Feb 9, 2018

YLongo commented Apr 27, 2018

sosojustdo commented Apr 28, 2018

jankeyfu commented May 13, 2018

sosojustdo commented May 14, 2018

lixuanli commented May 28, 2018

jenopob commented May 28, 2018

sosojustdo commented May 30, 2018 •

edited

Loading

jenopob commented Jun 2, 2018

lixuanli commented Jun 6, 2018

1.3分词速度很慢 #57

1.3分词速度很慢 #57

Comments

joyJZhang commented Feb 9, 2018

YLongo commented Apr 27, 2018

sosojustdo commented Apr 28, 2018

jankeyfu commented May 13, 2018

sosojustdo commented May 14, 2018

lixuanli commented May 28, 2018

jenopob commented May 28, 2018

sosojustdo commented May 30, 2018 • edited Loading

jenopob commented Jun 2, 2018

lixuanli commented Jun 6, 2018

sosojustdo commented May 30, 2018 •

edited

Loading