Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.3分词速度很慢 #57

Open
joyJZhang opened this issue Feb 9, 2018 · 9 comments
Open

1.3分词速度很慢 #57

joyJZhang opened this issue Feb 9, 2018 · 9 comments

Comments

@joyJZhang
Copy link

使用1.2时很快,但是使用1.3时打印了许多日志,相当慢。
18:23:38.131 [main] INFO o.a.w.s.SegmentationFactory - 构造分词实现类:org.apdplat.word.segmentation.impl.MaxNgramScore
18:23:38.159 [main] INFO org.apdplat.word.util.WordConfTools - 开始加载配置文件
18:23:38.160 [main] INFO org.apdplat.word.util.WordConfTools - 加载配置文件:word.conf
18:23:38.164 [main] INFO org.apdplat.word.util.WordConfTools - 未找到配置文件:word.local.conf
18:23:38.165 [main] INFO org.apdplat.word.util.WordConfTools - 配置文件加载完毕,耗时5 毫秒,配置项数目:33
18:23:38.165 [main] INFO org.apdplat.word.util.WordConfTools - 配置信息:
18:23:38.263 [main] INFO org.apdplat.word.util.WordConfTools - 1、auto.detect=true
这该怎么处理~

@YLongo
Copy link

YLongo commented Apr 27, 2018

只是第一次加载的时候很慢而已。或者你可以选择改一下日志的打印等级。

@sosojustdo
Copy link

@joyJZhang 就是项目启动的时候,加载配置文件,分词,模型建立的确慢,但是项目一旦启动成功后,直接调用就很快了。

@jankeyfu
Copy link

请问怎么进行预加载操作?

@sosojustdo
Copy link

@fujiangkun 定义一个普通类,spring.xml配置一下bean, 指定init-method就行了,在init-method里面实现就行了。

@lixuanli
Copy link

@sosojustdo 请问具体怎么操作,我在项目启动时调用了DictionaryFactory.reload()方法,但是后面每次分词,调用WordSegmenter.seg()方法的时候,还是会初始化配置。

@jenopob
Copy link

jenopob commented May 28, 2018

我也遇到楼上的问题,怎么解决的

@sosojustdo
Copy link

sosojustdo commented May 30, 2018

@lixuanli @jenopob 写一个process处理类,spring容器启动的时候,执行init方法,初始化好seg, 程序中直接使用process对应的bean去调用操作seg相关的方法就行啦。
`package com.llb.cloud.nlp.process;

import java.io.IOException;
import java.util.Comparator;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.TreeMap;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.atomic.AtomicInteger;

import org.apache.commons.collections.CollectionUtils;
import org.apache.commons.collections.MapUtils;
import org.apache.commons.lang.StringUtils;
import org.apdplat.word.dictionary.DictionaryFactory;
import org.apdplat.word.recognition.StopWord;
import org.apdplat.word.segmentation.Segmentation;
import org.apdplat.word.segmentation.SegmentationAlgorithm;
import org.apdplat.word.segmentation.SegmentationFactory;
import org.apdplat.word.segmentation.Word;
import org.apdplat.word.util.WordConfTools;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import com.llb.cloud.util.WordUtil;

/**

  • Description: 词频统计处理

  • All Rights Reserved.

  • @Version 1.0 2017年3月31日 下午12:17:24 by 代鹏([email protected])创建
    */
    public class WordFrequencyStatisticsProcess{

    private static final Logger logger = LoggerFactory.getLogger(WordFrequencyStatisticsProcess.class);

    private final Segmentation segmentation = SegmentationFactory.getSegmentation(SegmentationAlgorithm.MaxNgramScore);//分词器
    private ConcurrentHashMap<String, AtomicInteger> countMap = new ConcurrentHashMap<String, AtomicInteger>();//统计map
    private boolean removeStopWord = false;//是否排除统计词汇
    private static Set myDicSet = new HashSet();
    public void init(){
    try {
    //加载自定义词库
    myDicSet = WordUtil.readFileToSet("/nlpconfig/my_dic.txt");

     	//强制设置
     	WordConfTools.set("dic.path", "classpath:nlpconfig/my_dic.txt,classpath:dic.txt");
     	WordConfTools.set("stopwords.path", "classpath:nlpconfig/my_stopwords.txt,classpath:stopwords.txt");
     	WordConfTools.set("ngram", "no");
     	WordConfTools.set("person.name.recognize", "false");
     	WordConfTools.set("recognition.tool.enabled", "false");
     	DictionaryFactory.reload();
     	
     	WordFrequencyStatisticsProcess wordProcess = new WordFrequencyStatisticsProcess();
     	wordProcess.seg("初始化...");
     } catch (IOException e) {
     	logger.error("init load my dic file data error:", e);
     }
    

    }

    public boolean isRemoveStopWord() {
    return removeStopWord;
    }

    public void setRemoveStopWord(boolean removeStopWord) {
    this.removeStopWord = removeStopWord;
    }

    public Map<String, AtomicInteger> getCountMap() {
    return countMap;
    }

    public void reSet(){
    countMap.clear();
    }

    /**

    • Description: 分词统计处理
    • @Version1.0 2017年3月31日 下午1:34:58 by 代鹏([email protected])创建
    • @param text
      */
      public void seg(String text) {
      List words = segmentation.seg(text);
      if(CollectionUtils.isNotEmpty(words)){
      for(Word word:words){
      if(isRemoveStopWord() && StopWord.is(word.getText())){
      return;
      }
      //只统计指定词汇的频率
      if(StringUtils.isNotBlank(word.getText()) && myDicSet.contains(word.getText())){
      statistics(word, 1, countMap);
      }
      }
      }
      }

    private void statistics(Word word, int times, ConcurrentHashMap<String, AtomicInteger> container){
    statistics(word.getText(), times, container);
    }

    private void statistics(String word, int times, ConcurrentHashMap<String, AtomicInteger> container){
    container.putIfAbsent(word, new AtomicInteger());
    container.get(word).addAndGet(times);
    }

    /**

    • Description: 获取所有词频统计
    • @Version1.0 2017年3月31日 下午1:34:49 by 代鹏([email protected])创建
    • @return
      */
      public TreeMap<String, AtomicInteger> getAllStatisticMap() {
      if(MapUtils.isNotEmpty(countMap)){
      ValueComparator valueComparator = new ValueComparator(countMap);
      TreeMap<String, AtomicInteger> statisticMap = new TreeMap<String, AtomicInteger>(valueComparator);
      statisticMap.putAll(countMap);
      return statisticMap;
      }
      return new TreeMap<String, AtomicInteger>();
      }

    /**

    • Description: 获取Top词频统计
    • @Version1.0 2017年3月31日 下午1:44:03 by 代鹏([email protected])创建
    • @param top
    • @return
      */
      public TreeMap<String, AtomicInteger> topStatisticMap(int top) {
      TreeMap<String, AtomicInteger> totalStatisticMap = this.getAllStatisticMap();
      int size = totalStatisticMap.size();
      if(size <= top){
      return totalStatisticMap;
      }else{
      TreeMap<String, AtomicInteger> subMap = new TreeMap<String, AtomicInteger>();
      int loop = 0;
      for(Map.Entry<String, AtomicInteger> entry:totalStatisticMap.entrySet()){
      if(loop >= top){
      break;
      }
      subMap.put(entry.getKey(), entry.getValue());
      loop++;
      }
      return subMap;
      }
      }

    /**

    • Description: Map Value Comparator

    • All Rights Reserved.

    • @Version 1.0 2017年3月31日 下午1:25:30 by 代鹏([email protected])创建
      */
      class ValueComparator implements Comparator {

      Map<String, AtomicInteger> base;
      public ValueComparator(Map<String, AtomicInteger> base) {
      this.base = base;
      }

      @OverRide
      public int compare(String a, String b) {
      if (base.get(a).get() >= base.get(b).get()) {
      return -1;
      } else {
      return 1;
      }
      }
      }

}`

spring.xml配置:

<bean id="wordFrequencyStatisticsProcess" class="com.llb.cloud.nlp.process.WordFrequencyStatisticsProcess" init-method="init"/>

@jenopob
Copy link

jenopob commented Jun 2, 2018

已解决,this

@lixuanli
Copy link

lixuanli commented Jun 6, 2018

@sosojustdo 谢谢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants