Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

最大正向匹配算法-测试错误 #13

Open
nlpjoe opened this issue Sep 20, 2017 · 3 comments
Open

最大正向匹配算法-测试错误 #13

nlpjoe opened this issue Sep 20, 2017 · 3 comments
Assignees

Comments

@nlpjoe
Copy link

nlpjoe commented Sep 20, 2017

你好,在正向最大匹配分词练习里,我在文件eval.py遇到了编译错误,错误如下:

Traceback (most recent call last):
  File "eval.py", line 175, in <module>
    num_recall, num_pred, num_gold = evaluate(pred_inst, gold_inst, opt.mode)
  File "eval.py", line 34, in evaluate
    assert (pred.raw == gold.raw)
AssertionError

我的最大匹配分词代码如下:

def max_match_segment(line, dic):
    # write your code here
    # line = line.decode('utf-8')
    s = "" # pattern正常窗口
    s_f = "" # pattern前倾一位窗口
    ret = []
    tmp = set()
    for cur_word in line:  # line 为str
        s = s_f
        s_f += cur_word # s_f前倾一位
        if len(tmp) == 0: # 新词典为空,构建新词典 s_f是word子串,把word加入新词典
            tmp = set([word for word in dic if s_f in word])
        else: # 新词典不为空,遍历对比, 移除词典中不符合条件的词
            tmp = set([elem for elem in tmp if s_f in elem])

        if len(tmp) == 0: # 匹配到最大词,加入列表
            ret.append(s)
            s_f = "" + cur_word # 重置前倾
    return ret

我是mac系统,最后输出到output.dat文件中是乱码的,在decode再encode成UTF-8编码后虽然文字没问题,但是运行python eval.py --format=segment --mode=segment --eval=output.dat --gold=eval.dat依旧是同样的编译错误,不知道是什么原因呢?

@Oneplus Oneplus self-assigned this Sep 20, 2017
@mozillazg
Copy link

主要原因是编码问题。只需要统一所有字符串为 unicode 就可以了:

  • 把 line 解码成 unicode
  • 把 dic 中的字符串也解码成 unicode

output.dat 乱码是因为当字符串不是 unicode 时,单个汉字的长度不是 1 导致切分时会出现把一个字切成两半的问题:

>>> s = '你好'
>>> len(s)
6
>>> s[:2]
'\xe4\xbd'
>>> print s[:2]
�
>>>
>>> u = s.decode('utf-8')
>>> len(u)
2
>>> u[:2]
u'\u4f60\u597d'
>>> print u[:2]
你好
>>>

@nlpjoe
Copy link
Author

nlpjoe commented Sep 26, 2017

在decode再encode成UTF-8编码后虽然文字没问题

您好,我最开始已经解决过乱码问题。并且在实验室的windows系统上跑过这个测试,bug都是一样的编译错误:

Traceback (most recent call last):
  File "eval.py", line 175, in <module>
    num_recall, num_pred, num_gold = evaluate(pred_inst, gold_inst, opt.mode)
  File "eval.py", line 34, in evaluate
    assert (pred.raw == gold.raw)
AssertionError

@mozillazg
Copy link

@nlpjoe 用你的代码及“在decode再encode成UTF-8编码”的思路复现了问题,原因是因为:你的 max_match_segment 没有考虑 “line 字符串的最后几个字符不在词典内”的情况,比如: 立体声等功能。 分成了 立体声 等功 能 ,跟原始字符串相比少了后面的 ,导致 eval.py 代码中的检查失败。
处理一下这种情况就可以正常运行 eval.py 程序了。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants