- Python数据科学简介
- Python数据科学开发环境
- Python Pandas库
- Python Numpy库
- Python Scipy库
- Python Matplotlib库
- Python数据处理
- Python数据可视化
- 统计数据分析
Python词干与词形化
在自然语言处理领域,我们遇到了两个或两个以上单词具有共同根源的情况。 例如,agreed
, agreeing
和 agreeable
这三个词具有相同的词根。 涉及任何这些词的搜索应该把它们当作是根词的同一个词。 因此将所有单词链接到它们的词根变得非常重要。 NLTK库有一些方法来完成这个链接,并给出显示根词的输出。
以下程序使用Porter Stemming算法进行词干分析。
import nltk from nltk.stem.porter import PorterStemmer porter_stemmer = PorterStemmer() word_data = "It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms" # First Word tokenization nltk_tokens = nltk.word_tokenize(word_data) #Next find the roots of the word for w in nltk_tokens: print ("Actual: %s Stem: %s" % (w,porter_stemmer.stem(w)))
执行上面示例代码,得到以下结果 -
Actual: It Stem: It Actual: originated Stem: origin Actual: from Stem: from Actual: the Stem: the Actual: idea Stem: idea Actual: that Stem: that Actual: there Stem: there Actual: are Stem: are Actual: readers Stem: reader Actual: who Stem: who Actual: prefer Stem: prefer Actual: learning Stem: learn Actual: new Stem: new Actual: skills Stem: skill Actual: from Stem: from Actual: the Stem: the Actual: comforts Stem: comfort Actual: of Stem: of Actual: their Stem: their Actual: drawing Stem: draw Actual: rooms Stem: room
词形化是类似的词干,但是它为词语带来了上下文。所以它进一步将具有相似含义的词链接到一个词。 例如,如果一个段落有像汽车,火车和汽车这样的词,那么它将把它们全部连接到汽车。 在下面的程序中,使用WordNet词法数据库进行词式化。
import nltk from nltk.stem import WordNetLemmatizer wordnet_lemmatizer = WordNetLemmatizer() word_data = "It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms" nltk_tokens = nltk.word_tokenize(word_data) for w in nltk_tokens: print ("Actual: %s Lemma: %s" % (w,wordnet_lemmatizer.lemmatize(w)))
当我们执行上面的代码时,它会产生以下结果。
Actual: It Lemma: It Actual: originated Lemma: originated Actual: from Lemma: from Actual: the Lemma: the Actual: idea Lemma: idea Actual: that Lemma: that Actual: there Lemma: there Actual: are Lemma: are Actual: readers Lemma: reader Actual: who Lemma: who Actual: prefer Lemma: prefer Actual: learning Lemma: learning Actual: new Lemma: new Actual: skills Lemma: skill Actual: from Lemma: from Actual: the Lemma: the Actual: comforts Lemma: comfort Actual: of Lemma: of Actual: their Lemma: their Actual: drawing Lemma: drawing Actual: rooms Lemma: room
上一篇:Python单词标记化
下一篇:Python图表属性
扫描二维码
程序员编程王