Каков эффективный способ удаления стоп-слов в текстовом блобе для анализа тональности текста?

Я пытаюсь реализовать алгоритм Naive Bayes для анализа настроений заголовков News Paper. Я использую TextBlob для этой цели, и мне трудно удалить стоп-слова, такие как «a», «the», «in» и т. д. Ниже приведен фрагмент моего кода на python:

from textblob.classifiers import NaiveBayesClassifier
from textblob import TextBlob

test = [
("11 bonded labourers saved from shoe firm", "pos"),
("Scientists greet Abdul Kalam after the successful launch of Agni on May 22, 1989","pos"),
("Heavy Winter Snow Storm Lashes Out In Northeast US", "neg"),
("Apparent Strike On Gaza Tunnels Kills 2 Palestinians", "neg")
       ]

with open('input.json', 'r') as fp:
cl = NaiveBayesClassifier(fp, format="json")

print(cl.classify("Oil ends year with biggest gain since 2009"))  # "pos"
print(cl.classify("25 dead in Baghdad blasts"))  # "neg"

Srinidhi Bhat 20.02.2017 источник

Ответы (2)

arrow_upward
0
arrow_downward

Вы можете сначала загрузить json, а затем создать список кортежей (текст, метка) с заменой.

Демонстрация:

Предположим, что файл input.json выглядит примерно так:

[
    {"text": "I love this sandwich.", "label": "pos"},
    {"text": "This is an amazing place!", "label": "pos"},
    {"text": "I do not like this restaurant", "label": "neg"}
]

Затем вы можете использовать:

from textblob.classifiers import NaiveBayesClassifier
import json

train_list = []
with open('input.json', 'r') as fp:
    json_data = json.load(fp)
    for line in json_data:
        text = line['text']
        text = text.replace(" is ", " ") # you can remove multiple stop words
        label = line['label']
        train_list.append((text, label))
    cl = NaiveBayesClassifier(train_list)

from pprint import pprint
pprint(train_list)

выход:

[(u'I love this sandwich.', u'pos'),
 (u'This an amazing place!', u'pos'),
 (u'I do not like this restaurant', u'neg')]

saurabhsuman4797 20.02.2017

arrow_upward
0
arrow_downward

Ниже приведен код для удаления стоп-слов в тексте. Поместите все стоп-слова в файлы stopwords, затем прочитайте слова и сохраните их в переменной stop_words.


# This function reads a file and returns its contents as an array
def readFileandReturnAnArray(fileName, readMode, isLower):
    myArray=[]
    with open(fileName, readMode) as readHandle:
        for line in readHandle.readlines():
            lineRead = line
            if isLower:
                lineRead = lineRead.lower()
            myArray.append(lineRead.strip().lstrip())
    readHandle.close()
    return myArray

stop_words = readFileandReturnAnArray("stopwords","r",True)

def removeItemsInTweetContainedInAList(tweet_text,stop_words,splitBy):
    wordsArray = tweet_text.split(splitBy)
    StopWords = list(set(wordsArray).intersection(set(stop_words)))
    return_str=""
    for word in wordsArray:
        if word not in StopWords:
            return_str += word + splitBy
    return return_str.strip().lstrip()


# Call the above method
tweet_text = removeItemsInTweetContainedInAList(tweet_text.strip().lstrip(),stop_words, " ")

jo_Veera 15.03.2019

comment

Приведенный выше код — это мой собственный алгоритм удаления стоп-слов без использования встроенного классификатора. - jo_Veera; 26.03.2019

Каков эффективный способ удаления стоп-слов в текстовом блобе для анализа тональности текста?

Ответы (2)

Вопросы по теме