Как найти наиболее часто встречающиеся слова до и после заданного слова в данном тексте в python?

У меня есть большой текст, и я пытаюсь получить наиболее часто встречающиеся слова до и после данного слова в этом тексте.

Например:

Я хочу знать, какое слово встречается чаще всего после слова «озеро». В идеале должно получиться что-то вроде этого: (слово 1, # вхождение), (слово 2, # вхождение),...

То же самое и со словами, которые были раньше...

Я попробовал бигран NLTK, но, похоже, он находит только самые распространенные n-grans... Можно ли как-то исправить одно из слов и найти наиболее частые n-grans на основе фиксированного слова)?

Спасибо за любую помощь!!


person user2950162    schedule 16.11.2013    source источник


Ответы (2)


Вы ищете что-то подобное?

text = """
A lake is a body of relatively still water of considerable size, localized in a basin, that is surrounded by land apart from a river, stream, or other form of moving water that serves to feed or drain the lake. Lakes are inland and not part of the ocean and therefore are distinct from lagoons, and are larger and deeper than ponds.[1][2] Lakes can be contrasted with rivers or streams, which are usually flowing. However most lakes are fed and drained by rivers and streams.
Natural lakes are generally found in mountainous areas, rift zones, and areas with ongoing glaciation. Other lakes are found in endorheic basins or along the courses of mature rivers. In some parts of the world there are many lakes because of chaotic drainage patterns left over from the last Ice Age. All lakes are temporary over geologic time scales, as they will slowly fill in with sediments or spill out of the basin containing them.
Many lakes are artificial and are constructed for industrial or agricultural use, for hydro-electric power generation or domestic water supply, or for aesthetic or recreational purposes.
Etymology, meaning, and usage of "lake"[edit]
Oeschinen Lake in the Swiss Alps
Lake Tahoe on the border of California and Nevada
The Caspian Sea is either the world's largest lake or a full-fledged sea.[3]
The word lake comes from Middle English lake ("lake, pond, waterway"), from Old English lacu ("pond, pool, stream"), from Proto-Germanic *lakō ("pond, ditch, slow moving stream"), from the Proto-Indo-European root *leǵ- ("to leak, drain"). Cognates include Dutch laak ("lake, pond, ditch"), Middle Low German lāke ("water pooled in a riverbed, puddle"), German Lache ("pool, puddle"), and Icelandic lækur ("slow flowing stream"). Also related are the English words leak and leach.
There is considerable uncertainty about defining the difference between lakes and ponds, and no current internationally accepted definition of either term across scientific disciplines or political boundaries exists.[4] For example, limnologists have defined lakes as water bodies which are simply a larger version of a pond, which can have wave action on the shoreline or where wind-induced turbulence plays a major role in mixing the water column. None of these definitions completely excludes ponds and all are difficult to measure. For this reason there has been increasing use made of simple size-based definitions to separate ponds and lakes. One definition of lake is a body of water of 2 hectares (5 acres) or more in area;[5]:331[6] however, others[who?] have defined lakes as waterbodies of 5 hectares (12 acres) and above,[citation needed] or 8 hectares (20 acres) and above[citation needed] (see also the definition of "pond"). Charles Elton, one of the founders of ecology, regarded lakes as waterbodies of 40 hectares (99 acres) or more.[7] The term lake is also used to describe a feature such as Lake Eyre, which is a dry basin most of the time but may become filled under seasonal conditions of heavy rainfall. In common usage many lakes bear names ending with the word pond, and a lesser number of names ending with lake are in quasi-technical fact, ponds. One textbook illustrates this point with the following: "In Newfoundland, for example, almost every lake is called a pond, whereas in Wisconsin, almost every pond is called a lake."[8]
One hydrology book proposes to define it as a body of water with the following five chacteristics:[4]
it partially or totally fills one or several basins connected by straits[4]
has essentially the same water level in all parts (except for relatively short-lived variations caused by wind, varying ice cover, large inflows, etc.)[4]
it does not have regular intrusion of sea water[4]
a considerable portion of the sediment suspended in the water is captured by the basins (for this to happen they need to have a sufficiently small inflow-to-volume ratio)[4]
the area measured at the mean water level exceeds an arbitrarily chosen threshold (for instance, one hectare)[4]
With the exception of the sea water intrusion criterion, the other ones have been accepted or elaborated upon by other hydrology publications.[9][10]
""".split()

from nltk import bigrams

bgs = bigrams(text)
lake_bgs = filter(lambda item: item[0] == 'lake', bgs)

from collections import Counter
c = Counter(map(lambda item: item[1], lake_bgs))
print c.most_common()

Какой вывод:

[('is', 4), ('("lake,', 1), ('or', 1), ('comes', 1), ('are', 1)]

Обратите внимание, что вы можете использовать ifilter, imap, etc..., если у вас очень длинный текст.

Изменить: вот код для и после 'lake'.

from nltk import trigrams

tgs = trigrams(text)
lake_tgs = filter(lambda item: item[1] == 'lake', tgs)

from collections import Counter

before_lake = map(lambda item: item[0], lake_tgs)
after_lake = map(lambda item: item[2], lake_tgs)

c = Counter(before_lake + after_lake)
print c.most_common()

Обратите внимание, что это можно сделать и с помощью bigrams :)

person Ohad    schedule 16.11.2013
comment
Спасибо! Это было именно так для n+1 слов. Я новичок в программировании и не мог понять, как настроить код для N-1... - person user2950162; 16.11.2013
comment
Извините, нажмите Enter, прежде чем закончить выше... есть ли способ настроить его так, чтобы он считал слово до, а не слово после? Как вы думаете, сработает ли это и с триграммой? - person user2950162; 16.11.2013

Просто чтобы добавить к ответу @Ohad, вот реализация ngram в NLTK с некоторой масштабируемостью.

#-*- coding: utf8 -*-

import string
from nltk import ngrams
from itertools import chain
from collections import Counter

text = """
A lake is a body of relatively still water of considerable size, localized in a basin, that is surrounded by land apart from a river, stream, or other form of moving water that serves to feed or drain the lake. Lakes are inland and not part of the ocean and therefore are distinct from lagoons, and are larger and deeper than ponds.[1][2] Lakes can be contrasted with rivers or streams, which are usually flowing. However most lakes are fed and drained by rivers and streams.
Natural lakes are generally found in mountainous areas, rift zones, and areas with ongoing glaciation. Other lakes are found in endorheic basins or along the courses of mature rivers. In some parts of the world there are many lakes because of chaotic drainage patterns left over from the last Ice Age. All lakes are temporary over geologic time scales, as they will slowly fill in with sediments or spill out of the basin containing them.
Many lakes are artificial and are constructed for industrial or agricultural use, for hydro-electric power generation or domestic water supply, or for aesthetic or recreational purposes.
Etymology, meaning, and usage of "lake"[edit]
Oeschinen Lake in the Swiss Alps
Lake Tahoe on the border of California and Nevada
The Caspian Sea is either the world's largest lake or a full-fledged sea.[3]
The word lake comes from Middle English lake ("lake, pond, waterway"), from Old English lacu ("pond, pool, stream"), from Proto-Germanic *lakō ("pond, ditch, slow moving stream"), from the Proto-Indo-European root *leǵ- ("to leak, drain"). Cognates include Dutch laak ("lake, pond, ditch"), Middle Low German lāke ("water pooled in a riverbed, puddle"), German Lache ("pool, puddle"), and Icelandic lækur ("slow flowing stream"). Also related are the English words leak and leach.
There is considerable uncertainty about defining the difference between lakes and ponds, and no current internationally accepted definition of either term across scientific disciplines or political boundaries exists.[4] For example, limnologists have defined lakes as water bodies which are simply a larger version of a pond, which can have wave action on the shoreline or where wind-induced turbulence plays a major role in mixing the water column. None of these definitions completely excludes ponds and all are difficult to measure. For this reason there has been increasing use made of simple size-based definitions to separate ponds and lakes. One definition of lake is a body of water of 2 hectares (5 acres) or more in area;[5]:331[6] however, others[who?] have defined lakes as waterbodies of 5 hectares (12 acres) and above,[citation needed] or 8 hectares (20 acres) and above[citation needed] (see also the definition of "pond"). Charles Elton, one of the founders of ecology, regarded lakes as waterbodies of 40 hectares (99 acres) or more.[7] The term lake is also used to describe a feature such as Lake Eyre, which is a dry basin most of the time but may become filled under seasonal conditions of heavy rainfall. In common usage many lakes bear names ending with the word pond, and a lesser number of names ending with lake are in quasi-technical fact, ponds. One textbook illustrates this point with the following: "In Newfoundland, for example, almost every lake is called a pond, whereas in Wisconsin, almost every pond is called a lake."[8]
One hydrology book proposes to define it as a body of water with the following five chacteristics:[4]
it partially or totally fills one or several basins connected by straits[4]
has essentially the same water level in all parts (except for relatively short-lived variations caused by wind, varying ice cover, large inflows, etc.)[4]
it does not have regular intrusion of sea water[4]
a considerable portion of the sediment suspended in the water is captured by the basins (for this to happen they need to have a sufficiently small inflow-to-volume ratio)[4]
the area measured at the mean water level exceeds an arbitrarily chosen threshold (for instance, one hectare)[4]
With the exception of the sea water intrusion criterion, the other ones have been accepted or elaborated upon by other hydrology publications.[9][10]
"""

def ngrammer(txt, n):
    # Removes punctuations and numbers.
    sentences = "".join([i for i in txt if i not in string.punctuation and not i.isdigit()]).split('\n')
    return list(chain(*[ngrams(i.split(), n) for i in sentences]))

def before_after(ngs, word):
    word_grams = filter(lambda item: item[1] == word, ngs)
    before = map(lambda item: item[0], ngs)
    after = map(lambda item: item[2], ngs)
    return before, after

bgs = ngrammer(text,2) # bigrams
tgs = ngrammer(text,3) # trigrams
xgs = ngrammer(text,10) # 10grams

focus = 'lake'
bf, af = before_after(xgs, focus)
c = Counter(bf+af)

# Most common word before and after 'lake' from the 10grams.
print c.most_common()[0]
person alvas    schedule 28.12.2013