Чтобы быстро составить представление о длинном документе, мы всегда будем делать резюмирование, когда читаем статью или книгу. На английском языке первое (или первые два) предложения каждой статьи имеет очень высокий шанс представить всю статью. Конечно, тематическое предложение иногда может быть последним предложением.

В НЛП есть два подхода к резюмированию текста. Первый, экстрактивный подход, представляет собой простой подход, заключающийся в извлечении ключевых слов или предложений из статьи. Есть некоторые ограничения и доказано, что производительность не очень хорошая. Второй, абстрактивный подход, генерирует новую базу предложений по данной статье. Нужна более продвинутая техника.

После прочтения этой статьи:

  1. Понять алгоритм PageRank
  2. Понять алгоритм TextRank
  3. Как мы можем использовать алгоритм TextRank для резюмирования

Алгоритм PageRank разработан Google для поиска наиболее важных веб-сайтов, чтобы результат поиска Google соответствовал запросу.

В PageRank это ориентированный граф. Вначале все узлы имеют одинаковую оценку (1 / общее количество узлов).

Алгоритм

Первая формула - это упрощенная версия PageRank, и мы будем использовать ее для демонстрации. Второй вариант немного сложнее, так как он включает еще один параметр - коэффициент демпфирования «d». По умолчанию d равно 0,85

Давайте посмотрим на упрощенную версию. На итерации 1 рассчитывается PageRank:

  • А: (1/4) / 3. Поскольку только C указывает на A, мы используем предыдущий результат C (итерация 0), деленный на количество узлов (т. Е. 3), на которые указывает C
  • А: (1/4) / 2 + (1/4) / 3. И A, и C указывают на B, поэтому предыдущая оценка A (итерация 0) делится на количество узлов (т. Е. 2), на которые указывает A. Для C он такой же, как и предыдущий, который равен (1/4) / 3.

Для подробностей вы можете посмотреть видео для полного объяснения.

Вопрос: Когда следует останавливать итерацию?

Согласно теории, он должен рассчитывать до тех пор, пока не будет больших обновлений очков.

TextRank

Почему нам нужно вводить PageRank до TextRank? Потому что идея TextRank исходит из PageRank и использования аналогичного алгоритма (концепции графа) для вычисления важности.

Разница:

  1. График TextRank неориентированный. Это означает, что все края двунаправленные
  2. Вес края - это разница, пока он равен 1 в PageRank. Есть разные способы расчета, например BM25, TF-IDF.

Существует множество различных реализаций подобия документов, таких как BM25, косинусное подобие, IDF-модифицированный косинус. Вы можете выбрать наиболее подходящий для вашей проблемы. Если вы не имеете представления об этом алгоритме, дайте нам знать, и мы включим его в более поздний выпуск.

gensim предоставляет простой API для расчета TextRank с использованием BM25 (Best Match 25).

Шаг 1. Настройка среды

pip install gensim==3.4.0

Шаг 2. Импортируйте библиотеку

import gensim 
print('gensim Version: %s' % (gensim.__version__))

Результат:

gensim Version: 3.4.0

Шаг 3. Начальное тестирование содержания

# Capture from https://www.cnbc.com/2018/06/01/microsoft--github-acquisition-talks-resume.html
content = "Microsoft held talks in the past few weeks " + \
    "to acquire software developer platform GitHub, Business " + \
    "Insider reports. One person familiar with the discussions " + \
    "between the companies told CNBC that they had been " + \
    "considering a joint marketing partnership valued around " + \
    "$35 million, and that those discussions had progressed to " + \
    "a possible investment or outright acquisition. It is " + \
    "unclear whether talks are still ongoing, but this " + \
    "person said that GitHub's price for a full acquisition " + \
    "was more than Microsoft currently wanted to pay. GitHub " + \
    "was last valued at $2 billion in its last funding round " + \
    "2015, but the price tag for an acquisition could be $5 " + \
    "billion or more, based on a price that was floated " + \
    "last year. GitHub's tools have become essential to " + \
    "software developers, who use it to store code, " + \
    "keep track of updates and discuss issues. The privately " + \
    "held company has more than 23 million individual users in " + \
    "more than 1.5 million organizations. It was on track to " + \
    "book more than $200 million in subscription revenue, " + \
    "including more than $110 million from companies using its " + \
    "enterprise product, GitHub told CNBC last fall.Microsoft " + \
    "has reportedly flirted with buying GitHub in the past, " + \
    "including in 2016, although GitHub denied those " + \
    "reports. A partnership would give Microsoft another " + \
    "connection point to the developers it needs to court to " + \
    "build applications on its various platforms, including " + \
    "the Azure cloud. Microsoft could also use data from " + \
    "GitHub to improve its artificial intelligence " + \
    "producs. The talks come amid GitHub's struggle to " + \
    "replace CEO and founder Chris Wanstrath, who stepped " + \
    "down 10 months ago. Business Insider reported that " + \
    "Microsoft exec Nat Friedman -- who previously " + \
    "ran Xamarin, a developer tools start-up that Microsoft " + \
    "acquired in 2016 -- may take that CEO role. Google's " + \
    "senior VP of ads and commerce, Sridhar Ramaswamy, has " + \
    "also been in discussions for the job, says the report. " + \
    "Microsoft declined to comment on the report. " + \
    "GitHub did not immediately return a request for comment."

Шаг 4. Попробуйте другое соотношение

Использование параметра Ratio для определения количества возвращаемых предложений импорта.

Original Content:
Microsoft held talks in the past few weeks to acquire software developer platform GitHub, Business Insider reports. One person familiar with the discussions between the companies told CNBC that they had been considering a joint marketing partnership valued around $35 million, and that those discussions had progressed to a possible investment or outright acquisition. It is unclear whether talks are still ongoing, but this person said that GitHub's price for a full acquisition was more than Microsoft currently wanted to pay. GitHub was last valued at $2 billion in its last funding round 2015, but the price tag for an acquisition could be $5 billion or more, based on a price that was floated last year. GitHub's tools have become essential to software developers, who use it to store code, keep track of updates and discuss issues. The privately held company has more than 23 million individual users in more than 1.5 million organizations. It was on track to book more than $200 million in subscription revenue, including more than $110 million from companies using its enterprise product, GitHub told CNBC last fall.Microsoft has reportedly flirted with buying GitHub in the past, including in 2016, although GitHub denied those reports. A partnership would give Microsoft another connection point to the developers it needs to court to build applications on its various platforms, including the Azure cloud. Microsoft could also use data from GitHub to improve its artificial intelligence producs. The talks come amid GitHub's struggle to replace CEO and founder Chris Wanstrath, who stepped down 10 months ago. Business Insider reported that Microsoft exec Nat Friedman -- who previously ran Xamarin, a developer tools start-up that Microsoft acquired in 2016 -- may take that CEO role. Google's senior VP of ads and commerce, Sridhar Ramaswamy, has also been in discussions for the job, says the report. Microsoft declined to comment on the report. GitHub did not immediately return a request for comment.

---> Summarized Content (Ratio is 0.3):
On Friday, Business Insider reported that Microsoft has held talks to buy GitHub — a $2 billion startup that claims 24 million software developers as users.
Just to catch you up: GitHub is an online service that allows developers to host their software projects.

---> Summarized Content (Ratio is 0.5):
On Friday, Business Insider reported that Microsoft has held talks to buy GitHub — a $2 billion startup that claims 24 million software developers as users.
Microsoft declined to comment, but you can read the full Business Insider report here.
While we wait for further word on the future of GitHub, one thing is very clear: It would make perfect sense for Microsoft to buy the startup.
Just to catch you up: GitHub is an online service that allows developers to host their software projects.

---> Summarized Content (Ratio is 0.7):
On Friday, Business Insider reported that Microsoft has held talks to buy GitHub — a $2 billion startup that claims 24 million software developers as users.
Microsoft declined to comment, but you can read the full Business Insider report here.
While we wait for further word on the future of GitHub, one thing is very clear: It would make perfect sense for Microsoft to buy the startup.
Just to catch you up: GitHub is an online service that allows developers to host their software projects.
From there, anyone from all over the world can download those projects and submit their own improvements.
development world.

Результат

Original Content:
On Friday, Business Insider reported that Microsoft has held talks to buy GitHub — a $2 billion startup that claims 24 million software developers as users. It's not immediately clear what will come of these talks. Microsoft declined to comment, but you can read the full Business Insider report here. While we wait for further word on the future of GitHub, one thing is very clear: It would make perfect sense for Microsoft to buy the startup. If the stars align, and GitHub is integrated intelligently into Microsoft's products, it could give the company a big edge against Amazon Web Services, the leading player in the fast-growing cloud market. Just to catch you up: GitHub is an online service that allows developers to host their software projects. From there, anyone from all over the world can download those projects and submit their own improvements. That functionality has made GitHub the center of the open source software. development world.

---> Summarized Content (Ratio is 0.3):
On Friday, Business Insider reported that Microsoft has held talks to buy GitHub — a $2 billion startup that claims 24 million software developers as users.
Just to catch you up: GitHub is an online service that allows developers to host their software projects.

---> Summarized Content (Ratio is 0.5):
On Friday, Business Insider reported that Microsoft has held talks to buy GitHub — a $2 billion startup that claims 24 million software developers as users.
Microsoft declined to comment, but you can read the full Business Insider report here.
While we wait for further word on the future of GitHub, one thing is very clear: It would make perfect sense for Microsoft to buy the startup.
Just to catch you up: GitHub is an online service that allows developers to host their software projects.

---> Summarized Content (Ratio is 0.7):
On Friday, Business Insider reported that Microsoft has held talks to buy GitHub — a $2 billion startup that claims 24 million software developers as users.
Microsoft declined to comment, but you can read the full Business Insider report here.
While we wait for further word on the future of GitHub, one thing is very clear: It would make perfect sense for Microsoft to buy the startup.
Just to catch you up: GitHub is an online service that allows developers to host their software projects.
From there, anyone from all over the world can download those projects and submit their own improvements.
development world.

Шаг 5. Попробуйте другое количество слов

Параметр подсчета слов - еще одна переменная для управления результатом. Если введены и word_count, и ratio. Соотношение будет проигнорировано.

print('Original Content:')
print(content)
for word_count in [10, 30, 50]:
    summarized_content = gensim.summarization.summarize(body, word_count=word_count)
    print()
    print('---> Summarized Content (Word Count is %d):' % word_count)
    print(summarized_content)

Результат

riginal Content:
Microsoft held talks in the past few weeks to acquire software developer platform GitHub, Business Insider reports. One person familiar with the discussions between the companies told CNBC that they had been considering a joint marketing partnership valued around $35 million, and that those discussions had progressed to a possible investment or outright acquisition. It is unclear whether talks are still ongoing, but this person said that GitHub's price for a full acquisition was more than Microsoft currently wanted to pay. GitHub was last valued at $2 billion in its last funding round 2015, but the price tag for an acquisition could be $5 billion or more, based on a price that was floated last year. GitHub's tools have become essential to software developers, who use it to store code, keep track of updates and discuss issues. The privately held company has more than 23 million individual users in more than 1.5 million organizations. It was on track to book more than $200 million in subscription revenue, including more than $110 million from companies using its enterprise product, GitHub told CNBC last fall.Microsoft has reportedly flirted with buying GitHub in the past, including in 2016, although GitHub denied those reports. A partnership would give Microsoft another connection point to the developers it needs to court to build applications on its various platforms, including the Azure cloud. Microsoft could also use data from GitHub to improve its artificial intelligence producs. The talks come amid GitHub's struggle to replace CEO and founder Chris Wanstrath, who stepped down 10 months ago. Business Insider reported that Microsoft exec Nat Friedman -- who previously ran Xamarin, a developer tools start-up that Microsoft acquired in 2016 -- may take that CEO role. Google's senior VP of ads and commerce, Sridhar Ramaswamy, has also been in discussions for the job, says the report. Microsoft declined to comment on the report. GitHub did not immediately return a request for comment.

---> Summarized Content (Word Count is 10):


---> Summarized Content (Word Count is 30):
On Friday, Business Insider reported that Microsoft has held talks to buy GitHub — a $2 billion startup that claims 24 million software developers as users.

---> Summarized Content (Word Count is 50):
On Friday, Business Insider reported that Microsoft has held talks to buy GitHub — a $2 billion startup that claims 24 million software developers as users.
Just to catch you up: GitHub is an online service that allows developers to host their software projects.

Заключение

Весь код можно посмотреть на github. Дайте нам знать, если вы тоже хотите разобраться в абстрактном подходе. Позже оформлю статью

  • Согласно исходному коду gensim, рекомендуется использовать не менее 10 предложений.
  • Никаких обучающих данных или построения модели не требуется.
  • Он подходит не только для английского языка, но и для любого другого пакета ввода (символьный, японский и т. Д.). Вы также можете прочитать Исследовательский документ TextRank для понимания деталей.
  • По моему опыту, в большинстве случаев результат невысокий. Это может быть связано с разнообразием слов, а результат - это только часть введенных.

Обо мне

Я специалист по анализу данных в Bay Area. Сосредоточение внимания на последних достижениях науки о данных, искусственного интеллекта, особенно в области НЛП и связанных с платформами.

Посетите мой блог с ttp: //medium.com/@makcedward/

Подключитесь с https://www.linkedin.com/in/edwardma1026

Изучите мой код с https://github.com/makcedward

Проверьте мой ядро ​​с https://www.kaggle.com/makcedward