Извлечение текста в R с пакетом stringi

У меня есть текст ниже, и мне нужно извлечь определенные слова до и после определенного слова.

Пример:

sometext <- "about us, close, products & services, focus, close, research & development, topics, carbon fiber reinforced thermoplastic, separators for lithium ion batteries, close, for investors, close, jobs & careers, close, \nselect language\n\n, home > corporate social responsibility > \nsocial report\n >  quality assurance\n, \nensuring provision of safe products, \nthe teijin group resin & plastic processing business unit is globally expanding its engineering plastics centered on polycarbonate resin, where we hold a major share in growing asian markets. these products are widely used in applications such as automotive components, office automation equipment and optical discs (blu-ray, dvd). customers include automotive manufacturers, electronic equipment manufacturers and related mold companies. customer data is organized into a database as groundwork to actively promote efforts to enhance customer satisfaction., \nin accordance with iso 9001 (8-4, 8-2), the regular implementation of"
library(stringi)
stri_extract_all_fixed(sometext , c('engineering plastics', 'iso 9001','office automation'), case_insensitive=TRUE, overlap=TRUE)

Фактический результат ниже

[[1]]
[1] "engineering plastics"

[[2]]
[1] "iso 9001"

[[3]]
[1] "office automation"

Требуемый результат:

[1] globally expanding its engineering plastics centered on polycarbonate resin
[2] accordance with iso 9001 (8-4, 8-2), the regular implementation of

В основном нужно извлечь текст до и после упомянутых мною конкретных слов

r stringi stringr text-extraction

PRAVEEN R 27.12.2016 источник

comment

Ваш вызов stri_extract_all_fixed ссылается на переменную prav_1, которая не определена. Пожалуйста, сделайте ваш пример воспроизводимым. - drammock 28.12.2016

comment

Весь текст находится до или после ваших конкретных слов. Вы, кажется, хотите 3 слова перед инженерными пластиками и 4 слова после; 2 слова перед iso 9001 и довольно много после... у вас есть надежная логика, которую вы можете объяснить, сколько до и после вы хотите извлечь? - Gregor Thomas 28.12.2016

comment

пожалуйста, измените prav_1 как какой-нибудь текст - PRAVEEN R 29.12.2016

comment

Мне нужно 10 слов до и 10 слов после.. - PRAVEEN R 29.12.2016

Ответы (1)

arrow_upward
0
arrow_downward

Вот идея для начала:

sometext <- "about us, close, products & services, focus, close, research & development, topics, carbon fiber reinforced thermoplastic, separators for lithium ion batteries, close, for investors, close, jobs & careers, close, \nselect language\n\n, home > corporate social responsibility > \nsocial report\n >  quality assurance\n, \nensuring provision of safe products, \nthe teijin group resin & plastic processing business unit is globally expanding its engineering plastics centered on polycarbonate resin, where we hold a major share in growing asian markets. these products are widely used in applications such as automotive components, office automation equipment and optical discs (blu-ray, dvd). customers include automotive manufacturers, electronic equipment manufacturers and related mold companies. customer data is organized into a database as groundwork to actively promote efforts to enhance customer satisfaction., \nin accordance with iso 9001 (8-4, 8-2), the regular implementation of"
library(stringi)
words <- c('engineering plastics', 'iso 9001','office automation')
pattern <- stri_paste("([^ ]+ ){0,10}", words, "([^ ]+ ){0,10}")
stri_extract_all_regex(sometext , pattern, case_insensitive=TRUE, overlap=TRUE)

Объяснение: я добавляю простое регулярное выражение до и после нужных вам слов:

"([^ ]+ ){0,10}"

что значит:

что угодно, кроме космоса, повторенное столько раз, сколько вы можете
затем пробел
и все это до десяти раз

Это очень просто и наивно (например, все '&' или '>' рассматриваются как слова), но работает.

bartektartanus 17.02.2017

Извлечение текста в R с пакетом stringi

Ответы (1)

Вопросы по теме