как игнорировать нежелательный шаблон в регулярном выражении

У меня есть следующий код Python

from io import BytesIO
import pdfplumber, requests
test_case = {
    'https://www1.hkexnews.hk/listedco/listconews/sehk/2020/0514/2020051400555.pdf': 59,
    'https://www1.hkexnews.hk/listedco/listconews/gem/2020/0529/2020052902118.pdf': 55,
    'https://www1.hkexnews.hk/listedco/listconews/sehk/2020/0618/2020061800366.pdf': 47,
    'https://www1.hkexnews.hk/listedco/listconews/gem/2020/0630/2020063002674.pdf': 30,
}

for url, page in test_case.items():
    rq = requests.get(url)
    pdf = pdfplumber.load(BytesIO(rq.content))
    txt = pdf.pages[page].extract_text()
    txt = re.sub("([^\x00-\x7F])+", "", txt)  # no chinese
    pattern = r'.*\n.*?(?P<auditor>[A-Z].+?\n?)(?:LLP\s*)?\s*((PRC.*?|Chinese.*?)?[Cc]ertified [Pp]ublic|[Cc]hartered) [Aa]ccountants'
    try:
        auditor = re.search(pattern, txt, flags=re.MULTILINE).group('auditor').strip()
        print(repr(auditor))
    except AttributeError:
        print(txt)
        print('============')
        print(url)

Это дает следующий результат

'ShineWing'
'ShineWing'
'Hong Kong Standards on Auditing (HKSAs) issued by the Hong Kong Institute of'
'Hong Kong Financial Reporting Standards issued by the Hong Kong Institute of'

Желаемый результат:

'ShineWing'
'ShineWing'
'Ernst & Young'
'Elite Partners CPA Limited'

Я старался:

pattern = r'.*\n.*?(?P<auditor>[A-Z].+?\n?)$(?!Institute)(?:LLP\s*)?\s*((PRC.*?|Chinese.*?)?[Cc]ertified [Pp]ublic|[Cc]hartered) [Aa]ccountants' Этот шаблон охватывает последние два случая, но не первые 2.

pattern = r'.*\n.*?(?P<auditor>^(?!Hong|Kong)[A-Z].+?\n?)(?:LLP\s*)?\s*((PRC.*?|Chinese.*?)?[Cc]ertified [Pp]ublic|[Cc]hartered) [Aa]ccountants' Это приводит к желаемому результату, но ^(?!Hong|Kong) потенциально рискованно, поскольку может игнорировать другие желаемые результаты в будущем, поэтому он не является хорошим кандидатом.

Вместо этого $(?!Institute) является более общим и подходящим, но я понятия не имею, почему его нельзя сопоставить в первых двух случаях. было бы здорово, если бы я мог игнорировать совпадения, содержащие issued by the Hong Kong Institute of

Любое предложение будет оценено. Спасибо.

python regex pdf-extraction

chris_2020 09.08.2020 источник

comment

Я задал тот же вопрос несколько дней назад, и гений регулярных выражений ответил мне stackoverflow.com/a/62612757/13824946 - 09.08.2020

comment

@Cyber-Tech Можете ли вы быть более конкретным? Я попробовал ?<!, но это не имеет никакого значения для 1-го шаблона $(?!Institute). последние два не захвачены. - chris_2020 09.08.2020

Ответы (1)

arrow_upward
0
arrow_downward

pattern = r'\n.*?(?P<auditor>(?!.*Institute)[A-Z].+?)(?:LLP\s*)?\s*((PRC.*?|Chinese.*?)?[Cc]ertified [Pp]ublic|[Cc]hartered) [Aa]ccountants'

Это работает.

Community 09.08.2020

как игнорировать нежелательный шаблон в регулярном выражении

Ответы (1)

Вопросы по теме