Попробуйте Scrapy + Splash

Итак, я играю со Scrapy & Splash и сталкиваюсь с некоторыми проблемами. Я пытался запустить своих пауков и продолжал получать ошибки HTTP 502 и 504. Итак, я попытался проверить Splash в своем браузере. Сначала я сделал «sudo docker run -p 8050:8050 -p 5023:5023 scrapinghub/splash --max-timeout 3600 -v3», чтобы запустить Splash, затем я перешел на localhost:8050. Веб-интерфейс открывается правильно, и я могу ввести код. Вот основная функция, которую я пытаюсь запустить:

function main(splash, args)
  assert(splash:autoload("https://code.jquery.com/jquery-3.1.1.min.js"))
  splash.resource_timeout = 30.0
  splash.images_enabled = false
  assert(splash:go(args.url))
  assert(splash:wait(0.5))
  return {
    html = splash:html(),
    --png = splash:png(),
    --har = splash:har(),
  }
end

Я пытаюсь отобразить http://boingboing.net/blog с помощью этой функции и получаю "недопустимое имя хоста". ' Ошибка LUA; вот логи:

2017-08-01 18:26:28+0000 [-] Log opened.
2017-08-01 18:26:28.077457 [-] Splash version: 3.0
2017-08-01 18:26:28.077838 [-] Qt 5.9.1, PyQt 5.9, WebKit 602.1, sip 4.19.3, Twisted 16.1.1, Lua 5.2
2017-08-01 18:26:28.077900 [-] Python 3.5.2 (default, Nov 17 2016, 17:05:23) [GCC 5.4.0 20160609]
2017-08-01 18:26:28.077984 [-] Open files limit: 65536
2017-08-01 18:26:28.078046 [-] Can't bump open files limit
2017-08-01 18:26:28.180376 [-] Xvfb is started: ['Xvfb', ':1937726875', '-screen', '0', '1024x768x24', '-nolisten', 'tcp']
QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to '/tmp/runtime-root'
2017-08-01 18:26:28.226937 [-] proxy profiles support is enabled, proxy profiles path: /etc/splash/proxy-profiles
2017-08-01 18:26:28.301002 [-] verbosity=3
2017-08-01 18:26:28.301116 [-] slots=50
2017-08-01 18:26:28.301202 [-] argument_cache_max_entries=500
2017-08-01 18:26:28.301530 [-] Web UI: enabled, Lua: enabled (sandbox: enabled)
2017-08-01 18:26:28.302122 [-] Site starting on 8050
2017-08-01 18:26:28.302219 [-] Starting factory <twisted.web.server.Site object at 0x7ffa08390dd8>
2017-08-01 18:26:32.660457 [-] "172.17.0.1" - - [01/Aug/2017:18:26:32 +0000] "GET / HTTP/1.1" 200 7677 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:54.0) Gecko/20100101 Firefox/54.0"
2017-08-01 18:27:18.860020 [-] "172.17.0.1" - - [01/Aug/2017:18:27:18 +0000] "GET /info?wait=0.5&images=1&expand=1&timeout=3600.0&url=http%3A%2F%2Fboingboing.net%2Fblog&lua_source=function+main%28splash%2C+args%29%0D%0A++assert%28splash%3Aautoload%28%22https%3A%2F%2Fcode.jquery.com%2Fjquery-3.1.1.min.js%22%29%29%0D%0A++splash.resource_timeout+%3D+30.0%0D%0A++splash.images_enabled+%3D+false%0D%0A++assert%28splash%3Ago%28args.url%29%29%0D%0A++assert%28splash%3Await%280.5%29%29%0D%0A++return+%7B%0D%0A++++html+%3D+splash%3Ahtml%28%29%2C%0D%0A++++--png+%3D+splash%3Apng%28%29%2C%0D%0A++++--har+%3D+splash%3Ahar%28%29%2C%0D%0A++%7D%0D%0Aend HTTP/1.1" 200 5656 "http://localhost:8050/" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:54.0) Gecko/20100101 Firefox/54.0"
2017-08-01 18:27:19.038565 [pool] initializing SLOT 0
libpng warning: iCCP: known incorrect sRGB profile
libpng warning: iCCP: known incorrect sRGB profile
process 1: D-Bus library appears to be incorrectly set up; failed to read machine uuid: UUID file '/etc/machine-id' should contain a hex string of length 32, not length 0, with no other text
See the manual page for dbus-uuidgen to correct this issue.
2017-08-01 18:27:19.066765 [render] [140711856519656] viewport size is set to 1024x768
2017-08-01 18:27:19.066964 [pool] [140711856519656] SLOT 0 is starting
2017-08-01 18:27:19.067071 [render] [140711856519656] function main(splash, args)\r\n  assert(splash:autoload("https://code.jquery.com/jquery-3.1.1.min.js"))\r\n  splash.resource_timeout = 30.0\r\n  splash.images_enabled = false\r\n  assert(splash:go(args.url))\r\n  assert(splash:wait(0.5))\r\n  return {\r\n    html = splash:html(),\r\n    --png = splash:png(),\r\n    --har = splash:har(),\r\n  }\r\nend
2017-08-01 18:27:19.070107 [render] [140711856519656] [lua_runner] dispatch cmd_id=__START__
2017-08-01 18:27:19.070270 [render] [140711856519656] [lua_runner] arguments are for command __START__, waiting for result of __START__
2017-08-01 18:27:19.070352 [render] [140711856519656] [lua_runner] entering dispatch/loop body, args=()
2017-08-01 18:27:19.070424 [render] [140711856519656] [lua_runner] send None
2017-08-01 18:27:19.070496 [render] [140711856519656] [lua_runner] send (lua) None
2017-08-01 18:27:19.070657 [render] [140711856519656] [lua_runner] got AsyncBrowserCommand(id=None, name='http_get', kwargs={'url': 'https://code.jquery.com/jquery-3.1.1.min.js', 'callback': '<a callback>'})
2017-08-01 18:27:19.070755 [render] [140711856519656] [lua_runner] instructions used: 70
2017-08-01 18:27:19.070834 [render] [140711856519656] [lua_runner] executing AsyncBrowserCommand(id=0, name='http_get', kwargs={'url': 'https://code.jquery.com/jquery-3.1.1.min.js', 'callback': '<a callback>'})
2017-08-01 18:27:19.071141 [network] [140711856519656] GET https://code.jquery.com/jquery-3.1.1.min.js
qt.network.ssl: QSslSocket: cannot resolve SSLv2_client_method
qt.network.ssl: QSslSocket: cannot resolve SSLv2_server_method
2017-08-01 18:27:19.082150 [pool] [140711856519656] SLOT 0 is working
2017-08-01 18:27:19.082298 [pool] [140711856519656] queued
2017-08-01 18:28:39.151814 [network-manager] Download error 3: the remote host name was not found (invalid hostname) (https://code.jquery.com/jquery-3.1.1.min.js)
2017-08-01 18:28:39.152087 [network-manager] Finished downloading https://code.jquery.com/jquery-3.1.1.min.js
2017-08-01 18:28:39.152202 [render] [140711856519656] [lua_runner] dispatch cmd_id=0
2017-08-01 18:28:39.152268 [render] [140711856519656] [lua_runner] arguments are for command 0, waiting for result of 0
2017-08-01 18:28:39.152339 [render] [140711856519656] [lua_runner] entering dispatch/loop body, args=(PyResult('return', None, 'invalid_hostname'),)
2017-08-01 18:28:39.152400 [render] [140711856519656] [lua_runner] send PyResult('return', None, 'invalid_hostname')
2017-08-01 18:28:39.152468 [render] [140711856519656] [lua_runner] send (lua) (b'return', None, b'invalid_hostname')
2017-08-01 18:28:39.152582 [render] [140711856519656] [lua_runner] instructions used: 79
2017-08-01 18:28:39.152642 [render] [140711856519656] [lua_runner] caught LuaError LuaError('[string "function main(splash, args)\\r..."]:2: invalid_hostname',)
2017-08-01 18:28:39.152816 [pool] [140711856519656] SLOT 0 finished with an error <splash.qtrender_lua.LuaRender object at 0x7ffa08477e48>: [Failure instance: Traceback: <class 'splash.exceptions.ScriptError'>: {'error': 'invalid_hostname', 'type': 'LUA_ERROR', 'source': '[string "function main(splash, args)\r..."]', 'message': 'Lua error: [string "function main(splash, args)\r..."]:2: invalid_hostname', 'line_number': 2}
    /app/splash/browser_tab.py:1180:_return_reply
    /app/splash/qtrender_lua.py:901:callback
    /app/splash/lua_runner.py:27:return_result
    /app/splash/qtrender.py:17:stop_on_error_wrapper
    --- <exception caught here> ---
    /app/splash/qtrender.py:15:stop_on_error_wrapper
    /app/splash/qtrender_lua.py:2257:dispatch
    /app/splash/lua_runner.py:195:dispatch
    ]
2017-08-01 18:28:39.152883 [pool] [140711856519656] SLOT 0 is closing <splash.qtrender_lua.LuaRender object at 0x7ffa08477e48>
2017-08-01 18:28:39.152944 [render] [140711856519656] [splash] clearing 0 objects
2017-08-01 18:28:39.153026 [render] [140711856519656] close is requested by a script
2017-08-01 18:28:39.153304 [render] [140711856519656] cancelling 0 remaining timers
2017-08-01 18:28:39.153374 [pool] [140711856519656] SLOT 0 done with <splash.qtrender_lua.LuaRender object at 0x7ffa08477e48>
2017-08-01 18:28:39.153997 [events] {"user-agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:54.0) Gecko/20100101 Firefox/54.0", "error": {"error": 400, "info": {"error": "invalid_hostname", "type": "LUA_ERROR", "source": "[string \"function main(splash, args)\r...\"]", "message": "Lua error: [string \"function main(splash, args)\r...\"]:2: invalid_hostname", "line_number": 2}, "type": "ScriptError", "description": "Error happened while executing Lua script"}, "active": 0, "status_code": 400, "maxrss": 107916, "qsize": 0, "path": "/execute", "timestamp": 1501612119, "fds": 18, "args": {"render_all": false, "http_method": "GET", "png": 1, "url": "http://boingboing.net/blog", "wait": 0.5, "html": 1, "response_body": false, "har": 1, "load_args": {}, "lua_source": "function main(splash, args)\r\n  assert(splash:autoload(\"https://code.jquery.com/jquery-3.1.1.min.js\"))\r\n  splash.resource_timeout = 30.0\r\n  splash.images_enabled = false\r\n  assert(splash:go(args.url))\r\n  assert(splash:wait(0.5))\r\n  return {\r\n    html = splash:html(),\r\n    --png = splash:png(),\r\n    --har = splash:har(),\r\n  }\r\nend", "resource_timeout": 0, "uid": 140711856519656, "save_args": [], "viewport": "1024x768", "timeout": 3600, "images": 1}, "client_ip": "172.17.0.1", "rendertime": 80.11527562141418, "method": "POST", "_id": 140711856519656, "load": [0.46, 0.51, 0.54]}
2017-08-01 18:28:39.154127 [-] "172.17.0.1" - - [01/Aug/2017:18:28:38 +0000] "POST /execute HTTP/1.1" 400 325 "http://localhost:8050/info?wait=0.5&images=1&expand=1&timeout=3600.0&url=http%3A%2F%2Fboingboing.net%2Fblog&lua_source=function+main%28splash%2C+args%29%0D%0A++assert%28splash%3Aautoload%28%22https%3A%2F%2Fcode.jquery.com%2Fjquery-3.1.1.min.js%22%29%29%0D%0A++splash.resource_timeout+%3D+30.0%0D%0A++splash.images_enabled+%3D+false%0D%0A++assert%28splash%3Ago%28args.url%29%29%0D%0A++assert%28splash%3Await%280.5%29%29%0D%0A++return+%7B%0D%0A++++html+%3D+splash%3Ahtml%28%29%2C%0D%0A++++--png+%3D+splash%3Apng%28%29%2C%0D%0A++++--har+%3D+splash%3Ahar%28%29%2C%0D%0A++%7D%0D%0Aend" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:54.0) Gecko/20100101 Firefox/54.0"
2017-08-01 18:28:39.154237 [pool] SLOT 0 is available

Если я попробую без загрузки JQuery, я получу ошибку 'network5' LUA (что является своего рода тайм-аутом). Журналы для этого следующие:

2017-08-01 18:31:07.110255 [-] "172.17.0.1" - - [01/Aug/2017:18:31:06 +0000] "GET /info?wait=0.5&images=1&expand=1&timeout=3600.0&url=http%3A%2F%2Fboingboing.net%2Fblog&lua_source=function+main%28splash%2C+args%29%0D%0A++--assert%28splash%3Aautoload%28%22https%3A%2F%2Fcode.jquery.com%2Fjquery-3.1.1.min.js%22%29%29%0D%0A++splash.resource_timeout+%3D+30.0%0D%0A++splash.images_enabled+%3D+false%0D%0A++assert%28splash%3Ago%28args.url%29%29%0D%0A++assert%28splash%3Await%280.5%29%29%0D%0A++return+%7B%0D%0A++++html+%3D+splash%3Ahtml%28%29%2C%0D%0A++++--png+%3D+splash%3Apng%28%29%2C%0D%0A++++--har+%3D+splash%3Ahar%28%29%2C%0D%0A++%7D%0D%0Aend HTTP/1.1" 200 5658 "http://localhost:8050/" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:54.0) Gecko/20100101 Firefox/54.0"
2017-08-01 18:31:07.489653 [pool] initializing SLOT 1
2017-08-01 18:31:07.490576 [render] [140711856961016] viewport size is set to 1024x768
2017-08-01 18:31:07.490692 [pool] [140711856961016] SLOT 1 is starting
2017-08-01 18:31:07.490829 [render] [140711856961016] function main(splash, args)\r\n  --assert(splash:autoload("https://code.jquery.com/jquery-3.1.1.min.js"))\r\n  splash.resource_timeout = 30.0\r\n  splash.images_enabled = false\r\n  assert(splash:go(args.url))\r\n  assert(splash:wait(0.5))\r\n  return {\r\n    html = splash:html(),\r\n    --png = splash:png(),\r\n    --har = splash:har(),\r\n  }\r\nend
2017-08-01 18:31:07.493641 [render] [140711856961016] [lua_runner] dispatch cmd_id=__START__
2017-08-01 18:31:07.493782 [render] [140711856961016] [lua_runner] arguments are for command __START__, waiting for result of __START__
2017-08-01 18:31:07.493865 [render] [140711856961016] [lua_runner] entering dispatch/loop body, args=()
2017-08-01 18:31:07.493937 [render] [140711856961016] [lua_runner] send None
2017-08-01 18:31:07.494010 [render] [140711856961016] [lua_runner] send (lua) None
2017-08-01 18:31:07.494270 [render] [140711856961016] [lua_runner] got AsyncBrowserCommand(id=None, name='go', kwargs={'baseurl': None, 'http_method': 'GET', 'headers': None, 'body': None, 'url': 'http://boingboing.net/blog', 'errback': '<an errback>', 'callback': '<a callback>'})
2017-08-01 18:31:07.494416 [render] [140711856961016] [lua_runner] instructions used: 166
2017-08-01 18:31:07.494502 [render] [140711856961016] [lua_runner] executing AsyncBrowserCommand(id=0, name='go', kwargs={'baseurl': None, 'http_method': 'GET', 'headers': None, 'body': None, 'url': 'http://boingboing.net/blog', 'errback': '<an errback>', 'callback': '<a callback>'})
2017-08-01 18:31:07.494576 [render] [140711856961016] HAR event: _onStarted
2017-08-01 18:31:07.494697 [render] [140711856961016] callback 0 is connected to loadFinished
2017-08-01 18:31:07.495031 [network] [140711856961016] GET http://boingboing.net/blog
2017-08-01 18:31:07.495617 [pool] [140711856961016] SLOT 1 is working
2017-08-01 18:31:07.495741 [pool] [140711856961016] queued
2017-08-01 18:31:37.789845 [network-manager] timed out, aborting: http://boingboing.net/blog
2017-08-01 18:31:37.790154 [network-manager] Finished downloading http://boingboing.net/blog
2017-08-01 18:31:37.791064 [render] [140711856961016] mainFrame().urlChanged http://boingboing.net/blog
2017-08-01 18:31:37.796078 [render] [140711856961016] mainFrame().initialLayoutCompleted
2017-08-01 18:31:37.796343 [render] [140711856961016] loadFinished: RenderErrorInfo(type='Network', code=5, text='Operation canceled', url='http://boingboing.net/blog')
2017-08-01 18:31:37.796420 [render] [140711856961016] loadFinished: disconnecting callback 0
2017-08-01 18:31:37.796518 [render] [140711856961016] [lua_runner] dispatch cmd_id=0
2017-08-01 18:31:37.796576 [render] [140711856961016] [lua_runner] arguments are for command 0, waiting for result of 0
2017-08-01 18:31:37.796640 [render] [140711856961016] [lua_runner] entering dispatch/loop body, args=(PyResult('return', None, 'network5'),)
2017-08-01 18:31:37.796699 [render] [140711856961016] [lua_runner] send PyResult('return', None, 'network5')
2017-08-01 18:31:37.796765 [render] [140711856961016] [lua_runner] send (lua) (b'return', None, b'network5')
2017-08-01 18:31:37.796883 [render] [140711856961016] [lua_runner] instructions used: 175
2017-08-01 18:31:37.796943 [render] [140711856961016] [lua_runner] caught LuaError LuaError('[string "function main(splash, args)\\r..."]:5: network5',)
2017-08-01 18:31:37.797093 [pool] [140711856961016] SLOT 1 finished with an error <splash.qtrender_lua.LuaRender object at 0x7ffa083ff828>: [Failure instance: Traceback: <class 'splash.exceptions.ScriptError'>: {'error': 'network5', 'type': 'LUA_ERROR', 'source': '[string "function main(splash, args)\r..."]', 'message': 'Lua error: [string "function main(splash, args)\r..."]:5: network5', 'line_number': 5}
    /app/splash/browser_tab.py:533:_on_content_ready
    /app/splash/qtrender_lua.py:702:error
    /app/splash/lua_runner.py:27:return_result
    /app/splash/qtrender.py:17:stop_on_error_wrapper
    --- <exception caught here> ---
    /app/splash/qtrender.py:15:stop_on_error_wrapper
    /app/splash/qtrender_lua.py:2257:dispatch
    /app/splash/lua_runner.py:195:dispatch
    ]
2017-08-01 18:31:37.797158 [pool] [140711856961016] SLOT 1 is closing <splash.qtrender_lua.LuaRender object at 0x7ffa083ff828>
2017-08-01 18:31:37.797217 [render] [140711856961016] [splash] clearing 0 objects
2017-08-01 18:31:37.797310 [render] [140711856961016] close is requested by a script
2017-08-01 18:31:37.797430 [render] [140711856961016] cancelling 0 remaining timers
2017-08-01 18:31:37.797491 [pool] [140711856961016] SLOT 1 done with <splash.qtrender_lua.LuaRender object at 0x7ffa083ff828>
2017-08-01 18:31:37.798067 [events] {"user-agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:54.0) Gecko/20100101 Firefox/54.0", "error": {"error": 400, "info": {"error": "network5", "type": "LUA_ERROR", "source": "[string \"function main(splash, args)\r...\"]", "message": "Lua error: [string \"function main(splash, args)\r...\"]:5: network5", "line_number": 5}, "type": "ScriptError", "description": "Error happened while executing Lua script"}, "active": 0, "status_code": 400, "maxrss": 113372, "qsize": 0, "path": "/execute", "timestamp": 1501612297, "fds": 21, "args": {"render_all": false, "http_method": "GET", "png": 1, "url": "http://boingboing.net/blog", "wait": 0.5, "html": 1, "response_body": false, "har": 1, "load_args": {}, "lua_source": "function main(splash, args)\r\n  --assert(splash:autoload(\"https://code.jquery.com/jquery-3.1.1.min.js\"))\r\n  splash.resource_timeout = 30.0\r\n  splash.images_enabled = false\r\n  assert(splash:go(args.url))\r\n  assert(splash:wait(0.5))\r\n  return {\r\n    html = splash:html(),\r\n    --png = splash:png(),\r\n    --har = splash:har(),\r\n  }\r\nend", "resource_timeout": 0, "uid": 140711856961016, "save_args": [], "viewport": "1024x768", "timeout": 3600, "images": 1}, "client_ip": "172.17.0.1", "rendertime": 30.308406591415405, "method": "POST", "_id": 140711856961016, "load": [0.39, 0.42, 0.49]}
2017-08-01 18:31:37.798190 [-] "172.17.0.1" - - [01/Aug/2017:18:31:37 +0000] "POST /execute HTTP/1.1" 400 309 "http://localhost:8050/info?wait=0.5&images=1&expand=1&timeout=3600.0&url=http%3A%2F%2Fboingboing.net%2Fblog&lua_source=function+main%28splash%2C+args%29%0D%0A++--assert%28splash%3Aautoload%28%22https%3A%2F%2Fcode.jquery.com%2Fjquery-3.1.1.min.js%22%29%29%0D%0A++splash.resource_timeout+%3D+30.0%0D%0A++splash.images_enabled+%3D+false%0D%0A++assert%28splash%3Ago%28args.url%29%29%0D%0A++assert%28splash%3Await%280.5%29%29%0D%0A++return+%7B%0D%0A++++html+%3D+splash%3Ahtml%28%29%2C%0D%0A++++--png+%3D+splash%3Apng%28%29%2C%0D%0A++++--har+%3D+splash%3Ahar%28%29%2C%0D%0A++%7D%0D%0Aend" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:54.0) Gecko/20100101 Firefox/54.0"
2017-08-01 18:31:37.798294 [pool] SLOT 1 is available

Если я дополнительно закомментирую строку resource_timeout, я получу ошибку network3 LUA (опять же, недопустимое имя хоста, но на этот раз представленное по-другому).

Любая идея, что я делаю неправильно?


person Craig    schedule 01.08.2017    source источник


Ответы (2)


Как оказалось, это вовсе не проблема Scrapy/Splash — это проблема Docker/IP-маршрута/администратора сети. Сетевые администраторы настроили его так, что я могу делать HTTP-запросы только через определенное место назначения; добавление "--net=host" к запуску моего докера, похоже, исправило это. Эта веб-страница оказалась очень полезной.

person Craig    schedule 02.08.2017

Попробуйте изменить

function main(splash, args)
    ...
    assert(splash:go(args.url))
    ...

to

function main(splash)
    ...
    assert(splash:go(splash.args.url))
    ...

По крайней мере, так это читается, когда я открываю Splash на порту 8050 в сценарии по умолчанию. С этим изменением ваш сценарий работает для меня.

person Tomáš Linhart    schedule 02.08.2017