I get the error:
13/12/2017
20:51:38.141
2017-12-13 20:51:38,141 ERROR An exception occurred when attempting to retrieve information from the web-page, stanza=web_input://www_mos_eisley_dk
Traceback (most recent call last):
File "/splunk/etc/apps/website_input/bin/web_input.py", line 349, in run
https_only=self.is_on_cloud(input_config.session_key))
File "/splunk/etc/apps/website_input/bin/website_input_app/web_scraper.py", line 710, in scrape_page
additional_fields=additional_fields, **kw)
File "/splunk/etc/apps/website_input/bin/website_input_app/web_scraper.py", line 446, in get_result_single
content_decoded = content.decode(encoding=encoding, errors='replace')
LookupError: unknown encoding: 3Dutf-8=
This is a confirmed bug. I was able to reproduce this using the unit test framework which simulates a web-server providing an encoding that is invalid. See the bug report here: https://lukemurphey.net/issues/2190.
I have updated the app to now be forgiving if it sees an encoding it doesn't recognize. This is currently working. This fix will go out in version 4.5.2 (ETA: early next week).
@moseisleydk: thanks for the report.
Incidentally, I was unable to reproduce this on http://www.mos-eisley.dk today. Not sure if something changed.
This was still valid bug report though as I was able to reproduce this by recreating the scenario based on the stacktrace you provided.
Excellent - looking forward to it. I still get the error on 4.5.1:
2018-01-27 07:29:58,776 ERROR An exception occurred when attempting to retrieve information from the web-page, stanza=web_input://www_mos_eisley_dk
Traceback (most recent call last):
File "/splunk/etc/apps/website_input/bin/web_input.py", line 349, in run
https_only=self.is_on_cloud(input_config.session_key))
File "/splunk/etc/apps/website_input/bin/website_input_app/web_scraper.py", line 710, in scrape_page
additional_fields=additional_fields, **kw)
File "/splunk/etc/apps/website_input/bin/website_input_app/web_scraper.py", line 446, in get_result_single
content_decoded = content.decode(encoding=encoding, errors='replace')
LookupError: unknown encoding: 3Dutf-8=
@moseisleydk: Would you mind testing 4.5.2? You can get the app here: https://github.com/LukeMurphey/splunk-web-input/releases/tag/4.5.2-rc.1
I want to make sure that this fixes the issue since I wasn't able to reproduce the issue on 4.5.1 with your website.
02/07/2018 21:14:00.529 INFO The content could not be parsed, it doesn't appear to be valid HTML, url="http://www.mos-eisley.dk/dashboard/\"
02/07/2018 21:14:00.529 INFO The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/dashboard/\", encoding="cp1252"
02/07/2018 21:12:12.258 INFO The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/feeds/network.action?username=bnp&max=40&publicFeed=false&os_authType=basic&rssType=atom", encoding="UTF-8"
02/07/2018 21:12:08.922 ERROR An exception occurred when attempting to retrieve information from the web-page, stanza=web_input://www_mos_eisley_dk
02/07/2018 21:12:08.922 ERROR A general exception was thrown when executing a web request
02/07/2018 21:12:08.921 ERROR A general exception was thrown when executing a web request
02/07/2018 21:11:28.858 INFO The content could not be parsed, it doesn't appear to be valid HTML, url="http://www.mos-eisley.dk/plugins/inlinetasks/\"
02/07/2018 21:11:28.858 INFO The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/plugins/inlinetasks/\", encoding="cp1252"
02/07/2018 21:11:27.651 INFO The content could not be parsed, it doesn't appear to be valid HTML, url="http://www.mos-eisley.dk/users/\"
02/07/2018 21:11:27.651 INFO The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/users/\", encoding="cp1252"
02/07/2018 21:11:25.590 INFO The content could not be parsed, it doesn't appear to be valid HTML, url="http://www.mos-eisley.dk/spaces/\"
02/07/2018 21:11:25.590 INFO The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/spaces/\", encoding="cp1252"
02/07/2018 21:11:12.047 WARNING Detected encoding was not recognized and the content will be evaluated (possibly with the wrong encoding), encoding_detected="Shift_JIS"
02/07/2018 21:11:08.724 WARNING Detected encoding was not recognized and the content will be evaluated (possibly with the wrong encoding), encoding_detected="3Dutf-8="
02/07/2018 21:10:59.968 INFO The content could not be parsed, it doesn't appear to be valid HTML, url="http://www.mos-eisley.dk/\"
02/07/2018 21:10:59.968 INFO The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/\", encoding="cp1252"
02/07/2018 21:10:59.117 INFO Running web input, url="http://www.mos-eisley.dk"
A little background on what is going on here. The encoding is not getting detected properly. I setup the input to deal better with a bad encoding. However, since the input doesn't know the proper encoding, it fails to parse the output.
What is really weird, is that I'm not getting the same repro. I have tried several times but it never quite repros the same.
@moseisleydk: could you provide some more details? I'm sorry for the back-and-forth; I'm just struggling to get a solid repro. I tried today and get a partial repro.
Here are some questions:
Are results coming through for any of the URLs?
You can try running the following search to get a source="web_input://www_mos_eisley_dk" | table _time url match*
In my case, I am finding that I get results for everything but "http://www.mos-eisley.dk/dashboard/\\". That URL seems to just do a redirect to "http://www.mos-eisley.dk/dashboard/" which I do get results for.
What platform and version of Splunk is this running on?
I'm wondering if I cannot get an identical repro because I'm not on the same platform.
Ok, I'll hit you up on email.
02/07/2018 21:14:00.529 INFO The content could not be parsed, it doesn't appear to be valid HTML, url="http://www.mos-eisley.dk/dashboard/\"
02/07/2018 21:14:00.529 INFO The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/dashboard/\", encoding="cp1252"
02/07/2018 21:12:12.258 INFO The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/feeds/network.action?username=bnp&max=40&publicFeed=false&os_authType=basic&rssType=atom", encoding="UTF-8"
02/07/2018 21:12:08.922 ERROR An exception occurred when attempting to retrieve information from the web-page, stanza=web_input://www_mos_eisley_dk
02/07/2018 21:12:08.922 ERROR A general exception was thrown when executing a web request
02/07/2018 21:12:08.921 ERROR A general exception was thrown when executing a web request
02/07/2018 21:11:28.858 INFO The content could not be parsed, it doesn't appear to be valid HTML, url="http://www.mos-eisley.dk/plugins/inlinetasks/\"
02/07/2018 21:11:28.858 INFO The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/plugins/inlinetasks/\", encoding="cp1252"
02/07/2018 21:11:27.651 INFO The content could not be parsed, it doesn't appear to be valid HTML, url="http://www.mos-eisley.dk/users/\"
02/07/2018 21:11:27.651 INFO The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/users/\", encoding="cp1252"
02/07/2018 21:11:25.590 INFO The content could not be parsed, it doesn't appear to be valid HTML, url="http://www.mos-eisley.dk/spaces/\"
02/07/2018 21:11:25.590 INFO The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/spaces/\", encoding="cp1252"
02/07/2018 21:11:12.047 WARNING Detected encoding was not recognized and the content will be evaluated (possibly with the wrong encoding), encoding_detected="Shift_JIS"
02/07/2018 21:11:08.724 WARNING Detected encoding was not recognized and the content will be evaluated (possibly with the wrong encoding), encoding_detected="3Dutf-8="
02/07/2018 21:10:59.968 INFO The content could not be parsed, it doesn't appear to be valid HTML, url="http://www.mos-eisley.dk/\"
02/07/2018 21:10:59.968 INFO The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/\", encoding="cp1252"
02/07/2018 21:10:59.117 INFO Running web input, url="http://www.mos-eisley.dk"
Logs:
02/07/2018 21:14:00.529 INFO The content could not be parsed, it doesn't appear to be valid HTML, url="http://www.mos-eisley.dk/dashboard/\"
02/07/2018 21:14:00.529 INFO The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/dashboard/\", encoding="cp1252"
02/07/2018 21:12:12.258 INFO The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/feeds/network.action?username=bnp&max=40&publicFeed=false&os_authType=basic&rssType=atom", encoding="UTF-8"
02/07/2018 21:12:08.922 ERROR An exception occurred when attempting to retrieve information from the web-page, stanza=web_input://www_mos_eisley_dk
02/07/2018 21:12:08.922 ERROR A general exception was thrown when executing a web request
02/07/2018 21:12:08.921 ERROR A general exception was thrown when executing a web request
02/07/2018 21:11:28.858 INFO The content could not be parsed, it doesn't appear to be valid HTML, url="http://www.mos-eisley.dk/plugins/inlinetasks/\"
02/07/2018 21:11:28.858 INFO The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/plugins/inlinetasks/\", encoding="cp1252"
02/07/2018 21:11:27.651 INFO The content could not be parsed, it doesn't appear to be valid HTML, url="http://www.mos-eisley.dk/users/\"
02/07/2018 21:11:27.651 INFO The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/users/\", encoding="cp1252"
02/07/2018 21:11:25.590 INFO The content could not be parsed, it doesn't appear to be valid HTML, url="http://www.mos-eisley.dk/spaces/\"
02/07/2018 21:11:25.590 INFO The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/spaces/\", encoding="cp1252"
02/07/2018 21:11:12.047 WARNING Detected encoding was not recognized and the content will be evaluated (possibly with the wrong encoding), encoding_detected="Shift_JIS"
02/07/2018 21:11:08.724 WARNING Detected encoding was not recognized and the content will be evaluated (possibly with the wrong encoding), encoding_detected="3Dutf-8="
02/07/2018 21:10:59.968 INFO The content could not be parsed, it doesn't appear to be valid HTML, url="http://www.mos-eisley.dk/\"
02/07/2018 21:10:59.968 INFO The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/\", encoding="cp1252"
02/07/2018 21:10:59.117 INFO Running web input, url="http://www.mos-eisley.dk"
Could you share the URL that you are using if it is a publically available one? I would like to reproduce this myself. It looks like the website is provided an invalid encoding and the Website Inputs app doesn't handle that yet. I want to update the app to handle it more gracefully.
BTW . Its Confluence from Atlassian