Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kruize remote monitoring functional failures due to 502 from listRecommendations #1281

Open
chandrams opened this issue Sep 4, 2024 · 8 comments
Assignees
Labels
bug Something isn't working

Comments

@chandrams
Copy link
Contributor

Describe the bug
Kruize remote monitoring functional tests are failing with different issues on openshift with latest kruize 0.0.24_mvp image

https://ci.app-svc-perf.corp.redhat.com/job/ExternalTeams/job/Autotune/job/kruize_release_tests/139/ - Kruize scalelab
https://ci.app-svc-perf.corp.redhat.com/job/ExternalTeams/job/Autotune/job/kruize_functional_tests/128/ - kruize scalelab

https://ci.app-svc-perf.corp.redhat.com/job/ExternalTeams/job/Autotune/job/kruize_functional_tests/128/testReport/junit/rest_apis/test_list_recommendations/test_list_recommendations_profile_notifications_cpu_zero_test_1_True_update_metrics0_323002_CPU_usage_is_zero__No_CPU_Recommendations_can_be_generated_/

      data = response.json()

test_list_recommendations.py:2867: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/usr/lib/python3.6/site-packages/requests/models.py:897: in json
    return complexjson.loads(self.text, **kwargs)
/usr/lib64/python3.6/json/__init__.py:354: in loads
    return _default_decoder.decode(s)
/usr/lib64/python3.6/json/decoder.py:339: in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <json.decoder.JSONDecoder object at 0x7f45e9154160>
s = '<html>\r\n  <head>\r\n    <meta name="viewport" content="width=device-width, initial-scale=1">\r\n\r\n    <style type...t least one pod running.\r\n          </li>\r\n        </ul>\r\n      </div>\r\n    </div>\r\n  </body>\r\n</html>\r\n'
idx = 0

    def raw_decode(self, s, idx=0):
        """Decode a JSON document from ``s`` (a ``str`` beginning with
        a JSON document) and return a 2-tuple of the Python
        representation and the index in ``s`` where the document ended.
    
        This can be used to decode a JSON document from a string that may
        have extraneous data at the end.
    
        """
        try:
            obj, end = self.scan_once(s, idx)
        except StopIteration as err:
>           raise JSONDecodeError("Expecting value", s, err.value) from None
E           json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

/usr/lib64/python3.6/json/decoder.py:357: JSONDecodeError

https://ci.app-svc-perf.corp.redhat.com/job/ExternalTeams/job/Autotune/job/kruize_functional_tests/128/testReport/junit/rest_apis/test_list_recommendations/test_list_recommendations_cpu_mem_optimised/

                 response = list_recommendations(experiment_name)
>                   assert response.status_code == SUCCESS_200_STATUS_CODE
E                   assert 502 == 200
E                    +  where 502 = <Response [502]>.status_code

test_list_recommendations.py:2740: AssertionError
@chandrams chandrams added the bug Something isn't working label Sep 4, 2024
@chandrams chandrams added this to the Kruize 0.0.24_rm Release milestone Sep 4, 2024
@chandrams chandrams self-assigned this Sep 4, 2024
@chandrams
Copy link
Contributor Author

Commented out test_list_recommendations_cpu_mem_optimised test that failed with 502 error and running the sanity testsuite manually, all the tests passed now. Will run the entire testsuite and check again.

@msvinaykumar
Copy link
Contributor

I see an error occurring while creating the experiment. It could be related to the state, such as whether Kruize and its related pods, including the database service, are ready to handle the request.

@dinogun dinogun moved this to In Progress in Monitoring Sep 4, 2024
@chandrams
Copy link
Contributor Author

Yes, that create experiment issue failed due to 502 error in this job, hence commented the below test & other tests work fine.

https://ci.app-svc-perf.corp.redhat.com/job/ExternalTeams/job/Autotune/job/kruize_functional_tests/128/testReport/junit/rest_apis/test_list_recommendations/test_list_recommendations_cpu_mem_optimised/

We need to check why 502 occurs when we run the entire sanity bucket.

@chandrams
Copy link
Contributor Author

Commented out test_list_recommendations_cpu_mem_optimised test that failed with 502 error and running the sanity testsuite manually, all the tests passed now. Will run the entire testsuite and check again.

Two new tests failed now, after commenting the above test and running the entire functional testsuite manually, due to 502 error response from list recommendations:

Listing the recommendations...
URL =  http://kruize-openshift-tuning.apps.kruize-scalelab.h0b5.p1.openshiftapps.com/listRecommendations
PARAMS =  {'experiment_name': 'quarkus-resteasy-kruize-min-http-response-time-db_0'}
Response status code =  502
                    
************************************************************
<html><body><h1>502 Bad Gateway</h1>
The server returned an invalid or incomplete response.
</body></html>
.
.
FAILED test_list_recommendations.py::test_list_recommendations_for_diff_reco_terms_with_only_latest[long_term_test_true-15-reco_json_schema4-360.0-True-False]
FAILED test_list_recommendations.py::test_list_recommendations_for_diff_reco_terms_with_only_latest[long_term_test_false-15-reco_json_schema5-360.0-False-False]
========== 17 failed, 10 passed, 334 deselected in 2597.64s (0:43:17) ==========

########### Results Summary of the test suite remote_monitoring_tests ##########
remote_monitoring_tests took 2988 seconds
Number of tests performed 358
Number of tests passed 313
Number of tests failed 45

~~~~~~~~~~~~~~~~~~~~~~~ remote_monitoring_tests failed ~~~~~~~~~~~~~~~~~~~~~~~~~~
Failed cases are :
                  negative
                  extended

@chandrams chandrams changed the title Kruize remote monitoring functional tests are failing Kruize remote monitoring functional failures due to 502 from listRecommendations Sep 4, 2024
@chandrams
Copy link
Contributor Author

Executed the test suite again the above 2 failures are not seen, 502 error issue seems to be intermittent

########### Results Summary of the test suite remote_monitoring_tests ##########
remote_monitoring_tests took 5843 seconds
Number of tests performed 358
Number of tests passed 315
Number of tests failed 43

~~~~~~~~~~~~~~~~~~~~~~~ remote_monitoring_tests failed ~~~~~~~~~~~~~~~~~~~~~~~~~~
Failed cases are :
		  negative
		  extended

Check Log Directory: /home/jenkins/test_res_alltests_0.0.24_skip_cpu_mem_optimized/kruize_test_results/kruize_20240904:07:45:37/remote_monitoring_tests for failed cases 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

************************************** done *************************************

*********************************************************************************
~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Overall summary of the tests ~~~~~~~~~~~~~~~~~~~~~~~
Total time taken to perform the test 5843 seconds
Total Number of test suites performed 1
Total Number of tests performed 358
Total Number of tests passed 315
Total Number of tests failed 43

These 43 failures are due to known issues.

Executed only the sanity bucket by enabling the skipped test - test_list_recommendations_cpu_mem_optimised test, it passed, didn't see the 502 error.

########### Results Summary of the test suite remote_monitoring_tests ##########
remote_monitoring_tests took 2051 seconds
Number of tests performed 155
Number of tests passed 155
Number of tests failed 0

~~~~~~~~~~~~~~~~~~~~~~ remote_monitoring_tests passed ~~~~~~~~~~~~~~~~~~~~~~~~~~

************************************** done *************************************

@chandrams
Copy link
Contributor Author

Logs of another sanity run that failed with kruize pod restart
test_res_sanity_functional_0.0.24.zip

@chandrams
Copy link
Contributor Author

I have run one of the failing tests alone with the below builds, here are the results:

pytest -s test_list_recommendations.py::test_list_recommendations_cpu_mem_optimised --cluster_type openshift

Executed this test 5 times:

With 0.0.22_mvp, did not see the failure (could be very intermittent though I did not see the failure in 5 runs)
With 0.0.23_mvp, test failed 2 out of 5 runs
With 0.0.24_mvp, test failed 2 out of 5 runs

Note: When the test fails kruize pod is restarted

@msvinaykumar @khansaad - Can you please take a look at this issue.

@chandrams chandrams assigned khansaad and unassigned chandrams Sep 16, 2024
@chandrams
Copy link
Contributor Author

Issue is seen on 0.0.25_mvp build too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: In Progress
Development

No branches or pull requests

5 participants