Kruize remote monitoring functional failures due to 502 from listRecommendations #1281

chandrams · 2024-09-04T05:21:27Z

Describe the bug
Kruize remote monitoring functional tests are failing with different issues on openshift with latest kruize 0.0.24_mvp image

https://ci.app-svc-perf.corp.redhat.com/job/ExternalTeams/job/Autotune/job/kruize_release_tests/139/ - Kruize scalelab
https://ci.app-svc-perf.corp.redhat.com/job/ExternalTeams/job/Autotune/job/kruize_functional_tests/128/ - kruize scalelab

https://ci.app-svc-perf.corp.redhat.com/job/ExternalTeams/job/Autotune/job/kruize_functional_tests/128/testReport/junit/rest_apis/test_list_recommendations/test_list_recommendations_profile_notifications_cpu_zero_test_1_True_update_metrics0_323002_CPU_usage_is_zero__No_CPU_Recommendations_can_be_generated_/

      data = response.json()

test_list_recommendations.py:2867: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/usr/lib/python3.6/site-packages/requests/models.py:897: in json
    return complexjson.loads(self.text, **kwargs)
/usr/lib64/python3.6/json/__init__.py:354: in loads
    return _default_decoder.decode(s)
/usr/lib64/python3.6/json/decoder.py:339: in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <json.decoder.JSONDecoder object at 0x7f45e9154160>
s = '<html>\r\n  <head>\r\n    <meta name="viewport" content="width=device-width, initial-scale=1">\r\n\r\n    <style type...t least one pod running.\r\n          </li>\r\n        </ul>\r\n      </div>\r\n    </div>\r\n  </body>\r\n</html>\r\n'
idx = 0

    def raw_decode(self, s, idx=0):
        """Decode a JSON document from ``s`` (a ``str`` beginning with
        a JSON document) and return a 2-tuple of the Python
        representation and the index in ``s`` where the document ended.
    
        This can be used to decode a JSON document from a string that may
        have extraneous data at the end.
    
        """
        try:
            obj, end = self.scan_once(s, idx)
        except StopIteration as err:
>           raise JSONDecodeError("Expecting value", s, err.value) from None
E           json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

/usr/lib64/python3.6/json/decoder.py:357: JSONDecodeError

https://ci.app-svc-perf.corp.redhat.com/job/ExternalTeams/job/Autotune/job/kruize_functional_tests/128/testReport/junit/rest_apis/test_list_recommendations/test_list_recommendations_cpu_mem_optimised/

                 response = list_recommendations(experiment_name)
>                   assert response.status_code == SUCCESS_200_STATUS_CODE
E                   assert 502 == 200
E                    +  where 502 = <Response [502]>.status_code

test_list_recommendations.py:2740: AssertionError

The text was updated successfully, but these errors were encountered:

chandrams · 2024-09-04T05:40:47Z

Commented out test_list_recommendations_cpu_mem_optimised test that failed with 502 error and running the sanity testsuite manually, all the tests passed now. Will run the entire testsuite and check again.

msvinaykumar · 2024-09-04T06:21:10Z

I see an error occurring while creating the experiment. It could be related to the state, such as whether Kruize and its related pods, including the database service, are ready to handle the request.

chandrams · 2024-09-04T07:00:50Z

Yes, that create experiment issue failed due to 502 error in this job, hence commented the below test & other tests work fine.

https://ci.app-svc-perf.corp.redhat.com/job/ExternalTeams/job/Autotune/job/kruize_functional_tests/128/testReport/junit/rest_apis/test_list_recommendations/test_list_recommendations_cpu_mem_optimised/

We need to check why 502 occurs when we run the entire sanity bucket.

chandrams · 2024-09-04T08:03:46Z

Commented out test_list_recommendations_cpu_mem_optimised test that failed with 502 error and running the sanity testsuite manually, all the tests passed now. Will run the entire testsuite and check again.

Two new tests failed now, after commenting the above test and running the entire functional testsuite manually, due to 502 error response from list recommendations:

Listing the recommendations...
URL =  http://kruize-openshift-tuning.apps.kruize-scalelab.h0b5.p1.openshiftapps.com/listRecommendations
PARAMS =  {'experiment_name': 'quarkus-resteasy-kruize-min-http-response-time-db_0'}
Response status code =  502
                    
************************************************************
<html><body><h1>502 Bad Gateway</h1>
The server returned an invalid or incomplete response.
</body></html>
.
.
FAILED test_list_recommendations.py::test_list_recommendations_for_diff_reco_terms_with_only_latest[long_term_test_true-15-reco_json_schema4-360.0-True-False]
FAILED test_list_recommendations.py::test_list_recommendations_for_diff_reco_terms_with_only_latest[long_term_test_false-15-reco_json_schema5-360.0-False-False]
========== 17 failed, 10 passed, 334 deselected in 2597.64s (0:43:17) ==========

########### Results Summary of the test suite remote_monitoring_tests ##########
remote_monitoring_tests took 2988 seconds
Number of tests performed 358
Number of tests passed 313
Number of tests failed 45

~~~~~~~~~~~~~~~~~~~~~~~ remote_monitoring_tests failed ~~~~~~~~~~~~~~~~~~~~~~~~~~
Failed cases are :
                  negative
                  extended

chandrams · 2024-09-04T10:23:07Z

Executed the test suite again the above 2 failures are not seen, 502 error issue seems to be intermittent

########### Results Summary of the test suite remote_monitoring_tests ##########
remote_monitoring_tests took 5843 seconds
Number of tests performed 358
Number of tests passed 315
Number of tests failed 43

~~~~~~~~~~~~~~~~~~~~~~~ remote_monitoring_tests failed ~~~~~~~~~~~~~~~~~~~~~~~~~~
Failed cases are :
		  negative
		  extended

Check Log Directory: /home/jenkins/test_res_alltests_0.0.24_skip_cpu_mem_optimized/kruize_test_results/kruize_20240904:07:45:37/remote_monitoring_tests for failed cases 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

************************************** done *************************************

*********************************************************************************
~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Overall summary of the tests ~~~~~~~~~~~~~~~~~~~~~~~
Total time taken to perform the test 5843 seconds
Total Number of test suites performed 1
Total Number of tests performed 358
Total Number of tests passed 315
Total Number of tests failed 43

These 43 failures are due to known issues.

Executed only the sanity bucket by enabling the skipped test - test_list_recommendations_cpu_mem_optimised test, it passed, didn't see the 502 error.

########### Results Summary of the test suite remote_monitoring_tests ##########
remote_monitoring_tests took 2051 seconds
Number of tests performed 155
Number of tests passed 155
Number of tests failed 0

~~~~~~~~~~~~~~~~~~~~~~ remote_monitoring_tests passed ~~~~~~~~~~~~~~~~~~~~~~~~~~

************************************** done *************************************

chandrams · 2024-09-04T11:10:39Z

Logs of another sanity run that failed with kruize pod restart
test_res_sanity_functional_0.0.24.zip

chandrams · 2024-09-11T11:08:28Z

I have run one of the failing tests alone with the below builds, here are the results:

pytest -s test_list_recommendations.py::test_list_recommendations_cpu_mem_optimised --cluster_type openshift

Executed this test 5 times:

With 0.0.22_mvp, did not see the failure (could be very intermittent though I did not see the failure in 5 runs)
With 0.0.23_mvp, test failed 2 out of 5 runs
With 0.0.24_mvp, test failed 2 out of 5 runs

Note: When the test fails kruize pod is restarted

@msvinaykumar @khansaad - Can you please take a look at this issue.

chandrams · 2024-09-18T05:27:04Z

Issue is seen on 0.0.25_mvp build too.

chandrams added the bug Something isn't working label Sep 4, 2024

chandrams added this to the Kruize 0.0.24_rm Release milestone Sep 4, 2024

chandrams self-assigned this Sep 4, 2024

chandrams added this to Monitoring Sep 4, 2024

chandrams mentioned this issue Sep 4, 2024

Test plan for Kruize release 0.0.24 #1272

Merged

dinogun moved this to In Progress in Monitoring Sep 4, 2024

chandrams changed the title ~~Kruize remote monitoring functional tests are failing~~ Kruize remote monitoring functional failures due to 502 from listRecommendations Sep 4, 2024

rbadagandi1 modified the milestones: Kruize 0.0.24_rm Release, Kruize 0.0.25_rm Release Sep 5, 2024

chandrams assigned khansaad and unassigned chandrams Sep 16, 2024

rbadagandi1 modified the milestones: Kruize 0.0.25 Release, Kruize 0.0.26 Release Sep 20, 2024

khansaad modified the milestones: Kruize 0.1 Release, Kruize 0.2 Release Oct 9, 2024

dinogun modified the milestones: Kruize 0.2 Release, Kruize 0.3 Release Nov 26, 2024

rbadagandi1 modified the milestones: Kruize 0.3 Release, Kruize 0.4 Release Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kruize remote monitoring functional failures due to 502 from listRecommendations #1281

Kruize remote monitoring functional failures due to 502 from listRecommendations #1281

chandrams commented Sep 4, 2024

chandrams commented Sep 4, 2024

msvinaykumar commented Sep 4, 2024

chandrams commented Sep 4, 2024

chandrams commented Sep 4, 2024

chandrams commented Sep 4, 2024

chandrams commented Sep 4, 2024

chandrams commented Sep 11, 2024

chandrams commented Sep 18, 2024

Kruize remote monitoring functional failures due to 502 from listRecommendations #1281

Kruize remote monitoring functional failures due to 502 from listRecommendations #1281

Comments

chandrams commented Sep 4, 2024

chandrams commented Sep 4, 2024

msvinaykumar commented Sep 4, 2024

chandrams commented Sep 4, 2024

chandrams commented Sep 4, 2024

chandrams commented Sep 4, 2024

chandrams commented Sep 4, 2024

chandrams commented Sep 11, 2024

chandrams commented Sep 18, 2024