Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 4558 from Core CLI: Adding flock #358

Open
wants to merge 12 commits into
base: dev
Choose a base branch
from
Open

Issue 4558 from Core CLI: Adding flock #358

wants to merge 12 commits into from

Conversation

tonystarkjr3
Copy link
Collaborator

@tonystarkjr3 tonystarkjr3 commented Jan 13, 2023

After significant research on the most efficient, least intrusive, and maximally functionality-preserving way to increase the concurrency-safety of the CLI, we decided to use flock in addition sync.RWMutex. I thought about and trialed this on my system repeatedly before deciding that a simple addition of a file lock to the persistor's write method would suffice to address the suspected problem that overly-permissive access allowance to the file for multiple threads, in the read-only logic of the code, caused file corruption.

One piece of code that is commented about inline starting with strange (and also called out on this PR) is worth highlighting here: it seems that on first Save of the config, the code is requiring an error--that is not a permission error--to have occurred to write to the config file. This is very peculiar. My suspicion is that this doesn't matter: Save is called via an initOnce function that ensures only one write attempt to write to the config file is carried out this way. Subsequent attempts to write to the config file will occur via Load, which simply had a call to the persistor file's write function.

One helpful piece of information to keep in mind is that DiskPersistor is simply an object that houses the filepath (and that I've added to, to also contain the file lock) of the config. Additionally, the actual writing to the config file happens via the write function (which is now a callee of the lockedWrite function I've created), but the data structure (to be written to the config file) is modified, by dot-access, through functions peculiarly called cb().

For testing, I think it is sufficient to chain together multiple ibmcloud ... commands that involve read/writes to the commit with &s, e.g. ibmcloud config --http-timeout 60 & ibmcloud config --http-timeout 30 & ibmcloud config --http-timeout 45 & ibmcloud config --list & ibmcloud config --http-timeout 53. I also tested out the code with a Thread Group containing multiple OS Process Samplers in JMeter, and there was no data corruption.

PLEASE NOTE: to test this out with the core CLI's binary, it is important to run the following commands in the local copy of the core CLI repo (bluemix-cli) from where you will build a binary for testing, in the following order, to avoid issues with module pathing:

  1. go get github.ibm.com/arf/cli-dev-plugin/plugin@<commit-hash>. The <commit-hash> at the end of this command string corresponds to a version of the dev plugin code that uses the same packages as the SDK code in step 2. At the time of writing, that is 2a362c8b25d44002dcae8d8b37fcfae54dd94d17
  2. go get github.com/IBM-Cloud/ibm-cloud-cli-sdk@<commit-hash>. The <commit-hash> at the end of this command string corresponds to a version of the code in this repo that uses the same packages as the dev-plugin's code in step 1. At the time of writing, that is 1d5f40b5afd04087c3f803e24561a28cc16ebefe.
  3. go mod tidy

EDIT: Regarding testing, I think I will make a ruby script that hands off to and/or interweaves with a bash script for multithreaded modifications of the config. I think this can happen separately, as the PR is open for review, and be merged later.

@tonystarkjr3 tonystarkjr3 changed the base branch from master to dev January 13, 2023 21:28
@tonystarkjr3 tonystarkjr3 requested a review from Aerex January 13, 2023 21:33
@@ -44,14 +47,26 @@ func (dp DiskPersistor) Load(data DataInterface) error {
return err
}

if err != nil {
err = dp.write(data)
if err != nil { // strange: requiring an error (to allow write attempt to continue), as long as it is not a permission error
Copy link
Collaborator Author

@tonystarkjr3 tonystarkjr3 Jan 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strange code.
Requiring a non-permission error to actually carry out the writing to the config? Weird!
Please see my OP for my thoughts on why this is benign despite being pretty obviously counterintuitive.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is strange but the docs strongly recommend using TryLock instead due to the potential blocking issue. Can we try that and see if it resolves this issue

Lock is a blocking call to try and take an exclusive file lock. It will wait until it is able to obtain the exclusive file lock. It's recommended that TryLock() be used over this function. This function may block the ability to query the current Locked() or RLocked() status due to a RW-mutex lock. (doc)

Copy link
Collaborator

@Aerex Aerex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested yet but I am thinking using the TryLock might fix the issue that you are seeing

@@ -44,14 +47,26 @@ func (dp DiskPersistor) Load(data DataInterface) error {
return err
}

if err != nil {
err = dp.write(data)
if err != nil { // strange: requiring an error (to allow write attempt to continue), as long as it is not a permission error
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is strange but the docs strongly recommend using TryLock instead due to the potential blocking issue. Can we try that and see if it resolves this issue

Lock is a blocking call to try and take an exclusive file lock. It will wait until it is able to obtain the exclusive file lock. It's recommended that TryLock() be used over this function. This function may block the ability to query the current Locked() or RLocked() status due to a RW-mutex lock. (doc)

}
return err
}

func (dp DiskPersistor) lockedWrite(data DataInterface) error {
lockErr := dp.fileLock.Lock() // provide a file lock, in addition to the RW mutex (in calling functions), just while dp.write is called
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: For readability can we move the comments above line 57

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

Comment on lines 157 to 160
c.onError(err) // NOTE: the error could be triggered by a file-locking issue,
// a file-unlocking issue, OR a file-writing issue; currently, the error chain
// ends in `panic("configuration error: " + err.Error()"`, which is somewhat
// generic but sufficient for all three of these error types
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: For readability can we move the comment above line 157.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

@tonystarkjr3
Copy link
Collaborator Author

tonystarkjr3 commented Jan 17, 2023

I haven't tested yet but I am thinking using the TryLock might fix the issue that you are seeing

If we use this instead, we might need to code in a loop with the following psuedocode, since it is nonblocking and therefore will not wait until the exclusive lock is acquired:

while success, err := fileLock.TryLock(file) ; !success {
   if err != nil {
      return err
   }
}

Thoughts @Aerex ?

@Aerex
Copy link
Collaborator

Aerex commented Jan 17, 2023

I haven't tested yet but I am thinking using the TryLock might fix the issue that you are seeing

If we use this instead, we might need to code in a loop with the following psuedocode, since it is nonblocking:

while success, err := fileLock.TryLock(file) ; !success {
   if err != nil {
      return err
   }
}

Hmm actually lets use TryLockContext which provides a means to retry the lock after a given interval

@tonystarkjr3
Copy link
Collaborator Author

Ok. I like this idea. Basically the same as the loop, but with a delay in each iteration.

@tonystarkjr3
Copy link
Collaborator Author

Update: I think we should delay the merge based on the recent discovery that flock has a member of type sync.RWMutex that is used (indirectly) by TryLockContext

@tonystarkjr3 tonystarkjr3 added the do-not-merge Label to remind reviewers NOT merge this PR label Jan 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge Label to remind reviewers NOT merge this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants