Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update /generate to not split classes & functions across cells #1158

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

srdas
Copy link
Collaborator

@srdas srdas commented Dec 18, 2024

Fixes #1111

/generate will sometimes split a class or a function across multiple code cells. After tracing the code in generate.py the error arises from unanticipated LLM behavior in the curation of the resultant Jupyter notebook in the creation of the outline object. This is best handled by post-processing the notebook to merge "hanging" code cells with the preceding code cell.

The error looks like this:
calculator_hanging_code_cells

After the fix, the error is no longer present. See this example:
calculator_no_hanging_code_cells

Here is another example of the rectified notebook generation:
no_hanging_code_in_class

if necessary to check the log, add print statements in lines 236, 240, 245 to the code as shown below.
image
The before and after versions of the logged notebook can be compared for lines with ***** CELL 1 ***** as shown here:
image

@srdas srdas added the bug Something isn't working label Dec 18, 2024
@srdas srdas marked this pull request as ready for review December 18, 2024 22:30
@dlqqq dlqqq changed the title Update /generate to ensure no splitting of class or function code across cells in generated notebooks Update /generate to not split classes & functions across cells Dec 18, 2024
@srdas srdas requested a review from dlqqq December 18, 2024 22:48
Copy link
Member

@dlqqq dlqqq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srdas Awesome work! Love the creativity applied here. Few minor comments below.

@@ -212,6 +212,15 @@ def create_notebook(outline):
nb["cells"].append(nbf.new_markdown_cell("## " + section["title"]))
for code_block in section["code"].split("\n\n"):
nb["cells"].append(nbf.new_code_cell(code_block))

# Post process notebook for hanging cells: merge hanging cell with the previous cell
nb_cells = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be worth renaming this variable to avoid confusion with nb["cells"].

Suggested change
nb_cells = []
merged_cells = []

# Post process notebook for hanging cells: merge hanging cell with the previous cell
nb_cells = []
for cell in nb["cells"]:
if (cell["cell_type"] == "code") and (cell["source"][0] == " "):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The True branch requires nb_cells[-1] to exist, otherwise it will throw an exception. The condition cell["source"][0] == " " also requires cell["source"] to have at least 1 char, which is not guaranteed.

The condition can be updated to avoid these edge cases:

Suggested change
if (cell["cell_type"] == "code") and (cell["source"][0] == " "):
follows_code_cell = nb_cells and nb_cells[-1]["cell_type"] == "code"
is_incomplete = cell["cell_type"] == "code" and cell["source"].startswith(" ")
if follows_code_cell and is_incomplete:

Note that nb_cells should be renamed to merged_cells if the previous suggestion is accepted.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made all suggested changes! Thanks for the help.

@dlqqq
Copy link
Member

dlqqq commented Dec 18, 2024

@srdas One other question: with this change, are you able to run every notebook generated via /generate? It would be helpful to identify & document what issues still exist with /generate so that they can be fixed in the future.

@srdas
Copy link
Collaborator Author

srdas commented Dec 19, 2024

During testing across several different /generate use cases, one more issue seemed to occur (though somewhat rarely) -- there are a few cells that contains markdown text but the cell_type is "code" instead. Therefore, added another sweep through the cells to detect these cases and flip the cell_type to markdown.

Additional code to handle this is a new function:
image
And a second post-processing pass through the notebook to fix cells that should be markdown:
image

By adding print statements, this was checked as shown below, the cell_type before and after the rectification is shown:
image
See also the corresponding fix in the notebook:
image
The last example:
image

Notebooks usually run from end to end in most cases. The two cases when they do not are unrelated to the /generate code. These are:

  1. A missing package
  2. The function in the package being used is deprecated and the generated code is still using the old version. This is simply the fact that the LLM is not up to date.

@srdas srdas requested a review from dlqqq December 19, 2024 00:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

/generate sometimes splits classes and functions across multiple cells
2 participants