英文:
How can I configure my tools to ignore or prevent updates to the execution_count field in a Jupyter Notebook
问题
我在Visual Studio Code
(v1.73.1)中使用Jupyter
扩展(v2022.9.1303220346)。
要重现此问题,对笔记本进行任何修改并将其提交到git。您会注意到execution_count
会多出一个差异。例如(从Git Gui
显示):
- "execution_count": 7,
+ "execution_count": 9,
execution_count
似乎并不实用,它在git历史中是噪音。是否可以配置Jupyter或VS Code以停止更新此值,或者(更好的选择)完全忽略它?
英文:
I'm using the Jupyter
extension (v2022.9.1303220346) in Visual Studio Code
(v1.73.1).
To reproduce this issue, make any modification to the notebook and check it into git. You'll observe that you get an extra difference for execution_count
. For example (display from Git Gui
):
- "execution_count": 7,
+ "execution_count": 9,
The execution count doesn't appear to be useful and is noise in the git history. Can Jupyter or VS Code be configured to stop updating this value or (better) ignore it altogether?
答案1
得分: 2
Jupyter或VS Code是否可以配置以停止更新此值或(更好的)完全忽略它?
我不确定VS Code,我认为对于VS Code的配置选项,根据我在GitHub Jupyter笔记本功能请求问题票讨论中的一些信息,答案可能是否定的。它们都是功能请求,这表明对我来说答案目前也似乎是否定的,但也有很多解决这个问题的方法:
-
在
jupyter/notebook
:建议:为笔记本执行的单元格输出创建单独的文件。#5677我认为创建一个单独的文件(类似于.ipynb.output)以将输出链接到.ipynb JSON文件中的单元格是个不错的主意。这将使在像git这样的源代码控制系统中排除笔记本输出变得更加容易。 - jbursey
这不是个坏主意。但如果将单元格输出排除在源代码控制之外是您的主要关注点,最简单的解决方法就是在提交之前清除输出。有几种方法可以做到这一点:
-
使用Jupyter的“清除所有单元格输出”快捷方式
-
使用nbconvert来清除笔记本输出在提交之前。
-
您还可以编写自己的shell脚本来清除输出。我使用jq编写了一个相当简单的脚本来做到这一点。
有些人也选择只是使用nbconvert将笔记本转换为Python,然后提交它。如果搜索“如何版本控制Jupyter笔记本”,您会看到关于这个主题的许多帖子。
或者,Jupytext对您可能有所帮助。它允许您将笔记本保存为代码。然后,您只需提交代码到git,而可以忽略笔记本进行版本控制。
它们的成对笔记本避免了自动保存和转换笔记本的需要。
-
在
jupyterlab/jupyterlab
:使用笔记本和git会创建太多的差异#9444如果我们有一个选项可以只保存输入单元格而不是输出单元格,而且可以将单元格索引(execution_count)重置为0而不重新启动内核,那将简单得多。 - sylvain-bougnoux
我认为您可以配置底层的nbdiff以忽略输出,参见:https://nbdime.readthedocs.io/en/latest/config.html#configuring-ignores - krassowski
-
在
jupyterlab/jupyterlab-git
:清理笔记本单元格输出#392在审查提交的差异以查看发生了什么变化时,笔记本单元格输出可能会妨碍版本控制。
关于如何让用户处理jupyterlab-git中单元格输出的一些建议
- 启用命令调色板选项,以轻松安装带有nbstripout的Git过滤器
- 如果我们检测到在git推送期间存在单元格输出,提示用户删除单元格中的输出
- 使用JupyterLab设置注册表,让用户指定在git推送时必须清除所有笔记本输出
有了#700,现在可以在初始化git存储库时添加nbstripout(例如)。 - fcollonval
供您学习和参考,我通过搜索“github issues jupyter notebook put execution_count in separate file”找到了这些信息,并浏览了搜索结果中的前几项以及其链接的GitHub问题讨论线程。
英文:
> Can Jupyter or VS Code be configured to stop updating this value or (better) ignore it altogether?
I'm not sure about VS Code, and I think the answer for VS Code config options might be no after reading some discussions in GitHub feature-request issue tickets for Jupyter notebooks, where the fact that they are feature-requests indicates to me that the answer also currently seems to be no, but also that there are plenty of approaches to tackling the problem:
-
In
jupyter/notebook
: Suggestion: Separate file for notebook executed cell outputs. #5677
> I think it would be nice to have a separate file (something like .ipynb.output) that links output to their cells in the .ipynb json file. This would make it significantly easier to exclude notebook outputs in source control systems like git. - jbursey> Its not a bad idea. But if keeping cell output out of source control is your primary concern, the easiest solution is to just clear the outputs before committing. There are a few ways to do that:
>
> Use a commit hook as outlined in Jupyter docs.
>
> - Use Jupyter's shortcut to "clear all cell output"
>
> - Use nbconvert to clear the notebook outputs before committing.
>
> - You could also just write your own shell script to clear outputs. I wrote one using jq to do that and it is fairly easy.
>
> Some folks also choose to just convert the notebook to python using nbconvert and then just commit that. If you search for "How to version control jupyter notebooks" you will see a bunch of posts on the topic.
>
> - gitjeff05> Alternatively, Jupytext could be helpful for your case. It allows you to save notebooks as code. Then you only need to commit the code to git, whilst you can ignore the notebooks for version control.
>
> Their paired notebooks avoid the need for automatically saving and converting the notebooks.
>
> - IvoMerchiers -
In
jupyterlab/jupyterlab
: Using a notebook & git creates too many diff #9444
> It would be much simpler if we had an option to save only the input cells, not the output ones. And to reset the cell index (execution_count) to 0 without restarting the kernel. - sylvain-bougnoux
>
> I think that you can configure the underlying nbdiff to ignore outputs, see: https://nbdime.readthedocs.io/en/latest/config.html#configuring-ignores - krassowski -
In
jupyterlab/jupyterlab-git
: Cleaning Notebook cell outputs #392
> Notebooks cell outputs can be a hindrance in Version Control while reviewing the diff of a commit to see what changed (either in a PR or historically)
>
> Some ideas on how we could enable users to deal with outputs in cell in jupyterlab-git
>
> 1. Enable a Command Palette option to easily install a Git filter with nbstripout
> 1. Prompt the user to remove outputs from cells if we detect that there are cell outputs during a git push
> 1. Use the JupyterLab settings registry to let the user specify that all Notebook outputs must be cleaned on a git push
>
> - jaipreet-s> With #700, it is now possible to add nbstripout (for example) when initializing a git repository. - fcollonval
For your learning purposes / reference, I found this info by googling "github issues jupyter notebook put execution_count in separate file
" and looking through the top search results and linked GitHub issues in their discussion threads.
答案2
得分: 1
.ipynb
格式包含您的输入代码单元格、输出数据和多种元数据,以便在交互式运行笔记本时重现您所看到的确切形式。
"execution_count" 仅是其中的一个元数据,实际上还有很多其他元数据(单元格折叠、扩展元数据等),它们都被存储起来,但并不代表笔记本代码的任何差异。因此,很难保留所有信息并在Git中生成有意义的差异。尽管有关保留哪些数据或删除哪些数据用于版本控制的讨论,但底层的JSON格式本身并不是这一目的的理想选择,例如,每个单元格中的每一行都被编码为以下形式:
"source": [
"for fizzbuzz in range(101):\n",
" \n",
" if fizzbuzz % 3 == 0 and fizzbuzz % 5 == 0:\n",
" print(\"fizzbuzz\")\n",
" continue\n",
" \n",
" elif fizzbuzz % 3 == 0:\n",
" print(\"fizz\")\n",
" continue\n",
" \n",
" elif fizzbuzz % 5 == 0:\n",
" print(\"buzz\")\n",
" continue\n",
" \n",
" print(fizzbuzz)"
]
与底层代码相比,这种编码方式相对难以阅读。
因此,一种可能的解决方案是使用Jupytext扩展。它将您的.ibynb
文件与一个常规的.py
文件配对,同时保留一些元数据。配对的.py
文件可以使用任何编辑器查看和编辑,与Git配合使用良好,不依赖完整的Jupyter基础架构。
英文:
The .ipynb
format contains your input code cells, output data and a variety of metadata to reproduce the exact form you see when running the notebook interactively.
The "execution_count" is unfortunately only one of them, there are many more (cell collapsed, extension metadata and more) that are stored and do not represent any difference in the code of the notebook. So therefore it is not really possible to preserve all the information and generate meaningful differences in git. While there are discussions which data to keep or throw out for version control purposes the underlying JSON format is not ideal anyway for this purpose, as for example each line in each cell gets encoded like this:
"source": [
"for fizzbuzz in range(101):\n",
" \n",
" if fizzbuzz % 3 == 0 and fizzbuzz % 5 == 0:\n",
" print(\"fizzbuzz\")\n",
" continue\n",
" \n",
" elif fizzbuzz % 3 == 0:\n",
" print(\"fizz\")\n",
" continue\n",
" \n",
" elif fizzbuzz % 5 == 0:\n",
" print(\"buzz\")\n",
" continue\n",
" \n",
" print(fizzbuzz)"
]
},
which is rather hard to read compared to the underlying code.
One possibility out of this is to use the Jupytext extension. This pairs your .ibynb
file with a regular .py
file while keeping some of the metadata intact. The paired .py
file can be viewed & edited with any editor, works well with git, and does not depend on the complete jupyter infrastructure.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论