2023年5月28日 03:12:27go评论68阅读模式

英文:

How can I get Pytorch Lightning epoch progress bar to display when training on Google Cloud TPUs?

问题

当我在本地机器上运行我的GPU或CPU代码，甚至在Google Colab的TPU上运行时，会显示一个显示周期/步骤的进度条。然而，当我做最小的调整以在Google Cloud的TPU上运行代码时，我无法再看到进度条。我收到以下消息：

warning_cache.warn(
WARNING:root:Unsupported nprocs (8), ignoring...

根据TPU的使用情况，代码正在工作并且正在进行训练。TPU虚拟机正在使用python 3.8.10、torch==2.0.0、torch-xla==2.0、torchmetrics==0.11.4、torchvision==0.15.1、pl==2.0.2、transformers==4.29.2。

以下是我代码的末尾部分供参考：

if __name__ == '__main__':
    data_module = IsaDataModule(train_df, val_df, test_df, tokenizer, batch_size=BATCH_SIZE)
    data_module.setup()
    model = IsaModel()
    
    checkpoint_callback = ModelCheckpoint(
        dirpath='spec1_ckpt',
        filename='best_checkpoint',
        save_top_k=1,
        verbose=True,
        monitor='val_loss',
        mode='min'
    )
    
    #每个TPU有8个设备
    trainer = pl.Trainer(
        callbacks=[checkpoint_callback],
        max_epochs=N_EPOCHS,
        accelerator='tpu',
        devices=8
    )

    trainer.fit(model, data_module)

我尝试了一些来自此线程的修复方法：https://github.com/Lightning-AI/lightning/issues/1112，但在该线程中，问题出现在Colab上，而不是云虚拟机上。我还尝试过使用XRT运行时而不是PJRT，但在那种情况下，训练根本不起作用。任何帮助将不胜感激，谢谢。

英文:

When I run my code for GPU or CPU on my local machine or even on a Google colab TPU I get a progress bar showing the epoch/steps. However when I make the minimal adjustments to run the code on Google cloud TPUs, I can no longer get the bar to appear. I get the following message:

warning_cache.warn(
WARNING:root:Unsupported nprocs (8), ignoring...

Based on TPU usage the code is working and training is happening. The TPU vm is using python 3.8.10, torch==2.0.0, torch-xla==2.0, torchmetrics==0.11.4, torchvision==0.15.1, pl==2.0.2, transformers==4.29.2.

Here's the end of my code for reference:

if __name__ == &#39;__main__&#39;:
    data_module = IsaDataModule(train_df, val_df, test_df, tokenizer, batch_size=BATCH_SIZE)
    data_module.setup()
    model = IsaModel()
    
    checkpoint_callback = ModelCheckpoint(
        dirpath=&#39;spec1_ckpt&#39;,
        filename=&#39;best_checkpoint&#39;,
        save_top_k=1,
        verbose=True,
        monitor=&#39;val_loss&#39;,
        mode=&#39;min&#39;
    )
    
    #8 devices per TPU
    trainer = pl.Trainer(
        callbacks=[checkpoint_callback],
        max_epochs=N_EPOCHS,
        accelerator=&#39;tpu&#39;,
        devices=8
    )

    trainer.fit(model, data_module)

I've tried some of the fixes from this thread: https://github.com/Lightning-AI/lightning/issues/1112
But in that thread the issue is with colab and not cloud vm's. I've also tried using XRT runtime instead of PJRT, but in that case the training doesn't work at all. Any help would be appreciated, thanks.

答案1

得分: 1

不建议在TPUs上启用进度条，因为它会触发设备-主机通信，从而导致显着的减速。无论如何，它应该可以工作。您可以尝试显式地将enable_progress_bar=True传递给训练器，看看是否有帮助？

英文:

it is not recommended to enable progress bar on TPUs since it triggers device-host communication which causes significant slowdown. In any case, it should work. Can you try explicitly passing enable_progress_bar=True to the Trainer and see if that helps?

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在使用Google Cloud TPUs训练时让PyTorch Lightning显示周期进度条？

问题

答案1

为什么在BigQuery中使用unnest时没有数据返回？

Google Cloud Speech to Text API 在 Android 上由于无效的身份验证凭据而失败。

Document AI 银行对账单处理器

Google FireStore QuerySnapshot没有返回新添加的文档？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论