如何在使用Google Cloud TPUs训练时让PyTorch Lightning显示周期进度条?

huangapple go评论64阅读模式
英文:

How can I get Pytorch Lightning epoch progress bar to display when training on Google Cloud TPUs?

问题

当我在本地机器上运行我的GPU或CPU代码,甚至在Google Colab的TPU上运行时,会显示一个显示周期/步骤的进度条。然而,当我做最小的调整以在Google Cloud的TPU上运行代码时,我无法再看到进度条。我收到以下消息:

warning_cache.warn(
WARNING:root:Unsupported nprocs (8), ignoring...

根据TPU的使用情况,代码正在工作并且正在进行训练。TPU虚拟机正在使用python 3.8.10、torch==2.0.0、torch-xla==2.0、torchmetrics==0.11.4、torchvision==0.15.1、pl==2.0.2、transformers==4.29.2。

以下是我代码的末尾部分供参考:

if __name__ == '__main__':
    data_module = IsaDataModule(train_df, val_df, test_df, tokenizer, batch_size=BATCH_SIZE)
    data_module.setup()
    model = IsaModel()
    
    checkpoint_callback = ModelCheckpoint(
        dirpath='spec1_ckpt',
        filename='best_checkpoint',
        save_top_k=1,
        verbose=True,
        monitor='val_loss',
        mode='min'
    )
    
    #每个TPU有8个设备
    trainer = pl.Trainer(
        callbacks=[checkpoint_callback],
        max_epochs=N_EPOCHS,
        accelerator='tpu',
        devices=8
    )

    trainer.fit(model, data_module)

我尝试了一些来自此线程的修复方法:https://github.com/Lightning-AI/lightning/issues/1112,但在该线程中,问题出现在Colab上,而不是云虚拟机上。我还尝试过使用XRT运行时而不是PJRT,但在那种情况下,训练根本不起作用。任何帮助将不胜感激,谢谢。

英文:

When I run my code for GPU or CPU on my local machine or even on a Google colab TPU I get a progress bar showing the epoch/steps. However when I make the minimal adjustments to run the code on Google cloud TPUs, I can no longer get the bar to appear. I get the following message:

warning_cache.warn(
WARNING:root:Unsupported nprocs (8), ignoring...

Based on TPU usage the code is working and training is happening. The TPU vm is using python 3.8.10, torch==2.0.0, torch-xla==2.0, torchmetrics==0.11.4, torchvision==0.15.1, pl==2.0.2, transformers==4.29.2.

Here's the end of my code for reference:

if __name__ == '__main__':
    data_module = IsaDataModule(train_df, val_df, test_df, tokenizer, batch_size=BATCH_SIZE)
    data_module.setup()
    model = IsaModel()
    
    checkpoint_callback = ModelCheckpoint(
        dirpath='spec1_ckpt',
        filename='best_checkpoint',
        save_top_k=1,
        verbose=True,
        monitor='val_loss',
        mode='min'
    )
    
    #8 devices per TPU
    trainer = pl.Trainer(
        callbacks=[checkpoint_callback],
        max_epochs=N_EPOCHS,
        accelerator='tpu',
        devices=8
    )

    trainer.fit(model, data_module)

I've tried some of the fixes from this thread: https://github.com/Lightning-AI/lightning/issues/1112
But in that thread the issue is with colab and not cloud vm's. I've also tried using XRT runtime instead of PJRT, but in that case the training doesn't work at all. Any help would be appreciated, thanks.

答案1

得分: 1

不建议在TPUs上启用进度条,因为它会触发设备-主机通信,从而导致显着的减速。无论如何,它应该可以工作。您可以尝试显式地将enable_progress_bar=True传递给训练器,看看是否有帮助?

英文:

it is not recommended to enable progress bar on TPUs since it triggers device-host communication which causes significant slowdown. In any case, it should work. Can you try explicitly passing enable_progress_bar=True to the Trainer and see if that helps?

huangapple
  • 本文由 发表于 2023年5月28日 03:12:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/76348598.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定