跳转至

22 core dumped

在 22 机器上开发遇到 core dumped 问题。更新见 22 core dumped · Issue #1282

环境

安装 Anaconda (Anaconda3-2021.11-Linux-x86_64.sh)后,在 base 环境下(Python 3.9.7,torch 1.11.0+cu102)通过如下指令完成项目编译

git clone git@github.com:Oneflow-Inc/oneflow.git
cd oneflow
mkdir build
cd build
cmake -DCUDNN_ROOT_DIR=/usr/local/cudnn -DCMAKE_BUILD_TYPE=Release -DTHIRD_PARTY_MIRROR=aliyun -DUSE_CLANG_FORMAT=ON  -C ../cmake/caches/cn/fast/cuda-75.cmake .. && ninja

错误

在 22 上直接运行 oneflow/python/oneflow/test/expensive 下的 test_compatibility.py,报如下错

Aborted (core dumped)

通过调低 batch_sizeimage_size 之后仍然报相同的错误。

削减 test_compatibility.py 中的 TestApiCompatibility,只保留如下代码,仍然报相同的错误,0 并没有 print 出,说明 test_alexnet_compatibility() 函数前也即测试类初始化即出现了错误。

@flow.unittest.skip_unless_1n1d()
@unittest.skipIf(os.getenv("ONEFLOW_TEST_CPU_ONLY"), "only test gpu cases")
class TestApiCompatibility(flow.unittest.TestCase):
    def test_alexnet_compatibility(test_case):
        print(0)
        do_test_train_loss_oneflow_pytorch(
            test_case, "pytorch_alexnet.py", "alexnet", "cuda", 1, 8
        )

通过 gdb 指令运行文件

gdb --args python ./oneflow/python/oneflow/test/expensive/test_compatibility.py

run

得到如下信息

--Type <RET> for more, q to quit, c to continue without paging--

Thread 1 "python" received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) up
#1  0x00007ffff7c5d859 in __GI_abort () at abort.c:79
79      abort.c: No such file or directory.
(gdb) up
#2  0x00007fffde17f6c1 in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1
(gdb) up
#3  0x00007fffde18cbef in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1
(gdb) up
#4  0x00007fffde18d5aa in _Unwind_Resume () from /lib/x86_64-linux-gnu/libgcc_s.so.1
(gdb) up
#5  0x00007ffec14a3424 in throw_ft_error(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int) ()
   from /home/zhongshanshan/anaconda3/lib/python3.9/site-packages/matplotlib/ft2font.cpython-39-x86_64-linux-gnu.so
(gdb) up
#6  0x00007ffec14a5c1a in PyFT2Font_init(PyFT2Font*, _object*, _object*) ()
   from /home/zhongshanshan/anaconda3/lib/python3.9/site-packages/matplotlib/ft2font.cpython-39-x86_64-linux-gnu.so
(gdb) up
#7  0x00005555556989bb in type_call (kwds=0x0, args=0x7ffec1010dc0, type=<optimized out>)
    at /tmp/build/80754af9/python-split_1631797238431/work/Objects/typeobject.c:1026
1026    /tmp/build/80754af9/python-split_1631797238431/work/Objects/typeobject.c: No such file or directory.
(gdb) 

考虑到 OneFlow 与 PyTorch 之间的兼容性问题:

  • concat 的测试代码 ./oneflow/python/oneflow/test/modules/test_concat.py 代码正常运行。
  • 运行 pytorch_alexnet.py 中调用的 conv2d 的测试代码 ./oneflow/python/oneflow/test/modules/test_conv2d.py,报错。

自定义复现脚本,查看是否是 OneFlow 与 PyTorch 之间的兼容性问题,自定义脚本可以正常运行。自定义脚本如下

import oneflow as torch
import oneflow.nn as nn

# With square kernels and equal stride
m = nn.Conv2d(16, 33, 3, stride=2)
# non-square kernels and unequal stride and with padding
m = nn.Conv2d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2))
# non-square kernels and unequal stride and with padding and dilation
m = nn.Conv2d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2), dilation=(3, 1))
m.to("cuda")
input = torch.randn(20, 16, 50, 100).to("cuda")
output = m(input)

import torch
import torch.nn as nn

# With square kernels and equal stride
m = nn.Conv2d(16, 33, 3, stride=2)
# non-square kernels and unequal stride and with padding
m = nn.Conv2d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2))
# non-square kernels and unequal stride and with padding and dilation
m = nn.Conv2d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2), dilation=(3, 1))
m.to("cuda")
input = torch.randn(20, 16, 50, 100).to("cuda")
output = m(input)

当前解决方案

换台机子。

Back to top