22 core dumped
在 22 机器上开发遇到 core dumped 问题。更新见 22 core dumped · Issue #1282 。
环境
安装 Anaconda (Anaconda3-2021.11-Linux-x86_64.sh)后,在 base 环境下(Python 3.9.7,torch 1.11.0+cu102)通过如下指令完成项目编译
git clone git@github.com:Oneflow-Inc/oneflow.git
cd oneflow
mkdir build
cd build
cmake -DCUDNN_ROOT_DIR=/usr/local/cudnn -DCMAKE_BUILD_TYPE=Release -DTHIRD_PARTY_MIRROR=aliyun -DUSE_CLANG_FORMAT=ON -C ../cmake/caches/cn/fast/cuda-75.cmake .. && ninja
错误
在 22 上直接运行 oneflow/python/oneflow/test/expensive
下的 test_compatibility.py
,报如下错
Aborted (core dumped)
通过调低 batch_size
与 image_size
之后仍然报相同的错误。
削减 test_compatibility.py
中的 TestApiCompatibility
,只保留如下代码,仍然报相同的错误,0 并没有 print 出,说明 test_alexnet_compatibility()
函数前也即测试类初始化即出现了错误。
@flow.unittest.skip_unless_1n1d()
@unittest.skipIf(os.getenv("ONEFLOW_TEST_CPU_ONLY"), "only test gpu cases")
class TestApiCompatibility(flow.unittest.TestCase):
def test_alexnet_compatibility(test_case):
print(0)
do_test_train_loss_oneflow_pytorch(
test_case, "pytorch_alexnet.py", "alexnet", "cuda", 1, 8
)
通过 gdb
指令运行文件
gdb --args python ./oneflow/python/oneflow/test/expensive/test_compatibility.py
run
得到如下信息
--Type <RET> for more, q to quit, c to continue without paging--
Thread 1 "python" received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) up
#1 0x00007ffff7c5d859 in __GI_abort () at abort.c:79
79 abort.c: No such file or directory.
(gdb) up
#2 0x00007fffde17f6c1 in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1
(gdb) up
#3 0x00007fffde18cbef in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1
(gdb) up
#4 0x00007fffde18d5aa in _Unwind_Resume () from /lib/x86_64-linux-gnu/libgcc_s.so.1
(gdb) up
#5 0x00007ffec14a3424 in throw_ft_error(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int) ()
from /home/zhongshanshan/anaconda3/lib/python3.9/site-packages/matplotlib/ft2font.cpython-39-x86_64-linux-gnu.so
(gdb) up
#6 0x00007ffec14a5c1a in PyFT2Font_init(PyFT2Font*, _object*, _object*) ()
from /home/zhongshanshan/anaconda3/lib/python3.9/site-packages/matplotlib/ft2font.cpython-39-x86_64-linux-gnu.so
(gdb) up
#7 0x00005555556989bb in type_call (kwds=0x0, args=0x7ffec1010dc0, type=<optimized out>)
at /tmp/build/80754af9/python-split_1631797238431/work/Objects/typeobject.c:1026
1026 /tmp/build/80754af9/python-split_1631797238431/work/Objects/typeobject.c: No such file or directory.
(gdb)
考虑到 OneFlow 与 PyTorch 之间的兼容性问题:
concat
的测试代码./oneflow/python/oneflow/test/modules/test_concat.py
代码正常运行。- 运行
pytorch_alexnet.py
中调用的conv2d
的测试代码./oneflow/python/oneflow/test/modules/test_conv2d.py
,报错。
自定义复现脚本,查看是否是 OneFlow 与 PyTorch 之间的兼容性问题,自定义脚本可以正常运行。自定义脚本如下
import oneflow as torch
import oneflow.nn as nn
# With square kernels and equal stride
m = nn.Conv2d(16, 33, 3, stride=2)
# non-square kernels and unequal stride and with padding
m = nn.Conv2d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2))
# non-square kernels and unequal stride and with padding and dilation
m = nn.Conv2d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2), dilation=(3, 1))
m.to("cuda")
input = torch.randn(20, 16, 50, 100).to("cuda")
output = m(input)
import torch
import torch.nn as nn
# With square kernels and equal stride
m = nn.Conv2d(16, 33, 3, stride=2)
# non-square kernels and unequal stride and with padding
m = nn.Conv2d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2))
# non-square kernels and unequal stride and with padding and dilation
m = nn.Conv2d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2), dilation=(3, 1))
m.to("cuda")
input = torch.randn(20, 16, 50, 100).to("cuda")
output = m(input)
当前解决方案
换台机子。