Problems with AMD RX 580 + Akitio Node (TB3) + Ubuntu 18.10  

  RSS

rstrube
(@rstrube)
Active Member
Joined: 3 weeks  ago
Posts: 11
October 23, 2018 5:31 am  

Hello All,

Here's my system information
System: Dell XPS 15 9575 2 in 1
Built in GPUs: Intel iGPU, Vega M
OS: Ubuntu 18.10
Kernel: 4.18
eGPU: RX 580

Unfortunately I'm struggling to get my RX 580 working correctly as an eGPU on my Ubuntu 18.10 based system.

First of all I've been able to successfully get the Akitio Node authorized as a Thunderbolt Device.  The new 4.18 kernel makes this trivial as Ubuntu will prompt you to authenticate the new TB device as soon as you plug it in.  In addition, the Akitio Node *and* the RX 580 are visible in lspci:

00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers (rev 05)
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) (rev 05)
00:02.0 VGA compatible controller: Intel Corporation Device 591b (rev 04)
00:04.0 Signal processing controller: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem (rev 05)
00:13.0 Non-VGA unclassified device: Intel Corporation 100 Series/C230 Series Chipset Family Integrated Sensor Hub (rev 31)
00:14.0 USB controller: Intel Corporation 100 Series/C230 Series Chipset Family USB 3.0 xHCI Controller (rev 31)
00:14.2 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Thermal Subsystem (rev 31)
00:15.0 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Serial IO I2C Controller #0 (rev 31)
00:15.1 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Serial IO I2C Controller #1 (rev 31)
00:16.0 Communication controller: Intel Corporation 100 Series/C230 Series Chipset Family MEI Controller #1 (rev 31)
00:17.0 SATA controller: Intel Corporation HM170/QM170 Chipset SATA Controller [AHCI Mode] (rev 31)
00:1c.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #1 (rev f1)
00:1c.4 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #5 (rev f1)
00:1d.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #9 (rev f1)
00:1f.0 ISA bridge: Intel Corporation QM175 Chipset LPC/eSPI Controller (rev 31)
00:1f.2 Memory controller: Intel Corporation 100 Series/C230 Series Chipset Family Power Management Controller (rev 31)
00:1f.3 Audio device: Intel Corporation CM238 HD Audio Controller (rev 31)
00:1f.4 SMBus: Intel Corporation 100 Series/C230 Series Chipset Family SMBus (rev 31)
01:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Polaris 22 [Radeon RX Vega M GL] (rev c0)
02:00.0 Network controller: Qualcomm Atheros QCA6174 802.11ac Wireless Network Adapter (rev 32)
03:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS525A PCI Express Card Reader (rev 01)
04:00.0 PCI bridge: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] (rev 02)
05:00.0 PCI bridge: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] (rev 02)
05:01.0 PCI bridge: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] (rev 02)
05:02.0 PCI bridge: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] (rev 02)
05:04.0 PCI bridge: Intel Corporation JHL6540 Thunderbolt 3 Bridge (C step) [Alpine Ridge 4C 2016] (rev 02)
06:00.0 System peripheral: Intel Corporation JHL6540 Thunderbolt 3 NHI (C step) [Alpine Ridge 4C 2016] (rev 02)
07:00.0 PCI bridge: Intel Corporation DSL6340 Thunderbolt 3 Bridge [Alpine Ridge 2C 2015]
08:01.0 PCI bridge: Intel Corporation DSL6340 Thunderbolt 3 Bridge [Alpine Ridge 2C 2015]
09:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X] (rev e7)
09:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 580]

Here's the detailed information for the RX 580:

09:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X] (rev e7) (prog-if 00 [VGA controller])
Subsystem: XFX Pine Group Inc. Ellesmere [Radeon RX 470/480/570/570X/580/580X]
Flags: fast devsel, IRQ 18
Memory at 2fb0000000 (64-bit, prefetchable) [size=256M]
Memory at 2fc0000000 (64-bit, prefetchable) [size=2M]
I/O ports at 2000 [size=256]
Memory at bc000000 (32-bit, non-prefetchable) [size=256K]
Expansion ROM at bc040000 [disabled] [size=128K]
Capabilities: [48] Vendor Specific Information: Len=08 <?>
Capabilities: [50] Power Management version 3
Capabilities: [58] Express Legacy Endpoint, MSI 00
Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
Capabilities: [150] Advanced Error Reporting
Capabilities: [200] #15
Capabilities: [270] #19
Capabilities: [2b0] Address Translation Service (ATS)
Capabilities: [2c0] Page Request Interface (PRI)
Capabilities: [2d0] Process Address Space ID (PASID)
Capabilities: [320] Latency Tolerance Reporting
Capabilities: [328] Alternative Routing-ID Interpretation (ARI)
Capabilities: [370] L1 PM Substates
Kernel modules: amdgpu

If I look at dmesg I see some distressing entries in the system logs:[ 8.534250] amdgpu 0000:09:00.0: enabling device (0006 -> 0007)

[    8.534756] [drm] initializing kernel modesetting (POLARIS10 0x1002:0x67DF 0x1682:0xC580 0xE7).
[    8.537567] [drm] register mmio base: 0xBC000000
[    8.537568] [drm] register mmio size: 262144
[    8.537598] [drm] add ip block number 0 <vi_common>
[    8.537599] [drm] add ip block number 1 <gmc_v8_0>
[    8.537599] [drm] add ip block number 2 <tonga_ih>
[    8.537599] [drm] add ip block number 3 <powerplay>
[    8.537600] [drm] add ip block number 4 <dm>
[    8.537600] [drm] add ip block number 5 <gfx_v8_0>
[    8.537601] [drm] add ip block number 6 <sdma_v3_0>
[    8.537602] [drm] add ip block number 7 <uvd_v6_0>
[    8.537602] [drm] add ip block number 8 <vce_v3_0>
[    8.537608] kfd kfd: skipped device 1002:67df, PCI rejects atomics
[    8.537630] [drm] UVD is enabled in VM mode
[    8.537630] [drm] UVD ENC is enabled in VM mode
[    8.537636] [drm] VCE enabled in VM mode
[    8.614467] ATOM BIOS: 401815-171128-QS1
[    8.614512] [drm] GPU posting now...
[   13.621276] [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 5secs aborting
[   13.621310] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing E650 (len 187, WS 0, PS 4) @ 0xE6FA
[   13.621341] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing C53A (len 193, WS 4, PS 4) @ 0xC569
[   13.621359] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing C410 (len 114, WS 0, PS 8) @ 0xC47C
[   13.621361] amdgpu 0000:09:00.0: gpu post error!
[   13.621363] amdgpu 0000:09:00.0: Fatal error during GPU init
[   13.621370] [drm] amdgpu: finishing device.
[   13.621792] amdgpu: probe of 0000:09:00.0 failed with error -22

I noticed a couple other people have posted about similar problems:

https://forum.manjaro.org/t/rx-580-in-a-thunderbolt-egpu-dock/58210
https://egpu.io/forums/thunderbolt-linux-setup/egpus-under-linux-an-advanced-guide/#post-33304

I've an official bug report for amdgpu here:

https://bugs.freedesktop.org/show_bug.cgi?id=108521

If anybody has any suggestions they would be greatly appreciated!

On a final note, Ubuntu 18.10 ships with Kernel 4.18 but I've also tried 4.19 and I'm experiencing the same problems.

Thanks!
Rob


ReplyQuote
Topic Tags
nu_ninja
(@nu_ninja)
Trusted Member
Joined: 7 months  ago
Posts: 53
October 23, 2018 3:07 pm  

I'd start by configuring x with just the external display active, something like just this in /etc/X11/xorg.conf.d/

Section "Device"
     Identifier "AMD"
     Driver "amdgpu"
     BusID "PCI:10:0:0" ##ID in decimal, convert from hex if necessary
     Option "AllowEmptyInitialConfiguration"
     Option "AllowExternalGpus"
EndSection

then if that works try setting up a config with the internal screen bound to the iGPU and external as primary with eGPU (I posted a config file in this post). I've not had any problems with the amdgpu driver once it's properly configured in x with the bus id and external display.

Mid-2012 13" Macbook Pro (MacBookPro9,2) TB1 -> RX 460 (AKiTiO Node) macOS 10.14+Win10+Linux Mint 19
https://egpu.io/forums/builds/mid-2012-13-macbook-pro-macbookpro92-tb1-rx-460-akitio-node-macos-10-13-6win10/#post-43638
https://egpu.io/forums/builds/mid-2012-13-macbook-pro-rx46010gbps-tb1-3-linux-mint-19-build-guide-benchmarks-nu_ninja/#post-47083


theitsage liked
ReplyQuote
rstrube
(@rstrube)
Active Member
Joined: 3 weeks  ago
Posts: 11
October 23, 2018 5:27 pm  

Hi Ninja, thanks for the response!

I actually was following your guides pretty closely and did try a variety of different xorg config files, all with no luck unfortunately.

I think the issue is at the kernel level, as my dmesg output shows a failure trying to initialize the eGPU.  My guess (more of a hunch really) is that there is a problem because my laptop actually has 2 GPUs to start with, the Intel iGPU (915) and the Vega M GPU (amdgpu).  The eGPU would be a third GPU that also uses the amdgpu kernel drivers.  Perhaps there is a conflict with this? Right now I'm trying to see if there's someway to completely disable the Vega M discrete GPU (via kernel boot parameters) to test out this theory.

I'm wondering if you wouldn't mind sharing the dmesg output from your system after you've plugged in the eGPU via TB3.  I'd love to see if you see similar errors, or if the eGPU is initialized successfully.

Thanks!
Rob


ReplyQuote
nu_ninja
(@nu_ninja)
Trusted Member
Joined: 7 months  ago
Posts: 53
October 23, 2018 6:00 pm  

Ok, I attached the relevant parts of my dmesg output. Looks like I'm not getting the atom bios entries, maybe because I'm using an older card? That part of the code is obviously the problem, but I'm not sure if you could change that.

This might sound crazy but just for testing you could try and create a device section for the dgpu and deliberately give it the wrong driver like the nvidia nouveau driver to make sure it doesn't use the amdgpu driver.

Mid-2012 13" Macbook Pro (MacBookPro9,2) TB1 -> RX 460 (AKiTiO Node) macOS 10.14+Win10+Linux Mint 19
https://egpu.io/forums/builds/mid-2012-13-macbook-pro-macbookpro92-tb1-rx-460-akitio-node-macos-10-13-6win10/#post-43638
https://egpu.io/forums/builds/mid-2012-13-macbook-pro-rx46010gbps-tb1-3-linux-mint-19-build-guide-benchmarks-nu_ninja/#post-47083


ReplyQuote
rstrube
(@rstrube)
Active Member
Joined: 3 weeks  ago
Posts: 11
October 23, 2018 8:34 pm  

Thanks for the response!

I've been fooling around with a ton of different Xorg config files, and I really think it's less about configuration, and more about the lower level amdgpu kernel driver.

My hunch right now is that the Vega M (and specifically it's power management) are somehow interfering with the initialization of the new eGPU.  I've been trying to figure out a way to completely disable the Vega M at boot to see if the eGPU will work, but I've been struggling to accomplish this.

I found some interesting posts on using the pcistub kernel module to "reserve" a device so that it can't be initialized by the amdgpu module, but unfortunately this doesn't appear to work correctly.

See here:
https://superuser.com/questions/503697/prevent-radeon-driver-from-attaching-to-specific-pci-devices
https://superuser.com/questions/914810/how-to-disable-a-plugged-in-pci-e-graphic-card-on-os-level

I also tried to disable the Vega M in my BIOS but unfortunately that's not an option.

I'm gonna see what the amdgpu devs come up with.

Thanks for your response again!


ReplyQuote
nu_ninja
(@nu_ninja)
Trusted Member
Joined: 7 months  ago
Posts: 53
October 23, 2018 9:05 pm  

Yeah looking into it I agree its a deeper problem than x. Probably good to see what the devs say. One thing you might have already tried or not; setting amdgpu.dc=0 as a kernel parameter, since it seems to be on by default starting with vega cards per this article.

Mid-2012 13" Macbook Pro (MacBookPro9,2) TB1 -> RX 460 (AKiTiO Node) macOS 10.14+Win10+Linux Mint 19
https://egpu.io/forums/builds/mid-2012-13-macbook-pro-macbookpro92-tb1-rx-460-akitio-node-macos-10-13-6win10/#post-43638
https://egpu.io/forums/builds/mid-2012-13-macbook-pro-rx46010gbps-tb1-3-linux-mint-19-build-guide-benchmarks-nu_ninja/#post-47083


ReplyQuote
rstrube
(@rstrube)
Active Member
Joined: 3 weeks  ago
Posts: 11
October 23, 2018 9:58 pm  

Thanks for the suggestion.  I tried adding amdgpu.dc=0 to my kernel boot parameters, but unfortunately amdgpu still appeared to bind to the Vega M, and the eGPU still fails to initialize.  I think at this point I'm gonna wait and hear back from the kernel developers and see what they say.

I'm still trying to figure out a way to completely disable the Vega M GPU, but nothing I've tried seems to work.


ReplyQuote
rstrube
(@rstrube)
Active Member
Joined: 3 weeks  ago
Posts: 11
October 25, 2018 1:36 am  

For those of you that are interested, an amdgpu developer advised me to comment out the device IDs for Vega M in the kernel source (using 4.19) located here: /drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c

You can see the full source for this file here:
https://elixir.bootlin.com/linux/v4....u/amdgpu_drv.c

These are the lines in question:
/* VEGAM */
{0x1002, 0x694C, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_VEGAM},
{0x1002, 0x694E, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_VEGAM},

This did indeed cause my Vega M to not be initialized, *but* the problem I'm having with the eGPU remains. So it appears my hunch that the Vega M is interfering with the eGPU initialization was incorrect, and I'm back to square one...


ReplyQuote
nu_ninja
(@nu_ninja)
Trusted Member
Joined: 7 months  ago
Posts: 53
October 25, 2018 3:23 pm  

@rstrube

See this post particularly 1) at the bottom. Seems this may have been @karatekid430 's workaround.

Mid-2012 13" Macbook Pro (MacBookPro9,2) TB1 -> RX 460 (AKiTiO Node) macOS 10.14+Win10+Linux Mint 19
https://egpu.io/forums/builds/mid-2012-13-macbook-pro-macbookpro92-tb1-rx-460-akitio-node-macos-10-13-6win10/#post-43638
https://egpu.io/forums/builds/mid-2012-13-macbook-pro-rx46010gbps-tb1-3-linux-mint-19-build-guide-benchmarks-nu_ninja/#post-47083


ReplyQuote
rstrube
(@rstrube)
Active Member
Joined: 3 weeks  ago
Posts: 11
October 25, 2018 5:23 pm  

Thanks for the heads up about that post!  I'm actually just responded to that thread and I agree that the information seems very relevant to the problems I'm experiencing!

Rob


ReplyQuote
rstrube
(@rstrube)
Active Member
Joined: 3 weeks  ago
Posts: 11
October 26, 2018 5:37 am  

So after many many hours of debugging and trying different things I finally figured it out.  There is a a bug with acpi enabled which causes the Thunderbolt PCI bridges not to receive their proper resources.  If I disabled acpi via the kernel boot parameter:

acpi=off

Then the eGPU get's correctly initialized! One side affect is that the Vega M GPU is completely disabled with acpi=off.

It looks like a linux ACPI thunderbolt bug, or at least a bug with the XPS 9575 BIOS when acpi is enabled.


nu_ninja liked
ReplyQuote
nu_ninja
(@nu_ninja)
Trusted Member
Joined: 7 months  ago
Posts: 53
October 26, 2018 5:51 am  

Awesome! That sounds like a pretty big bug to squash.

Mid-2012 13" Macbook Pro (MacBookPro9,2) TB1 -> RX 460 (AKiTiO Node) macOS 10.14+Win10+Linux Mint 19
https://egpu.io/forums/builds/mid-2012-13-macbook-pro-macbookpro92-tb1-rx-460-akitio-node-macos-10-13-6win10/#post-43638
https://egpu.io/forums/builds/mid-2012-13-macbook-pro-rx46010gbps-tb1-3-linux-mint-19-build-guide-benchmarks-nu_ninja/#post-47083


ReplyQuote
rstrube
(@rstrube)
Active Member
Joined: 3 weeks  ago
Posts: 11
October 26, 2018 7:01 am  

@nu_ninja
So I've confirmed that I'm using my eGPU, but any games I run stutter like crazy every couple of seconds.  Any suggestions for what I should look at to improve performance.  I used your Xorg.config file to eliminate everything but the AMD GPU for Xorg.

Thanks for any suggestions!
Rob


ReplyQuote
(@timur_kristof)
Active Member
Joined: 8 months  ago
Posts: 17
October 26, 2018 9:28 am  
Posted by: rstrube

@nu_ninja
So I've confirmed that I'm using my eGPU, but any games I run stutter like crazy every couple of seconds.  Any suggestions for what I should look at to improve performance.  I used your Xorg.config file to eliminate everything but the AMD GPU for Xorg.

Thanks for any suggestions!
Rob

I've got an XPS 13 9370 here, and using an RX 570 with a Zotac AMP box mini. The hardware seems to be pretty similar to what you have, the main difference of course being that I don't have the Vega GPU. However since you already confirmed that isn't causing the problem, hope you don't mind my 20 cents.

This latest finding that you have sounds to me more like a Thunderbolt related bug that doesn't have anything to do with AMD. I know this is trivial, but can you check if you have the latest Thunderbolt firmware (also known as NVM) on both your laptop and your eGPU enclosure? Not sure if your device is supported by LVFS so you may have to install windows to update that firmware. While you are at it also check if your laptop has the latest bios. If all firmware is up to date and the issue is still there I would suggest to write an email on the linux-usb mailing list (that seems to be the place where the thunderbolt devs are), there are a bunch of helpful Intel guys there who maybe can help you out. This may very well be a bug in the thunderbolt driver.

With regards to stuttering and low performance. You didn't say it here but you did mention in your freedesktop bug report that you are booting with amdgpu.dpm=0 amdgpu.aspm=0 amdgpu.runpm=0 amdgpu.bapm=0 which means you totally destroy all power management features. When amdgpu.dpm=0 is there, then it basically doesn't do any power management, effectively just letting your graphics card sit at the same frequencies that it had when you booted it up. (That is 300 MHz on my RX 570, should be in the same ballpark for your 580.) Did you try without that? You shouldn't need any of those other parameters either, by the way.

Also, acpi=off is surely not a proper long term solution, because it has too many side effects.

While we are at it, does the eGPU setup work correctly on windows?


theitsage liked
ReplyQuote
(@timur_kristof)
Active Member
Joined: 8 months  ago
Posts: 17
October 26, 2018 9:56 am  
Posted by: nu_ninja

I'd start by configuring x with just the external display active, something like just this in /etc/X11/xorg.conf.d/

     Option "AllowEmptyInitialConfiguration"

As far as I understand "AllowExternalGpus" is specific to the nvidia proprietary driver, and will not have any effect on an AMD card.


ReplyQuote
rstrube
(@rstrube)
Active Member
Joined: 3 weeks  ago
Posts: 11
October 26, 2018 5:00 pm  

Hi Guys,

So unfortunately this *does* appear to be a BIOS bug with the Dell XPS 9575 that prevents the Thunderbolt 3 PCI bridge from receiving the proper PCI resources.  Adding acpi=off to the kernel boot parameters just creates a scenario where the Vega M resources are somehow made available to the eGPU, allowing it to become initialized.  Even so one of the Thunderbolt 3 PCI bridges still doesn't have the necessary PCI resources (specifically device 0000:05:02.0), which is probably causing the extreme performance problems that I'm having.

I've opened up an official bug with the ACPI BIOS kernel developers here: https://bugzilla.kernel.org/show_bug.cgi?id=201527 but this really needs to be solved at the BIOS level.  Perhaps they have a direct line of communication to the Dell engineers?

Thanks for all the suggestions, for now it appears that eGPUs on the Dell XPS 9575 are a no go on linux, at least until the BIOS issues are fixed!


theitsage liked
ReplyQuote
(@timur_kristof)
Active Member
Joined: 8 months  ago
Posts: 17
October 27, 2018 5:58 am  
Posted by: rstrube

Thanks for all the suggestions, for now it appears that eGPUs on the Dell XPS 9575 are a no go on linux, at least until the BIOS issues are fixed!

Does that mean that you tested and the eGPU doesn't work on Windows either?

Even so one of the Thunderbolt 3 PCI bridges still doesn't have the necessary PCI resources (specifically device 0000:05:02.0), which is probably causing the extreme performance problems that I'm having.

I'm pretty sure that at least some of your performance problems came from those amdgpu kernel parameters where you disabled the power management entirely.


ReplyQuote