Problems with AMD RX 580 + Akitio Node (TB3) + Ubuntu 18.10
 
Notifications
Clear all

Problems with AMD RX 580 + Akitio Node (TB3) + Ubuntu 18.10  

 of  3
  RSS

rstrube
(@rstrube)
Active Member
Joined: 2 years ago
 

So after many many hours of debugging and trying different things I finally figured it out.  There is a a bug with acpi enabled which causes the Thunderbolt PCI bridges not to receive their proper resources.  If I disabled acpi via the kernel boot parameter:

acpi=off

Then the eGPU get's correctly initialized! One side affect is that the Vega M GPU is completely disabled with acpi=off.

It looks like a linux ACPI thunderbolt bug, or at least a bug with the XPS 9575 BIOS when acpi is enabled.

To do: Create my signature with system and expected eGPU configuration information to give context to my posts. I have no builds.

.

h4wk590 and nu_ninja liked
ReplyQuote
nu_ninja
(@nu_ninja)
Reputable Member
Joined: 3 years ago
 

Awesome! That sounds like a pretty big bug to squash.

Mid-2012 13" Macbook Pro (MacBookPro9,2) TB1 -> RX 460/560 (AKiTiO Node/Thunder2)
+ macOS 10.15+Win10 + Linux Mint 19.1

 
2012 13" MacBook Pro [3rd,2C,M] + RX 460 @ 10Gbps-TB1 (AKiTiO Thunder2) + macOS 10.14.4 [build link]  


ReplyQuote
rstrube
(@rstrube)
Active Member
Joined: 2 years ago
 

@nu_ninja
So I've confirmed that I'm using my eGPU, but any games I run stutter like crazy every couple of seconds.  Any suggestions for what I should look at to improve performance.  I used your Xorg.config file to eliminate everything but the AMD GPU for Xorg.

Thanks for any suggestions!
Rob

To do: Create my signature with system and expected eGPU configuration information to give context to my posts. I have no builds.

.

ReplyQuote
Timur Kristóf
(@timur_kristof)
Active Member
Joined: 3 years ago
 
Posted by: rstrube

@nu_ninja
So I've confirmed that I'm using my eGPU, but any games I run stutter like crazy every couple of seconds.  Any suggestions for what I should look at to improve performance.  I used your Xorg.config file to eliminate everything but the AMD GPU for Xorg.

Thanks for any suggestions!
Rob

I've got an XPS 13 9370 here, and using an RX 570 with a Zotac AMP box mini. The hardware seems to be pretty similar to what you have, the main difference of course being that I don't have the Vega GPU. However since you already confirmed that isn't causing the problem, hope you don't mind my 20 cents.

This latest finding that you have sounds to me more like a Thunderbolt related bug that doesn't have anything to do with AMD. I know this is trivial, but can you check if you have the latest Thunderbolt firmware (also known as NVM) on both your laptop and your eGPU enclosure? Not sure if your device is supported by LVFS so you may have to install windows to update that firmware. While you are at it also check if your laptop has the latest bios. If all firmware is up to date and the issue is still there I would suggest to write an email on the linux-usb mailing list (that seems to be the place where the thunderbolt devs are), there are a bunch of helpful Intel guys there who maybe can help you out. This may very well be a bug in the thunderbolt driver.

With regards to stuttering and low performance. You didn't say it here but you did mention in your freedesktop bug report that you are booting with amdgpu.dpm=0 amdgpu.aspm=0 amdgpu.runpm=0 amdgpu.bapm=0 which means you totally destroy all power management features. When amdgpu.dpm=0 is there, then it basically doesn't do any power management, effectively just letting your graphics card sit at the same frequencies that it had when you booted it up. (That is 300 MHz on my RX 570, should be in the same ballpark for your 580.) Did you try without that? You shouldn't need any of those other parameters either, by the way.

Also, acpi=off is surely not a proper long term solution, because it has too many side effects.

While we are at it, does the eGPU setup work correctly on windows?

To do: Create my signature with system and expected eGPU configuration information to give context to my posts. I have no builds.

.

itsage liked
ReplyQuote
Timur Kristóf
(@timur_kristof)
Active Member
Joined: 3 years ago
 
Posted by: nu_ninja

I'd start by configuring x with just the external display active, something like just this in /etc/X11/xorg.conf.d/

     Option "AllowEmptyInitialConfiguration"

As far as I understand "AllowExternalGpus" is specific to the nvidia proprietary driver, and will not have any effect on an AMD card.

To do: Create my signature with system and expected eGPU configuration information to give context to my posts. I have no builds.

.

ReplyQuote
rstrube
(@rstrube)
Active Member
Joined: 2 years ago
 

Hi Guys,

So unfortunately this *does* appear to be a BIOS bug with the Dell XPS 9575 that prevents the Thunderbolt 3 PCI bridge from receiving the proper PCI resources.  Adding acpi=off to the kernel boot parameters just creates a scenario where the Vega M resources are somehow made available to the eGPU, allowing it to become initialized.  Even so one of the Thunderbolt 3 PCI bridges still doesn't have the necessary PCI resources (specifically device 0000:05:02.0), which is probably causing the extreme performance problems that I'm having.

I've opened up an official bug with the ACPI BIOS kernel developers here: https://bugzilla.kernel.org/show_bug.cgi?id=201527 but this really needs to be solved at the BIOS level.  Perhaps they have a direct line of communication to the Dell engineers?

Thanks for all the suggestions, for now it appears that eGPUs on the Dell XPS 9575 are a no go on linux, at least until the BIOS issues are fixed!

To do: Create my signature with system and expected eGPU configuration information to give context to my posts. I have no builds.

.

itsage liked
ReplyQuote
Timur Kristóf
(@timur_kristof)
Active Member
Joined: 3 years ago
 
Posted by: rstrube

Thanks for all the suggestions, for now it appears that eGPUs on the Dell XPS 9575 are a no go on linux, at least until the BIOS issues are fixed!

Does that mean that you tested and the eGPU doesn't work on Windows either?

Even so one of the Thunderbolt 3 PCI bridges still doesn't have the necessary PCI resources (specifically device 0000:05:02.0), which is probably causing the extreme performance problems that I'm having.

I'm pretty sure that at least some of your performance problems came from those amdgpu kernel parameters where you disabled the power management entirely.

To do: Create my signature with system and expected eGPU configuration information to give context to my posts. I have no builds.

.

ReplyQuote
MacFreekDotKext
(@jatechnology)
Eminent Member
Joined: 4 years ago
 

Has anyone gotten Thunderbolt 3 Radeon eGPU working on Linux yet? Very interested

I have yet to list my system & eGPU hardware or link a build guide in my signature. I will do so soon to give context to my posts.

 
2017 21" iMac 4K (RP560) [7th,4C,H] + RX 480 @ 32Gbps-TB3 (AKiTiO Node) + macOS 10.13.2 & Win10 [build link]  


ReplyQuote
rstrube
(@rstrube)
Active Member
Joined: 2 years ago
 

Hi @timur_kristof,

Apologies for the late reply, I've been posting about this issue on several other forums, on reddit, on the kernel mailing lists, etc. but I neglected to check back on this thread...

To answer your questions, I did update my TB firmware to NVM 36.  The output from my:

fwupdmgr get-devices

for that device is:

XPS 9575 Thunderbolt Controller
  DeviceId:             069ac71f347e92d158f2c211cca10d52a19e2d41
  Guid:                 8926f505-8219-5d6c-969a-e927534113fb
  Summary:              Unmatched performance for high-speed I/O
  Plugin:               thunderbolt
  Flags:                internal|updatable|supported|registered
  Vendor:               Dell
  VendorId:             TBT:0x00D4
  Version:              36.00
  Icon:                 computer
  Created:              2018-11-04

It's an excellent suggestion - and I wanted to rule out the TB firmware causing the problem.  Unfortunately, this *did not* solve the problem.

I've been keeping this thread Manjaro Linux forums up to date with additional information: https://forum.manjaro.org/t/rx-580-in-a-thunderbolt-egpu-dock/58210

Here's a reddit post related to this issue: https://www.reddit.com/r/Dell/comments/9u61lm/question_for_dell_i_believe_ive_discovered_dell/

I've also opened up an ACPI kernel bug (although to be honest it might be a Dell BIOS bug) here: https://bugzilla.kernel.org/show_bug.cgi?id=201527

The rationale for the kernel bug report is that sometimes kernel developers can work around buggy ACPI BIOS implementations - even though if this really is caused by a BIOS bug, it should probably be solved upstream by Dell.

The current theory among some of the kernel developers is that the some of the TB to PCI bridges are not receiving the necessary BAR resources.  This causes the card to fail initialization.  Disabling ACPI is really a hack - it bypasses using some of the ACPI information the BIOS provides, allowing the TB to PCI bridges to get more? resources - not sure if they get all the required resources that they need - but enough for the card to get initialized.  Here are some of the relevant details from my dmesg logs that demonstrate the problem.  Note I'm currently on Kernel 4.19.4, but I saw the same problems with Kernels 4.18.x

PCI resource allocation issues:

Note: devices 0000:04:00.0, 0000:05:00.0, 0000:05:01.0, 0000:05:02.0, and 0000:05:04.0 are all Thunderbolt PCI bridges, but device 0000:05:02.0 seems to be the problematic one.

[  152.673753] pci_bus 0000:05: Allocating resources
[  152.673792] pci 0000:05:01.0: bridge window [io  0x1000-0x0fff] to [bus 07-39] add_size 1000
[  152.673802] pci 0000:05:02.0: bridge window [io  0x1000-0x0fff] to [bus 3a] add_size 1000
[  152.673803] pci 0000:05:02.0: bridge window [mem 0x00100000-0x000fffff 64bit pref] to [bus 3a] add_size 200000 add_align 100000
[  152.673813] pci 0000:05:04.0: bridge window [io  0x1000-0x0fff] to [bus 3b-6e] add_size 1000
[  152.673823] pci 0000:04:00.0: bridge window [io  0x1000-0x0fff] to [bus 05-6e] add_size 3000
[  152.673825] pci 0000:04:00.0: BAR 13: assigned [io  0x2000-0x4fff]
[  152.673829] pci 0000:05:02.0: BAR 15: no space for [mem size 0x00200000 64bit pref]
[  152.673830] pci 0000:05:02.0: BAR 15: failed to assign [mem size 0x00200000 64bit pref]
[  152.673831] pci 0000:05:01.0: BAR 13: assigned [io  0x2000-0x2fff]
[  152.673832] pci 0000:05:02.0: BAR 13: assigned [io  0x3000-0x3fff]
[  152.673832] pci 0000:05:04.0: BAR 13: assigned [io  0x4000-0x4fff]
[  152.673834] pci 0000:05:02.0: BAR 15: no space for [mem size 0x00200000 64bit pref]
[  152.673835] pci 0000:05:02.0: BAR 15: failed to assign [mem size 0x00200000 64bit pref]
[  152.673837] pci 0000:05:00.0: PCI bridge to [bus 06]
[  152.673842] pci 0000:05:00.0:   bridge window [mem 0xea000000-0xea0fffff]
[  152.673852] pci 0000:05:01.0: PCI bridge to [bus 07-39]
[  152.673854] pci 0000:05:01.0:   bridge window [io  0x2000-0x2fff]
[  152.673859] pci 0000:05:01.0:   bridge window [mem 0xbc000000-0xd3efffff]
[  152.673863] pci 0000:05:01.0:   bridge window [mem 0x2fb0000000-0x2fcfffffff 64bit pref]
[  152.673870] pci 0000:05:02.0: PCI bridge to [bus 3a]
[  152.673872] pci 0000:05:02.0:   bridge window [io  0x3000-0x3fff]
[  152.673877] pci 0000:05:02.0:   bridge window [mem 0xd3f00000-0xd3ffffff]
[  152.673887] pci 0000:05:04.0: PCI bridge to [bus 3b-6e]
[  152.673889] pci 0000:05:04.0:   bridge window [io  0x4000-0x4fff]
[  152.673894] pci 0000:05:04.0:   bridge window [mem 0xd4000000-0xe9ffffff]
[  152.673898] pci 0000:05:04.0:   bridge window [mem 0x2fd0000000-0x2ff9ffffff 64bit pref]
[  152.673904] pci 0000:04:00.0: PCI bridge to [bus 05-6e]
[  152.673906] pci 0000:04:00.0:   bridge window [io  0x2000-0x4fff]
[  152.673912] pci 0000:04:00.0:   bridge window [mem 0xbc000000-0xea0fffff]
[  152.673915] pci 0000:04:00.0:   bridge window [mem 0x2fb0000000-0x2ff9ffffff 64bit pref]

It also appears that pcieport has PCI resource allocation issues:

[  193.946376] thunderbolt 0000:06:00.0: stopping RX ring 0
[  193.946388] thunderbolt 0000:06:00.0: disabling interrupt at register 0x38200 bit 12 (0xffffffff -> 0xffffefff)
[  193.946404] thunderbolt 0000:06:00.0: stopping TX ring 0
[  193.946413] thunderbolt 0000:06:00.0: disabling interrupt at register 0x38200 bit 0 (0xffffffff -> 0xfffffffe)
[  193.946421] thunderbolt 0000:06:00.0: control channel stopped
[  193.946516] thunderbolt 0000:06:00.0: freeing RX ring 0
[  193.946527] thunderbolt 0000:06:00.0: freeing TX ring 0
[  193.946542] thunderbolt 0000:06:00.0: shutdown
[  193.985339] pci_bus 0000:05: Allocating resources
[  193.985415] pcieport 0000:05:02.0: bridge window [mem 0x00100000-0x000fffff 64bit pref] to [bus 3a] add_size 200000 add_align 100000
[  193.985458] pcieport 0000:05:02.0: BAR 15: no space for [mem size 0x00200000 64bit pref]
[  193.985462] pcieport 0000:05:02.0: BAR 15: failed to assign [mem size 0x00200000 64bit pref]
[  193.985470] pcieport 0000:05:02.0: BAR 15: no space for [mem size 0x00200000 64bit pref]
[  193.985473] pcieport 0000:05:02.0: BAR 15: failed to assign [mem size 0x00200000 64bit pref]
[  198.333956] pcieport 0000:05:00.0: Refused to change power state, currently in D3

I've reached out to Dell support, and they assigned somebody to help out, but their first question is whether or not the problem exists on Windows 10.  In the meantime I've actually returned my RX 580 (I've kept the Aikido Node in the hopes that one day these issues will get resolved).  There's one other person (@adnans) on the Manjaro Linux forums that also has an XPS 9575 + and RX 580 so I'm hoping he can do some Windows testing and report back.

It's possible that Dell worked around some of the BIOS bugs in their Thunderbolt Windows Drivers.  It's also possible that this really is a Linux kernel Thunderbolt bug.

I'll try to do a better job keeping this thread up to date with additional information.  Thanks again for your reply, I really appreciate your suggestions!

Rob

To do: Create my signature with system and expected eGPU configuration information to give context to my posts. I have no builds.

.

itsage liked
ReplyQuote
rstrube
(@rstrube)
Active Member
Joined: 2 years ago
 

UPDATE:

An ACPI kernel developer got back to me and mentioned that the PCI resource allocation issues that are present in my dmesg are not actually a problem.  This is contradictory to what the AMD amdgpu developers thought was the root cause of the problem with using the RX 580 as an eGPU.

For those of you that are interested, here's the kernel bug report: https://bugzilla.kernel.org/show_bug.cgi?id=201527

Thanks!
Rob

To do: Create my signature with system and expected eGPU configuration information to give context to my posts. I have no builds.

.

ReplyQuote
 of  3