PCI Resource Error (MBP 2016 15”, 19.10)
I tried using the the scripts from @hertg (eGPU-switcher) and @Sebulon (gswitch). However, with both of them switching to the eGPU results in me being stuck on the login screen, where I am looping back to every time I put in my credentials.
I suppose this is down to some issue with the Xorg config. I installed the 430 Nvidia drivers, the GTX 1070 is recognised by Ubuntu and the Core appears as authorised as well. What I essentially want is for the 1070 to be used as the display output to use my external monitor, and to use it for CUDA acceleration.
Does anyone have any pointers in this situation?
Hey man, bummer to hear you're having issues!
When you say Ubuntu 18.04, I guess you're talking about standard Ubuntu, correct? If so, you could test installing Kubuntu 18.04 or changing to using another login manager, like LightDM that I've heard people having better success with, instead of the default GDM.
@sebulon, thank you for your input!
I installed & switched to LightDM instead, did not solve the issue but it changed the behaviour. When using gswitch to switch to the eGPU the internal screen turns into a black, empty console with an underscore blinking in the corner. Meanwhile the external screen stays dark, nothing happening there. No inputs of any kind are recognised. The same thing happens with egpu-switcher upon reboot. Changing back to the internal GPU via recovery mode fixes everything again.
Hmm, it's hard to say what the issue might be since I lack the hardware to test with but with my laptop, a Lenovo P50, there's both an iGPU (Intel integrated) and dGPU (Nvidia Quadro M1000M), and for some reason, I need to have a display setting in BIOS set to "Hybrid" and prime-select set to "nvidia" for gswitch to work. No idea what the equivalent would be for you though...
Alright, I have been banging my head against this all night and figured out where the problem lies:
For some reason Linux does not load the Nvidia driver and as a result the system displays a blank screen upon changing to the eGPU. Ubuntu recognises the GTX 1070 just fine but the driver never gets loaded, claiming the GPU.
I tried installing several different versions, via the GUI Updater, apt and run files, nothing seems to work. The machine does not use secure boot, the drivers are not blacklisted in modprobe. By this point I am really out of ideas, so if anyone has any experience installing Nvidia drivers properly with an eGPU, I’d be happy to hear your thoughts!
It's possible for the GPU to show up as a pcie device, but not be loaded by the driver due to lack of pcie resources. Can look at the output of dmesg and see if you are getting the same error "BAR1 is 0M" as in this thread: https://egpu.io/forums/thunderbolt-linux-setup/tutorial-ubuntu-18-04-rtx-2080-razer-core-v1/#post-76425
@nu_ninja, good call! It indeed seems to be a resource allocation issue:[ 141.199292] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:80:00.0)
[ 141.199293] NVRM: The system BIOS may have misconfigured your GPU.
[ 141.199295] nvidia: probe of 0000:80:00.0 failed with error -1
[ 141.199315] NVRM: The NVIDIA probe routine failed for 1 device(s).
[ 141.199316] NVRM: None of the NVIDIA devices were initialized.
[ 141.199628] nvidia-nvlink: Unregistered the Nvlink Core, major device number 510
Unfortunately the suggested boot parameters in the other thread did not help much. I tried my luck with disabling pcie devices to free up bandwith, however I did not get anywhere. Another option I considered was disabling the AMD GPU and using the iGPU instead, however even after configuring xorg to use the iGPU (as confirmed by a look at the system info), disabling the AMD GPU resulted in a black screen.
After a lot of experimenting I finally managed to make it work. By this point it is difficult to pinpoint the exact procedure, but it should be something like this:
- Ubuntu 19.04 (might work on other versions too)
- Add Kernel Parameter “pci=nocrs,realloc” to /etc/default/grub
- Deactivate the AMD dGPU by adding “blacklist amdgpu” to /etc/modprobe.d/blacklist.conf. The machine will now boot with virtual display drivers. (PS: This step might not be necessary, I have not yet attempted this without it)
- Installed drivers (in my case nvidia-driver-440)
1. Plug in eGPU and boot
2. Once logged in, open a terminal and execute the following commands as root, with 5-10 seconds in between each of them:
echo 1 > /sys/bus/pci/devices/0000:00:01.2/remove echo 1 > /sys/bus/pci/rescan echo 1 > /sys/bus/pci/devices/0000:00:01.1/remove echo 1 > /sys/bus/pci/devices/0000:00:01.2/remove echo 1 > /sys/bus/pci/rescan
What this essentially does is removing the right PCI bridge (where my eGPU is connected), rescan all PCI devices, and then removing both the left AND right PCI bridge, then rescans again. Upon the last rescan the device frees up resources and allocates them to the eGPU. It only seems to work when executing commands in this exact order, any deviation screws it up for me.
The caveat right now is that this process does not always work. Sometimes one of the rescans fails with a Segmentation or Killed error. That means rebooting and trying again. Once recognised by the driver, the eGPU works flawlessly though and is recognised without any issues by CUDA and related programs.
There is definitely a pattern to all of this and I will investigate further to figure out a more reliable method, however for now it serves as a proof of concept that this is definitely possible.
Good to hear you found a workaround. I imagine the timing is very sensitive, are you manually entering the commands? A bash script with each command separated by
where X is the number of seconds could maybe help make it more repeatable
I have a very similar issue on my late 2017 MBP 14,3 and an RTX 2080 ti with the razer x chroma. I also could make it work perfectly on windows and I'm now trying to make it work on ubuntu. How have you determined which pcie's to remove? I think I have 3 pcies that are related to my gpu instead of your 2 and so I'm not sure exactly which one I should remove and when to rescan.
I also noted that if I set pci=realloc then when connected the gpu fans are going crazy and dmesg doesn't even show any lines related to the nvidia driver loading.
Also, have you used the egpu-switcher in your setup?