External GPU: M2 vs Thunderbolt for Deep Learning and Gaming
 
Notifications
Clear all

External GPU: M2 vs Thunderbolt for Deep Learning and Gaming  

  RSS

Tadas Šubonis
(@tadas_subonis)
Active Member
Joined: 2 years ago
 

Hi all,

I've done some testing comparing M2 vs Thunderbolt. I've written up my results here:

 

External GPU: M2 vs Thunderbolt for Deep Learning and Gaming

2020-05-19

 

Intro

A while ago I’ve wanted to bump up non-existing gaming and deep learning capabilities of my workstation. Since it’s a laptop, I’ve started looking into getting an external GPU. That’s quite a convenient option - you get a portable machine that can hook into a beefy GPU when you are working in your regular place.

This required quite a bit of research and the expected performance wasn’t entirely clear (especially for deep learning related tasks) so after going through all at trouble I’ve decided to write up some of my experiences and things I’ve noticed.

Do not expect really sophisticated insights or benchmarks here but I hope that it’s going to help you build the intuition about the expected eGPU performance.

External GPUs

Most commonly eGPUs use Thunderbolt 3. It’s easy to connect and its 40Gbps bandwidth provides a decent performance so you could actually make use of that GPU.

There are lots of eGPUs available to choose from. I went with Razer Core X as it:

  • is relatively compact
  • has no external ports (a GPU won’t have to share Thunderbolt bandwidth)
  • has a beefy PSU to support RTX 2080 Ti

The only trickery that I had to do here was to do a complete and clean uninstall of NVIDIA Quadro drivers using DDU and then installing regular GeForce drivers. The only annoying thing after connecting Core X is that you can’t have the internal-dedicated GPU (M1000M) running because that will cause the fans on the GPU that’s inside the eGPU to go on a full blast.

Disabling M1000M using Device Manager and restarting the system helped here.

My Workstation

It’s a HP zBook G3 15” Laptop that has a dedicated Quadro M1000M GPU. Back in 2017 it was a decent mobile powerhorse but these days I would really like to get one of those Ryzen CPUs.

It has Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz 4C/8T and 32GB DDR4 RAM. Where this laptop excels is its extensibility options: you can add additional drive, express card, and there are two M2 slots available for another NVME drive or something else.

Also, it has a decent Thunderbolt 3 support and its PCI Express lane is not shared with other devices (AFAIK).

If you are interested to learn a bit more about the it, you can take a look at the review here. You might want to take a note, that the reviewed laptop has Quadro M2000M and can get 3820 score on Firestrike.

As the star of the show, I’ve picked Gigabyte Windforce RTX 2080 Ti.

M2 Options

Another option is to connect your eGPU to PCI-Express “directly” using something like this.

I got this setup after I’ve mistakenly thought that my Core X got busted. In the end, the fault was at the active Thunderbolt cable that I bought separately. Hey, but I got to play around with a fancy M2-based setup.

You have to get an external PSU together with the riser above to make it work. Also, as you can probably guess it is not as convenient as a Thunderbolt option as you have to remove a bottom cover (or make some kind of other access) to the M2 connector on the motherboard. That’s not something you want to plug and unplug everyday but it’s not really a big deal (it’s easy to access the internals of this zBook).

However, apparently this has some nice performance benefits (or this ) as there is no overhead of carrying the PCI Express data over Thunderbolt.

The whole M2 setup might look a bit strange (or cool) depending on how you judge.

Driver Issues

Connecting eGPU via M2 is far from ideal experience. First of all, the default installation of drivers won’t work and you will be greeted by error 43 after installing the drivers. You will find instructions here how to deal with that.

Apparently, this happens because NVidia checks if an M2 connector is marked as hotplugable or not.

Basically, if HWInfo shows something like

instead of

you will have to do run a magical script to fix that for you (I keep wondering how it works :/).

Finally, you can’t put the system on standby (sleep). I am not sure what happens but it would seems that there is an unexpected power issue when the system resumes from standby and it panics when there is no GPU powered/connected (yet). There is a chance that this might not happen if the eGPU is connected to M2 port that support hot plugging.

Power Issues

This specific ADT-Link device had some trouble running RTX 2080 Ti. On some specific heavy loads (e.g. Firestrike or Deep Learning tasks) the eGPU would just disconnect and (sometimes) crash your system. Not cool. Apparently, this is a known problem and the mitigation is as follows:

This can be either a PSU or video card stability issue with the factory settings which may be clocked beyond what the components can handle. For the latter, downclock your video card by 15% - core/mem/target power using > MSI Afterburner. If the problem persists, then swap your PSU for a known good one and test again.

After dropping memory and core clocks by 200Mhz and the whole power by 20%, I’ve managed to run it without issues. I’ve also connected one of the power cables directly to PSU instead of supplying all of the power to the GPU via ADT-Link. However, that’s far from optimal as you are basically losing performance that you were hoping to get using M2 connection. Later, I’ll refer to this as “Downclocked” in the benchmarks.

I haven’t tried another PSU or changing the power cords entirely.

Benchmarks

I’ve tried collecting a set of benchmarks that would allow making some useful comparisons against other systems. However, since the current CPU of the system is a bit limpy compared to modern desktop workhorses, it mostly makes sense to compare these results between to see what’s the general performance of eGPU and what’s the difference between Thunderbolt and M2 connections

Also, I hope that Deep Learning practitioners are going to get some useful hints of what they can expect from an eGPU-based setup compared to proper desktop machines (here or here).

External vs Internal Screen

First of all, there is a difference in how you connect your screens to your eGPU. If you are using the internal laptop display, you can expect to lose some performance compared to using an external display that’s connected to the eGPU directly.

3Dmark Core X

You can see 15% drop in terms of graphics score between internal and external screens. Below, the difference between TimeSpy benchmarks is not that high - ~7%;

However, overall, it is an extremely sweet improvement over M1000M because I am getting ~15k Firestrike scores instead of 3820 (M2000M - unfortunately I have not made any benchmarks using my own M1000M).

3Dmark M2

Something similar can be observed using M2

You can also see that M2 can perform up to 20% faster compared (comparing Graphics Score only) to the Thunderbolt connection:

4K

In case you are interested in the performance of 4K:

You can see that the downclocking makes M2 perform worse than regular-clocked GPU via Thunderbolt. If not for those power issues, M2 would probably perform here better too.

Superposition and Kombustor

I’ve also made some runs using Superposition and Kombustor if somebody were to look for those:

You can see 15% drop in terms of graphics score between internal and external screens. Below, the difference between TimeSpy benchmarks is not that high - ~7%;

AI Benchmark

There aren’t many options to choose from when benchmarking Deep Learning libraries. One quite decent option that I’ve found is AI Benchmark. The test runs using either Tensorflow 1.x or Tensorflow 2.x and basically tests the inference and training speed of the most popular neural network architectures.

I wasn’t able to run the full testsuite on non-downclocked M2 but the initial runs might give some insight:

Overall score comparison for the downclocked version (so the test would complete) can be seen below:

If you were to compare that with a public ranking, you would see that using my system we are getting a petty ~20k scores instead of 32k that some people have reported. The highest scores I got were 22k using non-downclocked GPU via Core X and internal screen and 22.8k using downclocked GPU via M2 and internal screen.

In most of the cases for individual tests of AI Benchmark I’ve found that the performance is similar between Thunderbolt and M2 except for NLP/RNN related tasks:

For example, Pixel-RNN training can be ~20% faster via M2 while GNMT-Translation inference is 23% faster.

PyTorch

I’ve also wanted to test some PyTorch code as well because that’s the framework I mainly use. Unfortunately, I could not find a decent testing framework so I’ve run pytorch-examples.

For most of the examples it was quite difficult to get the data or I’ve ran into other issues while testing (like forgetting to use --cuda 🙁 ). In the end, I’ve managed to procure benchmarks for MNIST and LSTM language model (5 epochs) examples. The results can be seen below:

As you can see, DL workloads really like M2 and having the whole GPU to themselves (no external monitor that’s connected directly to eGPU). It would seem, that if you are running RNN models, you could almost gain an improvement of ~28% if you stick with M2.

PCI Bandwidth

Finally, I’ve made some PCI Bandwidth benchmarks using 3Dmark.

Concluding Remarks

I am really glad that I got eGPU as it allowed me to do some proper 2x4K screen setup and substantially improved gaming experience. For deep learning stuff, I could use M1000M to test if the code runs locally and the run it on the server so it wasn’t that of a big deal.

Nevertheless, it’s really awesome when you can run some learning tasks really fast (~2-5min) instead of 30-50min as before. This helps a lot when you are trying to figure out why the net is not converging and you would like to do a lot of iterations.

If you are interested in a bit more detailed numbers, I’ve included all of the original data in a spreadsheet.

The use of M2 vs Thunderbolt really depends on your preferences. If you are not moving around that much with your laptop and if you really care about the performance, then M2 might seem like a really solid choice.

However, if you are more of a practical fellow, then it’s really hard to beat Thunderbolt as the performance loses are rather tiny but the setup is a lot more straightforward. I, personally, will be sticking with Core X.

To do: Create my signature with system and expected eGPU configuration information to give context to my posts. I have no builds.

.

joevt liked
ReplyQuote
nando4
(@nando4)
Noble Member Admin
Joined: 4 years ago
 

Thank you.

A M.2 eGPU will outperform TB3 since it's the raw PCIe datastream prior to TB3 encoding/decoding.

TB3 encoding sees it result in between 25%-64% less H2D/write bandwidth than M.2. As summarized from:

https://egpu.io/forums/builds/2015-15-dell-precision-7510-q-m1000m-6th4ch-gtx-1080-ti-32gbps-m2-adt-link-r43sg-win10-1803-nando4-compared-to-tb3-performance/ :

Performance Analysis

Forza 4 has a significant 33.3% decrease in FPS, beyond what we'd expect given TB3's 25.2% decrease in AIDA64 bandwidth (H2D). Oddly, even a 16Gbps-M2 interface with supposedly less bandwidth outperforms TB3. TB3 should be just an encode/decoding PCIe transport pair so why are we seeing such a performance decrease over TB3?

We find clues as to why by using the bandwidthTest.exe tool included DaVinci Resolve, running commandline below while with the eGPU connected on a 32Gbps-M2, 32Gbps-TB3 and 16Gbps-M2 interface.

C:Program FilesBlackmagic DesignDaVinci ResolvebandwidthTest.exe --htod --mode=shmoo --csv > out.csv

We then review the gathered bandwidth versus block size information where we see:

 

  1. TB3 reduces the 32Gbps PCIe bandwidth anywhere from 25% to 63.5%.
  2. The greatest reduction occurs on small block sizes. 2kb seeing the greatest 63.5% reduction.
  3. TB3 is outperformed by 16Gbps-M2 up to 8kb block size, only matching performance at 16kb block size and outperforming it thereafter.
  4. It appears that Forza 4 is bandwidth bound and utilizes small block transfer sizes.
  5. The min 25% reduction occurs at a block sizes of 200kb and greater -> TB3 is performance optimized for large block sizes.

 

Comments (eg: how has the eGPU improved your workflow or gaming)

Is a TB3 eGPU worth if for gaming? We've shown the additional TB3 transport layer decreases x4 3.0 32Gbps bandwidth anywhere from 25% to as much as 63.5% with bandwidth bound apps/games registering this performance reduction. . The reference 32Gbps-M2 interface itself is 4 times less bandwidth than an Intel desktop.

Is the problem simply the TB3 controller being clocked too slow? Maybe. We do see that TB3 is optimized for large block sizes as used for data transfer on SSDs. Intel certainly have plenty of room to improve this small block transfer performance in TB4.

 


FAQ

Q: How to maximize gaming performance on a notebook?

From this build and performance analysis we can suggest:

  • seek a notebook with a decent dGPU
  • seek a candidate system offering a factory direct PCIe eGPU interface like a Alienware Graphics Amplifier port
  • cobble together a eGPU using the NVME SSD's M.2 slot, also a direct PCIe eGPU interface, like shown in this build
  • obtain a TB3 eGPU now knowing it's performance limitations on bandwidth-bound games/apps

 


Q: What are the pros and cons of a M.2 eGPU over Thunderbolt 3 (TB3)?


M.2 eGPU pros over TB3:

1. gets full x4 3.0 bandwidth - TB3 has 25%-64% less H2D/write bandwidth than M.2

2. noticably lower cost for eGPU adapter hardware

3. being able to utilize the eGPU in a macOS Hackintoshed system


M.2 eGPU cons over TB3:

1. require a candidate notebook with dual-storage (M.2 + M.2 or M.2 + SATA) to be viable (use M.2 slot for the eGPU and other storage for boot OS)

2. requires underside eGPU cabling to the internal M.2 port

3. requires DIY tweaking/testing to confirm compatibility

eGPU Setup 1.35    •    eGPU Port Bandwidth Reference Table

 
2015 15" Dell Precision 7510 (Q M1000M) [6th,4C,H] + GTX 1080 Ti @32Gbps-M2 (ADT-Link R43SG) + Win10 1803 [build link]  


ReplyQuote
Gareth Rees
(@gareth_rees)
Eminent Member
Joined: 1 year ago
 

T Vega uses more GPU power than 2080 Ti so that makes no sense why you need to drop clocks that much. Try with both 8-pin pci-e power connectors from PSU to GPU instead of using the r43sg power out

This post was modified 4 months ago

Dell Latitude 5491 14" BIOS 1.12.1 + Active PCH cooling | Core i7 8850H + liquid metal - https://valid.x86.fr/z6xi8n | 32GB DDR4 2400 | Samsung 500GB 850 EVO | MX130 + liquid metal | Logitech Z-2300 | Razer Death Adder Elite | Corsair K70 Rapdifire | R43SG v1.2 + RX 570 4GB


ReplyQuote