Setup & Software Discussions
How much bandwidth overhead does Optimus/X-connect accelerated internal LCD mode...
 

How much bandwidth overhead does Optimus/X-connect accelerated internal LCD mode add?  

  RSS

nando4
(@nando4)
Noble Member Admin
Joined: 3 years ago
 
itsage recently published an article eGPU Performance: Internal vs. External Display. I wanted to take a slightly different spin on the subject asking the question:
 

Q: How much bandwidth overhead does Optimus/X-connect accelerated internal LCD mode add?

 
Subtracting this value from our total link bandwidth, with a reference of how much bandwidth is needed for acceptable performance will then tell us if our eGPU is truly a portable desktop-replacement solution. One that doesn’t require an external LCD.
 
Techpowerup have done a broad gaming test on different bandwidths, showing us you want a x4 2.0 16Gbps link to give real desktop replacement GPU performance (> 85%). That gives us a desirable reference point.
 
Below are the bandwidth requirements to send uncompressed data to the internal LCD @ 60Hz calculated using Kramer’s Bandwidth calculator. NVidia Inspector ‘Frame Rate Limiter V2’ software can lock this in to be doubly sure we do no exceed 60Hz, as any more is redundant due to that being the LCD’s refresh rate. Otherwise some games still report > 60FPS when benchmarked.
 
Do note * this does not take into consideration other tricks NVidia/AMD drivers may use to decrease internal LCD PCIe link traffic. So this overhead can be considered the worst-case scenario.

Resolution Internal LCD overhead* (Gbps) Theoretic remaining bandwidth for eGPU (Gbps)
32Gbps-TB3 16Gbps-TB2 10Gbps-TB1 4Gbps-EC2/mPCIe2
3840×2160 – 4k/UHD 14.9 17.1 1.1 unfeasible unfeasible
2880×1800 – 15″ Retina 9.3 22.7 6.7 0.7 unfeasible
2560×1600 – 13″ Retina 7.4 24.6 8.6 2.6 unfeasible
2560×1440 – WQHD 6.6 25.4 9.4 3.4 unfeasible
1920×1080 – FHD 3.7 28.3 12.3 6.3 0.3
1680×1050 – WSXGA+ 3.2 28.8 12.8 6.8 0.8
1366×768 – WXGA 1.9 30.1 14.1 8.1 2.1


Blue
: ~ 85% desktop GPU performance (x4 2.0 16Gbps or more bandwidth)
Lime: ~ 71% desktop GPU performance (x4 1.1 8Gbps or more bandwidth)

 

From this table comparison we can conclude:

 

  • x4 3.0 32Gbps-TB3 NVidia/AMD accelerated internal LCD traffic traffic overhead still leaves us with at least our desired x4 2.0 16Gbps or more bandwidth for purely GPU traffic.

 

  • 16Gbps-TB2 with an external LCD to maximise it’s narrower PCIe link also gives ~85% desktop GPU performance levels. Internal LCD mode up to FHD can still deliver more than 71% desktop GPU performance.

 

  • Slower 10Gbps-TB1, 4Gbps-EC2, 4Gbps-mPCIe2 links will see compromised performance, particularly at higher resolutions. Those require an external LCD attached, application of Frame Limiting tweaks (eg: NVidia Inspector Frame Rate Limiter V2) and disabling of eGPU audio devices to maximise their narrow bandwidth for GPU traffic.

eGPU Setup 1.35    •    eGPU Port Bandwidth Reference Table


genium me, 3RYL, enjoy and 2 people liked
ReplyQuote
vsod99
(@vsod99)
Active Member
Joined: 3 years ago
 

Great comparison! I'm going to keep this in mind when purchasing a new machine a few years down the line. I'm okay with using an external LCD for now, though. 

Main rig: i7-6850k, EVGA 980ti Classified, HX850 PSU, 16GB DDR4-2133, X99 MSI SLI PLUS motherboard
Current (WIP) eGPU setup - HP Elitebook 8470p/16GB ram/i5-3220m/980ti
Past eGPU setup - Dell Vostro 1520 with Intel Core 2 Duo and GTX 760


nando4 liked
ReplyQuote
AquaeAtrae
(@aquaeatrae)
Eminent Member
Joined: 3 years ago
 

Thanks for all the efforts nando4... both here and on reddit. Many people are learning a lot from your research.

However, I wanted to ask why you (and many others) suggest incoming and outgoing bandwidth affect one another? As I understand it, Thunderbolt 3's bandwidth is 40GBps upstream + 40GBps downstream (Technical Brief). I'm also reading that PCIe 3.0 is also bidirectional and full duplex (here). If so, I'm not clear why downstreaming video back to the internal screen (iGPU frame buffer) would so reduce the largely upstream bandwidth feeding an eGPU. Can you clarify why that would be?

As your post points out, this distinction becomes much more important when considering x2 PCIe lanes limited to 16GBps. But if it's truly 16GBps up + 16GBps down, maybe its not so serious for eGPUs to internal displays?

Pending: Add my system information and expected eGPU configuration to my signature to give context to my posts


ReplyQuote
Plastixx
(@plastixx)
Trusted Member
Joined: 3 years ago
 

With my x1 Gen2 mPCIe setup, bus utilization while using the internal display was >90% @1080P 60Hz. Using an external display it averages 35%.

Pending: Add my system information and expected eGPU configuration to my signature to give context to my posts


ReplyQuote
nando4
(@nando4)
Noble Member Admin
Joined: 3 years ago
 
Posted by: AquaeAtrae

 

If so, I’m not clear why downstreaming video back to the internal screen (iGPU frame buffer) would so reduce the largely upstream bandwidth feeding an eGPU. Can you clarify why that would be?

As your post points out, this distinction becomes much more important when considering x2 PCIe lanes limited to 16GBps. But if it’s truly 16GBps up + 16GBps down, maybe its not so serious for eGPUs to internal displays?

   

You do raise a great point about upstream vs downstream traffic. Indeed if it’s predominantly CPU->GPU traffic in games/apps, then the GPU->CPU bandwidth for accelerated internal LCD may not have as serious affect as posted.

How could we see practically just how much the accelerated internal LCD does impact performance? The best way I can think of is to:

– repeat the TechPower style PCIe scaling benchmarks on an Optimus/X-Connect machine across many apps/games, to cover bandwidth requirement variation

– have an external LCD attached to do the game/app testing FPS testing

– have the internal LCD available to repeat the same tests, but with an artificial 60Hz movie playback load to simulate the Optimus/X-connect load on the bus interface (GPU->CPU traffic)

– compare FPS results for external vs external with the artificial load

Such testing may also be able to leverage hwinfo64‘s hooks into NVidia’s bus interface metrics that can be extracted and charted, again for external LCD vs external LCD with artificial load.

eGPU Setup 1.35    •    eGPU Port Bandwidth Reference Table


ReplyQuote
AquaeAtrae
(@aquaeatrae)
Eminent Member
Joined: 3 years ago
 

Thanks! I figured this might be worth investigating. Many of us are investing money based on these understandings and predictions (myself included). So hopefully, we can develop some definitive testing soon and, if need be, update the popular understanding of Thunderbolt 3 and these reduced PCIe lanes.

I've been working to decipher the inner workings of PCIe, graphics APIs, Optimus, and Thunderbolt. I want to better understand the actual impact bandwidth has. For example, many consumers simply interpret: half the PCIe lanes = half of 40GBps = half the FPS when gaming. But these technologies are much more complex than consumer intuition and these things aren't so directly related to one another.

Most notably, the framerates (FPS) that affect gaming most directly are generated within the GPU, only based on data transmitted over PCIe. So if we load a 3D map, textures, lighting, and a camera position all into the GPU's VRAM and start generating video frames, it seems very little data would need to be sent over PCIe to, for instance, turn 5 degrees left and update the video frame. Such a simple call (e.g. CUDA) certainly wouldn't utilize hardly any PCIe bandwidth. On the other hand, loading new textures or dramatically changing the map and scene might fully utilize bandwidth, momentarily becoming the bottleneck and reducing average framerates.

The framerates reported by that TechPower article is one of few sources I had found illustrating this, debunking the popular notion that framerates might be directly tied to bandwidth. There's a hint of evidence that perhaps open world games experience these momentary bottlenecks a bit more often. But again, it's not substantial.

I am also learning which specific games demand more of various resources. Battlefield 1, for instance, demands a fan-cooled quad core CPU whereas Overwatch is quite playable even on an ultrabook dual core like the Transformer 3 Pro 2-in-1.  

I don't yet have an eGPU enclosure, but will be happy to help with testing as soon as I have one available. I've been holding out for the ASUS XG Station 2 with its optional, added bandwidth relief via USB so I might keep my game library on dock. Optimus and battery life is also of key interest. I've grown tired lugging my desktop replacement these last six years.

I believe eGPUs will become mainstream as Thunderbolt 3 is now widely and easily offered by manufacturers. It fulfills the road warrior and college student dream of a full day of mobile productivity plus powerful gaming when docked. Gamers will always need to plug in, so why not shift the heavy GPU and PSU to the other end of the cord. These capabilities just need to be presented to the market clearly, raising awareness of these exciting new possibilities.

Pending: Add my system information and expected eGPU configuration to my signature to give context to my posts


nando4 liked
ReplyQuote
Yukikaze
(@yukikaze)
Prominent Member Moderator
Joined: 3 years ago
 

Fellow eGPUers, I did some internal vs external monitor testing with my T430s TB1 setup, and have come to an interesting conclusion: 3DMark performance on an external and on the internal monitors is pretty much the same, with the only change appearing in high-FPS tests, due to the 60hz lock introduced by Optimus.

Here, take a look.

Want to output [email protected] out of an old system on the cheap? Read here.
Give your Node Pro a second Thunderbolt3 controller for reliable peripherals by re-using a TB3 dock (~50$).

"Always listen to experts. They'll tell you what can't be done, and why. Then do it."- Robert A. Heinlein, "Time Enough for Love."


ReplyQuote
Sky11
(@sky11)
Reputable Member
Joined: 3 years ago
 

While I won't comment on performance on hacked Macbook, I know from a fact that if you want the best performance from your external GPU, you need to attach display directly to eGPU, extended desktop (Win+P, then Extend) and set the external display as the Primary Display.

This is not AMD XConnect specific; this is how Microsoft Hybrid Graphics works, and if your laptop already has a GPU inside, then setting the external panel as primary is the only reliable way of launching the applications on the external GPU.

Pending: Add my system information and expected eGPU configuration to my signature to give context to my posts


ReplyQuote
Yukikaze
(@yukikaze)
Prominent Member Moderator
Joined: 3 years ago
 

Attached are the results from Heaven and Valley runs at 1600x900, external (left) vs internal (right). Looks like pretty much the same result: It gets capped at 60-ish FPS, but is otherwise identical in performance. I now have six different benchmarks from two different vendors confirming that external and internal monitor performance is essentially identical on this setup.

 

Want to output [email protected] out of an old system on the cheap? Read here.
Give your Node Pro a second Thunderbolt3 controller for reliable peripherals by re-using a TB3 dock (~50$).

"Always listen to experts. They'll tell you what can't be done, and why. Then do it."- Robert A. Heinlein, "Time Enough for Love."


ReplyQuote
AquaeAtrae
(@aquaeatrae)
Eminent Member
Joined: 3 years ago
 
Posted by: Yukikaze

 

Fellow eGPUers, I did some internal vs external monitor testing with my T430s TB1 setup, and have come to an interesting conclusion: 3DMark performance on an external and on the internal monitors is pretty much the same, with the only change appearing in high-FPS tests, due to the 60hz lock introduced by Optimus.

Here, take a look.

   

Thanks Yukikazi. Looks like about 6% difference. Can you share more about your specific setup? Is that a Levono T430 with some Thunderbolt 1 enclosure with 10Gbits/s bandwidth? It looks like a 1050 Ti driving the benchmarks, right? Was the left the external and right your internal display?

Pending: Add my system information and expected eGPU configuration to my signature to give context to my posts


ReplyQuote
Yukikaze
(@yukikaze)
Prominent Member Moderator
Joined: 3 years ago
 
Posted by: Yukikaze

 

external (left) vs internal (right)

 

  

Note, that 6% difference includes the fact that the internal display limits anything rendered on it to 60 fps. This reduces the maximum FPS in the test, and thus the average. It does not change the rest of the behavior, however, if you see my linked thread you will see that in tests where the FPS never exceeds 60 fps, there is no performance difference between the internal and external monitors. In other words, the difference is 0%, not 6%.

The linked thread has the full information about the setup, but I'll post here for clarity:

The T430s is driving a GTX1050Ti in an Akitio Thunder2 over TB1.

Want to output [email protected] out of an old system on the cheap? Read here.
Give your Node Pro a second Thunderbolt3 controller for reliable peripherals by re-using a TB3 dock (~50$).

"Always listen to experts. They'll tell you what can't be done, and why. Then do it."- Robert A. Heinlein, "Time Enough for Love."


ReplyQuote
AquaeAtrae
(@aquaeatrae)
Eminent Member
Joined: 3 years ago
 

Sorry I missed those. Thanks for the testing and details there.

The 60 fps internal display limit makes sense and would be the case with any Thunderbolt eGPU displayed internally. Maybe supersample some tests to force both internal and external displays to less than 60 fps. That should show what performance hit, if any, can be attributed to the bandwidth provided (16GTps I think).

I believe Thunderbolt 1 was also bidirectional bandwidth (upstream + downstream). If so, I don't think we'd see much impact at lower resolutions. Your running at 900p. I think even 1080p may prove much more effective than most expect. Of course, a 4k internal screen may simply be too much for the downstream PCIe channels. I'm curious if reducing the laptop's resolution from 4k to 2k or 1080p might also reduce the PCIe bandwidth or not. I'm guessing it would, but have little information about how Optimus operates. I, at least, would be happy to game at 1080p on a tiny 15.6" screen in my RV if eGPUs proved effective.

Pending: Add my system information and expected eGPU configuration to my signature to give context to my posts


ReplyQuote
Yukikaze
(@yukikaze)
Prominent Member Moderator
Joined: 3 years ago
 

I already used as much super-sampling in Valley and Heaven as I could (8x) to drop the FPS as close to 60 as possible. You can also see that in Fire Strike and Time Spy, being under 60 FPS at all times, lead to exactly the same results on both monitors. I think it is safe to conclude that there is no penalty in using the internal monitor at this point (beyond that which is introduced by Thunderbolt in general).

All Thunderbolt generations offer bi-directional bandwidth, so the traffic in one direction does not affect the traffic in the opposite.

Want to output [email protected] out of an old system on the cheap? Read here.
Give your Node Pro a second Thunderbolt3 controller for reliable peripherals by re-using a TB3 dock (~50$).

"Always listen to experts. They'll tell you what can't be done, and why. Then do it."- Robert A. Heinlein, "Time Enough for Love."


ReplyQuote
nando4
(@nando4)
Noble Member Admin
Joined: 3 years ago
 

@Yukikaze, have you looked at NVidia Inspector “Frame Rate Limiter V2” to lock in a 59/60Hz FPS? Then can do like-for-like internal vs external LCD comparison. Even better, can use an external LCD attached to your eGPU (direct out), then to your notebook (traffic goes via NVidia Optimus) to see performance differences across the same resolutions your monitor is capable of.

eGPU Setup 1.35    •    eGPU Port Bandwidth Reference Table


ReplyQuote
Yukikaze
(@yukikaze)
Prominent Member Moderator
Joined: 3 years ago
 

nando, to be honest, I think I've done enough testing for now, and the point is proven. Every time I've been under 60 fps, the performance on the external (via HDMI direct to the card) and the internal monitors was exactly the same. I'd be more interested in someone else doing a similar test with a different setup to corroborate my results.

Want to output [email protected] out of an old system on the cheap? Read here.
Give your Node Pro a second Thunderbolt3 controller for reliable peripherals by re-using a TB3 dock (~50$).

"Always listen to experts. They'll tell you what can't be done, and why. Then do it."- Robert A. Heinlein, "Time Enough for Love."


ReplyQuote
enjoy
(@enjoy)
Reputable Member
Joined: 3 years ago
 

Here are results of friend with no dGPU, Akitio Thunder2 / 13' Macbook 2015, Thunderbolt2:

External Display:

Internal Display:

Its a 30% performance even with Thunderbolt 2!

 


This is mine results with Akitio Thunder2 / 15' Macbook Retina mid-2012 /Thunderbolt 1 with fake display mod:

External Display:

Internal Display:

Again 30% performance drop!

Your results are just amazing and unreal!


And Result of 1060 on PCIe - desktop PC:

ϟ AKiTiO Thunder2 + EVGA GTX 1060 6GB SC Gaming (macOS Sierra 10.12.4 and Windows 10)
MacBook Pro (Retina, 15-inch, Later 2013) 3.2GHz Quad Core Intel i7-4750HQ / 8 GB 1600 MHz DDR3 / 256GB SSD + 1TB
mini eGPUPCI Express vs. ThunderboltMac CAN gameGaming Laptops vs. MacBook Pro with eGPU


ReplyQuote
Yukikaze
(@yukikaze)
Prominent Member Moderator
Joined: 3 years ago
 

Differences from my benchmark: You are not running in full-screen mode, and the results on the external display go as high as 100 fps. We've already deduced that FPS over a certain mark will get chopped off. That said, I wonder what is going on.

I wonder if the way that the Thunderbolt chip is connected to the system matters here (which is a possibility). It could be that the TB1 chip on the T430s is connected directly to the CPU (since without a dGPU, there is nothing using the CPU's direct connection), which would make some sense due to the fact that ONLY i7 models with no dGPU have a TB1 chip on the T430s lineup). In contrast, it could be that the MBPs connect the TB chip to the PCH chip, and not to the CPU directly. This way, the image traffic will also compete will storage and peripheral communications, whereas the direct connection means that the eGPU and the iGPU can "speak" pretty much directly. That would make the T430s a relatively unique snowflake, I suspect.

Want to output [email protected] out of an old system on the cheap? Read here.
Give your Node Pro a second Thunderbolt3 controller for reliable peripherals by re-using a TB3 dock (~50$).

"Always listen to experts. They'll tell you what can't be done, and why. Then do it."- Robert A. Heinlein, "Time Enough for Love."


ReplyQuote
nando4
(@nando4)
Noble Member Admin
Joined: 3 years ago
 
Posted by: enjoy

 

Here are results of friend with no dGPU, Akitio Thunder2 / 13′ Macbook 2015, Thunderbolt2:

  

As Yukikaze points out, the internal LCD mode can often be limited by software to 60Hz. The better way to do the benchmark is to peg the refresh to 60Hz on the eGPU using NVidia Inspector “Frame Rate Limiter V2”. Then compare external vs internal results.

 
Posted by: Yukikaze

 

I wonder if the way that the Thunderbolt chip is connected to the system matters here (which is a possibility). It could be that the TB1 chip on the T430s is connected directly to the CPU (since without a dGPU, there is nothing using the CPU’s direct connection), which would make some sense due to the fact that ONLY i7 models with no dGPU have a TB1 chip on the T430s lineup). In contrast, it could be that the MBPs connect the TB chip to the PCH chip, and not to the CPU directly. This way, the image traffic will also compete will storage and peripheral communications, whereas the direct connection means that the eGPU and the iGPU can “speak” pretty much directly. That would make the T430s a relatively unique snowflake, I suspect.   

any MBA or 2013″or 2013+ 13″ MBP has TB chip on PCH (0:1c.4)
2012 13″ MBP connects to CPU (0:1.0)
all 15″ MBPs connect the TB chip to CPU (0:1.x)

eGPU Setup 1.35    •    eGPU Port Bandwidth Reference Table


ReplyQuote
AquaeAtrae
(@aquaeatrae)
Eminent Member
Joined: 3 years ago
 

I would agree that we need to corroborate Yukikazi's findings against a variety of other setups. We should write up clear steps so anyone (of any experience level) can properly test and report performance and specs in each controlled test. Nando is great with presentation but may be speaking a bit above most user's levels now that eGPUs are reaching mainstream consumers. I'm willing to help if done in some collaborative document like a wiki or Google Doc. If we had screen shots, I'd be willing to animate a GIF slideshow that could be posted to reddit, etc with links to the needed tools and guiding people to report to a thread here. After we compile enough results, I'd then suggest making some simple visuals summarizing the technology and our findings. If popular perception is misled, we need good, consistent testing to go viral with YouTubers, reddit, etc.

As for connecting directly to the CPU vs through the PCH chipset... I believe most modern laptops route through PCH and potentially share their Thunderbolt data bandwidth with other internal devices. Like these other factors, it's hard to see how much impact this has and what specific conditions. One could only guess that loading new map areas from a NVMe drive might very briefly consume bandwidth affecting certain games at particular moments. 

I'm always online in Discord if you want to DM me about any of this: AquaeAtrae

Pending: Add my system information and expected eGPU configuration to my signature to give context to my posts


ReplyQuote
Yukikaze
(@yukikaze)
Prominent Member Moderator
Joined: 3 years ago
 

Things that would share PCH traffic would be: Ethernet/Wifi, SATA/NVMe, USB and pretty much every device sitting below it in the device hierarchy. I would think that what we need to create a relatively easy guide to reproduce this is a relatively low FPS benchmark: Something that is not likely to go over 30-40 fps with pretty much any GPU that we're likely to encounter. The only thing that comes to mind that is does not cost money would be Time Spy Basic, but as a DX12 benchmark, it is limited to only people running Win10. Running Time Spy on the internal monitor vs the external monitor, on default settings, is something that is easy for people to do. Is Fire Strike Extreme also free? If so that would be a good second option on DX11.

Want to output [email protected] out of an old system on the cheap? Read here.
Give your Node Pro a second Thunderbolt3 controller for reliable peripherals by re-using a TB3 dock (~50$).

"Always listen to experts. They'll tell you what can't be done, and why. Then do it."- Robert A. Heinlein, "Time Enough for Love."


ReplyQuote
nando4
(@nando4)
Noble Member Admin
Joined: 3 years ago
 

@AqaueAtra,  two important findings when comparing TB3 to desktop performance are:

1. accelerated internal LCD performance results can be skewed by Optimus pegging refresh rate to 60Hz. Need like-for-like comparison by pegging any reference external LCD to 60Hz [@Yukikaze doing amazing work to show this]

2. x4 3.0 via TB3 is delivering under 80% of direct x4 3.0 bandwidth: https://egpu.io/forums/mac-setup/pcie-slot-dgpu-vs-thunderbolt-3-egpu-internal-display-test/#post-3658

TB3 is clearly delivering compromised performance compared to a desktop. PC Gamers will probably side step PC TB3, as we are seeing,  and just get a desktop.  Macbook owners running macOS deriving the most benefit since their ‘desktop’ options like a Mac Pro are costly. Macbook eGPUs being the most prevalent on eGPU.io.

eGPU Setup 1.35    •    eGPU Port Bandwidth Reference Table


ReplyQuote
Yukikaze
(@yukikaze)
Prominent Member Moderator
Joined: 3 years ago
 

nando, about point #2, I can offer an explanation that I believe is correct, but without knowing the details of the Thunderbolt communications protocol, this is impossible to verify.

Here goes:

Every PHY introduces an overhead, in the shape of a preamble sequence. So there is likely a certain amount of overhead eating into the bandwidth figure coming from the PHY. In addition, the Thunderbolt protocol itself likely has a certain amount of overhead: It is a tunnel protocol carrying different protocols across itself, that means it needs to have some sort of a header+packet structure, likely with additional error correction fields. In the end, the effective bandwidth of TB3 is definitely smaller that the numbers indicate on the box. The smaller the packet data being sent, the worse the overhead will be: If the overhead adds 16 bytes per packet, then the effect is much worse for 128 byte packets than it is for 1024 byte packets.

 

Want to output [email protected] out of an old system on the cheap? Read here.
Give your Node Pro a second Thunderbolt3 controller for reliable peripherals by re-using a TB3 dock (~50$).

"Always listen to experts. They'll tell you what can't be done, and why. Then do it."- Robert A. Heinlein, "Time Enough for Love."


ReplyQuote
AquaeAtrae
(@aquaeatrae)
Eminent Member
Joined: 3 years ago
 

The devices that share PCIe lanes will depend on each motherboard's wiring. So, specific devices may vary as will their demand on that bandwidth over time. 3Dmark tests are a consistent and easy test, although even Time Spy basic provides a little over 60fps on my desktop's GTX 1080 at 1080p resolutions. And the upcoming GTX 1080 Ti will exceed that significantly. Also, one must pay to upgrade Time Spy for custom resolutions. To match resolutions, we'd need to guide people to set their external resolution to match before launching the free version.

@nando4 The 80% efficiency (20% overhead) you cited for Thunderbolt 3 is more significant than I recall seeing elsewhere. Ideally, I'd like to see an apple-to-apples test for that using, for instance, a desktop with Thunderbolt 3 with an eGPU to test the GPU. So far, I haven't seen anyone do this.

Would you agree that the most elusive and critical variable in all this is: "How much bandwidth does a GPU actually consume?"

Using just a piece of tape, TechPowerUps showed that scaling down PCIe bandwidth in half to 16GTps reduced overall FPS by 6 to 13% depending on resolution (that's adding 20% overhead for PCIe 2.0). Or more accurately, bandwidth rarely reduced FPS, averaging out to the slight loss they reported. In other words, even x2 PCIe lanes (16GTps) is rarely utilized by games to drive even the GTX 1080. Once the API sends the CUDA calls to the GPU, it's more so a matter of how powerful the GPU itself is that determines the rendered FPS that gamers care about. Consistent frame-times would actually be a better metric than average FPS, but also paying attention to the type of game (e.g open world) and specific moments in it (e.g traversing regions of the map vs turning in place).

Coming back to the question of internal displays, we have a more clear guess as to what the bandwidth requirements likely are (as your article lays out above). I don't believe there is any compression tricks involved with PCIe, Optimus, or X-Connect. At least, I've found nothing to suggest this, particularly given the latency requirements. So you're figures should be accurate ...but with the key distinction that all bandwidth is actually bi-directional. Downstream usage should not affect upstream.

Many Variables

Some systems are wired with x4 lanes (32GTps, less than 32GBps) while others have x2 lanes (16GTps). Many share lanes with other devices routing through the PCH chipset. There would be acknowledgment signals and feedback overhead in both the CUDA calls to the GPU and the Optimus video streaming to the CPU's iGPU. Different APIs like DirectX12, Vulkan, and Mantle utilize CPU cores differently in preparing those CUDA calls so some games demand quad-core CPUs, others don't. There's also PCIe overhead (~1.8% in 3.0, 20% in 2.0) plus Thunderbolt overhead. Any video through the iGPU is capped at 60fps. The actual timing frequencies interact may play a huge role in our results as we see higher FPS settings take a greater performance hit over PCIe. And there's likely other factors we're not yet fully aware of, including BIOS and software choices. To the extent possible, we should always encourage testers to document these details.

The eGPU market may be discouraging for the majority of PC gamers focused primarily on the added costs. But I expect that will shift as eGPU prices keep dropping and become more commonplace. The prevalence of Thunderbolt 3 means many laptop owners will soon have the option to upgrade using cheaper desktop GPUs + a $300 enclosure (or less). Bringing the low cost and long-term upgrade path from desktops to our portable systems is quite compelling and well worth a few hundred bucks. So too is the idea of converting a non-gaming laptops into gaming beast later on. I see college students and road warriors flocking to this new option when its time to upgrade. Consumers just need to be conscious of how much GPU horsepower their games will need given any overhead. That's what I hope we can help clarify here.

NOTE: 3Dmark and other tools available from Humble Bumble for just another 15 hours. Always a great deal! 😉

Pending: Add my system information and expected eGPU configuration to my signature to give context to my posts


ReplyQuote
Yukikaze
(@yukikaze)
Prominent Member Moderator
Joined: 3 years ago
 

Yup, that Humble Bundle is how I ended up with 3DMark Advanced 🙂

Highly recommended for anyone who is at all into benchmarking.

Want to output [email protected] out of an old system on the cheap? Read here.
Give your Node Pro a second Thunderbolt3 controller for reliable peripherals by re-using a TB3 dock (~50$).

"Always listen to experts. They'll tell you what can't be done, and why. Then do it."- Robert A. Heinlein, "Time Enough for Love."


ReplyQuote
nando4
(@nando4)
Noble Member Admin
Joined: 3 years ago
 

@Yukikaze, there will be overhead with any encoding. However, Intel are advertising TB3 as being a 40Gbps channel. That’s fat enough to encode 32Gbps x4 3.0.

What I’m sceptical about is Intel’s encoding scheme. TB uses 2 TX/RX electrical pairs to transfer 4 TX/RX (lanes) of PCIe traffic. Intel has halved the physical data  transmission wiring compared to native PCIe.

Consider, 20Gbps per TB “channel” (lane), as TB3 claims, is more throughput than the yet-to-be-released 16Gbps-a-lane PCIe 4.0 is specced at. What magic are Intel doing to out-perform the PCI SIG consortium’s findings?

The same bandwidth underperformance of TB3 occurs on 15″ Macbooks, which are hosting the TB3 chip off the CPU (0:1.x) rather than PCH.

 

eGPU Setup 1.35    •    eGPU Port Bandwidth Reference Table


ReplyQuote
Yukikaze
(@yukikaze)
Prominent Member Moderator
Joined: 3 years ago
 

nando, I am not sure if you are familiar with high-end networking devices, but 25Gbps on a single link is something that exists, and it requires no magic. 25GbE Ethernet (28Gbps line rate, with ECC overhead) over SFP28 connectors already exists. The standard most similar to a single TB3 channel is 25GBASE-CR-S, which allows for up to 3m copper cabling. Using a similar kind of PHY to pass a different kind of data would not be terribly difficult.

Want to output [email protected] out of an old system on the cheap? Read here.
Give your Node Pro a second Thunderbolt3 controller for reliable peripherals by re-using a TB3 dock (~50$).

"Always listen to experts. They'll tell you what can't be done, and why. Then do it."- Robert A. Heinlein, "Time Enough for Love."


nando4 liked
ReplyQuote
AquaeAtrae
(@aquaeatrae)
Eminent Member
Joined: 3 years ago
 

Here's some good reading...

It's a bit dated but here's one of the better explanations of PCIe and Thunderbolt bandwidth, lanes, and overhead. Written in 2013, it's comparing PCIe 2.0 vs 3.0 and Thunderbolt. The wikipedia pages fill in with some updated details like how PCIe reduced its overhead significantly from 20% to 1.54% by reducing the amount of error correction bits in its encoding. If Thunderbolt is using 64b/66b encoding, then the overhead would be about 3% plus a small protocol (PCIe) header.

Theoretical vs. Actual Bandwidth: PCI Express and Thunderbolt (Tested.com)

Wikipedia: PCI Express   Thunderbolt    64b/66b encoding   Optimus

And here's some other original technical briefs...
NVIDIA Optimus - Truly Seamless Switchable Graphics and ASUS UL50Vf (Anandtech)
Light Peak Overview    Thunderbolt: A Potential High-Speed, Multiprotocol Serial Interconnect    Designing a next-generation video interface with thunderbolt technology

TL DR; some take aways...

Note: We frequently misstate bandwidth as "32 Gbps" when it's actually GT/s. The stated "gigatransfers per second" (GTps or GT/s) is then reduced by error correction overhead to determine actual bandwidth available.  

Both PCIe and Thunderbolt bandwidth is definitely bidirectional with each lane composed of separate upstream and downstream pairs of wire. Both directions do carry flow control the other direction, but streaming the finished frame buffers back into the Optimus iGPU really shouldn't affect GPU copy engine so much. Anandtech reports:

If you're worried about bandwidth, consider this: In a worst-case situation where sixty 2560x1600 32-bit frames are sent at 60FPS (the typical LCD refresh rate), the copying only requires 983MB/s. An x16 PCI-E 2.0 link is capable of transferring 8GB/s [err 8GT/s = 1GB/s minus overhead = 985MB/s], so there's still plenty of bandwidth left. A more realistic resolution of 1920x1080 (1080p) reduces the bandwidth requirement to 498MB/s. Remember that PCI-E is bidirectional as well, so there's still 8GB/s of bandwidth from the system to the GPU; the bandwidth from GPU to system isn't used nearly much. There may be a slight performance hit relative to native rendering, but it should be less than 5% and the cost and routing benefits far outweigh such concerns. NVIDIA states that the copying of a frame takes roughly 20% of the frame display time, adding around 3ms of latency.

The Tested.com article also references an older TechPowerUp test that had showed similar scaling results to today's GTX 1080. Bottom line, GPUs have never fully utilized much bandwidth or if they do, it's very briefly.

 "Our testing confirms that modern graphics cards work just fine at slower bus speed, yet performance degrades the slower the bus speed is. Everything down to x16 1.1 and its equivalents (x8 2.0, x4 3.0) provides sufficient gaming performance even with the latest graphics hardware, losing only 5% average in worst-case. [emphasis added] Only at even lower speeds we see drastic framerate losses, which would warrant action."

In other news, I decided to order a new motherboard and CPU for my desktop machine. I specifically chose the GIGABYTE AORUS GA-Z270X-Gaming 7 over more powerful options because I realized I could arrange PCIe devices on it to limit the GPU to either x16, x8, x4, or x2 lanes for testing. [I could also use a piece of tape.] And of course, it supports Thunderbolt 3 so once I have an eGPU to test against, I should be able to do apples-to-apples testing with my GTX 1080 or 780. 

1 x PCI Express x16 slot, running at x16 (PCIEX16)
* For optimum performance, if only one PCI Express graphics card is to be installed, be sure to install it in the PCIEX16 slot.

1 x PCI Express x16 slot, running at x8 (PCIEX8)
* The PCIEX8 slot shares bandwidth with the PCIEX16 slot. When the PCIEX8 slot is populated, the PCIEX16 slot operates at up to x8 mode.

1 x PCI Express x16 slot, running at x4 (PCIEX4)
* The PCIEX4 slot shares bandwidth with the M2P_32G connector. The PCIEX4 slot operates at up to x2 mode when an SSD is installed in the M2P_32G connector.

3 x PCI Express x1 slots
* The PCIEX1_3 slot shares bandwidth with the SATA3 1 connector. The SATA3 1 connector becomes unavailable when the PCIEX1_3 is populated.

(All of the PCI Express slots conform to PCI Express 3.0 standard.)

I'm holding out for the ASUS XG Station 2 (not yet released in the USA) and debating different thunderbolt laptops for my work and play away from home. Since my eGPU will be used in an RV, I hope to utilize the internal display. I remain unconvinced that reduced x2 lanes with, for instance, the Dell XPS 15 would prevent this on 1080p or even 1440p internal display.

Pending: Add my system information and expected eGPU configuration to my signature to give context to my posts


ReplyQuote
AquaeAtrae
(@aquaeatrae)
Eminent Member
Joined: 3 years ago
 

Still looking for ways to measure what's actually happening. I found a few interesting tools that may be worth exploring.

Intel's Performance Counter Monitor (now open sourced on github)

Previous testing with Nvidia's NVAPI

Pending: Add my system information and expected eGPU configuration to my signature to give context to my posts


itsage liked
ReplyQuote