With the release of the Titan V, we now entered deep learning hardware limbo. It is unclear if NVIDIA will be able to keep its spot as the main deep learning hardware vendor in 2018 and both AMD and Intel Nervana will have a shot at overtaking NVIDIA. So for consumers, I cannot recommend buying any hardware right now. The most prudent choice is to wait until the hardware limbo passes. This might take as little as 3 months or as long as 9 months. So why did we enter deep learning hardware limbo just now?
NVIDIA has decided that it needs to cash-in on its monopoly position before the competition emerges. It needs the cash in order to defend itself in the next 1-2 years. This is reflected by the choice to price the Titan V at $3000. With TensorCores the Titan V has a new shiny deep learning feature, but at the same time, its cost/performance ratio is abysmal. This makes the Titan V very unattractive. But because there is no alternative, people will need to eat what there are served – at least for now.
The competition is strong. We have AMD whose hardware is now already better than NVIDIA’s and plans to get itself together to produce some deep learning software which is actually usable. With this step, the cost/performance ratio will easily outmatch NVIDIA cards and AMD will become the new standard. NVIDIA’s cash advantage will help fight AMD off so that we might see very cheap NVIDIA cards in the future. Note that this will only happen if AMD is able to push forward with good software — if AMD falters, NVIDIA cards will remain expensive and AMD will have lost its opportunity to grab the throne.
There is also a new contender in town: The Neural Network Processor (NNP) form Intel Nervana. With several unique features, it packs quite a punch. These new features make me drool — they are exactly what I want as a CUDA developer. The NNP solves most problems I face when I want to write CUDA kernels which are optimized for deep learning. This chip is the first true deep learning chip.
In general, for a 1-chip vs 1-chip ranking, we will see Nervana > AMD > NVIDIA, just because NVIDIA has to service gaming/deep learning/high-performance computing at once, while AMD only needs to service gaming/deep learning, whereas Nervana can just concentrate on deep learning – a huge advantage. The more concentrated a designed architecture, the less junk is on the chip for deep learning.
However, the winner is not determined by pure performance, and not even by pure cost/performance. It is determined by cost/performance + community + deep learning frameworks.
Let’s have a closer look at the individual positions of Nervana, AMD, and NVIDIA to see where they stand.
Nervana’s Neural Network Processors
Why @NaveenGRao is excited about #Intel Nervana NNP: https://t.co/DAqgOYFtoR #AI pic.twitter.com/dbLMdnp63t
— Intel AI (@IntelAI) October 31, 2017
Nervana’s design is very special mainly due to its large programmable caches (similar to CUDA shared memory) which are 10 times bigger per chip compared to GPUs and 50 times bigger per compute unit compared to GPUs. With this one will be able to design in-cache algorithms and models. This will speed up inference by at least an order of magnitude and one will be able to easily train on terabytes of data with small in-cache deep learning models, say, a multi-layer LSTM with 200 units. This will make this chip very attractive for startups and larger companies. Due to a special datatype, Flexpoint, one is able to store more data in caches/RAM and compute faster yielding even more benefits. All of this could mean speedup of about 10x compared to current NVIDIA GPUs for everybody. But this is only so if the main obstacles can be overcome: Community and software.
For the normal users and researchers, it will all depend on the community. Without community, we will not see in-cache algorithms. Without community, we will not see good software frameworks and it will be difficult to work with the chip. Everybody wants to use solid deep learning frameworks and it is questionable if Neon, Nervana’s deep learning framework, is up for the task. Software comes before hardware. If Nervana only ships pretty chips and does not push the software and community aspect effectively it will lose out to AMD and NVIDIA.
The community and software question is tightly bound to the price. If the price is too high, and students are not able to afford the NNP then no community can manifest itself around it. You do not get robust communities by just catering for industry. Although industry yields the main income for hardware companies, students are the main driver for the community. So if the price is right and students can afford it, then the community and the software will follow. Anything above $3000 will not work out. Anything above $2000 is critical and one would require special discounts for students to create a robust community. An NNP priced at $2000 will be manageable and find some adoption. Anything below $1500 will make Nervana the market leader for at least 2-3 years. An NNP at $1000 would make it extremely tough for NVIDIA and AMD to compete — software would not even be a question here, it follows automatically.
I personally will switch to NNPs if they are priced below $2500. They are just so much superior to GPUs for deep learning and I will be able to do things which are just impossible with NVIDIA hardware. If they are over $2500 then it also reaches my pain point for good hardware. I save up a lot of money to buy hardware — good hardware is just important to me — but I have to live from something.
For usual consumers not only the price will be important, but also how the community is handled. If we do not see Intel immediately pumping resources into the community to start up a solid software machinery then the NNP is likely to stagnate and die off. Unfortunately, Intel has a good history of mismanaging communities — it would be a shame if this happens because I really would like to see Nervana succeed.
In summary, we will see Nervana’s NNP will emerge as a clear winner if it will be priced below $2000 and if we see strong community and software development within the first few months after its release. With a higher price and less community support, the NNP will be strong, but might not be able to surpass other solutions in terms of cost/performance and convenience. If the software and community efforts fail or if the NNP is priced at $4000 it will likely fail. A price above $2000 will require significant discounts for students for the NNP to be viable.
AMD: Cheap and Powerful – If You Can Use It
AMDs cards are incredible. The Vega Frontier Edition series clearly outmatches NVIDIA counterparts, and, from unbiased benchmarks of Volta vs Pascal, it seems that the Vega Frontier will be on-a-par or better compared to a Titan V if it is liquid cooled. Note that the Vega is based on an old architecture while the Titan V is brand new. The new AMD architecture, which will be released in 2018Q3 will increase performance further still.
AMD hopes to advance deep learning hardware by just switching from 32-bit floats to 16-bit floats. This is a very simple and powerful strategy. The chips will not be useful for high-performance computing, but they will be solid for gamers and the deep learning community while development costs will be low because 16-bit float computation is straightforward.
They will not be able to compete in terms of performance with Nervana’s NNP, but the cost/performance might outmatch everything on the market. You can get a liquid cooled Vega Frontier for $700 which might be just a little worse than a $3000 Titan V.
The problem is software. Even if you have this powerful AMD GPU, you will hardly be able to use it – no major framework supports AMD GPUs well enough.
AMD is in limbo itself – in software limbo. It seems they want to abandon OpenCL for HIP but currently they officially still push and support the OpenCL path. If they push through with HIP, and if they put some good deep learning software on the market (not only libraries for convolution and matrix multiplication but full deep learning frameworks, say, HIP support for PyTorch) in the next 9 months then their release of their new GPU in 2018Q3 has the potential to demolish all competitors.
So in summary, if AMD gets its shit together in terms of software, it might become the dominating deep learning hardware solution.
NVIDIA: The Titan
NVIDIA’s position is solid. They have the best software, the best tools, their hardware is good and the community is large, strong and well integrated.
NVIDIA’s main issue is that they have to serve multiple communities: High-performance computing people, deep learning people, and gamers, and this is a huge strain on their hardware. It is expensive to design chips which are custom made for these communities and NVIDIA’s strategy was currently to design a one-size-fits-all architecture. This worked until it didn’t. The Titan V is just mediocre all-around.
With the emerging competitors, NVIDIA has two choices. (1) Push the price on their cards down until they starve the competition to death, or (2) they can develop specialized deep learning hardware on their own. NVIDIA has the resources to pursue the first strategy, and it also has the expertise for the second strategy. A new design, however, will take some time and NVIDIA might lose the throne to another company in the meantime. So we might see both strategies played out a once: Starving competitors so that NVIDIA can compete until their own deep learning chip hits the market.
In summary, NVIDIAs throne is threatened, but it has the resources and the expertise to fight against emerging players. We will probably see cheaper NVIDIA cards in the future and chips which are more specialized for deep learning. If NVIDIA does not lower its prices it might (temporarily) pass the throne to another player.
Deep learning hardware limbo means that it makes no sense to invest in deep learning hardware right now, but it also means we will have cheaper NVIDIA cards, usable AMD cards, and ultra-fast Nervana cards quite soon. It is an exciting time and we consumers will profit from this immensely. But for now, we have to be patient. We have to wait. I will keep you updated as the situation changes.
Kremena Gocheva says
Great article with predictions coming true… and also sort of catchment for commenting new developments by some knowledgeable readers!
I follow the comments for several months now and would like to hear your initial thoughts on the RTX series. The fact that AI is mentioned merely as a tool for raytracing and there’s no emphasis on CUDA cores and tensor cores in the consumer cards seems to suggest there’s not so big productivity gain to highlight there in terms of deep learning performance. More so since NVIDIA apparently promotes DGX-2 stations for training and RTX for inference only. If this proves to be the case, it would be a trifle disappointing… Can you tell from the specs shared so far, what the RTX series may turn out to mean for low budget DL?
Also, I wonder if you or any of the readers have experiences with deep learning using
the single precision and half precision announced? Will it, for example, mean that models have to be optimized in new ways, and the older ones would need to be rewritten for training in such environment?
Tim Dettmers says
For my thoughts on the RTX cards you might want to read my newly updated GPU recommendation blog post.
Hamir Shekhawat says
I just completed my bachelors and have dived pretty deep into machine learning. I am tired of using AWS and really want to switch to a personal hardware now. How long do you suggest to wait to buy one? The prices of GPU have come down almost to that of there MSRP and my budget can allow one 1080 ti currently. I wanna start learning Reinforcement Learning and want to work on DeepMind’s Lab and research papers but time is of essence. What do you suggest I should do and is buying a 1080 ti right now future proof?
P.S. I live in India and since all elecronics are imported here, everything is ~22% more expensive and it takes time for new hardware to come to India. Shipping from US is not an option. Life in India is pretty hard when it comes to computer components (EXPENSIVE!)
Tim Dettmers says
This is not an easy situation. It’s difficult to predict how prices change when the GTX 1180 will be released. There were rumors that NVIDIA produced way too many GPUs (due to cryptocurrency demand) and cannot get rid of them, so GPUs might fall further. However, as you say, time if of the essence.
I am still running a Maxwell Titan X which is already 3 years old and still is running quite okay. I guess you could get 2 years runtime from a GTX 1080 Ti before it becomes slow. However, you probably could get 4-year runtime out of a GTX 1180 if you wait for another month.
Another solution might be to rent a Hetzner GPU server (cheaper than AWS if you have them up all the time) for a few months until the Titan X Turing (Feburary 2019 maybe?) or GTX 1080 Ti hits the market (???).
In the end, it is personal preference. You have to decide if you want to have a cheap GPU now for 2 years, or an expensive GPU later for 4 years.
Just been reviewing your comments over the winter and just as we thought: Amd and Intel are nowhere to be seen advancing deep learning in affordable hardware and most importantly, software that is better than CUDA!!! Nvidia is still the king of deep learning hardware for the foreseeable future…
Can’t wait to order a Titan RTX!!
Tim Dettmers says
I agree, neither AMD nor Intel are currently a threat to NVIDIA.
Did AMD announce any new neural network stuff at Computex this year?
Tim Dettmers says
They released only limited information. It seems they will have a GPU with 32 GB RAM, 1 TB/s bandwidth and something similar to TensorCores. But it was also mentioned that this card will be “expensive” and I am not sure if it will be interesting for most deep learning researchers.
Thomas Brittain says
I’ve finally have a use case for training my own models. Started looking at rigs and was blown away by the market–stuff I was buying in 2016 is now 150% what I paid for it back then?
This article articulated the conclusion I was coming to when I found it. Thank you.
Have been using the TitanXp gpu for my Monte Carlo based simulations and this card is quite the beast for what I am doing. Code running on the Titan Xp is at least 30% faster than the 1080Ti I have aswell!!!
Yes the price is vary hard to swallow but at least for the work that I am doing, Titan Xp is definitely the way to go.
I truly hope that Nvidia will bring out a Titan Xv card when the volta based gaming cards are released by years end. I am also sure that there will be a new Titan V card released by the end of the year as well( higher speed, 16GB memory…)
Tim Dettmers says
Thanks for those numbers. It is always nice to see some hard numbers for applications — it is usually difficult to come by those numbers!
The only reason i am able to get those performance numbers fpr Titan Xp over 1080ti is that I can use the full 12GB on the Titan Xp as compared to just under 10GB on the 1080ti as I am able to use the Titan Xp with the tcc driver. The windows device driver always allocates twice as much memory as the Titan Xp in tcc mode as well as the Titan Xp uses bidirectional dma transfers which is noticable when you are reading/writing multiple gigabytes from a gpu card per second. I am able to do twice as many simulations using 12GB of memory as compared to 11GB memory on the 1080ti and definitely makes a difference if you want to make any real time analysis.
Really hoping that a Titan Xv card is in the works by the end of the year…
Tim Dettmers says
Thanks for your feedback — I did not know that the winders driver behaves in that way. Interesting!
Michael R Bernstein says
Have you heard or seen anything indicating whether the dedicated cryptocurrency mining GPU cards (which are intended to be less expensive than gaming cards) will be useful and cost-effective for deep learning purposes?
 Source: https://wccftech.com/exclusive-convert-cheaper-nvidia-mining-cards-to-full-graphics-cards/
Tim Dettmers says
Currently, there is no hard data out there so it is very difficult to say. I will write an update once more information is known.
You have not updated this column in months??
Maybe you have come to the conclusion that Nvidia will totally dominate AI hardware/software using GPU’s based on CUDA for the foreseeable future…
Anyone else have any comments regarding AMD/Intel vs NVidia with their respective hardware/software solutions??
Tim Dettmers says
Currently, it looks like that. AMD has not come up with any significant improvements in their software and Intel did not release the Nervana NNP and seems to falter in general due to their poor decision made in the past year. Currently, there is nothing that stops NVIDIA and they will easily be able to expand their monopoly aggressively. As a result, regular users will suffer. TPUs could help here if more people use them so that they get more cost-effective, but while cost-effective, the initial price is just too high for most people so TPUs also do not seem to be an option.
It is very depressing at the moment. There will be no development for regular users for many months to come and the developments that we will see will be very incremental since NVIDIA cannot make money from everyday users.
I agree too that it is depressing that the only solution to Ai/HPC development is thru NVidia…
I have been using the 1080Ti for almost 6 months and i just upgraded to Titan Xp as it makes no sense for me to upgrade to a TITAN V even from TITAN Xp as the V is at least twice as expensive for 30-40% gain in performance at the most!!! This is once again the curse from having no real competition which leaves a defacto monopoly to Nvidia to charge what the market can bear without a serious reply from AMD/Intel.
I hope Nvidia keeps the TITAN Xp in production until there is another TITAN version which is price/performance competitive with Xp
I am expecting that Nvidia will release an updated TITAN Xp (Xv??) that will not have the ai functionality but will have pure single float performance as its highest goal.
The 16bit float performance in the TITAN V is impressive but like the Nervana processor, is hard to exploit as compared to 32bit float performance.
I can now fully max out the performance on the TITAN Xp using CUDA 8.0 and allocating almost the full 12GB of memory . I need about a 30 fold increase in performance and then I can do real-time options forcasting. Hope all I need is 4-8 next generation TITAN Xv’s which will bring me under 5 second processing which is close enough for me.
Hope that Nvidia is listening to the needs of its low end users…
Tim Dettmers says
I feel your pain. Exactly main thoughts. Let’s see where this is going — I hope NVIDIA will do something to not disgruntle all of its ordinary AI/compute users.
Trevor Tanner says
I just bought a tablet that barely came out at the end of March (HP’s Spectre x360 15″) and selected the NVIDIA MX 150 over the RX Vega M GL solely because I want to optimally code CUDA cards that litter academic HPCs. It breaks my heart because the Vega has ~2x-5x better performance than the MX 150…
NVIDIA started working on their next gen deep learning focused chipset 3 years ago. The reason they haven’t released it is because they don’t need to. There’s no competition. Also, there’s not rush. While Intel and AMD are coming, cost is not the only one decision making element. In addition, for enterprise companies buying thousands of NVIDIA cards – those sales contracts are set and signed for 2018 so Intel and AMD won’t kick in until 2019 FY. And they they need to do an uphill battle while NVIDIA releases they new chipset. I’m not saying Intel and AMD won’t grow their share in the dedicated deep learning market, but that the market itself will be in expansion for the next many years and so everyone gets a bigger slice of the pie but NVIDIA has the advantage and will continue to outperform Intel and AMD in sales for years to come. Now, if you are a dev or small shop, sure you can start buying these compete hardware this year but NVIDIA doesn’t make their money by selling indie – but by selling cards by the thousands to the enterprise.
Tim Dettmers says
It is not about taking the economic lead in 2018, but the lead in growth. The company that grows fastest will out-compete others over time. You will not feel this growth in sales contracts but by the usage of the scientific or deep learning community. These communities establish the software and tools and then the contracts follow. This is not true in all cases, but in most of them and probably adoption growth and retention rates are the strongest indicators for long-term success. Thus a strong 2018 can be a very clear indicator for success even though the growth in revenue lags behind.
Lets call a spade a spade: NVidia wandered accidently into the deep learning bonanza by developing ridiculously fast GPU cards that also doubled as fast floating point machines.
Once NVidia got into the single TFlops, it now became relatively cheap to run deep learning algos in a reasonable amount of time. Most of these deep learning algos have been around for many decades via neural networks. NVidia jumped into the deep end with their Tesla card which evolved from their gaming cards.
There are over 1 billion PC gamers in the world today as well as millions of crypto currency miners with many if not most using NVidia gpu cards.
The above is the VERY statement of economic lead which only comes from growth!!!
AMD and Intel have to bring out hardware that is a order of magnitude faster/cheaper than NVidia WITH no reply from NVidia in order for them to get a toehold in the deep learning market.
Gaming cards will continue to power NVidia gpu’s which will intern allow NVidia to generate a faster GPU architecture EVERY year for a long time…look at Intel..I remember when there were so many risc cpus that were superior in execution than the 8086 with only one glaring problem: inability to run Windows faster than Intel. Nealry 100% of cloud data centers run Intel Xeon chips (8086based). Linux is great but is only remotely competitive in the server arena or cell phones, which by the way is dominated once again by Apple if you use profits as a metric.
Whoever makes the biggest bucks , sets the future when economics competes with technology…
Asaf Oron says
Just bought a brand new 1080Ti and started running some tensorflow models with it. it seems to run ok but i expected more. I believe it is a problem other people have encountered as well: how can we tell if our gpu is running properly and if tf is usingg it optimally or not , sort of gpu deep learning debugging. is there a simple straight forward way to do so ? it would be great if you could shed some light on this
Tim, have you noticed the recent announcements of the Intel CPU with on-package AMD Radeon GPU (11 CU’s, comparable to GTX 1050) and also new AMD APU’s with Ryzen cores and also Vega graphics? Those are coming to small PC’s and laptops, that are much more attractive to apartment-dwelling hobbyists than some big desktop PC. (My only home computer right now is a laptop, with my big computing done on remote servers).
What should it really take to do interesting machine learning on Vega? Does HIP not actually work? If it does, should porting CUDA code be a big problem? I’ve done enough C++ that I think I could hack on CUDA code, and am ok with doing some if it would be useful to ML people. Thanks!
Nice article. Several points though:
1. Titan V is too expensive. We are at a limbo because true valuation of GPU is derived from memory bandwidth(GB/s per dollar). But the market likes to price GPUs at Teraflops per dollar, hence the creation of TensorCores to make a GPU 110+TFlops, when in practice, it only gives CNNs 3.5 times speed up and 2 times speed up for LSTMs. So in practice, Titan V is roughly 2x 1080Tis for LSTM, 4x1080Tis for CNN if and only if you switch to FP16 training, which means code modification. So one Titan V is at price linearity against 4x 1080Ti only for CNN work, and only if you already train at FP16(which is a hassle, and convergence is crappy and erratic) If you do LSTM or RNN work, I would rather wait for 4x next 384bit GDDR6 @16GT GPU Ampere card. Titan V chopped off 1 HBM stack(losing that much memory bandwidth and capacity is huge) and chopping off the 2 NVLINKs just ticked me off for a $3000 GPU. (Hint: Nvidia, if you are reading this, give us the NVLINKS and $599 per NVLINK bridge is stupid)
2. There is no need to do OpenCL, AMD ROCm 1.7 already does Tensorflow. But really slow because there is no equvalent CuDNN part. As far as I understand, if they really did the HIP route, which is CUDA convertion and emulation, they might run into legal issues with Nvidia. AMD just doesn’t get it. HBM2 supply is totally wasted on AMD’s crappy software support.
3. As for Intel NNP. You are dreaming for it to be priced at $2500. That’s 2.5 dollar per GB/s bandwidth, and $80 per GB HBM2. That’s not going to happen. In fact V100/Titan V’s HBM is priced right now at 250dollars per GB(Titan V) and 300+dollars for V100(assuming $5000 cloud volume discount). I expect the Intel NNP will MSRP at $8000 and with volume discount to $5000. Training will require a lot of code modification at the Tensorflow/CNTK/MXNET level to be useful. In fact I do see people use it, for inferencing of large models(16-32GB) that used to require DGX-1 or CPU cluster to train. As for training, the effort might not be worth it if it is priced at $5000-$8000.
Tim Dettmers says
There are really good points — thank you for your comment!
1. Is pretty much true. It seems that NVIDIA took a bet that theoretical 110 TFlops would be a good marketing gimmick that increases sales — and they were right.
2. That is a good point. I did not think of the legal issues before. With the new CUDA licenses, HIP might be in danger.
3. I fully agree. I do not think a $2500 NNP will happen also mainly due to the HBM2 memory. I think they could have created an ultra-cheap, good performing chip if they would have chosen GDDR5 along with their 28nm. With huge caches, those cards would have been able to operate quite competitively and be priced ridiculously low. But of course, Intel knows the money is in data centers and it optimized for that. Maybe this naive short-term view was their own death sentence — let’s see how it plays out.
Very nice comments!!!
I too do not see any real competition for Nvidia unless you count the next version of their GTX card…very sad for Amd/Intel.
I really think both of these companies have missed the boat as software is king and CUDA is milk to the masses right now. As you know, once a technical framework/sdk gets ridiculously popular or embedded into working production solutions, hardware improvements alone will not dispose the king.
The reality is Nvidia is doing a great job with their gamer cards which are almost as fast a the Tesla cards but run the same CUDA dev environment. I don’t know how Amd/Intel can beat this combination in the near future. Intel NPP looks dead on arrival to me until a few more years which by then NVidia will have a a GPU version which will be just as fast. Nvidia is using a page out of Intel’s history book in that you don’t need the fastest processor but the fastest processor which runs the most binary compatible software.
I am currently running a 1070ti and 1080ti for my monte carlo simulations and the 1070ti runs my code at approximately 90% of the 1080ti for 50% of the cost. For me to do realtime options pricing , i need to run at least 80 gpu cards which is still many thousands of dollars even using 1070ti cards but the future looks really good and therefore I will continue to use Nvidia gpu cards/CUDA for the foreseeable future and continue to follow with much interest, the hardware designs for deep learning to piggyback my options pricing code on…
Very detailed article you wrote here but what is your take on the specifics of the Titan V (vs Titan Xp or GTX 1080Ti): that is (1) its Tensor Cores and (2) its Mixed Precision capability.
Why do you think these two specifics are a waste, for anyone using PyTorch or TensorFlow today or in 2018 ?
Tim Dettmers says
More deep learning specific features, such as larger caches, are very good for deep learning, but we will not see such features because they are useless for gaming or high-performance computing. TensorCores are quite good for deep learning, but they merely strike a balance between deep learning/HPC/gaming, they are not deep learning centric features. The same goes for mixed precision. AMD and NVIDIA do currently not have deep learning centric hardware. Thus, these features can be seen as a waste compared to fully deep learning centric features.
Dana Ludwig says
Tim, thanks for the great update! It seems the second shoe has dropped and the Titan V pricing is now just a footnote. A few days ago, NVidia restricted CUDA license such that GeForce and Titan V cards are forbidden from being deployed in a data center. So, e.g., Baidu will have to rebuild their data center and we will have to keep our cubicles warm and noisy with multiple-GPU mini-towers. Worth another blog post?:
Tim Dettmers says
Hey Dana! Indeed, it is quite shameful that NVIDIA decided to take this step. I know that in the past this was enforced implicitly since companies were not able to buy GTX cards even if they wanted to (Microsoft/HP/Cray wanted to buy GTX cards, but NVIDIA said that they cannot sell them for datacenters), but now it is made explicit. I would hope that this gives more incentive for people to develop OpenCL code — we will see! I was also thinking of buying an AMD cards and get comfortable with OpenCL. Developing an OpenCL based deep learning framework might be a good exercise for my PhD which I will start next year, so maybe I will do that!
Ok…so they still have tesla’s that you can still program in the cloud and how expensive can this really be!!! If you cannot afford to use Tesla’s in the cloud or at least ONE Gtx card in your desktop or a laptop with a Gtx chip, how in the world do you expect to do any serious deep learning!!! AMD has nothing to do with whether you can you use their cards in the cloud and Nvidia has every right to restrict data centers to using proper data center type cards for multi-user use…would you like them to use threadripper processors in the data centers too, overclocked to the max because you cannot afford a gaming machine. I think you need to make some real investment money wise if you are serious about deep learning otherwise forget it…Intel has developed the Nirvana chip for serious development work first then if they are successful, I am sure they will bring out a mass-produced cheaper version. For now realistically, Nvidia is the goto deep learning company until AMD /Intel gets their act together more seriously
Tim Dettmers says
Tell someone in a developing country who can save $1000 a year to buy a $3000 GPU or to rent a $2/h machine. It is just not a solution that works for 90% of people. Deep learning should be for everyone and not only for the privileged few. NVIDIA’s policies restrict usage only to those which are privileged. This is okay for NVIDIA because the top 10% have more money than the bottom 90%, but this is bad for most students or anybody in a developing country.
You really think that using a Titan V compared to a Tesla V100 over the cloud is going to be way cheaper? I think you don’t know the overall cost of data centers as Titan V maybe a lot cheaper than a V100 card but usually you are paying for the whole VM and not just the cards. Microsoft Azure will soon offer V100 vm’s which will also include infiniband access and I am sure the networking cost is not cheap. This could be an affordable solution for many but I am sure the overall cost difference in the cloud for a Tesla v100 vs a Titan V is small. Developing deep learning applications is not going to be cheap regardless of your economics when you need some serious hardware to research on. Economics is a fact of life and I definitely do not feel I am privileged here in Canada. I am very appreciative of where I live but I do work very hard and I know deep learning is not cheap. When I had no money during college, I remember time sharing on a mini computer shared with over 600 people and this was before the use of PC’s…my tuition fee was not cheap. The PC hardware for gaming is a good starting point for deep learning but Nvidia/AMD and Intel are developing hardware to make a profit. I am sure AI costs lots of money in China as well. As i keep saying Nvidia is overall the cheapest entry into deep learning with their CUDA environment regardless of whether you purchase the hardware or use it in the cloud but until they get serious development environment competition, the cost of developing/researching deep learning will be quite high.
Keep up the good work Tim, as I am still looking for cheaper solutions too…
Francis Hillman says
Have you checked out vertex.ai’s attempt at Keras on opencl (http://vertex.ai/blog/announcing-plaidml)? It’s still far from complete but an interesting first approach using an intermediate language they made themselves.
About the tensor cores in the titan V. Anandtech’s preview shows that they do really deliver in raw matrix multiply performance. Maybe more library optimization is needed?
Tim Dettmers says
The reported Titan V speeds on matrix multiplication are for very large matrix sizes that you will never see in LSTMs. So for such problems the Titan V is fast, but not so much for matrix multiplication in deep learning.
Benedikt S. Vogler says
Thanks for the great post. Can you please link to some resource with more information on the TensorCores (or maybe add a link in your blog post)? I don’t follow the hardware market so much, so this is the first time I hear from it.
Rene Schaub says
I disagree with this assessment. AMD has, at best, marginally better hardware and no huge price advantage. This won’t keep anybody from staying vested in cuda. Also I disagree on performance. At the eod it comes down to total number of transistors (that can be engaged in parallel computation). This is mostly a function of production scale. Whether a framework allows easier cache computations is irrelevant as 99% of researchers don’t write kernels. Cache is never going to replace memory for training data so I don’t know what the author means by alforithms in cache, the kernels already do that. Specialized hardware hasn’t really provided a huge boost to speed up the whole deep learning pipeline yet, the 5-10x improvements are localized and not throughout. The gaming market is the best driver for mass market scale and I don’t see that relationship changing anytime soon.
Tim Dettmers says
The GTX cards are on-a-par with respect to cost/performance, I would agree. However, this might change if NVIDIA decides to not put TensorCores into GTX cards. AMD cards are much more cost-effective when compared to Titan or even Tesla, there is no way you can deny that.
99% of researchers do not write CUDA code, true, but 99% use that said code. It only takes one person to write that algorithms and millions of researcher will use it. It is bold to say that no-one will develop these algorithms. I, for one, will certainly develop those kernels if the Nervana NNP is cheap.
We have not seen specialized hardware yet — do not be fooled by the name TensorCores. What they really are, are a trade-off between HPC and deep learning functionality and not real specialization. Since convolution and matrix multiplication are about 85-95% of all computation in deep learning one only needs to improve those operations to get large benefits. Nervana’s NNP will be quite good at those operations because it has been designed with matrix multiplication and convolution in mind.
Meanwhile there’s a new NVidia policy that GTX drivers aren’t allowed to be used in data centers except for coin mining(!). I can’t really run a noisy power hungry PC at home (just a laptop) so any graphics card I use will have to be online (dedicated server, not necessarily cloud instance) or colo, so that limits to AMD or the expensive Tesla/Titan V cards. That’s part of what interests me in AMD. Sadly I currently don’t know anyplace to use online AMD servers…
Tim, do you know anything about Google TPU’s? How do they compare to Nervana, Volta so-called tensor cores, etc.? Thanks.
Tim Dettmers says
TPUs are actually not that practical for consumers. TPUv1 is just for inference, and while TPUv2 is for training in terms of raw performance a GPU will still outmatch it. Where TPUv2 shines is performance per watt. For Google, the electricity bill is a huge problem and not so much how much hardware you have. So while for a normal user having 3 computers with 8 or more TPUs would be impractical this would be just 3 more servers in a large grid for Google. For consumers, a single computer with standard hardware and with GPUs will always be more convenient and usable. This is similar to how software is important to use the hardware — in this case, you need special hardware to use the TPU and this is a big barrier for consumer everyday usage. I do not think you will see consumer TPUs anytime soon.
I beg to differ…anyone that is researching or writing deep learning applications , actually is directly using CUDA to do this…I am sure they have a software department that has already blessed the CUDA environment. CUDA has been around over 10 years and is way too entrenched to be challenged by AMD anytime soon even if they may have superior price/performance hardware.
Now Intel is notoriously bad at writing SDK frameworks that take advantage of their hardware when targeting the masses and I would not trust anything for them for at least 2 more years. Once Intel starts developing GPU cards similar to AMD/Nvidia then they will be competitive. I looked at AVX-512 code and it is ridiculously hard to program effectively in C as compared to writing the same functionality in CUDA. If Intel does not make it easy to write software to take FULL advantage of the Nervana hardware then NPP will be just another gigantic fail. Nvidia is firing on all guns with the core money coming from GTX gamer cards that are everywhere and relatively easy to program with CUDA. Remember, it took many years for CUDA to take off.
My flat out general statement would be that 99% of researchers have access to Windows and once you use Visual Studio with the latest CUDA sdk, you will develop your solution in CUDA.
I really like what Intel is doing but this is vaporware until I can order hardware and develop a mission critical solution like I can with CUDA….software is king NOT hardware!!
I agree with this comment as I think AMD has a 24-36 month to stay financially alive compared to Nvidia or Intel. Great hardware does not always produce a great company as great companies definitely outlast great hardware (e.g Intel x86) .
The CUDA software developer environment on Windows at least, incredibly easy to develop ridiculously fast software which is good enough for 99% of deep learning needs. I do like what AMD is doing but I fear it maybe too little too late.
Like I said in my previous comment, I will be sticking to CUDA for my monte carlo simulations and have no interest to porting to Open CL…AMD would be smart to directly support CUDA but i can’t see Nvidia EVER making the CUDA runtime environment open source. I feel that in reality AMD has lost the software race and therefore their hardware has to be ridiculously faster/cheaper than Nvidia hardware…this is basic software development economics. I would easily by a Tesla card as developing software is way more expensive than buy ingthe hardware and will continue to get more expensive when advancements such as tensor cores become more popular.
All Nvidia has to come out with is a headless GTX card with 32-bit compute only for a ridiculously low price and its game over for AMD…
Bill Ross says
Is the Nervana chip suitable for currency mining? Whereas ML drove up GPU prices for other branches of science , it looks like crypto currency mining is the gorilla in the room for ML, price-wise, but it is driven by a possible bubble, though I wonder if with the US abdication on the world stage, crypto currencies might plausibly replace the dollar in trade? Seems obvious, but I haven’t heard it mentioned, so maybe I’m just out of the loop, or it’s patently impossible.
Tim Dettmers says
I think for many professional miners ASICs are more attractive than GPUs. GPUs are attractive for new currencies that require unusual hashes. If these hashes can profit from 16-bit computation, then the Nervana NNP could be quite useful, but I do not think it would be more useful then say, an AMD cards with high 16-bit FLOPS. I think most hashing algorithms will not be able to take advantage of the cache.
Before Bill Gates founded Microsoft, he voiced his concern at this “software problem” to a computer hobbyist forum – “everyone wants good software but nobody is willing to pay for it…” You have to give him credit at how intensely his solution addressed this… but there’s still a plank in his eye… and you’ve elegantly brought this up by highlighting the role of economics in community software. What Microsoft is doing wrong is that it has created an echo chamber, a jail, a money sucker, out of the promise of good software, a lie… you only need enough software, and a healthy community can fix that – but it needs to be ignited by a hardware foundation that understands the power of this community, one that will not betray its trust in the same way that Microsoft has.
Tim Dettmers says
Very well put! I think you are very right, thanks for your comment!
amd lag far behind nvda
Hi Tim, what measures are you using to say that Vega is comparable to Titan V? Are there applications for which Titan V’s tensor operations are important, e.g. what does Google use its TPU’s for? Is making good use of a Vega mostly a matter of porting the required libraries from CUDA to HIP? Thanks!
Tim Dettmers says
These benchmarks are quite revealing: https://www.xcelerit.com/computing-benchmarks/insights/benchmarks-deep-learning-nvidia-p100-vs-v100-gpu/
Since LSTM algorithms are often compute bound if you use fused-LSTMs, one can estimate the performance if on looks at FLOPS. The V100 is compute bound however, it can compute so fast that it has to wait for memory. So one can directly compare P100 and Vega FLOPS and from the V100 vs P100 benchmark one knows how close the Vega is to the V100. Since Titan V and V100 are built on the same architecture they are directly comparable. Overall Vega Frontier is about 75% of the speed of a Titan V, but it is significantly cheaper.
It is not sufficient to port, say, cuDNN to OpenCL/HIP (AMD has mostly done this already) but the important part is to have the actual working software. For example, PyTorch that works with AMD cards.
Thanks Tim, the benchmark was interesting. It mentions LSTM’s don’t do that many tensor operations so they don’t benefit that much from the Volta tensor cores. So I was wondering if there are other algorithms/methods that make heavier use of tensor operations.
Regarding Pytorch, my context is that I’m a good Python and C programmer interested in trying something machine-learning related. I haven’t looked at Pytorch but I’m imagining a client library using Python’s C API to connect to NVidia CUDA-to-C bindings, plus some CUDA code to do the actual computation. And I’m wondering if porting that to HIP sounds feasible, though I imagine someone would already have done it in that case. But I could look into it.
Followup: I looked into Pytorch a little and it’s interesting! I want to try to use it. Still don’t know about the porting picture.
Your hardware articles are great, do you think you could also recommend some good reading about software and algorithms? Thanks!
Bingo du says
Informative writing, thanks a lot!
Carlos Perez says
Well, thank for highlighting the unique opportunity that AMD has for its Vega products.
The Vega is on par with the Titan V with regards to FP 16 performance (with the exception of TensorCores). So there is a big opportunity here if they can ramp up their software development. ROcm 1.7 just got released and the TensorFlow HIP port seems to have made good progress.
Tim Dettmers says
Yeah, it sounds like great progress. I would love to see competitive AMD software! This would be very beneficial for all users.
I too would like to see full use of AMD software but the CUDA environment is just too good to give up!!! I am currently running CUDA 8.0 on Windows 10 using VS2017 in both c++ AND C#…works like a charm but I want max performance at the lowest cost possible. I have developed an options pricing model using monte carlo simulations running on CUDA which means i can use any NVidia device that supports CUDA. AMD is definitely cheaper but the dev environment is terrible for Windows 10. I may look at moving to linux once I go beyond a 4 GPU setup in one machine but for now…NVidia is king. I have a feeling using FP16 may not be a good idea for evaluating options but FP32 is fantastic and I can use relatively cheap graphic cards compared to Tesla compute cards. Any deep learning hard ware would also be great for running monte carlo simulations and therefore I would like to keep a continuing eye on your column…
Tim Dettmers says
Your situation is quite common. I think if AMD can notch up their support and tools it could be the tipping point where people start switching to AMD for the cost. Let’s see what they will do!
I hope you are right as AMD really has to go even lower pricewise to be competative with Nvidia/CUDA. I am currently running both 1070ti and 1080ti cards in my dev box and I shortly would like to add a pre-prod machine with 4 gpus.
For running my monte carlo options pricing model, the 1070ti is within 10% of the performance of 1080ti and yet cost 50% less!!! A quad gpu box of 1070ti is half the price of running 1080ti…and I really doubt Titan V is even 50% faster than this!!! The price performance king right now is a 1070ti but I would like to get an AMD Vega Radeon FE to test out but I dont think it can beat the price/performance of 1070ti. I eventually need to build a prod box/rack with 80 cards to do realtime monte carlo options pricing…
Steve Steiner says
Thank you for these posts. Based on your earlier one I got a GTX 1060 just to learn. It definitely works for learning frameworks, and i was glad to have your advice on the card. My own choice to use it with windows 10, however, generated a deep distraction. The card cannot use the TCC driver, and without that driver, as best as I can tell, it is only possible to get 5 of the 6 GB for training single models. It appears WDDM prevents allocation of the entire VRAM address space to a single process. As best I can tell the correct workaround on windows is to use the TCC driver, which requires a Titan or ‘better’. Do most people use linux rather than trying to ‘fix’ this?
Tim Dettmers says
I did not know about this, thanks for informing me on this one! Indeed, most people use Linux since most deep learning frameworks had poor Windows support. But it is getting better, so I bet this will be fixed quite soon.
Richard de c says
Great article. Thanks
Thomas J Quintana says
Thanks Tim! To your point I recently ran up quite the AWS bill hoping to get a significant performance from Volta GPUs which in conclusion left a lot to be desired. I’m glad to know more options are on the horizon.
very interesting article.
according to the dates:
– 2018Q3 new AMD architecture
– Intel NNP aren’t available right now
when do you think that we will see these technologies integrated into TensorFlow or PyTorch? because if it happens in mid 2019, perhaps it doesn’t worth to wait for me
I’m looking for my first PC for DeepLearning (AMD TR 1900X, 32GB, 1080Ti, 1000Watt). And perhaps I should buy it during this limbo, and upgrade it with one next generation chip in 2020
Tim Dettmers says
Buying a GTX 1080 Ti and waiting out the limbo to buy a new chip sounds like the way to go for me!
Ahmed adly says
I have just did the same 🙂
Another way to look at it is Nvidia has dropped the price of V100 from $8k to $3k, while only reducing performance by slightly. Sweet! V100 has been quite popular, and now Titan V is extremely attractive to any DL researchers who have been thinking about V100 but hesitated.
Also, I’m puzzled as why is students inability to afford these cards is a concern. Do you, as a grad student have to buy your own research equipment? That’s pretty much unheard of in any top 100 grad school in US. It’s not like a few $3k cards would make a dent in any decent DL lab’s budget.
But a big bummer against Titan V for HPC is that ECC support is disabled for Titan V, making it pointless for production grade setups.
Production HPC environments usually turn ECC off to increase performance. So no, not pointless. With a real HPC budget though, it’d be better to get the V100 due to increased memory bandwidth. (would still turn ecc off)
Tim Dettmers says
Indeed, that is also my experience with HPC. For HPC you definitely want to get V100s.
Tim Dettmers says
(1) If you sell 100g chewing gum for $5000 and then you release a 50g chewing gum priced at $50, that does not make the $50 gum cheap. That is not really an argument for the Titan V.
(2) I worked in a top 10 institution and we never had enough GPUs. I know many other labs that do not have enough GPUs. Even NVIDIA-sponsored labs do not have enough GPUs. Of course, $3k is a big burden for deep learning labs. Try to sell a $3k GPU to someone in a developing country!
It’s strange to expect these chips to be cheap. V100 is one of the largest and most complex chips ever produced (if not the largest). Why would you expect it to be priced any lower than $8k? Are there any other chips of similar size (GPUs, FPGAs, Xeons, DSPs, etc) that cost less?
Moreover, V100 has been selling well. According to my contact at a tier 1 system builder, Nvidia sells as many of them as it can make. So it is reasonably priced. Announcing Titan V for people who can’t afford that much, Nvidia slashed the price by more than 60%, with a slight reduction of performance. Also, they still make 1080Ti for those who can’t afford even $3k. So it’s not like it’s impossible to do DL research if you’re a student in a developing country.
We’ve been able to use toys (gaming cards) as tools. These latest cards are not made to be used as toys. This puts them into a category of pro tools. Good pro tools are not cheap.
I haven’t seen any benchmarks comparing AMD Vega to V100 or Titan V, for DL tasks. I highly doubt that it’s competitive, but if I’m wrong, and it’s possible to make a card with this much performance for a fraction of the price (and still make a profit), I’ll take my words back.
Finally, “not having enough GPUs” might be a good thing for DL field. People have been brute forcing solutions far too often. Perhaps it’s time to work smarter, not harder. For example, I’ve seen two papers on evolving NN architectures. First one from Google, where they used 256 GPUs, second one from some guys in England who managed to optimize genetic algorithms to perform a similar task on a single GPU. We need more of the latter. It’s not clear to me that progress in DL today is bottlenecked by the lack of computing power. I personally can’t say “if only I had a rig with 8 V100 cards, I’d solve . Can you?
*I personally can’t say “if only I had a rig with 8 V100 cards, I’d solve [insert a hard DL problem here]*.
Tim Dettmers says
That is true for most people. However, if you have a group of 8 people sharing either 2 Titan Vs or 8 GTX 1080 Ti I would bet the people with the 8 GTX 1080 Ti cards would be much happier!
You mentioned a team in England doing optimization. What paper or code are you referring to? I’m interested in following up.
Trevor Tanner says
I’d really like to know who the team is as well.
Tim Dettmers says
A V100 is not so much different from a Vega Frontier, and the Vega Frontier is 10x cheaper.
Of course, NVIDIA is selling a lot, they have a monopoly position right now, you cannot buy anything else. It is just back when Windows had 95% share of all OSes and you could not really buy any other OS.
One reason why V100 sell is that vendors are not allowed to buy GTX cards in high quantities. In terms of technology, Tesla and GTX cards are almost the same. It is merely a marketing trick so that companies that buy GPUs have to pay more — and it works! I talked to many companies who would want to buy GTX cards, but they cannot get offers for them.
I would disagree. Not having enough GPUs only gives the privileged access to research tools. India has a lot of smart minds, but they just cannot afford GPUs so they cannot help pushing the progress on deep learning — that is quite sad! Deep learning should be for everyone!
Nikolaos Tsarmpopoulos says
Very good article, thank you! It appears like NVIDIA aims for Titan V to compete with AMD’s Vega Frontier Edition based on the former’s 100Teraflops Tensor cores’ performance, compared to AMD’s 25 Teraflops FP16 performance . In this case, if software needs to be rewritten to take advantage of Tensor cores, the Titan V currently appears to be in a similar position as AMD’s Vega Frontier Edition.
Once frameworks are updated to take advantage of NVIDIA’s Tensor Cores and AMD’s OpenCL or HIP libraries (lets assume AMD manages to catch up with NVIDIA in a more timely manner, this time), wouldn’t a single $3000 (250Watt) Titan V be preferred to 4x $1000+ Vega FEs (300Watt)?
Thanks and regards,
Also going for a 4-GPUs solution require a massive investment in your platform, probably an extra $2000 over a single GPU one, because of the upgrades in CPU, RAM, PSU etc.
Here’s an example based on ThreadRipper 1950X (no hard drive, already $3350)):https://fr.pcpartpicker.com/user/EricPB/saved/rNhd6h
Tim Dettmers says
Given the benchmarks, for LSTMS 2 GTX Titan Xp are about as fast as a Titan V. If one builds a cost-effective rig, a 2-card Titan V rig is $7500 vs a $5500 for 4 Titan Xp. Additionally, you have more RAM and are able to share the hardware with more people. A huge win for the 4 GPU setup. Thus it is not true at all that 4 GPU rigs are more expensive than buying Titan Vs.
I was referring to Nikoalos’ question: “wouldn’t a single $3000 (250Watt) Titan V be preferred to 4x $1000+ Vega FEs (300Watt) ?” by pointing out the overhead costs of hosting 4 (300W) GPUs vs one (250W) GPU.
my question is how you avoid bw bottleneck using 4 cards, while there is only 2x16x slots on the mother board ?
There isnt any x399 Mother boards with 4 full 16x pcie-3 slots. Sadly.
Tim Dettmers says
The advertised TensorCore FLOPS are theoretical; in practical setting one is not able to reach those numbers. Benchmarks on LSTMs show that in most cases they are at most 50% faster than 16-bit floating computation. However, AMD Vega is 75% faster for 16-bit computation than the benchmarked V100 which means that an AMD Vega is about 25% faster than a Titan V. At the same time one gets water cooling for free on a Vega which widens the performance difference further still. So if the software would be there, then I would definitely go with an AMD Vega over a Titan V.
Kellen Sunderland says
I’m not following you here. Titan Vs are slower than V100s. If a Vega was 75% faster (link?) than a V100 shouldn’t it also be at least 75% faster than a Titan V.
Tim Dettmers says
Sorry, my comment was confusing. I meant that a Vega was 75% the speed of a Titan V (so slower). Thus its cost/performance ratio is much better.
Alexander Veysov says
Very nice blog post indeed.
Simple, clean and concise.
Does N. Matter says
Any thoughts/insight on Graphcore?
Haris Jabbar says
It was about time that we saw a fresh outlook on DL hardware from you and as always it was a treat to read. Thank you for that!
Even though I bought my 1070s few months back (1070ti wasn’t announced then) and they serve my needs well, I agree that DL will/should become a cornerstone of hardware development policy of all the three players. Especially since many of the DL based products are just on the verge of transitioning from academia to commercial solutions, this is prime time to offer a complete solution (HW+SW+Support) to research community.
Looking forward to your updates.
Max Plank says
I totally agree with you, Haris!
Interestingly, it looks like the systems research community is trying to provide HW+SW support for different DL hardware and numerous DL frameworks using open competitions: see the new REQUEST tournament on “Reproducible Quality-Efficient Systems Tournaments” attempting to benchmark and co-design deep learning in terms of speed, accuracy and costs at ASPLOS conference: http://cKnowledge.org/request.html .
Maybe it will help to answer which HW+SW is the best?