<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	
	xmlns:georss="http://www.georss.org/georss"
	xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#"
	
	>
<channel>
	<title>
	Comments on: TPUs vs GPUs for Transformers (BERT)	</title>
	<atom:link href="https://timdettmers.com/2018/10/17/tpus-vs-gpus-for-transformers-bert/feed/" rel="self" type="application/rss+xml" />
	<link>https://timdettmers.com/2018/10/17/tpus-vs-gpus-for-transformers-bert/</link>
	<description>Making deep learning accessible.</description>
	<lastBuildDate>Sun, 20 Sep 2020 21:40:19 +0000</lastBuildDate>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.0.11</generator>
	<item>
		<title>
		By: Tim Dettmers		</title>
		<link>https://timdettmers.com/2018/10/17/tpus-vs-gpus-for-transformers-bert/comment-page-1/#comment-65580</link>

		<dc:creator><![CDATA[Tim Dettmers]]></dc:creator>
		<pubDate>Sun, 24 Nov 2019 23:54:18 +0000</pubDate>
		<guid isPermaLink="false">http://timdettmers.com/?p=686#comment-65580</guid>

					<description><![CDATA[In reply to &lt;a href=&quot;https://timdettmers.com/2018/10/17/tpus-vs-gpus-for-transformers-bert/comment-page-1/#comment-65526&quot;&gt;Jay&lt;/a&gt;.

This is a great question, thank you! The number of tiles is determined by the amount of shared memory you have available. You have been 64kb to 96kb per streaming multiprocessor (SM), but of that memory you usually need to reserve half of it to do double buffered loads since the memory load latency is pretty high. For very big, modern GPUs, like Titan RTX, we have about 72 SM and so a total of 4608 kb of memory for tiles. About half of that is reserved for double buffering. So you have 2304 kb of total memory. You need to distribute this unto at least 142 thread blocks to get full compute utilization, that means a maximum size of about 16kb for tiles. Usually, you do not want actually maximum compute utilization since memory movement is the bottleneck, so you run with less thread blocks to increase the memory tile size.

Overall, you can thus expect that the memory tile size will span about 16-20 kb for between matrices A and B. If you assume the memory is split between both matrices equally that is about 8192 bytes for the maximum title size which is 4096 elements for 16-bit floats. This means, for a large GPU the maximum tile size is about 32x128. You can go a bit larger, but it will quickly eat into computational performance. Note, that this calculation is the for largest GPU (Titan RTX). For other GPUs, other tile sizes will provide better performance. Optimal tile size is a difficult concept which often cannot be really theoretically determined. Often the only solution is to write a matrix multiply algorithm for a certain tile size and benchmark it against other tile sizes to get the best tile size.]]></description>
			<content:encoded><![CDATA[<p>In reply to <a href="https://timdettmers.com/2018/10/17/tpus-vs-gpus-for-transformers-bert/comment-page-1/#comment-65526">Jay</a>.</p>
<p>This is a great question, thank you! The number of tiles is determined by the amount of shared memory you have available. You have been 64kb to 96kb per streaming multiprocessor (SM), but of that memory you usually need to reserve half of it to do double buffered loads since the memory load latency is pretty high. For very big, modern GPUs, like Titan RTX, we have about 72 SM and so a total of 4608 kb of memory for tiles. About half of that is reserved for double buffering. So you have 2304 kb of total memory. You need to distribute this unto at least 142 thread blocks to get full compute utilization, that means a maximum size of about 16kb for tiles. Usually, you do not want actually maximum compute utilization since memory movement is the bottleneck, so you run with less thread blocks to increase the memory tile size.</p>
<p>Overall, you can thus expect that the memory tile size will span about 16-20 kb for between matrices A and B. If you assume the memory is split between both matrices equally that is about 8192 bytes for the maximum title size which is 4096 elements for 16-bit floats. This means, for a large GPU the maximum tile size is about 32&#215;128. You can go a bit larger, but it will quickly eat into computational performance. Note, that this calculation is the for largest GPU (Titan RTX). For other GPUs, other tile sizes will provide better performance. Optimal tile size is a difficult concept which often cannot be really theoretically determined. Often the only solution is to write a matrix multiply algorithm for a certain tile size and benchmark it against other tile sizes to get the best tile size.</p>
]]></content:encoded>
		
			</item>
		<item>
		<title>
		By: Jay		</title>
		<link>https://timdettmers.com/2018/10/17/tpus-vs-gpus-for-transformers-bert/comment-page-1/#comment-65526</link>

		<dc:creator><![CDATA[Jay]]></dc:creator>
		<pubDate>Sat, 23 Nov 2019 15:32:27 +0000</pubDate>
		<guid isPermaLink="false">http://timdettmers.com/?p=686#comment-65526</guid>

					<description><![CDATA[For a GPU we have the same process, but we use smaller tiles with more processors.
Why? Can&#039;t we use more tiles for GPU? Since it will decrease the load and eventually can match up with TPU?   Who decides the number of tiles in use?]]></description>
			<content:encoded><![CDATA[<p>For a GPU we have the same process, but we use smaller tiles with more processors.<br />
Why? Can&#8217;t we use more tiles for GPU? Since it will decrease the load and eventually can match up with TPU?   Who decides the number of tiles in use?</p>
]]></content:encoded>
		
			</item>
		<item>
		<title>
		By: Tim Dettmers		</title>
		<link>https://timdettmers.com/2018/10/17/tpus-vs-gpus-for-transformers-bert/comment-page-1/#comment-58962</link>

		<dc:creator><![CDATA[Tim Dettmers]]></dc:creator>
		<pubDate>Thu, 11 Jul 2019 13:38:25 +0000</pubDate>
		<guid isPermaLink="false">http://timdettmers.com/?p=686#comment-58962</guid>

					<description><![CDATA[In reply to &lt;a href=&quot;https://timdettmers.com/2018/10/17/tpus-vs-gpus-for-transformers-bert/comment-page-1/#comment-58884&quot;&gt;Enrique&lt;/a&gt;.

A TPU can only hold one tile of A and B at any time. The tile is evicted once you move on to the next sub-matrix multiplication. As such, you need to reload the tiles of B more often if you want to aggregate within a specific memory location. Aggregating across the entire length of the matrix would entail less tile loads for B, but more tile store executions and would be slower — it is faster (2x) to aggregate in the same location for 16 tiles rather than to change location with every tile. This is a major reason why lowing the bits from 32-bit to 16-bit is so effective because it reduces the factor by which you need to reload tiles on a GPU.]]></description>
			<content:encoded><![CDATA[<p>In reply to <a href="https://timdettmers.com/2018/10/17/tpus-vs-gpus-for-transformers-bert/comment-page-1/#comment-58884">Enrique</a>.</p>
<p>A TPU can only hold one tile of A and B at any time. The tile is evicted once you move on to the next sub-matrix multiplication. As such, you need to reload the tiles of B more often if you want to aggregate within a specific memory location. Aggregating across the entire length of the matrix would entail less tile loads for B, but more tile store executions and would be slower — it is faster (2x) to aggregate in the same location for 16 tiles rather than to change location with every tile. This is a major reason why lowing the bits from 32-bit to 16-bit is so effective because it reduces the factor by which you need to reload tiles on a GPU.</p>
]]></content:encoded>
		
			</item>
		<item>
		<title>
		By: Enrique		</title>
		<link>https://timdettmers.com/2018/10/17/tpus-vs-gpus-for-transformers-bert/comment-page-1/#comment-58884</link>

		<dc:creator><![CDATA[Enrique]]></dc:creator>
		<pubDate>Tue, 09 Jul 2019 13:23:41 +0000</pubDate>
		<guid isPermaLink="false">http://timdettmers.com/?p=686#comment-58884</guid>

					<description><![CDATA[Why do you need to load 64 tiles of B for each tile of A? In total, there are 16 tiles of A and 64 tiles of B, but not every tile of B has to be multiplied by each tile of A. It would make more sense to load 8 tiles of B for each tile of A:
for i=1:2
    for k=1:8
        Load tile A(i,k)
        for j=1:8
            Load tile B(k,j)]]></description>
			<content:encoded><![CDATA[<p>Why do you need to load 64 tiles of B for each tile of A? In total, there are 16 tiles of A and 64 tiles of B, but not every tile of B has to be multiplied by each tile of A. It would make more sense to load 8 tiles of B for each tile of A:<br />
for i=1:2<br />
    for k=1:8<br />
        Load tile A(i,k)<br />
        for j=1:8<br />
            Load tile B(k,j)</p>
]]></content:encoded>
		
			</item>
		<item>
		<title>
		By: Tim Dettmers		</title>
		<link>https://timdettmers.com/2018/10/17/tpus-vs-gpus-for-transformers-bert/comment-page-1/#comment-58049</link>

		<dc:creator><![CDATA[Tim Dettmers]]></dc:creator>
		<pubDate>Fri, 14 Jun 2019 02:33:45 +0000</pubDate>
		<guid isPermaLink="false">http://timdettmers.com/?p=686#comment-58049</guid>

					<description><![CDATA[In reply to &lt;a href=&quot;https://timdettmers.com/2018/10/17/tpus-vs-gpus-for-transformers-bert/comment-page-1/#comment-57072&quot;&gt;Abe&lt;/a&gt;.

I think you are right — I made a mistake here. I will look into this when I have time and update the numbers.]]></description>
			<content:encoded><![CDATA[<p>In reply to <a href="https://timdettmers.com/2018/10/17/tpus-vs-gpus-for-transformers-bert/comment-page-1/#comment-57072">Abe</a>.</p>
<p>I think you are right — I made a mistake here. I will look into this when I have time and update the numbers.</p>
]]></content:encoded>
		
			</item>
		<item>
		<title>
		By: Abe		</title>
		<link>https://timdettmers.com/2018/10/17/tpus-vs-gpus-for-transformers-bert/comment-page-1/#comment-57072</link>

		<dc:creator><![CDATA[Abe]]></dc:creator>
		<pubDate>Mon, 13 May 2019 14:13:36 +0000</pubDate>
		<guid isPermaLink="false">http://timdettmers.com/?p=686#comment-57072</guid>

					<description><![CDATA[In reply to &lt;a href=&quot;https://timdettmers.com/2018/10/17/tpus-vs-gpus-for-transformers-bert/comment-page-1/#comment-50152&quot;&gt;Tim Dettmers&lt;/a&gt;.

I think &quot;1 Cloud TPU&quot; = &quot;4 TPU Chips&quot; = &quot;8 TPU cores&quot; = &quot;8 TPU processors&quot;.   Sources:

https://cloud.google.com/tpu/docs/deciding-pod-versus-tpu

https://github.com/google-research/bert/issues/67#issuecomment-436351513

From the BERT paper:
&quot;Training of BERT_LARGE was performed on 16 Cloud TPUs (64 TPU chips total)&quot;

I&#039;m also not sure why the post here mentions &quot;256 chips&quot;, rather than 64.]]></description>
			<content:encoded><![CDATA[<p>In reply to <a href="https://timdettmers.com/2018/10/17/tpus-vs-gpus-for-transformers-bert/comment-page-1/#comment-50152">Tim Dettmers</a>.</p>
<p>I think &#8220;1 Cloud TPU&#8221; = &#8220;4 TPU Chips&#8221; = &#8220;8 TPU cores&#8221; = &#8220;8 TPU processors&#8221;.   Sources:</p>
<p><a href="https://cloud.google.com/tpu/docs/deciding-pod-versus-tpu" rel="nofollow ugc">https://cloud.google.com/tpu/docs/deciding-pod-versus-tpu</a></p>
<p><a href="https://github.com/google-research/bert/issues/67#issuecomment-436351513" rel="nofollow ugc">https://github.com/google-research/bert/issues/67#issuecomment-436351513</a></p>
<p>From the BERT paper:<br />
&#8220;Training of BERT_LARGE was performed on 16 Cloud TPUs (64 TPU chips total)&#8221;</p>
<p>I&#8217;m also not sure why the post here mentions &#8220;256 chips&#8221;, rather than 64.</p>
]]></content:encoded>
		
			</item>
		<item>
		<title>
		By: Tim Dettmers		</title>
		<link>https://timdettmers.com/2018/10/17/tpus-vs-gpus-for-transformers-bert/comment-page-1/#comment-55992</link>

		<dc:creator><![CDATA[Tim Dettmers]]></dc:creator>
		<pubDate>Tue, 16 Apr 2019 20:09:49 +0000</pubDate>
		<guid isPermaLink="false">http://timdettmers.com/?p=686#comment-55992</guid>

					<description><![CDATA[In reply to &lt;a href=&quot;https://timdettmers.com/2018/10/17/tpus-vs-gpus-for-transformers-bert/comment-page-1/#comment-55910&quot;&gt;Chester&lt;/a&gt;.

That is correct, however, the BERT paper is using the following compute resources:
&lt;blockquote&gt;
Training of BERT BASE was performed on 4 Cloud TPUs in Pod configuration (16 TPU chips total).5 Training of BERTLARGE was performed on 16 Cloud TPUs (64 TPU chips total).
&lt;/blockquote&gt;]]></description>
			<content:encoded><![CDATA[<p>In reply to <a href="https://timdettmers.com/2018/10/17/tpus-vs-gpus-for-transformers-bert/comment-page-1/#comment-55910">Chester</a>.</p>
<p>That is correct, however, the BERT paper is using the following compute resources:</p>
<blockquote><p>
Training of BERT BASE was performed on 4 Cloud TPUs in Pod configuration (16 TPU chips total).5 Training of BERTLARGE was performed on 16 Cloud TPUs (64 TPU chips total).
</p></blockquote>
]]></content:encoded>
		
			</item>
		<item>
		<title>
		By: Chester		</title>
		<link>https://timdettmers.com/2018/10/17/tpus-vs-gpus-for-transformers-bert/comment-page-1/#comment-55910</link>

		<dc:creator><![CDATA[Chester]]></dc:creator>
		<pubDate>Mon, 15 Apr 2019 11:51:56 +0000</pubDate>
		<guid isPermaLink="false">http://timdettmers.com/?p=686#comment-55910</guid>

					<description><![CDATA[In reply to &lt;a href=&quot;https://timdettmers.com/2018/10/17/tpus-vs-gpus-for-transformers-bert/comment-page-1/#comment-50152&quot;&gt;Tim Dettmers&lt;/a&gt;.

Why 64 GPUs is equivalent to 4 TPU pods? I don&#039;t really get it.
Google says, 1 TPU pod = 64 TPU devices = 256 TPU chips = 512 cores.]]></description>
			<content:encoded><![CDATA[<p>In reply to <a href="https://timdettmers.com/2018/10/17/tpus-vs-gpus-for-transformers-bert/comment-page-1/#comment-50152">Tim Dettmers</a>.</p>
<p>Why 64 GPUs is equivalent to 4 TPU pods? I don&#8217;t really get it.<br />
Google says, 1 TPU pod = 64 TPU devices = 256 TPU chips = 512 cores.</p>
]]></content:encoded>
		
			</item>
		<item>
		<title>
		By: Tim Dettmers		</title>
		<link>https://timdettmers.com/2018/10/17/tpus-vs-gpus-for-transformers-bert/comment-page-1/#comment-52558</link>

		<dc:creator><![CDATA[Tim Dettmers]]></dc:creator>
		<pubDate>Wed, 20 Feb 2019 22:26:08 +0000</pubDate>
		<guid isPermaLink="false">http://timdettmers.com/?p=686#comment-52558</guid>

					<description><![CDATA[In reply to &lt;a href=&quot;https://timdettmers.com/2018/10/17/tpus-vs-gpus-for-transformers-bert/comment-page-1/#comment-51703&quot;&gt;Haibin&lt;/a&gt;.

Fixed — thank you!]]></description>
			<content:encoded><![CDATA[<p>In reply to <a href="https://timdettmers.com/2018/10/17/tpus-vs-gpus-for-transformers-bert/comment-page-1/#comment-51703">Haibin</a>.</p>
<p>Fixed — thank you!</p>
]]></content:encoded>
		
			</item>
		<item>
		<title>
		By: Haibin		</title>
		<link>https://timdettmers.com/2018/10/17/tpus-vs-gpus-for-transformers-bert/comment-page-1/#comment-51703</link>

		<dc:creator><![CDATA[Haibin]]></dc:creator>
		<pubDate>Wed, 06 Feb 2019 21:03:36 +0000</pubDate>
		<guid isPermaLink="false">http://timdettmers.com/?p=686#comment-51703</guid>

					<description><![CDATA[typo: 7.6r-05 seconds -&#062; 7.6e-05 seconds]]></description>
			<content:encoded><![CDATA[<p>typo: 7.6r-05 seconds -&gt; 7.6e-05 seconds</p>
]]></content:encoded>
		
			</item>
	</channel>
</rss>
