# 利用 FFT 求解压力柏松方程加速烟雾流体模拟

# 前言

往往在基于欧拉网格 2d 烟雾流体模拟中,求解压力场是最费事的一部份,其他步骤莫过于采样上下左右,求散度或梯度,还有平流,而在求解压力场的时候,这需要 Jacobi 迭代法迭代 20~30 次以上才能有好的效果,单单这一步骤的耗时就可能去到 1~2ms 以上,在有一次无意中接触到了 FFT 解柏松方程,就转念一想用迭代法求解烟雾压力场也是个柏松方程,而且 FFT 还是利用 GPU 并行效率快去求解,所以就有今天利用 compote shader 的并行通过 FFT 求解柏松方程,只需要一次就可以求解出压力场。

# NS 方程

先回顾一下流体解算过程

  1. 首先是施加力,就是各种力,各种外力,体积力相加起来。简单点来说就是重力、发射源速度,空气中的扰乱力,漩涡力什么力累加起来得到当前帧的速度
  2. 平流速度,就是利用刚刚算出来的速度来运输速度,就是根据当前像素位置,加上一个速度向量到达目标像素写进去
  3. 根据平流好的速度梯度计算压力场
  4. 然后根据压力场散度重新投影回速度,再平流一次密度,这样就得到最终结果,顺便别忘了做碰撞检测

然后计算压力场的公式是个通过亥姆霍兹 - 霍奇分解 得到一个无旋场和无源场,加上一个散度标志就变成了

w=(u+p)=u+2p.\nabla\cdot\mathbf{w}=\nabla\cdot\left(\mathbf{u}+\nabla p\right)=\nabla\cdot\mathbf{u}+\nabla^2p.

因为是不可压缩流体,所以u=0\nabla\cdot\mathbf{u} = 0,这时候新的速度场 w 散度就是拉普拉斯算子2p\nabla^{2}p

2p=2px2+2py2=pi+1,j+pi1,j+pi,j+1+pi,j14pi,j(δx)2\nabla^{2}p=\frac{\partial^2p}{\partial x^2}+\frac{\partial^2p}{\partial y^2}=\frac{p_{i+1,j}+p_{i-1,j}+p_{i,j+1}+p_{i,j-1}-4p_{i,j}}{\left(\delta x\right)^{2}}

对于柏松方程定义就是2φ=f\nabla^2\varphi=f,当ff 等于 0 时候就是拉普拉斯方程。所以解算压力场也叫求解压力柏松方程,这个就是等下我们要求解的东西,其中最简单求解就是雅可比迭代,他简单而且易于实现所以解算流体很常用的。

# 离散傅立叶变换(DFT)

对于游戏行业最常听到 FFT 就是 FFT 海洋,就是利用几个函数叠加成海洋模拟波的运动,这篇文章就详细讲解一下 FFT

傅立叶变换是一种数学工具,可以将一个函数从时间域(或空间域)转换到频率域,或者转换回来。

而 FFT 是一种 DFT 的高效算法,DFT 就是离散傅立叶变换(Discrete Fourier Transform,DFT)DFT 的公式长这样

Xk=n=0N1xnei2πkNnX_k=\sum_{n=0}^{N-1}x_n\cdot e^{-i2\pi\frac kNn}

其中,k 是频率域中的第 k 个频率分量,n 是时域中的第 n 个数据点,大 N 就是数据总数,i 是虚数单位,举个例子怎么算 DFT,来自维基百科的题目,求出下面的 DFT

x=(x0x1x2x3)=(12ii1+2i)\mathbf{x}=\begin{pmatrix}x_0\\x_1\\x_2\\x_3\end{pmatrix}=\begin{pmatrix}1\\2-i\\-i\\-1+2i\end{pmatrix}

解题过程是这样的

X0=ei2π00/41+ei2π01/4(2i)+ei2π02/4(i)+ei2π03/4(1+2i)=2X1=ei2π10/41+ei2π11/4(2i)+ei2π12/4(i)+ei2π13/4(1+2i)=22iX2=ei2π20/41+ei2π21/4(2i)+ei2π22/4(i)+ei2π23/4(1+2i)=2iX3=ei2π30/41+ei2π31/4(2i)+ei2π32/4(i)+ei2π33/4(1+2i)=4+4i\begin{aligned}X_0&=e^{-i2\pi0\cdot0/4}\cdot1+e^{-i2\pi0\cdot1/4}\cdot(2-i)+e^{-i2\pi0\cdot2/4}\cdot(-i)+e^{-i2\pi0\cdot3/4}\cdot(-1+2i)=2\\X_1&=e^{-i2\pi1\cdot0/4}\cdot1+e^{-i2\pi1\cdot1/4}\cdot(2-i)+e^{-i2\pi1\cdot2/4}\cdot(-i)+e^{-i2\pi1\cdot3/4}\cdot(-1+2i)=-2-2i\\X_2&=e^{-i2\pi2\cdot0/4}\cdot1+e^{-i2\pi2\cdot1/4}\cdot(2-i)+e^{-i2\pi2\cdot2/4}\cdot(-i)+e^{-i2\pi2\cdot3/4}\cdot(-1+2i)=-2i\\X_3&=e^{-i2\pi3\cdot0/4}\cdot1+e^{-i2\pi3\cdot1/4}\cdot(2-i)+e^{-i2\pi3\cdot2/4}\cdot(-i)+e^{-i2\pi3\cdot3/4}\cdot(-1+2i)=4+4i\end{aligned}

结果就是

X=(X0X1X2X3)=(222i2i4+4i)\mathbf{X}=\begin{pmatrix}X_0\\X_1\\X_2\\X_3\end{pmatrix}=\begin{pmatrix}2\\-2-2i\\-2i\\4+4i\end{pmatrix}

这时候我们用一个旋转因子(twiddle factor)ww 来替代这个ee,定义为

WNnk=ej2πNnkW_{\mathrm{N}}^{nk}=e^{-j\frac{2\pi}{N}nk}

Untitled

既然定义了旋转因子,那么复数也要变成笛卡尔坐标表示,其中虚轴是 y,实数轴是 x,一个 a+bi 久变成了(a,b)向量。

他有三个性质

  • 周期性:WNa+N=WNa\mathrm{W_N^{a+N}=W_N^a}
  • 对称性:WNa+N/2=WNa\mathrm{W_N^{a+N/2}=-W_N^a}
  • 缩放性:WN2=WN/21\mathrm{W}_{\mathrm{N}}^2=\mathrm{W}_{\mathrm{N}/2}^1

都很好理解,就是一个圆,周期性就是转一圈,对称性就是转 180 度,其中幅角为正且最小的向量称为nn 次单位向量,记为wn1w^1_n, 其他复数可由他乘积得到wnk=wnk1wn1(2kn)w_n^k=w_n^{k-1}\cdot w_n^1\mathrm{~}(2\leq k\leq n)

顺便说一下欧拉公式

eiθ=cos(θ)+isin(θ)e^{i\theta}=\cos(\theta)+i\sin(\theta)

根据欧拉公式 DFT 就可以变成

XN=n=0N1xnwNk=n=0N1xncos(j2πkN)+sin(j2πkN)iX_N=\sum_{n=0}^{N-1}x_n\cdot w^k_N = \sum_{n=0}^{N-1}x_n\cdot cos(-j2\pi \frac{k}{N})+sin(-j2\pi \frac{k}{N})i

最后我们要在笛卡尔坐标下表示就是

XN=n=0N1xncos(j2πkN)sin(j2πkN)X_N= \sum_{n=0}^{N-1}x_n\cdot (cos(-j2\pi \frac{k}{N}),sin(-j2\pi \frac{k}{N}))

这样就可以复数单位消除,变成一个很好理解的向量,但是他还是复数,所以乘法会变得不一样

(a,b)+(c,d)=(a+c,b+d)(a,b)(c,d)=(acbd,bc+ad)(a,b)+(c,d)=(a+c,b+d)\\(a,b)\cdot(c,d)=(ac-bd,bc+ad)

既然后面是复数,所以前面xnx_n 也是个复数,我们就拿相邻的两个实数组成一个复数。

# 加速 DFT 算法

看完上面你会发现,其中这个算法要O(n2)O(n^2) 的时间,因为要每个数跟其他数相乘,耗时太长,所以我们利用旋转因子ww 的特殊性质来加速算法

对于A(x)=a0+a1w1+a2w2+a3w3++an1wn1A(x)=a_0+a_1\cdot w^1+a_2\cdot w^2+a_3\cdot w^3+\ldots+a_{n-1}\cdot w^{n-1} 考虑将其按照奇偶分组

A(x)=(a0+a2w2+a4w4+an2wn2)+(a1w1+a3w3+a5w5++an1wn1)\begin{aligned}&A(x)=(a_0+a_2\cdot w^2+a_4\cdot w^4\ldots+a_{n-2}\cdot w^{n-2})+(a_1\cdot w^1+a_3\cdot w^3+a_5\cdot w^5+\ldots\\&+a_{n-1}\cdot w^{n-1})\end{aligned}

在奇数里面提取一个 w 出来就变成

A(x)=(a0+a2w2+a4w4+an2wn2)+x(a1+a3w2+a5w4++an1wn2)\begin{aligned}&A(x)=(a_0+a_2\cdot w^2+a_4\cdot w^4\ldots+a_{n-2}\cdot w^{n-2})+x\\&\cdot(a_1+a_3\cdot w^2+a_5\cdot w^4+\ldots+a_{n-1}\cdot w^{n-2})\end{aligned}

A1(x)=(a0+a2w+a4w2+an2wn22)A2(x)=(a1+a3w+a5w2+an1wn22)\begin{aligned}A1(x)&=(a_0+a_2\cdot w+a_4\cdot w^2\ldots+a_{n-2}\cdot w^{\frac{n-2}2})\\\\A2(x)&=(a_1+a_3\cdot w+a_5\cdot w^2\ldots+a_{n-1}\cdot w^{\frac{n-2}2})\end{aligned}

则可以的得到

A(x)=A1(x2)+wA2(x2)A(x)=A1(x^2)+w\cdot A2(x^2)

这时候我们可以把它分为两半

A(ωnk)=A1(ωn2k)+ωnkA2(ωn2k),0kn21A(ωnk+n2)=A1(ωn2k)ωnkA2(ωn2k),n2k+n2n1A(\omega_n^k)=A1(\omega_{\frac n2}^k)+\omega_n^k\cdot A2(\omega_{\frac n2}^k),0\leq k\leq\frac n2-1\\A(\omega_n^{k+\frac n2})=A1(\omega_{\frac n2}^k)-\omega_n^k\cdot A2(\omega_{\frac n2}^k),\frac n2\leq k+\frac n2\leq n-1

下半部分就是有个减号是因为

A(ωnk+n2)=A1(ωn2k+n)+ωnk+n2A2(ωn2k+n)A(\omega_n^{k+\frac n2})=A1(\omega_n^{2k+n})+\omega_n^{k+\frac n2}\cdot A2(\omega_n^{2k+n})

其中ωn2k+n=ωn2kωnn=ωn2k=ωn2k\omega_n^{2k+n}=\omega_n^{2k}\cdot\omega_n^n=\omega_n^{2k}=\omega_{\frac n2}^k

由对称性ωnk+n2=ωnk\omega_n^{k+\frac n2}=-\omega_n^k 所以有个减号

# FFT 蝶式运算

蝶式算法主要的思想就是不断一半切份做 FFT,这样最后原来需要O(n2)O(n^2) 就变成O(logN)O(logN),过程是这样的,

Untitled

说实话第一次看这图我也是一脸懵逼看不懂,但举例一下就很简单

X(W40)=F(W20)+W40G(W20)=n=01x(2n)(W20)n+W40n=01x(2n+1)(W20)nF(W20)=Ff(W10)+W20Gf(W10)=n=00x(4n)(W10)n+W20n=00x(4n+2)(W10)n=x(0)+x(2)W20\begin{aligned}&X(W_{4}^{0})=F(W_{2}^{0})+W_{4}^{0}G(W_{2}^{0})=\sum_{n=0}^{1}x(2n)(W_{2}^{0})^{n}+W_{4}^{0}\sum_{n=0}^{1}x(2n+1)(W_{2}^{0})^{n} \\&F(W_{2}^{0})=F_{f}(W_{1}^{0})+W_{2}^{0}G_{f}(W_{1}^{0}) \\&=\sum_{n=0}^0x(4n)(W_1^0)^n+W_2^0\sum_{n=0}^0x(4n+2)(W_1^0)^n \\&=x(0)+x(2)W_{2}^{0}\end{aligned}

G(W20)=Fg(W10)+W20Gg(W10)=n=00x(4n+1)(W10)n+W20n=00x(4n+3)(W10)n=x(1)+x(3)W20\begin{aligned}&G(W_{2}^{0})=F_{g}(W_{1}^{0})+W_{2}^{0}G_{g}(W_{1}^{0}) \\&\begin{aligned}&=\sum_{n=0}^0x(4n+1)(W_1^0)^n+W_2^0\sum_{n=0}^0x(4n+3)(W_1^0)^n\end{aligned} \\&=x(1)+x(3)W_{2}^{0}\end{aligned}

到这全部 FFT 就写完了。对于更详细的可以看看

一小时学会快速傅里叶变换(Fast Fourier Transform)

# 实现

实现方面因为图片是二维的,所以解压力柏松方程时候,利用 Compute shader 先 x 轴做一次,再 y 轴做一次,然后在逆 FFT 做一次 x 和 y。因为是直接用 Compute shader 实现的,发现比在 Niagara 实现快一倍,在同样 3070ti,512x512,Niagara 要 0.3ms,而自己的 Compute shader 只要 0.1ms

Untitled

Untitled

在做到一半的时候,发现 ue5.4 的 Niagara fluid 已经采用了这个方法,发现他的更高效,所以直接抄了过来,最后是 dispatch compute shader,代码都是一样,只有开关不同,在每帧都要初始压力场 Grid,就是把速度散度写进去

Untitled

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
disgroupshared float2 FFTGroupShared[2*THREADGROUP_SIZEX];
[numthreads(THREADGROUP_SIZEX, 1, 1)]
void MainCS(uint3 DispatchThreadId : SV_DispatchThreadID,uint3 GroupThreadId : SV_GroupThreadID,uint GroupID:SV_GroupID)
{
//FFTSolverPressure(GroupThreadId.x,true, bIsInverse.x);

// discrete sine transform using FFT.
//
// defined as :
// F_k = Sum_{n=1}^{M-1} f_n Sin(Pi k n / M)
//
// f_n = (2/M) Sum_{k=1}^{M-1} F_k Sin(Pi k n / M)
//
//
// Requirement. Lenght of input is M-1, where M is a power of two.
//
// Dispatch:
// with M/2 threads.
// Grid2D:
// The grid2d must have a channel with name "Real"

// Idea. Input. Real Array of length M -1
// {f_i} with i in [1, M-1]
// i.e. {f_i} = { f1, f2, .. f_{M-1} }.
// Step 1)
// odd extension is applied creating {g_i} with i in [0,2M-1]
// i.e. {g_i} = { 0, f1, f2.. f_{M-1}, 0, -f_{M-1}...-f2 -f1 }
//
// Step 2)
// pack pairs of terms in {g_i} as complex array {h_i} of length M
// {h_i} = {g_{2i} + i g_{2 i +1 }
//
// Step 3)
// FFT the complex array {h_i}
//
// H_k = FFT(h_i)
//
// Step 4)
// unpack the H_k to get G_k
// G_k = (H_k + Conjugate(H_{M-k}) - i/2 (H_k - Conjugate(H_{M-k))Exp(i pi k / M)
//
// Step 5)
// construct coefficeints of Sin Transform from
// F^{sin}_k = -i/2 F_k.
// Note: this is purely real, so F_k is pure imaginary.
// also F^sin_0 = 0.
//
// Step 6)
// store output as real array lenght M-1
// { F^sin_1, F^sin_2 ... F^sin_{M-1} }

int2 NumCells = GridSize.xy ;
int ScanlineIdx = GroupID.x;
int ThreadIdx = GroupThreadId.x ;
bool bPoissonSolver = true;

int Mx = bTransformX ? NumCells.x : NumCells.y;
int NumCols = Mx / 2; // requires a dispatch such that NumCols = THREADGROUP_SIZE

int My = bTransformX ? NumCells.y : NumCells.x;

// constants.
float TwoPi = 2.0 * Pi;

// This assumes we have already done the sin-transform in Y
// and now we are doing it in X - to be followed by inv in Y inv in X
bool bApplyPoissonConvolution = (bPoissonSolver) && (bTransformX) && (bIsInverse == false);

bool bCorrectDim = (Mx / 2 == THREADGROUP_SIZEX);
bool bValidSetUp = bCorrectDim;

float GridSpaceDX = dx;
float RScale = 1. / GridSpaceDX;
float WScale = (GridSpaceDX * GridSpaceDX) * 1. / RScale;

bool bFirstRead = ((bTransformX == false) && (bIsInverse == false));
bool bLastWrite = ((bTransformX == false) && (bIsInverse == true));

if (bValidSetUp)
{
// local scratch of complex numbers. ( V.x+ I V.y is a complex number)
float2 V[2];
// load data into scratch. "j" is threadid
// V[0] = h_j
// V[1] = h_{M/2 +j}
{
// load V[0] = h_j
// = float2(g_{2j}, g_{2j +1})
// = float2(data[2j-1], data[2j]) // offset in data
//
// V[1] = h_{M/2 + j}
// = float2(g_{M + 2j}, g_{M + 2j + 1})
// = -float2(g_{M-2j}, g_{M-2j-1} ) // odd sym.
// = -float2( data[M-2j-1], data[M-2j -2]) // offset in data
// loads the data, extended as odd, and packs it into M complex numbers.
{
int j = ThreadIdx; // j in [0, M/2-1]

// holds h_j
int2 pxl = int2(2 * j - 1, ScanlineIdx);
int2 off = int2(1, 0);

// holds h_{M/2 +j}
int2 pxl2 = int2(Mx - 2 * j - 1, ScanlineIdx);

if (bTransformX == false)
{
pxl.xy = pxl.yx;
off.xy = off.yx;
pxl2.xy = pxl2.yx;
}

// boundary are zero, i.e. data[-1], and data[M].
if (j == 0)
{
V[0].x = 0; // g_0 = f_0 = 0
V[1].x = 0; // g_m = 0
}
else
{
// g_{2j} ( = f_2j )
V[0].x = PressureGridUAV.Load(pxl);
// g_{M-2j}
V[1].x = PressureGridUAV.Load(pxl2);
}
// g_{2j + 1}
V[0].y = PressureGridUAV.Load(pxl + off);
// g_{M - 2j - 1}
V[1].y = PressureGridUAV.Load(pxl2- off);

// make values odd
// g_{M + 2j} = -g_{M-2j}
// g_{M + 2j+1} = -g_{M-2j-1}
V[1] *= -1.;
}
}

if (bFirstRead)
{
V[0] *= RScale;
V[1] *= RScale;
}

// Scale. The forward and inverse scale should have the product 2/Mx
// note, this scale was chosen because the pressure values were so high
// that doing two forward transforms (e.g. x and y) would generate garbage
// most likely by exceeding the max value for the grid storage.
{
float Scale = rsqrt(float(NumCols));
V[0] *= Scale;
V[1] *= Scale;
}

// compute FFT
// This is a sequence of Radix-2 FFTs and data exchanges.
{
int IdxS = ThreadIdx;

for (int Ns = 1; Ns < Mx; Ns *= 2)
{
//IdxD = Expand(ThreadIdx, Ns, RADIX);
int IdxD = (ThreadIdx / Ns) * Ns * 2 + (ThreadIdx % Ns);

// Apply the Twiddle
float Theta = TwoPi * float(ThreadIdx % Ns) / float(Ns * 2);
{
float2 Twiddle;
sincos(Theta, Twiddle.y, Twiddle.x);

V[1] = ComplexMultEqs(V[1], Twiddle);
}

// Radix(2) FFT
{
float2 Vo = V[0];
V[0] = Vo + V[1];
V[1] = Vo - V[1];
}

// Exchange data with other threads
{
GroupMemoryBarrierWithGroupSync();
// write to group shared
for (int r = 0, j = IdxD; r < 2; ++r, j += Ns)
{
FFTGroupShared[j] = V[r];
}
GroupMemoryBarrierWithGroupSync();

// read from group shared
for (int r = 0, j = IdxS; r < 2; ++r, j += NumCols)
{
V[r] = FFTGroupShared[j];
}
}
} // end loop
} // end FFT

// group shared memory now hold H[0]...H[M-1].
// note: H[M] = H[0]
// and in the local scratch:
// H_j = V[0]
// H_{j+M/2} = V[1]

if (ThreadIdx == 0)
{
// Use this thread to write F^sin_{M/2}
int k = NumCols; // = M/2
int2 GridIdx = int2(k - 1, ScanlineIdx);

if (bTransformX == false)
{
GridIdx.xy = GridIdx.yx;
}

// unpacking this frequency has a simplified form.
// G_{M/2} = H_{M/2}
// F^Sin_k = -i/2 G_{k}
float Val = 0.5 * V[1].y;

if (bApplyPoissonConvolution)
{
int kx = k;
int ky = ScanlineIdx + 1;

Val /= EigenValue(kx, ky, Mx, My, dx);
}

if (bLastWrite)
{
Val *= WScale;
}

PressureGridUAV[GridIdx] = Val;
}

// Need to rebuild the FSin_k coefficients.
//
// F_k = (1/2) (H_k + Conjugate(H_{M-k)) + (1/2i) ( H_k - Conjugate(H_{M-k}) * Exp(i pi k/M)
// FSin_k = (1/2i) F_k
// note F_k is pure imaginary. FSin_k is pure real. FSin_0 = 0

// Recall; number of threads = M / 2. Each thread should write 2 slots.
// read H_{M-k} from group shared.
{
int j = (ThreadIdx != 0) ? Mx - ThreadIdx : 0; // H_{M} = H_{0}
V[1] = FFTGroupShared[j];
}

// unpack the F_{k} and F_{M-k} coefficients of the transform of the
// odd extension of the original data.
// threadIdx should be [0,M/2-1]
float2 Fk;
float2 FMminusk;
{
int k = ThreadIdx;
float2 Apart = V[0] + ComplexCon(V[1]);

float2 Bpart; // i ( V[0] - ComplexCon(V[1])) Exp(i pi k / M)

{
Bpart = V[0] - ComplexCon(V[1]);
float2 Twiddle;
float Angle = Pi * float(k) / float(Mx);
sincos(Angle, Twiddle.y, Twiddle.x);
Bpart = ComplexMultEqs(Bpart, Twiddle);
Bpart = float2(-Bpart.y, Bpart.x);
}

Fk = 0.5 * (Apart - Bpart);
FMminusk = 0.5 * (ComplexCon(Apart) + ComplexCon(Bpart));
}

float FsinK = 0.5 * Fk.y;
float FsinMmK = 0.5 * FMminusk.y;

// Also due to symmetry, F^sin_0 will be zero. We don't record it.

// Copy the results to the output buffer, But shift the data
// so [0] holds F^sin_1, [1] holds F^sin_2.. etc

// write F^sin_k
if (ThreadIdx != 0)
{
int k = ThreadIdx;
int2 GridIdx = int2(k - 1, ScanlineIdx);

if (bTransformX == false)
{
GridIdx.xy = GridIdx.yx;
}

if (bApplyPoissonConvolution)
{
int kx = k;
int ky = ScanlineIdx + 1;

FsinK /= EigenValue(kx, ky, Mx, My, dx);
}

if (bLastWrite)
{
FsinK *= WScale;
}

PressureGridUAV[GridIdx] = FsinK;
}

// write F^sin_{M-k}
{
int MminusK = Mx - ThreadIdx;
int2 GridIdx = int2(MminusK - 1, ScanlineIdx);

if (bTransformX == false)
{
GridIdx.xy = GridIdx.yx;
}

if (bApplyPoissonConvolution)
{
int kx = MminusK;
int ky = ScanlineIdx + 1;

FsinMmK /= EigenValue(kx, ky, Mx, My, dx);
}

if (bLastWrite)
{
FsinMmK *= WScale;
}

PressureGridUAV[GridIdx] =FsinMmK;
}
}
}