Optimize scaleFFTData for float FFTs
BUG=
Speed up scaleFFTData by about 30% by doing the scaling on 4
complex (8 float) elements at a time.
Some timing measurements using perf measuring
time_fft_time -T -F -f 1 -n 11 -g 2 -c 1000000
Before optimization:
samples pcnt function DSO
_______ _____ ______________________________________ _____________
2364.00 25.9% evenOddButterflyLoopInv [vectors]
1957.00 21.4% radix4SetLoopINV [vectors]
1197.00 13.1% radix4SkipReadINV [vectors]
1009.00 11.0% scaleFFTData [vectors]
After optimization:
samples pcnt function DSO
_______ _____ ______________________________________ _____________
3806.00 25.9% evenOddButterflyLoopInv [vectors]
3523.00 23.9% radix4SetLoopINV [vectors]
2103.00 14.3% radix4SkipReadINV [vectors]
1471.00 10.0% radix4lsGrpLoopinv [vectors]
1134.00 7.7% scaleFFTData [vectors]
The time spent has gone in scaleFFTData has gone down from 11% to 7.7%.
R=aedla@chromium.org, andrew@webrtc.org, kma@webrtc.org
Review URL: https://webrtc-codereview.appspot.com/1574005
git-svn-id: http://webrtc.googlecode.com/svn/deps/third_party/openmax@4148 4adac7df-926f-26a2-2b94-8c16560cd09d
diff --git a/dl/sp/src/omxSP_FFTInv_CCSToR_F32_Sfs_s.S b/dl/sp/src/omxSP_FFTInv_CCSToR_F32_Sfs_s.S
index 2616506..5deaf89 100644
--- a/dl/sp/src/omxSP_FFTInv_CCSToR_F32_Sfs_s.S
+++ b/dl/sp/src/omxSP_FFTInv_CCSToR_F32_Sfs_s.S
@@ -134,6 +134,8 @@
#define dScale D2.F32
#define one S4.F32
+#define qX0 Q2.F32
+#define qX1 Q3.F32
@// Allocate stack memory required by the function
M_ALLOC4 complexFFTSize, 4
@@ -262,15 +264,25 @@
VDIV one, one, fN @ one = dScale[0] = 1 / fftSize
- @// N = subFFTSize ; dataptr = pDst
+ @// subFFTSize = N = complexFFTSize, which is always even and
+ @// greater than 0.
+ CMP subFFTSize, #4
+ BLT scaleFFTData1
scaleFFTData:
- VLD1 {dX0},[pSrc] @// pSrc contains pDst pointer
- SUBS subFFTSize,subFFTSize,#1
- VMUL dX0, dX0, dScale[0]
- VST1 {dX0},[pSrc]!
+ @// Scale 4 complex (8 float) elements at a time
+ VLD1 {qX0, qX1}, [pSrc :256] @// pSrc contains pDst pointer
+ SUBS subFFTSize, subFFTSize, #4
+ VMUL qX0, qX0, dScale[0]
+ VMUL qX1, qX1, dScale[0]
+ VST1 {qX0, qX1}, [pSrc :256]!
BGT scaleFFTData
-
+scaleFFTData1:
+ CMP subFFTSize, #2
+ BLT End
+ VLD1 {qX0}, [pSrc]
+ VMUL qX0, qX0, dScale[0]
+ VST1 {qX0}, [pSrc]!
End:
@// Set return value
MOV result, #OMX_Sts_NoErr