alessiob | ec99ebc | 2017-03-18 09:29:13 | [diff] [blame] | 1 | # Conversational Speech generator tool |
alessiob | ce0290b | 2017-03-03 15:31:10 | [diff] [blame] | 2 | |
alessiob | ec99ebc | 2017-03-18 09:29:13 | [diff] [blame] | 3 | Tool to generate multiple-end audio tracks to simulate conversational speech |
| 4 | with two or more participants. |
alessiob | ce0290b | 2017-03-03 15:31:10 | [diff] [blame] | 5 | |
| 6 | The input to the tool is a directory containing a number of audio tracks and |
| 7 | a text file indicating how to time the sequence of speech turns (see the Example |
| 8 | section). |
| 9 | |
| 10 | Since the timing of the speaking turns is specified by the user, the generated |
| 11 | tracks may not be suitable for testing scenarios in which there is unpredictable |
| 12 | network delay (e.g., end-to-end RTC assessment). |
| 13 | |
| 14 | Instead, the generated pairs can be used when the delay is constant (obviously |
| 15 | including the case in which there is no delay). |
| 16 | For instance, echo cancellation in the APM module can be evaluated using two-end |
| 17 | audio tracks as input and reverse input. |
| 18 | |
| 19 | By indicating negative and positive time offsets, one can reproduce cross-talk |
alessiob | 82f71d6 | 2017-06-15 10:49:57 | [diff] [blame] | 20 | (aka double-talk) and silence in the conversation. |
alessiob | ce0290b | 2017-03-03 15:31:10 | [diff] [blame] | 21 | |
alessiob | ec99ebc | 2017-03-18 09:29:13 | [diff] [blame] | 22 | ### Example |
alessiob | ce0290b | 2017-03-03 15:31:10 | [diff] [blame] | 23 | |
| 24 | For each end, there is a set of audio tracks, e.g., a1, a2 and a3 (speaker A) |
| 25 | and b1, b2 (speaker B). |
| 26 | The text file with the timing information may look like this: |
alessiob | a541f84 | 2017-03-08 14:12:23 | [diff] [blame] | 27 | |
| 28 | ``` |
| 29 | A a1 0 |
| 30 | B b1 0 |
| 31 | A a2 100 |
| 32 | B b2 -200 |
| 33 | A a3 0 |
| 34 | A a4 0 |
| 35 | ``` |
| 36 | |
alessiob | ce0290b | 2017-03-03 15:31:10 | [diff] [blame] | 37 | The first column indicates the speaker name, the second contains the audio track |
| 38 | file names, and the third the offsets (in milliseconds) used to concatenate the |
| 39 | chunks. |
| 40 | |
| 41 | Assume that all the audio tracks in the example above are 1000 ms long. |
| 42 | The tool will then generate two tracks (A and B) that look like this: |
| 43 | |
alessiob | a541f84 | 2017-03-08 14:12:23 | [diff] [blame] | 44 | **Track A** |
| 45 | ``` |
alessiob | ce0290b | 2017-03-03 15:31:10 | [diff] [blame] | 46 | a1 (1000 ms) |
| 47 | silence (1100 ms) |
| 48 | a2 (1000 ms) |
| 49 | silence (800 ms) |
| 50 | a3 (1000 ms) |
alessiob | a541f84 | 2017-03-08 14:12:23 | [diff] [blame] | 51 | a4 (1000 ms) |
| 52 | ``` |
alessiob | ce0290b | 2017-03-03 15:31:10 | [diff] [blame] | 53 | |
alessiob | a541f84 | 2017-03-08 14:12:23 | [diff] [blame] | 54 | **Track B** |
| 55 | ``` |
alessiob | ce0290b | 2017-03-03 15:31:10 | [diff] [blame] | 56 | silence (1000 ms) |
| 57 | b1 (1000 ms) |
| 58 | silence (900 ms) |
| 59 | b2 (1000 ms) |
alessiob | a541f84 | 2017-03-08 14:12:23 | [diff] [blame] | 60 | silence (2000 ms) |
| 61 | ``` |
alessiob | ce0290b | 2017-03-03 15:31:10 | [diff] [blame] | 62 | |
| 63 | The two tracks can be also visualized as follows (one characheter represents |
| 64 | 100 ms, "." is silence and "*" is speech). |
| 65 | |
alessiob | a541f84 | 2017-03-08 14:12:23 | [diff] [blame] | 66 | ``` |
| 67 | t: 0 1 2 3 4 5 6 (s) |
alessiob | ce0290b | 2017-03-03 15:31:10 | [diff] [blame] | 68 | A: **********...........**********........******************** |
| 69 | B: ..........**********.........**********.................... |
| 70 | ^ 200 ms cross-talk |
alessiob | a541f84 | 2017-03-08 14:12:23 | [diff] [blame] | 71 | 100 ms silence ^ |
| 72 | ``` |