Page  00000001 A New Efficient Approach to Query by Humming Leon Fu Xiang-yang Xue Dept. of Computer Science and Engineering, Fudan University, China {022021221 @fudan.edu.cn, xyxue@fudan.edu.cn} Abstract In this paper, we propose a new efficient approach to Query-by-Humming, focusing on the melody representing and similarity measure. Since the humming error can not be avoided and the query response must be as fast as possible, we present a new quantized pitch change description and dual-dynamic programming algorithm to improve precision and response both. In our system, we create a web service and use the MIDI songs as the database and hum in tune as query. The proposed approach shows us its well balance of precision and response. 1 Introduction By the development of multimedia application on web, there're more and more audio data we need to access, especially the music database. As the rapid growth of various kinds of music, many music retrieval systems were built to meet the needs. Most of them were based on predefined free-text retrieval such as by music name, genre, artist and etc. But in some other cases, we may not know the music profile information exactly and just remember some simple and short melody. Thus, content-based music retrieval is the best choice and query-by-humming is one of the most popular directions for large music database in the content-based methods. QBH system aims at searching a desired piece of music as quickly as possible only by singing or whistling its tune, which is very useful when you would like to find a song from music library but forget its title or artist. As we can imagine, QBH do not need to obtain any predefined information manually. Instead, it has to find a method to represent the music content itself. In one hand, as for a large raw music database, it is possible to extract the melody automatically since profile information may not be obtained easily. On the other hand, when the humming query comes, we can also extract the melody from the humming automatically and match it with the music melody database so as to find the desired piece of music. Among these steps, there are many problems need to be resolved. We need to find an effective way to represent the melody as well as finding an efficient way to extract melody from humming. We also have to find an efficient approach to measure the similarity of humming and music database. As to a music retrieval system, we not only have to pay attention to the music itself, but pay attention to the human's habits as well. The difficulties will be presented in the further discussion. The architecture of our QBH system is shown in Figure 1. It consists of two subsystems which are off-line (blue line) part and on-line (red line) part. MIDI database creation is processed in the off-line subsystem while humming process and matching work is processed in the on-line subsystem. All the MIDI files have been processed into features and stored into database before humming is coming for query. Afterwards, our system can convert the incoming humming into features and matches them with MIDI feature database. Therefore, the efficiency of humming processing and similarity measure both need to be considered obviously. N.. Humming Input (Wave Stream) Pre-Processing Melody Representing off-line on-line Similarity Midi Midi Measure Feature Database..................... Sorting Figure 1 System architecture Our proposed new approach to QBH puts more focus on new description representation and its similarity measure in order to enhance the efficiency of on-line retrieval procedure. 2 Related Work Most of recent works on QBH are focused on these three components, which are melody representations, humming pitch extraction and similarity measurement. In some earlier works, such as [1], which first presents to use only pitch contour to represent melody. It uses an alphabet of three possible relationships between pitches ('U', 'D', and 'S') to the situations where a note is above, below or the same as the previous note, which can not reach an ideal performance. Otherwise, in another work [2], they think it is very difficult to distinguish each repeated note Proceedings ICMC 2004

Page  00000002 from continuous humming correctly, for there seems no obvious energy break when a note is repeated. Since merging the same notes in humming, they resort to use two alphabets to represent the pitches ('U' and 'D') in music database. However, only such simple pitch contour description is not enough. In [3], pitch interval and rhythm are both presented; in [4], four basic segment types are used to pitch the contour; in [5], a triplet <TPB> (time signature, pitch vector, beat vector) is considered to represent the melody. In our paper, we adopt to use the simple pitch contour description and a new defined pitch description. As to humming pitch extraction, in [1], it gives us a whole introduction to human voice model such as vocal tract model, voice formant frequencies. Thus, it tries three methods to track pitch: Autocorrelation, Maximum Likelihood and Cepstrum Analysis. In some other work such as [2][3], they use general method to extract pitch such as zero-crossing detecting, energy changing and median filters. All these methods can make pitch detection effective for discrete note humming, but as to continuous melody humming, there have not yet any efficient methods to extract pitch well. The matching algorithm for the similarity measure also needs to be considered. In [1], they use an approximate string matching algorithm described by Baesa-Yates and Perleberg. In [6], a new distance metrics between query and songs is proposed. [2] presents a "hierarchical matching" method for matching which doesn't prove its efficiency in response time. In a recent paper [7], they improve the existing Dynamic Time Warping (DTW) indexes technique by introducing the concept of envelope transforms to index time series database which gives a high precision and errortolerance but not show its speed and efficiency in detail. Therefore, we propose a new melody representation and an efficient matching algorithm for similarity measure, which consider response and precision both. 3 Melody Representation The reason why we choose MIDI file to process is that MIDI format records music tracks and every track tells us the note-on value, note-off value, note velocity and note strength. As melody is hardly extracted from complicated music wave streams and few researches in this field show promising progress, we adopt MIDI files for melody representation. However, there are many multi-track MIDI files that we can not just simply merge or remove tracks. In our system, we deal with it in the pre-process work shown in Figure 2. Firstly, we choose the track from the beginning of the file due to the fact that main track is always arranged ahead. So we select the first track whose note count is larger than the threshold which is related to the total note count and the track count in the file. Secondly, we select the notes from the selected track. As to a chord in the track, we average the notes in the chord as a selected note. In our system, we do not merge two or more tracks into one because mixed track proves quite different from the humming. The reasons for such process are based on: 1. most pieces of songs have only one main track 2. The main track must have sufficient notes 3. MIDI maker usually put the main track at the Notes Selected - I Notes Selected Figure 2 MIDI Pre-process Figure 3 Humming Pre-process As to humming melody representation, it has no such melody information as MIDI file. All humming pieces are recorded through microphone and converted into PCM. The notes selection work is shown in Figure 3. Firstly, Humming query is converted into quantized PCM data. Secondly, the data is divided into segments without silence according to energy detection and zero-crossing detection. As every piece of silence represents a humming pause, different segments usually have different notes. Thirdly, in each segment, we apply median filter to smooth it. Then we get notes by pitch detection for each. We can select one or two major notes in each segment according to the various pitch detection results. Considering the efficiency of humming representation and similarity measure, we still adopt pitch contour description. It is a simple but effective method for matching in practice. It can ignore the differential distance between two notes. For example, "Do So" and "Do Me" are looked as the same. In our system, we use UD string with ignorance of repeated note. In this way, every note turns to a U or D. For example, the pitches are (36,37,40,40,33,30), we convert it into "_UUDD". Repeated notes are also ignored in humming processing, since we can not ensure exact same note is hummed and detected as the same. Otherwise, notes duration is also ignored as we can not ensure that humming can keep better notes duration. If not, it sometimes even makes noises in similarity measure. Furthermore, we propose a new description called quantized pitch change description. It can ignore the tiny difference between two notes in a small music scale. For example, "Do Re" and "Re Do" are looked as the same. Firstly, we find the highest and lowest notes in the selected Proceedings ICMC 2004

Page  00000003 track. Secondly, we make every note into quantized note shown in Formula 1. As we know, the humming is just a short piece of music while a piece of MIDI is a complete one. Moreover, a song usually has its climax piece, but a humming query could be any short piece of a song, which may cause them incomparable because of their different keys. Therefore, we change the quantized notes value into quantized change value according to its adjacent value shown as Formula 1 to reduce the effect of different key. As to the example above, we may get the following pitch description "CCDBA" where "A" here means 30-32, "B" means 33-35, "C" means 36-38 and "D" means 39-41 and get description string "_+0+1-2-1" finally. Qk (Notek,) =L(Notek, - min(Notek,0..., Notek,1))a (1) QCk (Notek) =Qk (Notek,)- Qk (Notek-1l) where k is the id of MIDI in database, i is the i -th notes of k -th MIDI, I is the length of the MIDI or humming, a is a quantizing step and in our system, it equals 1+(nmx(Note,o...,,)-nin(oto,...,ANdek,))18 for MIDI. But query has a changeable a step which always keeps it as same as the step of each MIDI to be matched. These two descriptions are not complicated but have well performance, especially in error-tolerance. Pitch contour description contains the local information of adjacent notes change while quantized pitch change description contains global information of all notes change. [o, j]= 0 (2) C[i, 0] i i C[i, j]= if(P, = Ti)then C[i-1, j-1] else 1+ min(C[i - 1, j], C[i, j -1], C[i -1, j - 1]) where C[i, j] (error value) represents the minimum number of errors needed to match PI to a suffix of T7.. From these figures, we just match two strings once so that we can get a list of error values at the last line. For example, in Figure 5, the bottom-right two values are what we need for calculating. The value "2" means that the distance between "CDBA" and "CDDB" is 2 while the value "1" means that the distance between "CDBA" and "DDBA" is 1. Therefore, parsing once can get all the matching results. C D D B A C 1 0 1 1 1 1 D 2 1 0 1 2 2 B 3 2 1 1 1 2 A 4 3 2 2 2 1 Figure 5 Deleting Case C D D B A 0 0 0 0 0 0 C 1 0 1 1 1 1 D 2 1 0 1 2 2 B 3 2 1 1 1 2 B 4 3 2 2 1 2 A 5 4 3 3 2 1 Figure 6 Replacing Case 4 Similarity Measure Humming 10 Pitch Contour Data Descriptions Midi Quantized Pitch Change Data Descriptions Figure 4 Similarity Measure Process Since the MIDI database and humming query have been represented into melody description, relevant similarity measure needs to be proposed. As we know, the on-line response time includes two parts. One is humming preprocess time which has no relation to the size of database and another is matching time which is related to it. So if we can shorten the time of matching each MIDI in database by decreasing the time complexity of algorithm, we can shorten the whole time of retrieval procedure. In our system, we propose a dual-dynamic programming algorithm shown in Figure 4. As being proved, dynamic programming is an effective and efficient algorithm in string matching such as the following example (Figure 5, Figure 6). They are usually computed by formula 2. In our approach, we have two main melody descriptions, pitch contour description and quantized pitch change description. Therefore we employ dynamic programming algorithm in parallel, which we named Dual-Dynamic Programming. As to pitch contour description matching we can use the formula 2, but as to the quantized pitch change description matching, we modify the formula 2 as formula 3. Here we adjust the error values by considering the distance between the two different quantized pitch change values instead of just adding one. Otherwise, there's a fact that when pitch changes more, human tends to make mistake more easily in humming. Therefore, we have to decrease the effect of such error. In formula 3, the error value will be smaller if the pitch change (T ) in database is bigger. C'[o, j]=0 C'[i, 0]=i (3) C[i, j= if((IP-T. I/ (IT. I+1))< ) then C'[i- 1, j-1] else I -T1 I/ (I. I +1)+min(C'[i-1,j],C[i, j-1],C'[i-1,j-1]) where Iý is the i -th value of humming description, T1 is the j -th value of database description and 86 is a threshold. In this way, we also resolve the problem that many songs usually have the same score and every song can get the reasonable score finally. As a result, for each humming query, we will get two error value matrixes C and C' to each MIDI in database. According to the demonstration above, the last line of each matrix is what we need to calculate the error score for similarity. We combine these two last line error value lists to each MIDI using formula 4. Errk_ = y * CU [II, i] + (1 - /) * C QC, [II, i] (4) Proceedings ICMC 2004

Page  00000004 where 11 is the length of humming description string, i is from 11 to the length of k -th MIDI description string and y is a weight, 0.3 in our system. Finally, we can obtain the similarity between the query and each MIDI as formula 5 and sort them for return. Sim(Hum, Midik) = min(Errk1,,..., Errk,12) /11 (5) where 11 is the length of humming description string, 12 is the length of MIDI description string. 5 Experiment Results As a matter of fact, if the target song in a 10,000-songs database is listed in the top 5, 10 or even 20, most users will be satisfied with the result, but if the response time is more than one minutes on web, most of them may not have such patience. Since the response time will increase in linear with the amount of songs in database, we have to focus on response time as well as the precision in our system. Our system is based on the approach above, which was built as a web service. For the experiments, we got 1904 pieces of MIDI songs and music from internet and 38 humming queries hummed by those who have no music background. All the humming queries are about 15-20 seconds long and 12-16 pitch contour and pitch change description values were separately extracted per query. As being discussed above, it is very important to shorten the time of similarity measure for each. In this case, we compare this new dual-dynamic programming approach with another two approaches. Test 1 is tested by traditional dynamic programming approach which ranks only by pitch contour description and Test 2 is done by ranking pitch contour description first and then re-ranking the songs by considering the similarity of pitch interval and duration and extra rhythm description in order to get a higher precision. According to the experimental figures shown in Figure 7 and Figure 8, we can find out that our new approach has a quick response that about 4.47 seconds in average(not include humming pre-process time) that a query is processed and about 55.3% queries that the desired song is listed in top 1 and 84.2% in top 10. Test 1 is about 1.98 seconds per query but it has only 15.8% queries that results are listed in top 1 and 31.6% in top 10. Test 2 has the best top rank that 63.2% results are listed in top 1 and 89.5% in top 10, but a query is processed in so long as 9.01 seconds in average. In this case, our new approach has a better trade-off between response and precision. 6 Conclusion Query by humming is a new application in content-based music retrieval. It has many problems left to be resolved such as singing the lyric and creating unstructured music database such as MP3. In this paper, we just present a new approach to deal with the problem of query precision and query response in QBH. We hope we can resolve some other problems of this area in the further research work. dj &Q> c3., oJ 0 Os 1 100O 900 80R 70~ 60% 40% 30% 20~ lO0 5 10 20 50 Top Rank ---- Dual-Dynamic Programming --- Test 1.-- Figure 7 Queries Top Rank Percentage 70% 60% 50% 40% 30% 20% 10% 0% *Test 2 1 2 3 4 5 6 7 8 9 10 11 Response Time(Sec) U Dual-Dynamic Programming E[ Test 1 0 Test 2 Figure 8 Queries Response Time Percentage 7 Acknowledgements This work was supported in part by NSF of China under contracts 60003017 and 60373020, 863 Plans under contracts 2002AA103065, and Shanghai Municipal R&D Foundation under contracts 03DZ15019 and 03DZ14015, MoE R&D Foundation under contracts 104075. References [1] A. Ghias, et al, "Query By Humming-Musical Information Retrieval in an Audio Database ". Proc.s of ACM Multimedia95, pp231-236, 1995. [2] Lie Lu, Hong You, Hong-Jiang Zhang A New Approach to query by humming in music retrival. Microsoft Research, China [3] R. J. McNab, et al, "Towards the Digital Music Library: Tune Retrieval from Acoustic Input". Proc. of Digital Libraries, pp 11-18, 1996. [4] A. L.P. Chen, M. Chang, J. Chen. "Query by Music Segments: An Efficient Approach for Song Retrieval". In Proc. of IEEE International Conference on Multimedia and Expo., 2000 [5] Y. Kim, W. Chai, R. Garcia, B. Vercoe, "Analysis of a Contour -Based Representation for Melody," Proc. International Symposium on Music Information Retrieval, Oct. 2000. [6] C. Francu and C. G. Nevill-Manning. "Distance Metrics and Indexing Strategies for a Digital Library of Popular Music". In Proc. of IEEE International Conference on Multimedia and Expo. 2000. [7] Yunyue Zhu, Dennis Shasha. Warping Indexes with Envelope Transforms for Query by Humming. SIGMOD 2003, June 9 -12, 2003, San Diego, CA. Proceedings ICMC 2004