Preview only show first 10 pages with watermark. For full document please download

A High-efficiency Algorithm For Mining Frequent Itemsets Over-05565215-20

International Conference on Intelligent Control and Information Processing August 13-15, 2010 - Dalian, China A High-efficiency Algorithm for Mining Frequent Itemsets over Transaction Data Streams Zhaoyang Qu, Peng Li, and Yaying Li based on this model.In addition, FP − stream algorithm is generated from FP − growth algorithm by adding a titled-time window to save the pattern information under different time granularity, so the algorithm has good dynamics[2]. Lossycounting damage counting algor

   EMBED


Share

Transcript

     Abstract   —The mobility and unlimitedness of data streamsmake the traditional frequent itemsets mining algorithm nolonger applicable. In this paper, according to the characteristicsof data streams, we propose a novel algorithm MFIBA(MiningFrequent Itemsets based on Bitwise AND) based on bitwise ANDoperation for mining frequent itemsets. This algorithm updatesthe sliding window with basic window, and maintains item’sfrequent information in the memory with the array structure,finally obtains all the frequent itemsets by using bitwise ANDoperations between items. The arrays are updated dynamicallywhen a basic window is inserted into the sliding window, theanalysis and experiment results show that this algorithm hasgood performance. I.   I  NTRODUCTION   ATA streams  DS  is the dynamically growth transactionsets which is formed by the continually arrivedtransactions[1]. Data streams widely exist in many areas of daily life, such as network monitoring, meteorologicalmonitoring, sensor networks, stock analysis, and so on.Frequent itemsets mining as a critical step of association rulemining has been extensively researched by experts at homeand abroad in the past ten more years, and many classicalmining algorithms based on static data sets have been proposed. However, because the data in data streams with thefeatures that continually and real-time arrive, unlimitedamount of data and can not be predicted, traditional frequentitemsets mining algorithms can not be directly applied to thedata streams environment, so frequent itemsets mining basedon data streams becomes a challenging research direction.Window technique is generated because of users are moreinterested in the recently data. Depending on the differenttime scope, the query window in the data streamsenvironment can be divided into the following three kinds:snapshot window, landmark window and sliding window.The feature of the snapshot window model determines it can just detect the data in the range of fixed-time, so that modeldoesn’t be used frequently. The algorithms based onlandmark window model mine all the data from a particular time point to the current time, most of the early algorithms are Manuscript received May 24, 2010. This papper is supported by Fund project: Grants of JiLin Procincial Technology DevelopmentProject(No.20080322):”The Eleventh-Five-Year Plan” Science andTechnology Education research project of JiLin Provincial Department of Education(2008(41)).Zhaoyang Qu is with Northeast Dianli University, Jilin. (correspondingauthor to provide phone: 13843225791; e-mail:[email protected]).Peng Li is now with the School of Information Engineering, NortheastDianli University, Jilin. (e-mail: [email protected]).Yaying Li is now with the School of Information Engineering, NortheastDianli University, Jilin.  based on this model.In addition,  stream FP  − algorithm isgenerated from  growth FP  − algorithm by adding atitled-time window to save the pattern information under different time granularity, so the algorithm has gooddynamics[2]. ing  Lossycount  damage counting algorithmintroduces an error parameter  ε  , divides the arriving datastreams into aequilate bucket conceptually, and also gives thealgorithm for mining frequent itemsets in a single bucket[3].In the real applications, people are more concerned about therecently arrived data, so mining frequent itemsets in a slidingwindow under data streams environment gets more and moreattention.  DSFPM  algorithm constructs the Tree DSFPM  − data structure to save the potentialfrequent itemsets, and slides the sliding window with the unitof basic window, this algorithm only deals with and saves the potential frequent closed itemsets of every basic window, soit increases the efficiency of time and space[4].This paper proposes a method  FIBA for efficiencymining frequent itemsets in a sliding window over datastreams, the method uses two-dimensional array to compressand save the potential frequent itemsets in a sliding windowover data streams. When a basic window of the data streamsarrives, the method achieves the incremental update of thearray just need scanning the transaction elements one pass.II.   B ASIC CONCEPTIONS OF DATA STREAMS  A transaction data streams },...,,{ 21 i T T T  DS  = is atransaction sequence which is formed by infinite transactions,while the i T  is the i-th arrived transaction, and Tid  is theunique identifier of the transaction in data streams. Supposethe set },...,,{ 21 m iii I  = is formed by m different items,and the data element in every transaction is a sub-set of the set  I  . The sliding window SW  contains the recently arrived W  transactions of data streams  DS  . Every itemset  A is asub-set of data set  I  , the count of the items in  A is calledthe size of the itemset  A , denoted as ||  A , the itemsets withthe length of   A is called A-itemsets. If a transaction T   contains all the items of itemsets  A , then call that thetransaction T  contains itemsets  A . Definition 1 Frequent itemset: Suppose  A is aitemset,  I  A ⊆ . In data streams  DS  , the count of transactions in which contains itemset  A is called thesupport count of   A , denoted as )(  A f  . The ratio of the A H i gh-efficiency Algorithm for Mining Frequent Itemsets overTransaction Data Streams Zhaoyang Qu, Peng Li, and Yaying Li D   International Conference on Intelligent Control and Information ProcessingAugust 13-15, 2010 - Dalian, China978-1-4244-7050-1/10/$26.00c  2010 IEEE 148   support count of itemset  A and the total number of transactions in data streams W  is called the support of itemset  A , denoted as )sup(  A , ||/)()sup( W  A f  A = .Suppose  s is a user-specified minimum support threshold, if the support of itemset  A no less than  s , then  A is calledfrequent itemset over data streams  DS  . Definition 2 Basic window: Block the continually arriveddata streams, the number of transactions in every block isequal, every block is called a basic window, denoted as w ,the value of  w indicates the width of the basic window.Thus, suppose the sliding window is formed by k   continuous basic windows, in which k  is a fixed value,namely },...,,{ 21 k  wwwW  = . In data streams, with thecontinually arriving of transactions, the new basic window isinserted into the sliding window, and at the same time theoldest basic window is deleted from the sliding window, sothe sliding window is always formed by the latest k  basicwindows. Definition 3 Known user-specified minimum supportthreshold  s and error parameter  ε  , suppose the width of thesliding window is || W  , the support count of itemset  A inthe sliding window W  is )( W  f  , there have the followingdefinitions: if  ||)( W  s A f  W  ≥ , then  A is called frequentitemset over sliding window W  ; if  ||)( W  A f  W  ε  ≥ , then  A is called sub-frequent itemset over sliding window W  ; if  ||)( W  A f  W  ε  < , then  A is called infrequent itemset over sliding window W  . Theorem 1 Delete the infrequent itemset from every basicwindow, even though it is frequent over the entire slidingwindow, its output support error no more than ε  .Proof: The sliding window W  is formed by k  basicwindows, namely },...,,{ 21 k  wwwW  = . For the knownminimum support threshold  s and error parameter  ε  ,suppose that the itemset  A is frequent over the entire slidingwindow, is sub-frequent in the basic window )1( k iw i ≤≤ ,and is infrequent in the basic window )1( k  jw  j ≤≤ ,namely ||)( iw w A f  i ε  ≥ , ||)(  jw w A f   j ε  < . Because ∑ ∑ += )()()(  A f  A f  A f   ji W W W  , so after deleting thecount of itemset  A from the basic window )1( k  jw  j ≤≤ ,the count of   A in the entire sliding window is ∑ ∑ ∑ −≥−= ||||)()()(  jW W W  wW  s A f  A f  A f   ji ε  , because ∑ > ||||  j wW  , ∑ −> ||)()( W  s A f  i w ε  .So, delete the infrequent itemset from every basic window,even though it is frequent over the entire sliding window, itsoutput support error no more than ε  .We can see from the theorem 1 that it just needs to save thesub-frequent itemsets in every basic window, and outputs allthe frequent itemsets of the sliding window in the error  parameter scope.III.   M FIBA ALGORITHM    FIBA algorithm maintains a buffer in main memory totemporarily save the coming data of the data streams, whenthere are w transactions arrive, the algorithm takes this w  transactions as a basic window and inserts it into the slidingwindow; the sliding window contains k  basic windows,every basic window is numbered by the sequence it is insertedinto the sliding window.The algorithm is formed by two steps:First, scan and deal with the basic windows, and save thefrequent item information of every basic window in the particular two-dimensional array structure;Second, mine the two-dimensional array, and get all thefrequent itemsets of the sliding window.  A.   Construct the two-dimensional array  FIBA algorithm saves all the potential frequentitemsets in the sliding window scope with thetwo-dimensional array structure. Suppose there are a  different frequent items in the arranged basic window, and thewidth of the basic window is w , the item set },...,,{ 21 m iii I  = is formed by m different items, then wecan save all the frequent information of the basic windowwith the )1( +× wa two-dimensional array.Generate the two-dimensional array of the basic windownumbered as )1( k i N  i ≤≤ : Input: all the transaction data of the basic windownumbered as )1( k i N  i ≤≤ , the minimum supportthreshold  s , error parameter  ε  ; Output: the two-dimensional array of the basic windownumbered as )1( k i N  i ≤≤ .(1) for( ++≤=  jw j j ;;1 ){(2) to the different items )1( ml i l  ≤≤ in transaction  j T  , judge whether there exist an one-dimensional array with itsfirst value is l  i ;(3) if(exist that array){(4) establish an one-dimensional array with the length of  1 + w ;(5) set the first value of the array to l  i , the 1 +  j -th valueto 1, and other values to 0;}(6) else(7) set the 1 +  j -th value of the one-dimensional arraywhich first value is l  i to 1, and other values invariant;(8) }(9) calculate the count of 1 in the one-dimensional array,which value is the frequent count of every item in the basicwindow;(10) for(every item l  i in the basic window){ 149   (11) if  |)|)(( i j N   N i f  ε  <  (12) l  i is the infrequent item in basic window i  N  , deletethe one-dimensional array which saves l  i ;(13) }(14) calculate the count of the rest one-dimensional array,denoted as a ;(15) establish a )1( +× wa two-dimensional array;(16) save the a one-dimensional array into thetwo-dimensional array.The following is an instance to illustrate the procedure of constructing the two-dimensional array, suppose the set },,,,,,,,,{ 10987654321 iiiiiiiiii I  = , transaction datastreams episode as shown in TABLE I :Taking those ten transactions as a basic window, thetransactions arrived by the time sequence. The algorithmdeals with the first item 1 i of transaction 1 T  , establishes anone-dimensional array with the length of eleven, the firstvalue of the array is 1 i , the second value is one, and other values set to zero; when deals with the second item 3 i of transaction 1 T  , establishes an one-dimensional array for  3 i ,the first value of the array is 3 i , the second value is one, andother values set to zero, there will be four one-dimensionalarray established after the first transaction is dealed with.When the algorithm deals with the first item 3 i of transaction 2 T  , because there has established an array of item 3 i whendealed with transaction 1 T  , so just need to update the thirdvalue of the 3 i array to one, and other values invariant, therewill be five array maintained after the second transaction isdealed with. When the last transaction is dealed with, we canget the table 1, each line in this table indicates anone-dimensional array, the last list of the table indicates thefrequent count of every item.Suppose user-specified minimum support threshold 3.0 =  s , error parameter  03.01.0 =×=  s ε  , then the tendifferent items in this data streams episode are allsub-frequent items, so the 1110 × two-dimensional arrayformed by ten one-dimensional array of sub-frequent itemscan save all the frequent information of this basic window.  B.   Mining frequent itemsets When there are user query request, the algorithm mines the basic window by the window number, finally gets all thefrequent itemsets of the sliding window by combining thefrequent itemsets in every basic window. Specific proceduresare as follows:(1) To the two lines α  l  ,  β  l  ),,1( β α  β α  ≠≤≤ a of the )1( +× wa two-dimensional array, the corresponding itemof the line α  l  is  j i , and the corresponding item of the line  β  l  is k  i ; the algorithm does the bitwise AND operation between the two lines, and saves the result in anone-dimensional array with the length of  1 + w , the firstvalue of the array is itemset k  j ii , other  w values set to theresult after the two lines did the bitwise AND operation;(2) Calculate the count of one in the array which saves the2-itemset, this is the support count of the 2-itemset, accordingto the user-specified minimum support threshold  s and error  parameter  ε  , delete the infrequent 2-itemset;(3) Do the bitwise AND operation between theone-dimensional array which saves the 2-itemset k  j ii andthe line of the two-dimensional array which saves the item ),,,1( eh jmeh ji e ≠≠≤≤ , save the result in anone-dimensional array, the first value of the array is itemset ek  j iii , other  w values set to the result after the two lines didthe bitwise AND operation;(4) Use the same method until the generate frequentitemset is null, the algorithm is over. TABLE   Ⅱ T WO - DIMENSIONAL ARRAY OF THE BASIC WINDOW   i 1 110001 1 0 105i 3 110101 0 0 105i 5 110000 1 0 104i 4 101000 1 1 116i 2 010010 1 1 004i 6 001100 0 0 013i 8 001010 0 0 002i 7 000100 0 0 012i 9 000100 0 1 002i 10 000001 0 1 013  TABLE   ID ATA STREAMS INSTANCE   Tid items T 1  i 1 i 3 i 5 i 4   T 2   i 3 i 2 i 1 i 5   T 3   i 6 i 8 i 4   T 4   i 3 i 6 i 7 i 9  T 5   i 2 i 8   T 6   i 1 i 3 i 10   T 7  i 1 i 5 i 4 i 2   T 8   i 2 i 10 i 9 i 4   T 9   i 3 i 5 i 4 i 1   T 10  i 10 i 6 i 7 i 4   150   IV.   T ESTING AND ANALYSIS  All the programs are implemented using Microsoft VisualC++ Version 6.0 and performed on a 2.4GHz Pentium4 PCmachine with 512 MB memory running on Windows XP, thesimulate data used in the experiments are generated by IBMsynthetic data generator. The synthetic data streams, denoted by T5I4D1000K, in which T indicates the average length of the data set transaction, I indicates the average length of the potential frequent itemset, and D indicates the count of thetransaction, set the count of item to 5K, other parameters of the data generator use the default value.  A.   Comparative analysis Because the TransSW MFI  − algorithm which is anone-pass scanning algorithm is also based on the bitwiseAND operation, so we select TransSW MFI  − as thecomparative algorithm. Comparing the memory consumptionwhen the two algorithms mining the data set T5I4D1000K under the same minimum support threshold, the experimenttakes the minimum support threshold 005.0 =  s , error  parameter   s 1.0 = ε  , and sets the width of the basic windowto 10000, the whole sliding window contains 10 basicwindows. The experiment compares the performance of thosetwo algorithms by three phases, namely window initialization phase, window sliding phase and frequent itemsets mining phase.Figure 1 shows the memory consumption comparison of   FIBA algorithm and TransSW MFI  − algorithm withthe transaction arriving in window initialization phase. In  FIBA algorithm, with the arriving of basic window before the sliding window is full, there are more and moreone-dimensional array maintained in the memory, so thememory consumption is increasing. TransSW MFI  −  algorithm updates the sliding window with the unit of transaction, and it does not use the error parameter  ε  todelete the infrequent itemsets in the basic window, so thefrequent itemsets it maintained and memory consumption aremore than the  FIBA algorithm.We can know from figure 2, after the ten-th basic windowis inserted into the sliding window, the sliding window is full,when the next basic window arrives, the window sliding phase starts. Because the information of the oldest basicwindow will be deleted when the new basic window isinserted into the sliding window, so the memory consumptionin window sliding phase will not vary too much.When there has an user query, the algorithm will generatefrequent itemsets from the maintained two-dimensional arrayof every basic window. It will produce a lot of candidateitemsets in the process of generating frequent itemsets byfrequent items, and in which there are many infrequentitemsets, so frequent itemsets mining phase will take up a lotof memory resources to save the candidate itemsets.  B.    Performance analysis Mining the data set T5I4D1000K by  FIBA algorithm,and the support threshold  s set to 0.01, 0.008 and 0.004respectively, observe the time and space efficiency of thealgorithm.Figure 4 gives the memory consumption of the algorithmin frequent itemsets mining phase. With the declining of thesupport threshold  s , the storage space of the algorithm andthe running time of every episode are increasing. When thesupport threshold is small, the impact on the algorithm’smemory consumption is visible, because the smaller thesupport threshold, the more the frequent 1-itemset of the basicwindow, so the number of frequent itemsets generated will bemore. 0510152025301th 2th 3th 4th 5th 6th 7th 8th 9th 10thArriving basic window   M e m o r y   C o n s u m p  t  i o n  (  M  ) MFIBA MIF-TransSW  Fig. 1. Memory consumption in window initialization phase 02040608010012014010 20 30 40 50 60 70 80 90 100Running time(s)   M e m o r y  c o n s u m p  t  i o n  (  M  ) MFIBA MIF-TransSW  Fig. 3. Memory consumption in frequent itemsets mining phase 05101520253035200K 300K 400K 500K 600K 700K 800K 900K1000KArriving transaction   M e m o r y  c o n s u m p  t  i o n  (  M  ) MFIBA MFI-TransSW  Fig. 2. Memory consumption in window sliding phase 151