0:00:34 | this work was supported in part by grants from the |
---|---|

0:00:37 | you states uh so it was a as of side it is so much and the it's national science foundation |

0:00:43 | only go back |

0:00:48 | okay so um |

0:00:50 | the topic of the talk is on classification |

0:00:53 | so |

0:00:53 | in a a model based classification as you all of there |

0:00:57 | yeah given |

0:00:58 | a a a a a prior distribution on the classes and uh |

0:01:02 | the and D like to function of the observations given class |

0:01:05 | and given these two things we can come up with the uh minimum probability of error decision rule |

0:01:10 | which i the on noise a maximum a posterior probably who |

0:01:13 | but simplifies to the maximum likelihood rule for you likely classes |

0:01:17 | so that that's model based |

0:01:19 | uh classifications not a better the model is from specified then you can in principle come up with the optimum |

0:01:24 | decision |

0:01:25 | in contrast to the sum of our uh be a the it in what is gonna of learning based classification |

0:01:29 | read everything is get a driven |

0:01:32 | so you only given examples of the two classes say |

0:01:35 | and you wanna come up with an a got of them which separates these class |

0:01:39 | the the channel have a lower that T wish that this is in this scenario is |

0:01:43 | that |

0:01:44 | very you often you encounter situations that you a high dimensional data for example you have a |

0:01:48 | so the billions video which lines be big by a by to get a |

0:01:51 | you have hyperspectral images you have you know synthetic aperture radar images and so forth |

0:01:56 | so get a high dimensional data on the one hand and you very few examples |

0:02:00 | compared to the dimensionality of the data on the other hand |

0:02:03 | now you might say well why not just use a generic uh |

0:02:07 | i did a reduction technique like |

0:02:09 | say pca A or L L E or so map |

0:02:12 | well on the one hand these are really generate methods |

0:02:16 | which are you know not |

0:02:17 | but it really do device so the classification problems so the optimized sort another generate method mow uh measures of |

0:02:24 | oh such as an preserving get about distances and so forth on the one hand |

0:02:27 | and don't know other hand |

0:02:29 | they haven't been designed with the view to high dimensionality prob the problem but that if you example |

0:02:34 | so our approach is to sort of exploit |

0:02:37 | what i shall call as the |

0:02:38 | latent in low dimensional sensing structure |

0:02:41 | now to make |

0:02:42 | this clear let's take a the cartoon example |

0:02:45 | let's suppose that |

0:02:47 | you given examples of each class only two classes here |

0:02:50 | and a learning based as a vision got in the such as svm but a kernel svm |

0:02:54 | which simply take the data and a lot a classification rule |

0:02:57 | in completely ignore |

0:02:58 | if any as C structure was present or not |

0:03:02 | in contrast to the this is |

0:03:04 | what i would call sensing of and classification where let's say we know that that these observations came from some |

0:03:09 | the sensing process |

0:03:11 | say for example the blurring operator |

0:03:13 | or of we may have either full or partial information about the blurring operator |

0:03:17 | and to the with some noise |

0:03:19 | and the question is can exploit knowledge of the fact that these observations came from someone underlying sensing structure |

0:03:25 | oh a the classification performance |

0:03:28 | no |

0:03:29 | yeah actually the it in a to the study of one is on the fundamental asymptotic limits of classification in |

0:03:34 | the |

0:03:34 | so audio of high dimensional was and very few samples |

0:03:38 | them to make things more concrete let's we assume that the the uh the did i mention and possibly the |

0:03:43 | number of samples |

0:03:44 | uh goes to infinity |

0:03:45 | while these samples per dimension |

0:03:48 | most of you |

0:03:49 | so that this a not of a you have that if you samples are very high dimensional data |

0:03:54 | but |

0:03:55 | in contrast to a a number of studies in the literature which has focused on |

0:03:59 | and S imported easy situation be want fixed the problem difficult D S imported you meaning that even if the |

0:04:04 | dimension increases to infinity |

0:04:06 | as it's not going to be easy to classify |

0:04:09 | and |

0:04:10 | for what is essentially means is that have fixing the signal to noise ratio as the problem schemes and this |

0:04:15 | would be considered as i do the mathematical more |

0:04:18 | a fundamental issues that be vision as is one is yes and it that's if you can performance |

0:04:23 | uh i that this asymptotic G |

0:04:25 | does it probably that are good to half which means |

0:04:27 | is it no better and random guessing |

0:04:29 | or does it go to the optimum base |

0:04:31 | probably that are which by the is not equal to half which is what i mean by fixing the problem |

0:04:36 | we do not equal to zero which is what i mean by fixing the problem difficulty or to something else |

0:04:41 | now |

0:04:42 | to make things more concrete i have two |

0:04:44 | i i is a model so that that's of the talk is based on only this is of a specific |

0:04:48 | model so |

0:04:49 | because need to understand the you got of these issues be side of the base simple model |

0:04:53 | a model is a simple in that |

0:04:55 | the observations are made up of um |

0:04:58 | uh are that some uh the uh the mean location which is lying and some of sensing subspace of think |

0:05:04 | of H as the sensing subspace |

0:05:06 | and even get in last one you are of this look at the you a mean location and one |

0:05:10 | and that |

0:05:11 | you are having a scalar gaussian perturbation along the edge axis |

0:05:16 | a big but for by a vector gaussian noise perturbation which to take your side this |

0:05:21 | subspace into to the gender the P dimensional space |

0:05:24 | so that the uh sensing a model we have a uh and lies the performance |

0:05:28 | and that's and what condition each class of the means are different so be know that the means |

0:05:32 | the are line a subspace and that |

0:05:35 | that's a scalar of but vision component along the subspace for or by but the gaussian perturbation it's takes the |

0:05:40 | subspace |

0:05:43 | so that the uh simple model not and the goal of was is that you are given uh menu of |

0:05:47 | many P dimensional vectors a and P dimensional vector some each class |

0:05:51 | and you to come up with a classifier |

0:05:52 | i understand the asymptotic classification performance for different uh sonatas |

0:05:57 | now |

0:05:58 | a be was a model to be simple to keep things tractable we are does not an article understanding not |

0:06:02 | even though it's fairly simple |

0:06:04 | that as not is that does make sense for example you have a sense an adults and audio |

0:06:08 | but you could have a the use so it's be the dimension of the observation in in the previous slide |

0:06:13 | uh each component being a sense on this case |

0:06:15 | observing some kind of a the line each signal few you |

0:06:18 | and and that last longer observing edge which is a signal |

0:06:21 | i the noise |

0:06:23 | and don't of the different class you of the negative of H i the noise |

0:06:27 | and the board of course is that |

0:06:28 | you a given and observations of the weak signal or sensor |

0:06:31 | i the each class and |

0:06:33 | the question is |

0:06:34 | yeah to come up with a classifier with decides |

0:06:36 | uh a the next observation is but to be which does it a long as a long the last class |

0:06:40 | the negative class |

0:06:42 | no moving ahead a that kind of classifies as for the rest of the talking would do |

0:06:46 | consider are the following |

0:06:48 | we like look at the baseline uh classifier which are is the full based which means you know everything about |

0:06:53 | the models so what is |

0:06:54 | a what is the that's which implements that we were fixed it |

0:06:57 | but gonna get familiar with the notation they're |

0:06:59 | then you wanna look at what a what i the and stuff sure |

0:07:02 | uh classifier which means that i know that it's of the conditionally gaussian observations but i don't know the means |

0:07:07 | that and all the variances quite is as |

0:07:09 | i would i them to estimate everything |

0:07:11 | using maxim like good estimates |

0:07:13 | how to that form |

0:07:14 | and then |

0:07:15 | a finally that look at structure based uh |

0:07:17 | a that additional problems |

0:07:19 | then the first case we look at the structure of it and what exact sensing subspace |

0:07:22 | how does the things behave in those cases |

0:07:25 | the second case i to for a structured maximum likelihood |

0:07:28 | which means that |

0:07:29 | of the estimate a tomatoes |

0:07:31 | annoying uh that is a little low dimensional subspace but i don't know the subspace |

0:07:35 | and finally um |

0:07:36 | you see that |

0:07:37 | yeah have negative results in this case is and the will of more T so a structured sparsity uh more |

0:07:44 | oh as a baseline model |

0:07:45 | so that a likelihood ratio test Q can you can john to the at and you can come up to |

0:07:50 | what is one of the up one like decision rule |

0:07:52 | it's it's gonna be a linear discriminant rule and is based on these parameters that i and mu uh it's |

0:07:57 | not important to know exactly what expressions are |

0:08:00 | that that stands for the difference in the class conditional mean |

0:08:03 | new is the have to the class conditional means and signal i Z equal that ends of the observations so |

0:08:08 | the that can rule depends on these parameters |

0:08:11 | and uh ms in can probably you can about added in closed form |

0:08:14 | it is and of the Q function which is nothing but the T and probably of a standard normal |

0:08:18 | and in terms of these uh a it is M to and what except which were a bit up your |

0:08:22 | here |

0:08:23 | only the important thing is that yeah is you a fixed the difficulty of the problem as the dimension scale |

0:08:28 | which means that i have to fix the argument of the Q function |

0:08:31 | that's that amount essentially fixing on most everything you are in particular the energy of these sensing a a vector |

0:08:37 | H |

0:08:38 | so we wanna keep the norm of edge fixed as things scale and that's an important uh a part of |

0:08:42 | this work |

0:08:44 | so that one of the the full based about looks like |

0:08:48 | oh that's one of the case better we know that it's conditionally got but you know we don't know any |

0:08:52 | of these parameters so |

0:08:54 | this of what the base classifier looks like |

0:08:56 | but i don't know a i don't know the model |

0:08:59 | so i have to estimate all these parameters from the get i given |

0:09:03 | so one approach a actual approach is to use a plug-in estimator which means estimate all these and it does |

0:09:08 | using the did a given |

0:09:10 | and like it into the optimum decision rule |

0:09:13 | that you are you you get a what as well as the uh of the medical fisher rule |

0:09:17 | and you can have a analyse the uh probably do better or you can get a close form expression and |

0:09:22 | look at what happens so that probably at as |

0:09:25 | these samples but dimensions go down to you'll the dimensions english to infinity |

0:09:28 | but you fixed the difficulty of the level |

0:09:31 | lot |

0:09:32 | turns out |

0:09:33 | not surprisingly that be probably a error goes to have |

0:09:36 | which means a no but than random guessing |

0:09:38 | now do not surprising because |

0:09:40 | you're trying to estimate for more parameters than you have data for |

0:09:44 | so asymptotically a you you don't catch up with the uh or or load of information that to estimate |

0:09:50 | so we in the structure in estimating all "'em" it is not a good idea and your |

0:09:54 | uh let's want to |

0:09:56 | structured uh approaches |

0:09:58 | so that's a minus so that does the sensing model |

0:10:01 | and let's suppose and the one extreme not been more tie sensing structure which means that i know the subspace |

0:10:06 | in which the observations lie |

0:10:08 | okay the underlying one dimensional subspace |

0:10:11 | so not natural thing to do in this case of wine not project everything down to the one dimensional subspace |

0:10:15 | right is it was scalar are learning based classification problem |

0:10:19 | estimate all the parameters |

0:10:21 | in that a reduced one some problem using the data you have the maximal some estimates and |

0:10:26 | C of what's |

0:10:27 | okay |

0:10:28 | that leads you do the uh what i what as projected empirical fisher rule |

0:10:32 | and that's the uh i an exact expression iteration at the exact expression is a set was not very important |

0:10:37 | but idea is that you |

0:10:38 | you know the sensing subspace we put giving down to that and reduce is it a one dimensional problem |

0:10:43 | and and the uh the probably did N are shown here |

0:10:46 | asymptotically as the number of samples goes to infinity |

0:10:49 | the out not surprisingly again that |

0:10:51 | i to keep the difficulty level of the problem fixed and a |

0:10:54 | as a the number of samples to infinity |

0:10:57 | the probably of uh N or goes to the base or it probably are which means of the optimum thing |

0:11:02 | you can do |

0:11:03 | now there is a uh it's can expect it is because |

0:11:06 | you know it's one i'm it so that it lit in one and some structure uh in one in this |

0:11:10 | problem and you know it exactly so when you project it down to that problem |

0:11:14 | that that the at the actually dimension of the but data of relevant |

0:11:17 | so P doesn't appear to this equation at all |

0:11:20 | your your the scale classification problem and as we know that when you uh do a mass and that the |

0:11:25 | estimation but in number of an uh a number of samples you can asymptotically get |

0:11:29 | optimal performance |

0:11:30 | but the did a dimension is fixed |

0:11:32 | so in this case the effectively the demonstrated option |

0:11:35 | uh by it takes into account a them a reduction in this or element to this problem |

0:11:41 | now |

0:11:42 | but the the idea of what of that we don't even know in general the sensing structure |

0:11:46 | okay we don't know the sensing subspace so when i is one to estimate the sensing subspace from the data |

0:11:50 | you have |

0:11:52 | so what would be one approach to estimate the sensing subspace |

0:11:55 | but we know what is that if a look at the difference in the class conditional means that are |

0:11:59 | it's actually a aligned with edge |

0:12:01 | "'kay" |

0:12:02 | so it was a lot of that and natural thing to do is to use a maximum that to estimate |

0:12:05 | of the that which was done before |

0:12:08 | and use that of the proxy for edge |

0:12:10 | then produce then project thing down to that that up |

0:12:14 | and then you're back to square the previous situation |

0:12:17 | and uh i again to get a project anybody "'cause" we shouldn't X of that the that action a which |

0:12:21 | project thing is not on the edge because it's not known to you but it's the estimated H |

0:12:26 | what you expect to get here |

0:12:28 | turns out that if you analyse the probability of mouth-position ever |

0:12:32 | as examples for dimension goes to zero |

0:12:35 | and the uh difficulty level is fixed |

0:12:37 | the probability of classification error goes to have |

0:12:41 | which means that even though you knew that was an underlying wind amazon something structure and you know that that |

0:12:45 | that was aligned with that |

0:12:47 | trying to estimate using using a matching like to kind of an estimate |

0:12:50 | didn't |

0:12:51 | doesn't do the job |

0:12:52 | okay you know but and random guessing asymptotically |

0:12:55 | but also it's it's all suggests that you need additional sensing structure to exploit here |

0:13:00 | no although this was not presented in our icassp able um yeah since then be able to show that this |

0:13:04 | fundamental meaning that |

0:13:05 | for this particular problem to analysing here |

0:13:08 | without any additional structure on edge |

0:13:10 | it's impossible for any uh learning a lot of them |

0:13:13 | to do any better than random guessing some importantly |

0:13:16 | so that's not present it an i cast to be appearing elsewhere but it's actually a fundamental be able lower |

0:13:20 | bound of the does of in probably which actually goes to have |

0:13:24 | and if you don't make any assumptions on these sensing structure |

0:13:27 | so that lead more T is the need for a id no structure don't edge |

0:13:30 | and one of the structures but be like to study is of course uh is a is a popular thing |

0:13:35 | these days |

0:13:36 | is uh as possibly okay |

0:13:38 | so |

0:13:38 | uh |

0:13:39 | that's say that the signal that uh that subspace is back the direction is sparse meaning that |

0:13:44 | uh the energy and edge |

0:13:46 | is look lies leave it if you components |

0:13:48 | compared to the number of dimension |

0:13:50 | so in particular let's see that the daily energy of a to this of the effect that edge the man |

0:13:54 | of the vector to the components |

0:13:56 | and their P components |

0:13:58 | and uh let's a pick a truncation point D um and look at the energy G this truncation and the |

0:14:02 | tail of the |

0:14:04 | uh edge vector here |

0:14:05 | as E N P will go to infinity you want the a in a to do to zero |

0:14:10 | so that a certainly a a a statement about the sparsity |

0:14:13 | as simple ks possibly all the signal |

0:14:17 | so in this case uh a natural thing to do is to use a uh so only have used the |

0:14:22 | maximal like to the estimate that a a of the top |

0:14:25 | and that didn't work |

0:14:27 | but not you know something more about edge namely that it's still energy goes to zero so one one interesting |

0:14:32 | to do you can try is why not and K that estimator |

0:14:35 | the component of the estimator |

0:14:36 | and use that as a proxy for that instead |

0:14:39 | the idea is to keep the estimate team i'd only are for all components less and some implication parameter T |

0:14:44 | and then set to you everything beyond |

0:14:47 | so that leads to what condition bayes estimate of the direction along H |

0:14:50 | and i and used i |

0:14:52 | that's the L how how things be |

0:14:54 | a big for show that as the that mentions the number of samples and the truncation point goes to infinity |

0:14:59 | but the truncation one is chosen in such a way |

0:15:02 | that the it goes slower than the number of sample |

0:15:05 | then |

0:15:06 | as important D can estimate |

0:15:08 | this is signal subspace perfect mean that in the mean a sense there are between the a truncated estimate and |

0:15:14 | the true data goes to zero we can as a a to the estimated i mention the subspace and of |

0:15:19 | course if we can estimate the subspace perfectly some got it it's on surprising then that |

0:15:22 | uh as things scale and you could the difficulty level fixed |

0:15:25 | the probably of class of never goes to the base probably |

0:15:28 | another the what's not is the sensing structure |

0:15:30 | but additional sparsity assumptions or some additional structure information |

0:15:34 | can a simple really yeah give you the uh a bayes pro uh probably of |

0:15:40 | he has a little simulation does not uh reinforce some of these insights |

0:15:43 | so here we have fixed the uh is probably other the the difficulty to be point one is fixed throughout |

0:15:48 | as a dimension scale |

0:15:50 | the energy use fixed to not some value to and you're some parameters to than in the model |

0:15:55 | and the number of samples is going slower than the other dimension as shown here |

0:16:01 | um |

0:16:01 | the truncation point |

0:16:03 | uh uh chose into go slower than the number of samples that shown here |

0:16:08 | and yeah one assume a polynomial D K for edge |

0:16:11 | and joint you're up for example of the beam line is the H |

0:16:14 | or the of one pretty localisation of edge |

0:16:16 | and on |

0:16:18 | the uh D D uh the red line is actually the noise the um at some like to to estimate |

0:16:23 | that the had |

0:16:24 | they are normalized to have unit energy |

0:16:26 | sure you're |

0:16:27 | and a blue one is a point conversion of the red one |

0:16:30 | the truncation point the i-th exactly twenty or so |

0:16:34 | on the right side is the probability of error on the vertical axes most of the dimension ambient dimension |

0:16:39 | so that the dimension scales |

0:16:41 | uh the unstructured uh approach where you don't know anything about the sensing structure you try to estimate all the |

0:16:48 | parameters using mac selected estimates |

0:16:50 | we'll approach to be that they probably about it being |

0:16:52 | you could have |

0:16:54 | on the other hand uh if you if you if you knew the sensing subspace but you estimated using nightly |

0:16:59 | using |

0:17:00 | simply that had |

0:17:01 | which is a max um that to estimate |

0:17:03 | then also you get a have |

0:17:05 | but if use the truncation based estimate |

0:17:08 | you are a pros the bayes optimal performance |

0:17:12 | so the control my talk |

0:17:13 | uh the |

0:17:14 | the you take points out that |

0:17:16 | for possible to many problems where you encounter situations where the number of samples that far fewer than the |

0:17:21 | i'm being uh get a dimension |

0:17:24 | in addition that is often exists a lead in sensing structure of the low-dimensional which can be exploited |

0:17:30 | you try to totally ignore the sensing structure and nine to try to estimate everything using mac selected estimates uh |

0:17:35 | you would probably be no better than random guessing in many scenarios |

0:17:39 | and even having a general knowledge of sensing structure like knowing that it's a one dimensional signal edge but i |

0:17:43 | don't know what they choose |

0:17:44 | and trying to estimate a nightly |

0:17:46 | can be it cannot do the job |

0:17:49 | so but only covers if you have a general or something structure plus some additional structure and edge |

0:17:54 | then you can often recover the optimum |

0:17:56 | asymptotically optimum estimation |

0:17:59 | the data into my |

0:18:16 | yeah i think |

0:18:17 | i know which i mean |

0:18:19 | was gonna be departing |