Science topic

Machine Learning - Science topic

Explore the latest questions and answers in Machine Learning, and find Machine Learning experts.
Questions related to Machine Learning
Question
2 answers
HITON is an algorithm for markov blanket discovery of a target variable. The algorithm first appear in this paper: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1480117/
I am looking for an R implementation of an algorithm which finds the markov blanket of a target variable.
Any reply would be greatly appreciated.
Thank you.
Relevant answer
Answer
We have implemented a similar algorithm in our R package MXM. The differences from HITON is that it does not find the full Markov Blanket, but only parents and children. On the upside though, it also returns multiple equivalent solutions, it can be applied on a plethora of different data types (continuous, discrete, censored time-to-event, mixed, etc.), it caches results between calls with the same dataset but different hyper-parameter values, returns p-values for the selections and contains other tricks for efficiency.
We are maintaing the code and in the near future we'll release more extensions of the algorithm for multiple solutions and the full Markov Blanket. 
To get the full Markov Blanket with the current version, start from the output of the algorithm then perform a forward-backward search.
Your feedback is welcome!
regards, IT
Question
12 answers
If I have 1000 instances of false class and 50 instances of true class, which practice is aimed for a better result? To select False and True on the basis of equal ratio or equal number?
Relevant answer
Answer
I guess it depends on which classifier and training algorithm you use. Since your question belongs to the ANN topic, I assume you want to use a neural network classifier (but the answer is valid for many other types of classifiers). If you are using anything similar to an MLP, I would go for equal number of instances. Otherwise the training algorithm will be dominated by the classification errors on the most numerous class. That is, if you use the standard BP algorithm the ANN weights will be modified at each presentation of a training example. Since one class has 20 times more examples of the other, most of the training procedure will be spent on minimising the classification error on the most numerous class. In the extreme case, the training algorithm could just learn to classify all examples as 'false' and it would reach still 95% accuracy. If the data set is noisy, trying to learn the exact classification boundaries may give a lower accuracy than 95%. If your training set is balanced (i.e. all classes are equally represented), you force the ANN to learn to distinguish the two classes (e.g. if all examples are classified as 'false', the classification error is 50%). However, if some errors are more dangerous than others (e.g. a false negative in a mammography), you may want to work with unbalanced sets in order to bias the learning procedure on the most important classes.
Question
26 answers
I want to develop a prediction model (like time series forecasting) with BPNN. The sigmoid function is mostly used as activation functions in BPNN but the sigmoid function gives an output between 0 to 1. If my expected output is like 231.54 then how to calculate the error & go for training? In short, I want my network to produce values like 231.54. What should the activation function for the hidden and the output layer then be?
Relevant answer
Answer
It is right that sigmoid function gives an output between 0 to 1. Regarding to the activation function, transformation of time series data is needed. It would be better if you look at the range of activation function before you decide to transform data. And reverse transformation is a way to restore the data to the initial scale (original data)
Question
3 answers
How do these three subspace methods perform with respect to classification?
Relevant answer
Answer
In a multidimensional space, PCA finds the directions that maximizes the variance whereas ICA finds the directions that maximize statistical independence. Both methods are used for the so-called blind source/signal separation:
They do not assume any class distributions of the data, i.e., all data are assumed to be unlabeled.
Fisher's LDA, however, is a standard classifier that assumes that there exists Gaussian distributed classes having the same covariance matrix. I do not see a clear relation of LDA to PCA or ICA.
Question
3 answers
We often use leave-one-subject-out cross-validation (LOSOXV) for machine learning experiments involving human subjects to allow for the subject-to-subject variation that occurs and also the tendency for autocorrelation for time series data involving a single subject. I am familiar with the paper by Kohavi recommending 10-fold XV but is there any equivalent paper that supports LOSOXV for this type of subject-dependent data?
Relevant answer
Answer
I think the best paper on cross validation so far has been written by Sylvain Arlot and Alain Celisse: http://arxiv.org/abs/0907.4728 (Section 6.1)
They also discuss leave-one-out (LOO) cross validation (CV) in detail. LOO is probably the best method to estimate the risk when learning a model. For model selection, however, 10-fold CV should be used (see Breiman and Spector 1992).
Although both papers I mention do not talk about human subjects, it is clear that, from a statistical point of view, these papers support LOO for your analysis.
Question
3 answers
Does anybody know a solver for a large scale sparse QP that works on the GPU?
Or, more in general, can a GPU speed up solvers for sparse QPs?
Relevant answer
Question
8 answers
For example price is non-stationary transfer to rate of return that is stationary .
Relevant answer
Answer
It depends on the prediction model. If you use ARMA or ARMA GARCH modeling this is the only way. If you use some other kind of non stationary interpolation (neural networks) then not necessarily.
Question
38 answers
I am unable to code for Neural Networks as there is no support for coding. I want to code for prediction with Neural Networks. A simple example about coding will help to understand how to build our own network, how to train & test.
Relevant answer
Answer
Here is another site with simple source code for neural networks:
Question
11 answers
for e.g. the input values can take finite states only, say 10 symbols {A, B, C, D, E, F, G, H, I, J}
Relevant answer
Answer
Finite automats described by disjunctive and conjunctive Boolean forms – this the most general description.
Question
85 answers
A large number of machine-learning models have been built to predict stock prices in literature. What are the main reasons why no one has achieved success so far?
Relevant answer
Answer
Zhenyu,
1) Machine-learning methods HAVE been successfully used by various individuals and institutional 'in-house' groups, but most 'public' individuals, such as yourself, will NOT learn of 'THE' SPECIFIC methodologies that have yielded 'lucrative' returns and results. When 'huge' money is involved, and this IS the case when 'dealing with' the financial markets, NO ONE is going to publicly 'share' their 'edge' derived from applying THEIR successful methods to trading....hence, you're not likely to hear of, nor see, detailed studies and reports of such successes.
2) MOST 'academic' researchers who publish papers attempting to apply computer-processing algorithms to trading markets simply do NOT truly UNDERSTAND the underlying 'dynamics' of market price behaviors, so 'naive' applications of methodologies are attempted and 'researched', with the result that 'less than stellar' outcomes are generated frequently. To be 'effective' in developing 'successful' trading methods requires a rather 'deep' understanding of 'general underlying dynamic behaviors' of what makes the markets 'tick'. In particular,....
3) Markets (stocks, futures, forex, options, etc) generate data that form (statistically) NON-STATIONARY, time-series of numbers over ANY period of 'time window' that one may want to examine, 'forecast' upon, and trade. 'Prediction' (which is highly 'precise') is essentially impossible, but to a greater or lesser degree, 'forecastability' (less 'precise', but more 'probabilistic') IS applicable to market time-series data, with the exception of what are called 'event shocks', such as USA's 9/11, October of 1987, 'flash crashes', and similar types of 'events'. (From a 'risk-management' standpoint, any 'good' and 'effective' trading strategy/system MUST make provision for such occurrences in order to protect trading capital and prevent financial 'disaster'!)
4) From an engineering (and computer science) perspective, a 'trading system' can be 'thought of' as a 'combined' mathematical/logical TRANSFORM that uses 'appropriately-conditioned' time-series 'market' data as input and then attempts to 'functionally' convert this input into a monotonically-increasing 'capital-capture' output time-series. Before attempting to EFFECTIVELY design such a 'transform', one MUST have a relatively 'decent' understanding of the characteristics AND 'character' of the time-series 'input' data to which the 'transform' is to be applied.....MOST researchers don't have an adequate, NOR realistic, market-dynamics UNDERSTANDING....hence, their market MODELS are 'inadequate' and THIS is another reason why you rarely see public information of 'successful' machine-learning methods as applied to trading the markets.
5) Lastly here, but not 'finally', 'patterns' DO frequently recur in market-oriented time-series data that CAN be 'exploited' when designing a 'transform' such as mentioned in the previous paragraph. 'Pattern', in this context, does not necessarily mean a 'visual stock-chart formation' only....it can be comprised of various 'features' that are often-times 'embedded' in the time-series data. These 'patterns' and 'features' can certainly, and EFFECTIVELY, be discerned by means of machine-learning methods.
So, to summarize, in the spirit of 'hand-waving' guidance, the real 'trick' to applying machine-learning methods, or ANY other methods for that matter, to successfully 'extract capital' from the financial markets is to become well-versed in HOW time-series data (typically 'price' series, but not exclusively) is 'created' by the dynamics of how market-participants behave and act upon the market trading-vehicles of interest. The usual 'admonition' applies: "Apply and Use the proper
'tools' to 'solve' the problem/challenge presented".....THIS requires a PROPER understanding of the various 'elements' of the 'problem at hand'. AND, if YOU want to employ a machine-learning approach to SUCCESSFULLY make money using the
financial markets, or just research this as an 'interesting' academic pursuit, you're probably going to have to do the research yourself, OR do it in collaboration with other(s) who have a similar purpose. I hope my thoughts here may be of some assistance in giving some 'direction' for your quest........
Have fun and ENJOY!! (;->)
Buzz
Question
21 answers
My dataset has four categories with 1100 reviews in each category. I want to apply a cross-validation method for finding the optimal value of K for KNN (sentiment classification). How many folds will be required?
Relevant answer
Answer
The standard approaches either assume you are applying (1) K-fold cross-validation or (2) 5x2 Fold cross-validation.
For K-fold, you break the data into K-blocks. Then, for K = 1 to X, you make the Kth block the test block and the rest of the data becomes the training data. Train, test, record and then update K. In this case, the standard value for K is 10. There is some variation; some people will do 5, others 20. It somewhat ties into (a) Type I/Type II errors (and, the impact on statistical tests) and (b) whether you are looking to determine either (1) approximate average error or (2) which is the more reliable algorithm.
Note, you should, likely, do stratified cross-validation, where the class (category) representation in each block is same (or close) to that in your 'full' data set.
For notes on K-fold, try two papers (a) A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection by Ron Kohavi ( http://robotics.stanford.edu/%7Eronnyk/accEst.pdf ) and (b) and On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach by Steven L. Salzberg ( http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.183.534 )
Alternately, the 5x2 fold cross-validation can be employed. It is generally better at detecting which algorithm is better (K-fold is generally better for determining approximate average error). In this case, randomly divide the data into 2 blocks (or, randomly divide each category into two blocks if doing stratified cross-validation). Then, train on block A and evaluate on B. Next, reserve it (train on B and evaluate on A). Then, repeat the process. Divide the data randomly into two blocks (use a different seed value). Do the two evaluations. Repeat again (actually, until you have done this 5 times). The statistical test performed is a bit different.
Note 2: If you are doing parameter tuning, you have to modify the procedures (for either approach) a bit to include a validation set.
For more on 5x2 fold, you should read the paper "Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms" by Thomas Deitterich. See http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.37.3325
Finally, if you are looking to do statistical tests, keep in mind that both K-fold and 5X2 fold cross-validation are really heuristic approximations. There is some indication you may need to run them multiple times to get a reliable Different/Not Different judgment. For that, jeffery Bradford and Carla Brodley provide a both an explanation and a means to make the call in "The Effect of Instance-Space Partition on Significance"; nice paper. Can be found at http://link.springer.com/article/10.1023%2FA%3A1007613918580?LI=true
Question
2 answers
I'm looking for Evaluation methods for data stream.
I want Evaluated performance of Decision tree algorithm for Classification data stream.
Relevant answer
Answer
Classification performance is usually expressed as some sort of error rate. Common methods include a"confusion matrix" which quantifies the number of correctly and incorrectly classified samples
Question
1 answer
I need an example of user defined callable function for weights in:
sklearn.neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, warn_on_equidistant=True, p=2)¶
Relevant answer
Answer
You simply need to define a function that works with the n_neighbors distance values and returns n_neighbors weights. For instance, you could define a function for testing purposes that does the same like weights='distance'.
Question
19 answers
The opinion on this diverges in the literature. Some claim that AIC does require the true model to be in the set of candidate models, that is, in the case of linear regression, the true model is a subset of the candidate predictors. Some others say that AIC does not require the truth. As far as I understand the derivation, the truth is not required and, in fact, we aim at the KL-optimal model as proxy to truth and the nature of truth is not relevant to the derivation at any point.
Relevant answer
Answer
you might want to consider
Burnham, K.P. & Anderson, D.R. (2002). Model Selection and Multimodel Inference. A Practical Information-Theoretic Approach. (2nd ed.). NY: Springer.
They vote against the "true model approach"; moreover, they consider the question as such misleading (referring to Taub, 1993: a model is not truth by definition; Box 1976: all models are wrong, but some are useful, etc). Rather, they focus on numerous other model selecting procedures.
Question
6 answers
I heard that the SVM classifier for binary classification needs a small number of positive examples compared to other classifiers. Does anyone know the reason why? And is there any other benefit of using SVM as classifier?
Relevant answer
To answer your second question: SVM may be able to REPRESENT the model using only few examples (the support vectors, hence the name). The proportion of examples included can be indirectly tuned with a regularization parameter.
This soft-margin SVM is always feasible (see Cortes and Vapnik 1995 -they discuss the matter relative to neural networks)
Question
66 answers
Using a 4 layer ANN with 10 input neurons, 5 hidden neurons in two hidden layers, and an output layer with 2 neurons, the network after trained generate almost identical outputs for different input patterns; by as little as0.0000001 difference. Does anyone know what might be the problem? And the possible fix?
Relevant answer
Answer
I would like to clarify onething: for minX and maxX, minY and MaxY: is there only one min and one max for each of the x and y through out the entire data sets; or is it just from 1 line of data.
What I meant is, I have a text file which contains hundred lines of data; I train the network using data from a line in that file (randomly select a line). So, does the min and max x and y refer to each line specifically or in the entire text file (where I'll have to search for min and max); or should it be the ideal main and max in real life?
Question
3 answers
Generally we deal with two types of forecasting; point and interval. With statistical method of forecasting (like ARIMA), we can perform interval prediction considering uncertainty of prediction. But what about doing the same with machine learning techniques like ANN and SVM?
Relevant answer
Answer
interval predictions can be generated in many ways using statistcal models, in either frequentist or bayesian setting. A way to hedge predictions, i.e. to produce confidence intervals, that in my opinion is particularly interesting is conformal prediction. The theory of conformal prediction, which is based on frequentist statistics, can be applied to any model, even non statistics based ones. you cam find a tutorial about this here:
Question
8 answers
In a research, I need to visualize each tree in random forest due to count the number of nodes included in each tree. I use R language to generate random forest but couldn't find any command to satisfy my demand. Do you have any idea how I can visualize each tree in random forest? or how I can calculate the number of nodes in each tree?
Relevant answer
Answer
Thank you Paul for all information you have provided me.
Actually, in my work I check many parameter of random forest but one that I couldn't calculate yet (and need it certainly) is the number of nodes in each tree of random forest. I use 'randomForest' package in R in order to generate my model. But I couldn't find any function presents me the random forest structure(form each DT structure point of view).
Named-entity recognition: Better results using Support Vector Machine (SVM) compared to Conditional Random Field (CRF)?
Question
25 answers
Does anyone know if quality of results for Named-entity recognition improves by using an implementation for Support Vector Machines instead of Conditional Random Fields?
Relevant answer
Answer
It usually depends on the task, and as Christian mentioned, training a CRF is much faster. However, I've actually experimented and had extremely good results with the YamCha package, which is an SVM tailored for chunking / labelling tasks. It is very easy to use and has the input format similar to any CRF package (e.g., MALLET, CRF++ or CRFSuite). On the downside, it does take longer to train, especially in the 1 vs. All setting.
Question
7 answers
I'm using the J48 decision tree algorithm with Weka, but it doesn't build a tree.
It only says Number of Leaves:1 and Size of the tree:1
Relevant answer
Answer
"tree size is 1" means that the split on the data in some particular condition is not worthwhile based on the j48 pruning mechanism. you don't have to use pruning mechanism actually and just construct an unpruned tree. there are so many options in j48, you can choose no pruning
Question
16 answers
I want to have information about the size of each tree in random forest (number of nodes) after training. I usually use WEKA but it seems it is unusable in this case.
Relevant answer
Answer
I really think scikitlearn is better than R unless you are already fluent in R. Python is much easier to read/learn, and scikit learn is optimized at the C level, meaning it will already be fast and suited for bigger datasets. That is not to say the randomForest package may not be more extensive than in skilearn, but skilearn does have randForest functions.
You also may want to look into ''ilastik'' if visualization and interactivity are important for you application.
Question
17 answers
I need to do clustering on a dataset composed of both numerical and categorical data. What algorithms do you recommend ?
Relevant answer
Answer
@Afarin Adami: That's a good idea. It's basically the Hamming distance between the categorical variables. You could compute that, and also an Euclidean distance between the numerical variables, and give a weight to each, depending on the number of categorical and non-categorical variables, the range of the numerical features, etc.
Question
10 answers
I have to explain advantage and disadvantage of decision tree versus other classifier
Relevant answer
Answer
The main different somehow is about the domain of application. Note that kNN and SVM are used for continuous value inputs, unlike Decision Trees that is applicable for continuous and categorical inputs. If you deal with a problem where inputs are categorical values ( ~ discrete values ) even in part then you have to apply the trees.
Question
4 answers
Genetic information gathered from autistic patients is transformed to multidimentional data. this is huge and requires machine learning techniques to create an automated Autism detection system. I wonder if there are publications along this track.
Relevant answer
Answer
Thank you for your help. You probably mean this article: Structural, Genetic, and Functional Signatures of Disordered Neuro-Immunological Development in Autism Spectrum Disorder
Basic image processing in R
Question
15 answers
In Matlab, one can invoke commands such as "im2double" and "mat2gray" to convert a bitmap into a numerical matrix and back again to an image. I was wondering whether this can be achieved in R, maybe via additional packages.
Relevant answer
Answer
I've done a fair amount of image analysis and haven't seen a lot in R...but EBImage and Raster packages might have what you want. Python (or Java/ImageJ) as opposed to R seems to have a stronger image analysis community.
Question
2 answers
Relevant answer
Answer
What exactly single series mean? In SVM regression we use single series data.
Question
1 answer
I need some real data for experiment.
Relevant answer
Answer
for CT/MRI you may be addressed to http://brainweb.bic.mni.mcgill.ca/brainweb/ is a brain CT/MRI simulator with a lot of cites in the papers. For CT/TEP you may be addressed to http://www.creatis.insa-lyon.fr/rio/popi-model_original_page is a 4D volume data set, and you can find a vector fields there. Actually, it is very difficult to find an image set with a vector field, the best alternative it is that build your owns deformation vector fields for an specific set of data.
Question
5 answers
I need my network to produce output like 234,231.......(values in range 200-300).
So I have created a 5-4-1 FF network with Sigmoid activation function in Hidden Layer and linear activation at output layer. Are these correct network activation functions for my network?
And since i am using Sigmoid function at hidden layer the output of hidden layer will be [0-1].
And calculation at hidden layer are summation of all Wts*inpus and then sigmoid function y=1/(1+e^-x). Is this correct what I am thinking?
Relevant answer
Answer
All the input vectors and the target vectors have to be scaled bewteen 0 and 1 or -1 and 1. This scaling has to be done by column (different variables may have non comparable values). To do that you have to establish the max and the min value of each variable (Max and Min). Then you have to decide the range of the new scaled values (High and Low). Now you have to calculate two new coefficient Scale and Offset in this way: Scale=(High-Low)/(Max-Min); Offset=(Max*Low-Min*High)/(Max-Min).
At this point you have to rewrite each value of considered variable in this way:
NewVar=Var*Scale+Offset.
You need to create a set of vectors for each of the dataset variables:
Max=new float[NumVariables]; Min=new float[NumVariables]; Scale=new float[NumVariables]; Offset=new float [NumVariables]. That because you have to scale each variable in an independent way. High and Low, instead has to be the same for the whole dataset. For the input vectors, I suggest [0,1] if the values of the dataset represent real quantity (that is : the lowest values represent small quantities) and [-1,+1] if the values represent qualitative features.
After the training of ANN you can recode the output into the original range of value using the same equation: Var=(NewVar-Offset)/Scale.
Good work
Massimo Buscema
Question
18 answers
How to calculate 10-fold cross validation paired t-test for classification data?
Relevant answer
Answer
I'm a bit surprised by the way this has gone! Personally, I was thoroughly convinced that a straightforward t-test was NOT the right answer as it may lead to wrong conclusions and the readily available Wilcoxon Signed Rank Test http://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test is more appropriate. This test is supported by R and GNU Octave (and probably matlab). If you are not familiar with the argument on why t-tests are not appropriate, please see http://jmlr.csail.mit.edu/papers/volume7/demsar06a/demsar06a.pdf (which was previously mentioned by Kapil but I don't think he provided a link).
Question
2 answers
Scenario:
Suppose I have 5 subjects (S1, S2, ..., S5). Data for each subject is recorded in 4 different Sessions. which varies in size as well. If I construct a matrix for Subject 1 it looks like and here each row present 1 complete feature, but number of instances varies.
Sub1 Session 1 = [20*100]
Sub1 Session 2 = [10*100]
Sub1 Session 3 = [6*100]
Sub1 Session 4 = [13*100]
and variation goes on for other subjects
Question: Is it fine to use above data for SVM, or it should be as follows?
Sub1 Session 1 = [6*100]
Sub1 Session 2 = [6*100]
Sub1 Session 3 = [6*100]
Sub1 Session 4 = [6*100]
Relevant answer
Answer
When you use SVM, the number of dimensions should be the same over all examples. The number of examples per class may vary. However, this might lead to a skewed optimization problem. Binary SVM packages often feature an option to re-weight the optimization problem, which you can use to correct for the bias towards the over-represented class.
Question
13 answers
In negative selection algorithm how we should find proper parameters like affinity threshold?
Relevant answer
Answer
Hi Sajjad. Generally, when speaking about one-class classification, negative examples are not to be used for parameter tuning. This is due to the fact that, in a real one-class scenarios, parameter tuning using ROC measures is simply not possible by definition. If you have, however, a few negatives available, ROC values can provide useful information for parameter tuning. You should, however, keep in mind to strictly separate training and test data. This might be a problem when negatives are scarce.
Parameters other than threshold parameters can possibly be tuned using the (threshold-free) area under the ROC curve (based on the extra validation set).
One additional comment: be aware that negative selection algorithms might have problems in high-dimensional feature spaces and methods describing the positive data might be advantageous (or at least computationally feasable) in those cases.
Question
3 answers
I have three features of lengths 100, 25 and 128 for every subject. I want to use SVM. What is the best approach? Of course, scaling/normalization will be done.
Question 1 : Should I place these three features in one vector for every subject or is there any other appropriate way to deal it.
Question 2 : Is feature extraction an art and based on gut feelings more than engineering ?
Relevant answer
Answer
In my experience, it is better to represent individual components of your three feature separately to construct a single feature vector. Also if you are planning to use SVM you can further extend the vector by using multiple binary features for a single categorical feature (Refer: Hsu, C. W., Chang, C. C., & Lin, C. J. (2003). A practical guide to support vector classification, Section 2.1 )
Choosing features for representing your data may not have a straight forward task. But once chosen you can computationally know which features are important for your problem. Often cross validation is performed for feature selection and parameter tuning. There are ready to use tools in case of SVM (Refer: FAQ section on libsvm site and look for feature selection tool)
Question
7 answers
I'm trying to collect information on this area, but most existing papers seem to target network-based DoS detection, whereas I could not find much about application-level approaches
Relevant answer
Answer
Antonio,
As you have already established, most of the work in DOS detection is based on network intrusion. Some time ago I did a survey of the current state of intrusion detection and found that most of the literature used KDD cup 99 as benchmark for their tests. The literature was filled with this benchmark even though the set has been criticized (and is the main data set used in benchmarking data mining algorithms in detecting DOS). The core problem is lack of a good benchmark on which to publish results.
I would suggest looking into more general topics which can be adapted to host level detection of DOS attacks using ML such as:
P. Kola Sujatha, A. Kannan, S. Ragunath, K. Sindhu Bargavi, and S. Githanjali. A Behavior Based Approach to Host-Level Intrusion Detection Using Self-Organizing
Maps. In ICETET ’08 Proceedings of the 2008 First International Conference on Emerging Trends in Engineering and Technology, pages 1267–1271. IEEE Computer Society Washington, DC, USA, 2008.
H. Kim, J. Smith, and K. Shin. Detecting energy-greedy anomalies and mobile malware variants. In Proceeding of the 6th international conference on Mobile systems, applications, and services, pages 239–252. ACM, 2008.
These might give you some ideas on other topics that might help you.
Hope this helps.
Question
9 answers
I want to compute conditional cross entropy for a distribution like P(X1;Y|X2) where X1 and X2 are independent to each other.
P(X1,Y|X2) can be written: P(X1,X2,Y)/P(X2)=P(Y|X1,X2)*P(X1|X2)*P(X2)/P(X2)=P(Y|X1,X2)*P(X1). So, ultimately P(X1,Y|X2) = P(Y|X1,X2)*P(X1). Now what will the entropy of the distribution P(Y|X1,X2)*P(X1) be? How can it be implemented for some raw data in MATLAB?
Relevant answer
Answer
Not really. You may write p(y,s1,s2,s3) = p(s1)p(s2)p(s3)p(y|s1,s2,s3).
Question
2 answers
Mahout is a solution of Apache Foundation to build scalable machine learning libraries.
Relevant answer
Answer
We are using for Text Mining solutions.
Question
6 answers
Most literature uses processing of video/cam images. I would need a simpler solution, also need to avoid ethical issues due to videoing people.
Relevant answer
Answer
Dear All,
Thanks for your answers - it is really appreciated. The thermitrack is most interesting, seems to be the right mixture between commercial and development tool (still allows to get the numeric data, which are hidden by most commercial devices).
Question
43 answers
I have 18 input features for a prediction network, so how many hidden layers should I take and what number of nodes are there in those hidden layers? Is there any formula for deciding this, or it is trial and error?
Relevant answer
Answer
You need to use Cross-validation to test the accuracy on the test set. The optimal number of hidden units could easily be smaller than the number of inputs, there is no rule like multiply the number of inputs with N... If you have a lot of training examples, you can use multiple hidden units, but sometimes just 2 hidden units works best with little data. Usually people use one hidden layer for simple tasks, but nowadays research in deep neural network architectures show that many hidden layers can be fruitful for difficult object, handwritten character, and face recognition problems.
Question
2 answers
The best thesis I found in this field is "Efficient Boosted Ensemble-Based Machine Learning In The Context Of Cascaded Frameworks", written by Teo Susnjak. It is a very useful resource and also a new one, but I need more resources.
Relevant answer
Answer
There is the nice paper by Graf,Vapnik et al., 2004, "Parallel Support Vector Machines:
The Cascade SVM". It is also worth to look at other ensemble techniques (bagging, boosting), see e.g. the paper Valentini, 2004, "Random aggregated and bagged ensembles of SVMs: an empirical bias–variance analysis" or the paper of Wang et al., 2009, "Empirical analysis of support vector machine ensemble classifiers". I'm attaching the Graf-publication, the other are easily found on the Internet
Question
19 answers
I am doing a project on prediction with neural networks. I have searched through a lot of material and found that most programmers prefer MATLAB or C++, JAVA or C# is very rarely used. Use of MATLAB is not allowed for me so I want to know what should I use, C++ or JAVA?
Is there any problem with JAVA or its libraries?
And finally what should I use, JAVA or C++? (I am more comfortable with JAVA but can also become familiar with C++).
Relevant answer
Answer
It is quite simple - machine learning is very CPU time consuming task. C++ is still much faster than Java or C#, and with MPI you can parallelize this task for very large clusters.
I don't know about MATLAB efficiency, but it is probably much easy to use than C++ (especially for newbies)
Question
1 answer
By overlapping clustering I mean clustering where an object may belong to several clusters. By extrinsic evaluation I mean that I have the ground truth (a list of correct clusters) and I want to compare the output of my cluster algorithm against the ground truth. The clustering algorithm needs to determine the number of clusters, so the metrics should be robust against the number of clusters.
Relevant answer
Answer
Entropy and Purity are likely the most popular. See Zhao and Karypis, 2001 (http://glaros.dtc.umn.edu/gkhome/fetch/papers/vsclusterTR01.pdf). Several pair counting metrics have also been popular, especially in the Information Retrieval community, including Precision, Recall, F1, and Rand Index. Amigo et al, 2009 (http://nlp.uned.es/docs/amigo2007a.pdf), discuss and compare a number of metrics and propose that BCubed metrics better capture clustering quality in diverse scenarios. However, BCubed metrics are less efficient to compute -- O(n^2), where n is the number of items -- as compared to the rest.
Question
4 answers
I've implemented a GMM for one class of the data. Then I performed ROC analysis and the diagram is not bad. I achieved 83% accuracy with a threshold but the problem is that the threshold is pretty small. It's 0.36*E-25. The probabilities of the test data are pretty small too. They are all near zero. Do you think it's unreasonable or is it an immense problem?
Relevant answer
Answer
It's typical that the probabilities are very small - this is due to the exponential function's fast decay.
This *can* cause numerical issues. Therefore, it is typically a good idea to compute log(p(x|c)) rather than p(x|c).
The Gaussian function also makes a logarithmic probability computation very easy, so this solution not only helps numerically but also eases implementation.
Question
2 answers
In a simple experiment, I have a dataset with 200 samples, each includes 5 attributes and a binary class label. 3 of these attributes are supposed to have equal measures for all samples in the dataset (we suppose there are 3 fixed numbers). Now I classified the dataset by random forest using R but the result is strange.
Having these 3 attributes in a dataset, classification is going wrong and all the samples take just 1 label (say label A) .
Eliminating these 3 fixed attributes, classification is correct with more than 96% accuracy.
The question is: what is the reason for this different result?
What is the effect of eliminating some fixed equal attributes on the classification problem?
Relevant answer
Answer
It is a good practice to eliminate the attributes which have a constant value for all the samples in your dataset.
There are several reasons for justifying this, which mainly depend on the classification algorithm you are using.
For example if you call you data matrix X (samples x features) and you have a constant feature the matrix X'X will have reduced rank and therefore algorithms that make use of this matrix (e.g. least squares regression) may not find a solution or may run into numerical problems.
The I guess that in the training of a decision tree (a random forest is just an ensemble of decision trees) this might cause problems in the selection of the split attribute. On which function of your dataset do you perform the splits? Entropy? I guess the presence of constant attribute may cause numerical problems in the calculation of the function on which to base the split, but this is pure speculation (I don´t know the R function for random forests in detail).
Hope this helps!
Cheers
Question
4 answers
What are the latest approaches for object detection and which (if applicable) are the machine learning algorithms used (i.e. SVM, Adaboost, Neural Networks)?
Relevant answer
Answer
HOG+SVM, Haar+Boosting classifier are more successful methods. However, you can also try part based detection/recognition.
Question
6 answers
What are the methods or best predictive methods to use for this kind of data?
Relevant answer
Answer
Check the proceedings of the "Distributed Systems and Networks (DSN)" conference series. You will find a couple of good ideas there. Another starting point is the "Computer Failure Data Repository (CFDR)", which offers HPC system logs and the according analysis papers.
Question
27 answers
I have a text classification task. I used sliding windows method to pupolate my data set. The problem is that the size of the data set is huge and the data points are very similar in my data set. I would like to reduce the data set without losing informative data points. I am aware of variable selection techniques such as "kruskal.test", "limma", "rfe", "rf", "lasso", .... But how can I choose a suitable method for my problem without doing computationaly intensive operations.
Relevant answer
Answer
Hello Noma,
what is your problem exactly? Do you want to have less samples or less features? Those are two very different things!
Clustering or SOM approaches can be used for reducing the number of samples, while PCA or other dimensionality reduction methods for reducing the numebr of features you have.
Cheers
Question
7 answers
I try to find out the baseline of a handwritten word. To understand what is a baseline, you can see in the link below, from Wikipedia. (It's the line in red in the picture). do you know any algorithms to do that?
Relevant answer
Answer
1. Binarize the image
2. Perform closing operation with appropriate structuring element
3. Find the edge of the image
4. Perform hough transform on the edge image with the constraint that the theta is the small range near +90 and -90. Because we the texts are expected to be horizontal
5. draw the hough lines that satisfy a certain threshold
You can extend the result by choosing the best horizontal line in the vertical neighbourhood to avoid number of lines clustered near each other.
I hope this helps. A preliminary result is attached
Question
1 answer
I am using fuzzy rules to solve one of my reseach problem. Since rule are exhaustive. I need some fuzzy rule learning technique that can evolve the rules at its own. Which technique is best : genetic algorithm, neurofuzzy, or some other .
Relevant answer
Answer
Evolving fuzzy systems (EFS) can be defined as self-developing, self-learning fuzzy rule-based or neuro-fuzzy systems that have both their parameters but also (more importantly) their structure self-adapting on-line.
They are usually associated with streaming data and on-line (often real-time) modes of operation. In a narrower sense they can be seen as adaptive fuzzy systems. The difference is that evolving fuzzy systems assume on-line adaptation of system structure in addition to the parameter adaptation which is usually associated with the term adaptive. They also allow for adaptation of the learning mechanism. Therefore, evolving assumes a higher level of adaptation.
In this definition the English word evolving is used with its core meaning as described in the Oxford dictionary (Hornby, 1974; p.294), namely unfolding; developing; being developed, naturally and gradually.
Often evolving is used in relation to so called evolutionary and genetic algorithms. The meaning of the term evolutionary is defined in the Oxford dictionary as development of more complicated forms of life (plants, animals) from earlier and simpler forms. EFS consider a gradual development of the underlying (fuzzy or neuro-fuzzy) system structure and do not deal with such phenomena specific for the evolutionary and genetic algorithms as chromosomes crossover, mutation, selection and reproduction, parents and off-springs.
Question
3 answers
I know Multidimensional Scaling uses only a distance matrix, but Self-Organizing Map requires coordinates of points in the original space. What are some other dimensionality reduction techniques, such as Multidimensional Scaling, that need only a distance matrix rather than point coordinates?
Relevant answer
Answer
Thanks, Evaldas. I just found Isomap works pretty well for my problem. Nice research though !
Question
8 answers
Relevant answer
Answer
It ought to depend on how you use the statistical models. If you use them to analyze/generalize a whole bunch of inputs, yes, that is flawed, as there are human behaviours that do not depend on a large set of inputs, but something more like an internal model. This is basically Chomsky's point, and he's right to the extent that learning approaches which depend on aggregation of data like this are not a good clue to human behaviour. However, that is not the only way you can use statistics. If you look at Anderson's ACT-R model, for example, it is firmly rooted in statistics but at a memory chunk level not at a stimulus level. This is both neurally relatively plausible and capable of modelling complex human behaviours, such as problem solving.
I'm not sure Norvig's points are helpful, either. ACT-R is sort of algorithmic modelling, but it really does make claims of correspondence to the processes in human psychology. That's why it's really cognitive modelling. For some time, artificial intelligence and cognitive science have gone their separate ways. Chomsky's critique of artificial intelligence as not saying much about human language is reasonable in this respect, but that doesn't undermine the use of statistical modelling in disciplines which do care about understanding the processes in human psychology.
Question
4 answers
All the above have some common ground. In your opinion, what is the real difference between these fields? if you were asked what do you classify your research area what would that be?
Relevant answer
Answer
In layman's language, statistics is a way to infer patterns from data based on existing model; machine learning is a heuristics to have the computer form its own model from the data; data mining and pattern recognition are applications (not methods) that can be done through either statistics or machine learning; and pattern recognition is a sub-field of data mining. Many people would just claim they do all of them, I guess.
I do woodworking and carpentry using routers and saws, etc., BTW ;)
Question
16 answers
What is the latest technology in advanced machine learning ( AML), beside Natural Language Processing and Neural Network?
Relevant answer
Answer
Subscribe to Journal of Machine learning (jmlr.csail.mit.edu) - they send you an e-copy of the quarterly publication; you will (obviously get the idea of current research by reading the abstracts). Some of the broad topics are Integrated machine learning, Machine learning in affective modeling,
Question
23 answers
On the face of it, topic modelling, whether it is achieved using LDA, HDP, NNMF, or any other method, is very appealing. Documents are partitioned into topics, which in turn have terms associated to varying degrees. However in practice, there are some clear issues: the models are very sensitive to the input data small changes to the stemming/tokenisation algorithms can result in completely different topics; topics need to be manually categorised in order to be useful (often arbitrarily, as the topics often contain mixed content); topics are "unstable" in the sense that adding new documents can cause significant changes to the topic distribution (less of an issue with large corpi).
In the light of these issues, are topic models simply a toy for NLP/ML researchers to show what can be done with data, or are they really useful in practice? Do any websites/products make use of them?
Relevant answer
Answer
Topic models are good for data exploration, when there is some new data set and you don't know what kinds of structures that you could possibly find in there. But if you did know what structures you could find in your data set, topic models are still useful if you didn't have the time or resources to construct classification models based on supervised machine learning. Lastly, if you did have the time and resources to construct classification models based on supervised learning, topic models would still be useful as extra features to add to the models in order to increase their accuracy. This is the case because topic models act as a kind of "smoothing" that helps combat the sparse data problem that is often seen in supervised learning.
There is previous work in automatically assigning labels to topics discovered by topic models. They work OK, but they're heuristic in nature.
So, topic models are not the be all and end all of data analysis. But they're a useful tool.
Question
60 answers
Complexity of input and output variables.
Relevant answer
Answer
I understand now.
The ANN is a model of your data, so one thing you can do is to measure the complexity of the ANN after it is trained on a given input data. One measure of the complexity of an ANN is the time it takes to train it, or the amount of training data required to train it to a given accuracy.
If you don't like this way of doing things, because the logic is a little circular, then you can measure the complexity of one input data set relative to another input data set.
One way that this can be done is to discretize the two sets of input data, X and Y using the same set of bins (if they are not already discrete).
Thereafter, compute how much X can be compressed by using Y as the template, and also how much Y can be compressed by using X as the template.
The idea behind this approach is that the more complex data set will be more effective in compressing the less complex data set.
Is this what you might like to do?
Question
19 answers
This is a toy example with four variables v,w,x,y and a binary outcome variable z. What approach would you take to find a statistical or algorithmic model to predict "z"? If you choose to perform the analysis, what is your cross-validated accuracy (10-fold)? What conclusions about the data-generating mechanism would you draw?
Relevant answer
Answer
Simple logistic regression, restricted to data with v=1, and using x, log(y) and their squares leads to a 100 % accurate classifier, even if non convergency is signalized because of perfect separation.
Question
1 answer
I tried increasing heap size by editing RunWeka.ini file and changed it to 1024m from 256m. I am using large dataset. After changing the file I saved it and ran WEKA again. But still I am getting the error. Is that the change was not taken or 1024m is not enough. Then how to increase further?
Relevant answer
Answer
You need to open the program with following command;
java -Xmx6g your/program/path
This command tries to allocate 6g memory
Change the value accordingly like 160M -> 160 mb or such
Question
8 answers
My claim is that none of the accepted concepts of class are *adequate*. Is this situation ok? Is this ok to do classification without an adequate (central) concept of class available in the model? Do you think this situation cannot be avoided?
CAUTION: This is a fundamental scientific question with radical applied implications.
Relevant answer
Answer
Dear Lev, I would not completely agree (nor disagree) with your claim. My rationale is that in order to train some models, the notion of a "class" must be incorporated in the training data. Of course, one may then say that the notion of "class" is part of the data, rather than part of the model. But if you removed this part of the data, it would be impossible to train any supervised model (SVM, ANN, Bayesian classifier). Now, if you consider unsupervised methods (KNN, etc), they usually rely on the concept of a "class" much more than supervised methods, since they are expressed in terms of a labeling of the samples or a partitioning of the sample set.
On the other hand, we have regression methods which instead of assigning a discrete label to each data point, they only attempt to approximate an unknown function which depends on the data. In these methods, the concept of "class" is usually irrelevant, or at least not strongly enforced.
So my answer, in conclusion, is that whether the concept of "class" is inherently embedded in a learning method depends on both the model and the application.
Neural Networks, for example, can be used both as classifiers (e.g., with a sigmoidal output neuron) and as regression models (e.g., with a linear output unit).
It's quite an open question, though, and you may get many different answers.
Question
2 answers
is there any way a program can emit codes. Not by doing string concatenation. it should logically generate codes based on observation.
Relevant answer
Answer
What exactly do you mean by 'codes'? What exactly do you mean by 'logically'`? Most programs generate output depending on it's observation of e.g. keyboard status
I think a bit more detailed description would help ppl to anwser...
cheers
Question
9 answers
For doing cross validation is it the parameter x which has to be used? Or is it some other parameter? I am confused. For doing 10 cross validation -x 10 will be fine.
Also in SVM light how to generate ROC curve. How to change threshold values in SVM?
Relevant answer
Answer
We have not tested bob on Windows yet. It is not supported for the moment.
Question
77 answers
Because the present forms of data representation---including vector space, logical, graphs, etc.---have not been uniformly useful, does it mean that, in general, there will not be a single universally preferable form of data representation which would capture previously inaccessible and more complete view of data objects?
CAUTION: This is a serious scientific question with radical applied implications.
Relevant answer
Answer
Prof Goldfarb, we know you are hinting towards the ETS (Evolving Transformation System) representation. I had read some early papers of your group proposing and developing ETS. We would like to know more, about the development after the early period, up till now.
Question
23 answers
Neural Networks very loosely imitate and are inspired by the human brain and as such have some the strengths of the human brain like discrimination/pattern recognition and some of the weaknesses like forgetting and not being able to multiply 2n digits efficient
y. Neural Networks are used in Machine Learning a subset of AI and in AI itself in many different applications.
Relevant answer
Answer
It really depends on what type of data you may use. As with any formal system, if you provide more and more for a system to use to construct its analysis, the more likely you can miss outlier cases (in more general cases). Using 'too much' training data can really be the case of not using the right parameters, and maybe using too much of the same type of data. For instance, I wouldn't just feed an ANN the exact same looking data for everything, balance out the cases so that less likely cases it may try to catch can be 'roughly' considered. It is a flaw of any ANN is the input given, and the problem domain it can be applied to. Since ANN can consider a lot, but should be used carefully when it comes to lots of different situations. Independence of problems can be a big part of this. This is not discussed possibly as much due to just people not understanding the limitations of ANN. People think of them when they don't understand them as well as 'the greatest thing since sliced bread', when they don't yield much more power than some other methods, much like any field of algorithmic research. People find their power in expert systems, and predicting input such as handwriting especially. Usually people get in trouble when they expect them to be like 'us', when computation is not us or our brains. Computation is by computational models, and if we are even to consider that we would have a lot of work ahead of ourselves before we can say more but it looks a little unlikely due to the limitations even of approximations in formal systems. This is why it is not discussed amongst experts as much since they understand this while maybe non-computer science audiences may see it as this 'glorious' thing. Machine learning is a very fascinating field, but it is not best thought of as this 'human brain' replication concept, which is not what scientists should be after since we know mathematics has limitations.
Question
2 answers
I'm researching different semantic distance measures for words and would like to have a test to compare those. I'm looking for the "TOEFL Synonym Questions" from the TOEFL test which others have used. It is linked here: http://www.aclweb.org/aclwiki/index.php?title=TOEFL_Synonym_Questions_(State_of_the_art) but the no one from the linked mail addresses reacted to my requests. Maybe someone here knows where to get the questions?
I'm also open to suggestions for other test sets to benchmark semantic distance measures for words.
Relevant answer
Answer
You can write to Peter Turney to get the TOEFL Synonym questions.
You can find his email here :
I recently did this myself, so I'm quite sure you will be successful! :)
Question
3 answers
SO If I use a NN with 5 layers - 3 hidden(h1,h2,h3), input and output to compress an image. Such that,
size of h1=h3 and h2 <h1 . How do i reconstruct the image from the output ?
Relevant answer
Answer
You are talking about the one-hot-code. Another
extremum would be the binary code, so that the
length of a code word equals log2(n), where n
is the number of items to encode. Something
between n and log2(n) should be an appropriate
guess for the number of neurons in h2.
Another thing to consider is that the code for
an image doesn't have to consist of zeros and
ones but can also be floats. When working with
backpropagation, the output value of neurons
will usually be float values, so the code for
an image is a vector of floats, e.g. 0.1, 0.83, 0.42, ...
I faintly remember that typical codes that were
obtained by neural nets resemble Gray codes.
I made experiments in the eighties, of course
due to computer limitations not with images but
with small bit patterns as input and output.
Binary encoding has the property that digits
have big differences in weight. I think nature
hates such inequality, therefore the Gray-like
outcome.
Regards,
Joachim
Question
5 answers
Can anyone help in how to classify 20 visual concepts? Suppose we have visual concepts such as cars, cats, mountains, moon, night, I have a data set for these visual concepts that is based on tags and MPEG7. I trained the data set on a single concept (car) through the help of libsvm but the results were not satisfying.
The data set contains apx. 15000 images that belong to all these 20 concepts. I've put car as positive example and all other concepts as negative. Is this a problem of libsvm tool? I used linear classifier in libsvm.
Relevant answer
Answer
i am using linear classifier
i am using tags and exif for the classification
Attached is a baby tag file. I have total of 20 concepts.
Question
3 answers
Today's ITS research is dominated by highly ambitious Machine Learning Techniques. However, according to me, education as a process should be least mechanized. Thus I am working on the development of an ITS using production rules written in JESS.
My concern is whether rule based implementation will find the acceptance among the publication watchdogs or would they throw it to dustbin, being influenced by the overwhelming popularity of machine learning techniques?
Relevant answer
Answer
Why is there a confusion between ML and RBS? You can do ML over rules, if you plan to induce these rules. You can do ML over other representations also -- including opaque ones like weights in a NN. So using RBS does not mean you can not use ML, and the vice versa. I dont think someone will throw away something, just because it uses rules. The issues are the power/richness of rules to represent what you want to represent, the inferencing mechanism, how to get the rules in place, how to ensure that they are right, etc.
ML is generally used when some aspects are not clear enough to formulate properly. So, what question to ask next, what topic to show next, etc can be functions which are trained, rather than write a lot of rules saying when to do what. And in many cases, the change of rules over time, is another concern, to put ML on top to induce or refine rules with time.
Question
11 answers
If we want to define a size on a concept, what is the best criterion to do that?
Is it possible to use Greedy Set Covering Algorithm and suppose the minimum number of spheres covering positive instances as a Concept complexity measurement?
Relevant answer
Answer
Interesting question. But can you give some examples of concepts you want to learn and how are those concept represented in your input space? Because concepts can be understood in different ways ...
Question
12 answers
At the implementation stage of a machine learning model, is there rule of thumb to prefer MPI against GPGPU architecture as the architecture?
Relevant answer
Answer
Generally, GPUs processing power can be exploited more efficiently for problems where the algorithm operating on the data can be straightforwardly parallelized, e.g., vector to vector operations (element-wise additions/multiplications). On the other hand, when the parallel algorithm is implemented by a small number of parallel threads/processes and they operate on large chunks of data, then you probably should opt for MPI.
Question
1 answer
I was wondering whether there are competitions for Bayesian network structure learning algorithms, inside the machine learning community. I've read some papers on the topic, but it's hard to compare the efficiency of the different approaches, since they often use different metrics (AIC, BIC, MDL, loglikelihood, number of arcs that are not in the original model, ...) and some methods might be better with regard to certain parameters, and worse for others.
So, an "objective" (as much as possible) competition, evaluating the methods on a great number of different metrics, would be great to understand whether some approaches are actually better than others, of if they lay on the same "Pareto front".
Relevant answer
Answer
Hi, up to my knowlegde there is no specific competition about bayesian networks. However you can evaluate the methods on a specific (real) problem. Have a look at http://www.kaggle.com/competitions
What you can do is testing your methods in such competitions. Hope this help.
Question
4 answers
Automated cervical cancer detection
Relevant answer
You can try ImageJ program.
Select from menu Image -> Adjust -> Thresold to extract nulei. Nuclei is most dark objects.
Quality of extraction of cytoplasm depends on image contrast and background properties. For example you can extract background (B) from image (I) first. Background have most bright pixels. Then you can get cytoplasm segment (C) like
C = I - B - N
where N - is nuclei segments.
To separate touching cells use Process -> Binary -> Watershed.
Most interesting thing is to match cytoplasm and nuclei of one cell. If you plan to calculate nuclei-cytoplasmatic ratio for example.
Question
11 answers
Mainly how are training samples selected? Do we assign class labels to each pixel in an image?
Relevant answer
Answer
Demetris, you can see image segmentation as a classification problem. Specifically, you want to classify each pixel in the image according to some criteria, so that all pixels belonging to a region are given the same class label. In this case, your input vectors are intensities of the pixel you want to classify, or the intensities of the neighboring pixels. I know there are many other models which may be more appropriate for segmentation (such as Markov Random Field models), but I think the Neural Network approach could work well in the case where (1) you do not have a clear idea of the segmentation criteria, or how to model it, and (2) you do have several examples of the expected results for given images, which you can use as training data.
Question
2 answers
See my question
Relevant answer
Answer
Metsis et al. describe derivatives of the Enron dataset that were created for anti-spam evaluation:
Question
5 answers
What difference between active learning and passive learning ?
Relevant answer
Answer
Active learning is about being proactive to a problem. For example, research is a proactive phenomenon. You seek information about how nature works, remember, we do not make laws of nature. Newton, or Einstein, Faraday and all other numerous scientists observed nature to discover how it works.  So they tried to arrange for some experiments and learned that despite all efforts of a 'controlled' environment, they failed to equate the theoretical answer with a practical answer. Hence they suggested 'alpha, beta, gamma, etc as a digit by which the practical answer will come too close to theoretical answer, hence the term 'empirical formulae' . On the other hand, passive learning is as it comes to you, for example, you go to a class and you are taught something, you never had a plan to learn that but the teacher taught you maths, or physics etc. 
Active learning is about being going after some problem with an approach to solve it.
I hope this clarifies the difference.
Question
3 answers
Is there any way to identify the best kernel "a priori"?
Relevant answer
Answer
Andreas is right. There are chances that you can design your kernel when enough prior knowledge of the problem at hand is given a priori, e.g. in terms of (invariant) transformations that ought to preserve the meaning of input patterns.
Unfortunately, many such kernels are no longer positive definite (e.g. see jittering kernels or invariant distance substitution kernels) and global optimality as well as convergence might no longer be guaranteed, as resulting problems can loose their convexity property (due to kernels becoming indefinite). This might be still fine for SVMs, depending on the solver you use, and can boost performance, but can be especially severe for other kernel based learning algorithms that are restricted to positive definite kernels (e.g. Gaussian processes using marginal likelihood inference).
Question
4 answers
Classification Tool
Relevant answer
Answer
I siggest you the package Rweka implemented in "R" or naive bayes implemented in matlab :)
Question
6 answers
Looking for some implementation of compressive sensing.
Relevant answer
Answer
Any signal is sparse in some domain like Fourier, DCT, or Wavelet. So we can compress a signal in that domain. Compressed sensing utilises this by using random projections for sampling. The measurement (or sampling) is minimal and it uses L1 optimisation or greedy methods for reconstruction(determining all under-determined vector in the algebraic equation). A typical application is a single pixel camera, to understand this process.
Question
4 answers
Classical approaches perform badly on the minority class, even if the global learning is satisfactory.
Relevant answer
Answer
One class classification methods were designed precisely for this type of problem. It is often called novelty detection, or outlier detection or abnormality detection.
The idea is that you learn solely from the majority class (normal) so that any deviations from it are regarded as the minority (abnormal). For example, many common classifiers such as k-means, SVMs, MoGs and nearest neighbour have been adapted for this problem.
David Tax from Delft University is very knowledgeable in this field. His phd thesis is a very good read, as are his numerous papers in this area.
If you use Matlab, there is a toolbox dd_tools which contains these methods.
Finally, when measuring your classifier's performance, I find the balanced error rate (BER) measure useful. It is the average of the false positives and false negatives.
Consider a 90 positive and 10 negative examples and a classifier that treats all objects as positive. Here, accuracy would be 90% which sounds really good but for your situation, it hides the poor performance. The BER here would be 0.5*(0+100) = 50%, a poor value that reflects the weak classifier.
Hope that is of some use.
Question
8 answers
I have 20 concepts, when I train (1 concept vs all) the problem that i face is that it takes a long time while training, and the accuracy that I found is about 50% when I test it.
So when training (1 vs 1) or (1vs k) the accuracy reaches to 85% . My question is " Which K concepts to select while I am training the classifier? " How can i find the nearest concept?
Relevant answer
Answer
u can try NN Classifier...........
Question
8 answers
I read an article from Pedro Domingos titled "A few useful things to Know about machine Learning" (Oct 2012), and I do not buy into how he says that a 'dumb algorithm' and a tremendous amount of data will provide better results than moderate data and a more clever algorithm. What if adding more data gives you more noise or irrelevant information? The only thing I see as their justification is that you can go through more iterations from the data and that you have more ways to learn from it. I can't see that claim as sufficient or sound enough for this be valid.
Relevant answer
Answer
His section "More data beats a cleverer algorithm" follows the previous section "Feature engineering is the key". In this context he is probably right, but with this priority: the representation "is the most important factor" (his quote) for the success of the project..
I think what he means is that IF you got the 'right' representation and the appropriate amount of data you are OK without the fancy algorithms. However, what he should have stated more explicitly is that if you don't have a 'good' representation---which is almost always the case---no amount of data or clever algorithms will produce satisfactory results.
What he and almost everyone else do not realize is that there are fundamentally new representations which could radically change the usual rules of the game.
I'm quite sure that the PR / ML obsession with better algorithms can be explained by the lack of good representations: if you don't have access to a 'good' data representation and you like statistics, you start inventing "cleverer" algorithms. ;--)
Question
31 answers
What are you ideas about Danger Theory Algorithm? There are some algorithms to model and implement Artificial Immune System like Negative Selection and Clonal selection and recently Danger Theory. I want work on Dendritic Cell Algorithm (DCA), but it is a little ambiguous. Any recommendations?
Relevant answer
Answer
It's necessary to pay attention about differences between 1st generation and 2nd genreation of AIS. In the first, the approach is self-non-self. In that case, we can apply over an IDS/IPS but we will have a scalability problem. Remember, we need to define what is self and what is not. In this case, it's difficult to us define what is self or no in a big network. This is the problem of the first generation of AIS. In the 2nd generation inspired by MAtzinger theory, the scalability problem is mitigated. This generation uses a signal concepts and dendritic cells have an important value in the immune system. This approach uses a DCA algorithm (by Greensmith) as motor of theory. I recommend you observe the Uwe Aickelin research group that works with this new concepts of AIS.
Sorry for responding so late.
Moisés Danziger
Question
16 answers
Suppose we deal with a dataset with different kind of attributes (numeric and nominal) and binary class. How can we find a unique number as the shannon entropy of this dataset (as a presentation of kolmogorov complexity of this dataset)?
Relevant answer
Answer
If this is a machine learning application, you may be better off working with the Vapnik–Chervonenkis (VC) dimension. http://en.wikipedia.org/wiki/VC_dimension
Question
17 answers
My svmlight output for binary classification (+1 for true and -1 for false classes) is like
+1 0.001
-1 -0.34
+1 0.9
+1 055
-1 -0.2
I tried generating roc in R using ROCR package. But I found in all examples of ROCR TPR ,FPR and cutoff values were used. TPR and FPR can be calculated from the above data but will get only one value and also w.r to cutoff I am confused. In this case cutoff is -1 to +1 right?
Can anyone help me how to give the values from the above data for drawing ROC in ROCR package.
Relevant answer
Answer
Thanx for the help Mr.Iordan . Actually with lot of browsing I corrected the errors and could also manage more than one curve in ROC. Got the plot of multiple curves. But AUC in R was giving problem. So finally I managed with excel which has a formula to calculate AUC.
For AUC in R I tried RocAUC, pAUC but nothing worked. I was giving for each dataset different names.
Question
11 answers
Algorithms to deal with unbalanced clusters for classification?
Relevant answer
Answer
One class classification methods are ideal for unbalanced data. They learn the majority class so that any deviations from it are assumed to be the minority class, or faulty class perhaps.
Assistant Professor David Tax at Delft University is an expert in this field and more details can be seen from his papers and particularly his thesis http://homepage.tudelft.nl/n9d04/thesis.pdf
The methods are also implemented in his Matlab toolbox which is free for academic use.
Types of methods include density methods (gaussian, parzen windows), distance based (one class SVMs, SVDD) and reconstruction methods (NNDD).
Question
9 answers
Is cross validation necessary in neural network training and testing.
Relevant answer
Answer
I hope there is no misunderstanding in terms and I think that by saying "cross validation" you mean the use of a verification subset monitor during NN training for early stopping purposes. Correct? If yes, then it is _absolutely_ necessary. Without it neural networks can be easily overtrained, where you very nicely reproduce the training set observations, but get almost useless model.
If, on the other hand, you understand "cross validation" in terms of post-training testing, like its worst version live-one-out cross validation, then it is not as useful as an external validation.
Question
15 answers
More accurately, I mean not the narrow technical discussions but the discussions on the general direction of the corresponding natural science. And this is particularly conspicuous given the transitional nature of our scientific period, or it may be exactly the symptom of this transition. Do you know the reasons? (Is this a symptom of unconscious insecurity?)
Relevant answer
Answer
The question by Lev Goldfarb about "the lack of serious scientific discussions" from the internet is a very real issue. The internet is turning to a giant forest of the same trees. New findings are becoming scarce except on the commercial sites where you may buy them. Even then you buy products of ideas not new ideas at all.
This days, ideas became marketable and they may not appear on the net. There are fewer high level works that are published than that are not published. My question, like Lev Goldfarb is: "are scientists free to discuss their ideas freely on the net since most top scientists seem to be hired by multinational corporations, or governments"?
Question
1 answer
Just wanted to start exploring the possibilities of bioinformatics to be simulated using artificial intelligence ?
Relevant answer
Answer
You wish to predict a complex phenomenon. Machine learning (SVM, Neural Networks) requires of experimental data available. "Prediction" is not from nothing. So, why you don't use an exploratory response surface, a classical method of sequential statistical optimization...??
Question
4 answers
I'm fitting some multivariate observation sequences to a hidden markov model using R's RHmm pacakge. But I keep getting the following error "Error in BaumWelch(paramHMM, obs, paramAlgo) : Non inversible matrix". Does anyone know what happened? Or does anyone know where I can find other implementations?
Relevant answer
Answer
Hi,
If you have too few observations and too many hidden states, you can get singular matrices. Another reason might be a poor initialization. You can try a different initialization. You could also try adjusting the prior. If your matrix is singular then a different toolbox won't help...
Dave
Question
4 answers
I have installed R and ROC package in R. I need to generate ROC from my SVM output data . How to call my data in R to run ROCR?
Relevant answer
Answer
you want to look at the read.table or read.csv commands. Check out the help, they are pretty straightforward.
Question
5 answers
Does anybody know how to train sequence using HMM using kevin murphy's toolbox? I want to train different sequences and want to test which sequence belongs to which class.
Relevant answer
Answer
Let O = # of observations, Q = # of States, T = observation sequence length and G = # of classes. You should first preallocate PI (prior probabilities), A & B:
PI = zeros(G,Q);
A = zeros(Q,Q,G);
B = zeros(Q,O,G);
after creating observation sequences for class g, which size(Seqs) = [number_of_training_samples_in_class_g,T];
PI_g = normalise(rand(Q,1));
A_g = mk_stochastic(rand(O,O));
B_g = mk_stochastic(rand(Q,O));
if you want use another HMM topology (left-right, Bakis, linear, ...):
then train HMM for class g:
[LL,PI(g,:),A(:,:,g),B(:,:,g)] = dhmm_em(Seqs, PI_g,A__g,B_g);
for test; to obtain belong log-likelihood of Seqs observation sequence to class g:
[loglik(g),errors] = dhmm_logprob(Seqs, PI(g,:), A(:,:,g), B(:,:,g));
Question
75 answers
I am using newff of neurolab package of Python. The network is not learning for some set of inputs and show constant error through the learning phase. Here is the error values the network displays:
Epoch: 100; Error: 149.999993147;
Epoch: 200; Error: 149.999992663;
Epoch: 300; Error: 150.0;
Epoch: 400; Error: 150.0;
Epoch: 500; Error: 150.0;
Epoch: 600; Error: 150.0;
Epoch: 700; Error: 150.0;
Epoch: 800; Error: 150.0;
Epoch: 900; Error: 150.0;
Epoch: 1000; Error: 150.0;
Epoch: 1100; Error: 150.0;
Epoch: 1200; Error: 150.0;
Epoch: 1300; Error: 150.0;
Epoch: 1400; Error: 150.0;
Epoch: 1500; Error: 150.0;
Can anybody shed some light on the situation? What might be the possible reason for this kind of results?
Relevant answer
Answer
I think Mehdi Fatan's suggestions are correct but there is much more than that to look for. Here is a check list.
How many inputs? How many hidden layers? (start with 1 and if possible stay there) How many hidden units? What activation function: standard or creative (e.g. non monotonic)? differentiable or binary?
Did you normalize your data within either [-1,1] or [0,1]? (if you don't, the magnitude of numbers you obtain has no meaning, and even the order of magnitude of the learning rate cannot be estimated other than by trial and error, let alone its value)
What is the training algorithm used? How is the output error computed? (certainly it is not a wrong classification rate or percentage, because it exceeds both 1 and 100%, respectively).
The error should also be normalized, otherwise you cannot compare it on problems with different input cardinalities (number of patterns) and output dimensionalities (number of output units).
What is the cost function, the actual error minimized by your algorithm? Is it appropriate for the problem you want to learn, i.e., does it have a minimum where your solution satisfies you? Is this minimum reached smoothly or are there spikes, flats and ravines in the error landscape?
Are your data consistent or contradictory? Is your input related to your output by a reasonable relation (e.g. not random)?
Question
1 answer
I usually use Latent Dirichlet Allocation to cluster texts. What do you use? Can someone give a comparison between different text clustering algorithms?
Relevant answer
Answer
Bisecting K-means
Question
1 answer
In machine translation there are a number of phrases whose translations are quite fixed and do not depend on the surrounding words also. I have a list of such phrases and want Giza++ to take the advantage of this prior knowledge while making alignement of source and target language phrases. This should help Giza++ in reducing the effort of alignment process and also reduce the size of final phrase table. But how to do this ?
Relevant answer
Answer
You can do this with mGIZA++, which you can find on Qin Gao's web site (http://geek.kyloo.net/). mGIZA++ provides a number of extensions: multi-threading, forced alignment, resuming alignment, and also using partial (manual) alignments.
The idea: if you have some alignment points, then the EM algorithm will not change those points, but find the best alignment compatible with those fixed alignments. We called this constraint alignment.
So, if you can generate a partial alignment for each sentence, using your prior knowledge about those phrase pairs, then you can load this into mGIZA++ and generate the complete alignment.
See: Qin Gao, Nguyen Bach, & Stephan Vogel: A semi-supervised word alignment algorithm with partial manual alignments. ACL 2010: Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR. Proceedings of the workshop, 15-16 July 2010, Uppsala University, Uppsala, Sweden; pp. 1-10.
If you run into problems, please contact Qin, as he implemented the stuff.
Question
5 answers
Machine learning is an essential part of many systems: communications, computer vision, AI, the web, etc. according to your field what do you think really the top killer application of ML?
Relevant answer
Answer
Two possibilities occur to me:
1) Personal digital assistant (e,g, Apple's Siri). As the technology improves, this will turn your cell phone into a very smart and capable info-servant. Within 5 years, Siri (and its ilk) *will* change the way we interact not only with computers, but the outside world. (Of course this assumes the availability of speech recognition and generation as well.)
2) Automated government regulation. Many gov't regulations require businesses to summarize and report their activity on a regular basis (e.g. monthly/quarterly) so regulators can confirm that gov't's many laws are obeyed. In a few years this likely will change. Government will monitor business transactions automatically in real time, looking for patterns that have been found to predict / reveal misbehavior (illegal as well as just non-conforming). This could greatly lighten governmental oversight and make it almost impossible to game the system, since the patterns of misbehavior would be formed through machine learning, not procedural statute. Since regulator patterrns would not be publically visible, they'd be very hard to circumvent. This could reinvent the role of government and regulatory oversight.
Question
1 answer
Basically, do they provide features that are better than SIFT? I went through quite a few papers, everyone talks about translation, viewpoint , illumination invariance and so on. The topic of scale invariance doesn't seem to be clearly stated or at-least i dint find it.
Relevant answer
Answer
I'm still starting as well on the practical issues of deep learning, so I may be wrong but I think that deep learning approaches learn only what you give then as data.
If you give examples of the same thing at different scales and view points then the deep learning machine will encode that into the network. If not then I believe that some invariance can still exist but it will be quite limited.
Of course if you make the input patch size/shape depend on the nature of the input texture/structure, akin to what DoG/SIFT does, then you could have an invariant deep learning machine. I highly doubt that this will be well looked upon since the idea of deep learning is to have raw data as input...
Question
1 answer
According to my review of related literature, I think the sparse coding may be the basis of deep learning, i.e., the features in the lowest level of deep learning structures come from the dictionary of sparse coding. Is this statement right?
Relevant answer
Answer
Dear Yu Zhao, 
Normally, sparse coding is compared with autoencoder not with deep learning. Autoencoder is one of building block of deep learning (in case we use stacked denoising autoencoder). One of elaborated comparison between sparse coding and autoencoder can be find in the following link:
In essence, a sparse coding can be viewed as an autoencoder with added sparsity constraint. If you add a regularization constraint into the original autoencoder formulation, the results tend to be the same as sparse coding. 
Hope this will help.
Question
6 answers
Anyone care to share on how to learn Matlab for feature selection? I have no prior background in using Matlab, so perhaps you could share a website where you could use the existing code that I could alter.
I want to use Matlab to analyze feature in Malware API calls.
Relevant answer
Question
3 answers
Since random forest includes a bunch of random decision trees, it is not clear when we say forest size, it can be :
1) number of bits it takes
2) number of decision trees included in forest
3) accuracy
4) execution time
5) a manifold of all of above terms
6) etc.
What is the most important and what is the best?
Relevant answer
Answer
To answer a question about most important and/or best one needs to know the goal. For example, if I was going to use a random forest in an adaptive controller with 512 bytes of RAM, representation bits would be a very important design constraint. If my random forest were to be embedded into a real-time system, execution time would be key. Medical diagnosis would usually maximize some sort of risk-weighted accuracy and sacrifice other criteria.
To summarize: depending on the application different measures will be important.
Question
8 answers
What are type of lazy classifier?
Relevant answer
Answer
Different kinds are IB1, IBk where k>1.
Question
5 answers
I want to classify images based on SIFT features, so what tool can I use to extract them? I have used "http://koen.me/research/colordescriptors/" to extract the SIFT feature, but the problem that I am facing is: the size of the file after extracting the SIFT feature goes too large. I cannot pass that file to SVM as on file has apx. 12000 rows. The image that I am using has dimensions eg. 1024x683. SIFT feature file must contain less information so in this way I can pass hundreds of images to SVM at same time.
Relevant answer
Answer
Try using SURF features. They are similar but faster. There is an implementation in google code site:
Question
5 answers
Will Rapidminer tool help me in drawing a ROC curve? Moreover, I tried installing it, but could not.
Relevant answer
Answer
Thanx both.
Question
18 answers
This is related to Machine Learning, digit/object recognition using KNN which is a supervised learning algorithm.based upon instance/lazy learning without using generalization and it is non-parametric in that it makes no assumptions of normalcy (Gaussian) distribution. I am trying to implement in: octave, python and java.
Relevant answer
Answer
Unfortunately, there is no "best" distance, as the goodness of such parameters always depends on data peculiarities.
However, you might be interested in the so-called tangent distance proposed by Simard et al., which tries to capture invariances with respect to some pattern transformations you expect the data to undergo. In case of OCR, this could be slight rotations, scalings etc.
Unfortunately, the tangent distance is not necessarily a metric. For a simple KNN, it might nevertheless prove valuable.
The original paper "Transformation Invariance in Pattern Recognition – Tangent Distance and Tangent Propagation" about this very distance measure is available on the net. Other nice papers on this topic can be easily found as well.
Moreover, here's already some code for a tangent distance implementation provided by D. Keysers:
Question
14 answers
I have a binary classification problem with a data set that is missing a large amount of data. I have been reading papers about data imputation, but haven't reached a solid conclusion about which method is best. Note that I will most likely be using SVM for classification.
Relevant answer
Answer
Just one more addition to this topic: before using any algorithm, please be sure that missing data are really missing :). I mean NULL values are not necessarily 'i do not know' values, it also could be 'not applicable' (e.g. number of children of a 6 year old male), or there is no data (yet) (e.g. birth year of the last child). So, sometimes ignorance is the best algorithm. That is probably not the case here; just a warning to do data understanding part of your job as well and find out why data are missing.