Science topic

Data Mining - Science topic

Explore the latest questions and answers in Data Mining, and find Data Mining experts.
Questions related to Data Mining
Question
9 answers
I'm a Master student in the Faculty of Science.However,I can't find doctors who work in the field of Data Mining. At first, I thought to do my research on "Recommendation Systems", but since it depends on "Web Text Mining" and "Text Parsing", I thought it would be better to start by the last two. I have done previous researches on both topics. The research attached mirrors my field of interest.I would be glad if you advised me about the best place I should join: DMCM or elsewhere.
This the website of DMCM
Relevant answer
Answer
Dear Dina,
To start your PhD, I would recomend first anylze the area in data mining where you want to work.Becoz I am also doing the Phd In Predictive Analysis of Data Mining , As Data Mining Combines lot of fields, Statistics, Computer science, Operation Reasearch. And we can be good in Computer science but not may be in Statistics, So plan accordingly.Identify your research area and then collect some research papers and then identify what new can be done in this. And then prepare synopsis.Becoz most of us says I have interest in doing research, but starting is not right.Apart from this whatever we study in our student life and it looks easy to us, as it is fully developed things, so it becomes easy but as a research we have to think and prove something new and then the problem starts here.Institutions play a great role only when your guide/supervisor under whom you are doing Phd is good and he has knowledge and has experience in that area.Hope this information well help you :-).
Question
14 answers
Is there a good, easy to use open source software package (with graphical interface) to do PCA, PCO and other types of data reduction methods? I know TMEV can do that. I also found some R packages, but I would like something faster.
Relevant answer
Answer
You can try GENESIS or Flexarray! Both are poen source, you just have to register...
Question
11 answers
Is there anyone who works on SNA? Do you have any recommended tools for SNA?
I'll start some research related to SNA, especially for twitter and facebook, but its hard for me to develop some tools to gain data from twitter and facebook, basically I just want to get the "text based data" from both. And I'll use text mining to analyze all that I need.
Could you please give me an advice, what kind of tools that I can use to get data from twitter and facebook?
Relevant answer
Answer
I suggest this book is a good reference for you "Russell, (2011) Mining the Social Web: Analyzing Data from Facebook, Twitter, LinkedIn, and Other Social Media Sites".
Question
4 answers
Is it to compare the reliability of the these data sets?
Relevant answer
Answer
The kdd cup data is a wellknown dataset to be consider for data mining purpose among scholar or researchers. Various machine learning algorithm can be perform on it either for intrusion detection or classification. Anyhow, since the data out dated and various approaches which applied this data has reach almost 100 accuracy rate. Thus, the real traffic needed to show the proposed approach works well on different and new data.
Question
31 answers
"The nature of statistical learning theory" was my favorite book during my PhD, and several years after. I still use "Elements of Information Theory" in my work. Recently a book has opened new horizons for me: "Prediction, Learning, and Games".
Do you have good books to advise?
Relevant answer
Answer
"Neural Networks for Pattern Recognition" by Christopher M. Bishop
Question
8 answers
I'm trying to solve a link prediction problem for Flickr Dataset. My dataset has 5K nodes and each node has around 27K features, it is sparse.
I want to find similarity between the nodes so that I can predict a link between them if the similarity value is greater than some threshold that I decide.
One more problem is, how to define this as a classification problem? I wanted to find overlapping tags for two nodes, so the table contains the nodes and some features of them (will be in thousands) and all of them will be positive class only as I know that there is a link between them.
I want to create a test data set with some of the nodes and and create a similar table and label them as a positive class or a negative class. But my problem is that all the data I have is positive, so I think it would never be able to label as negative. How to change it to a classification problem correctly?
Any pointers or help is very much appreciated.
Relevant answer
Answer
You can use Rough theory, Evolutionary algorithm, PCA, HMM ant etc if you need a sample way to solve this problem, so MCDM techniques can help to you
best wishes
Mohammad
Question
6 answers
We consider a data stream which is not necessary stationary. We would like to label instance on the fly.
Relevant answer
Answer
It really depends on your type of data. I think all classification algorithms can (and are mostly intended to) do online classification if you have enough training data available, it is just a matter of which one has best performance in your data. Also, whether you want to build a model with training data, save the model, and apply the model to new data; OR alternatively have the classification model trained and applied online... What data are you dealing with? Did you try any algorithms "off line" first?
Question
12 answers
What are the important factors to look into to improve the quality of software by predicting the post release defects ?
Relevant answer
Answer
@Utkarsh: I see one problem with your metrics:
It is not clear what rules we should use to define the number of defects a software has (it is hard to envision what exactly counts as a defect without a definition. Would it be the number of use cases for which it fails, or maybe the number of modules that don't function they way they should?)
I will admit it is a pretty interesting formula, though.
@Sam: For code-oriented metrics, you could try some kind of formal verification, but that's only going to work out if your developers/designers are mathematically and logically inclined. You could try, though, it will increase the reliability of the design at least, and consequentially, it will also increase the reliability of the implementation up to a certain point.
Question
3 answers
I want to develop an application in java which takes input as the log file of a mail server in csv format and outputs a graph for it. The graph should be generated by gephi. Then again I want to input that csv file to KruskalMST.java (program for finding minimum spanning tree of the generated graph above). But before giving the csv file KruskalMST.java, i will convert all the weights to negative so that instead of getting minimum spanning tree I get maximum spanning tree because I'm interested in maximum spanning tree. KruskalMST.java will give me a file which will contain three columns (Source,Target,Weight). Again this file will be input to Gephi and it will generate a maximum spanning tree graphically, then I want to do some analysis on these two graphs.
Until now I have done all these things manually and separately. I want to integrate all these things and want to develop a single application which has gephi embedded in it.
Please give me some suggestions on how to proceed. Will gephi toolkit be helpful?
Relevant answer
Answer
You might want to take a look at nosql graph databases. See the current list:
Question
3 answers
Can anyone help me find a tool that allows me to download the old tweets in the history of a user. I need to study the content of the tweets of 2011 from a group of users who used a # hashtag.
Relevant answer
Answer
Hi Oscar, 
You can get twitter data from Podargos Data. It can provide twitter historical and realtime data. In additional, it's able to help you retrieve other mobile App data of social networks and e-commerce, and supports for nearly 100 languages.
Pordargos's website is http://www.podargos.com.
Here is the data sample:
Question
9 answers
I know that this is an advantage of using decision trees, but usually people get very large and complex models...
Relevant answer
Answer
In the cases I have needed classification, I usually got best results with other methods, but had to use Decision Trees exactly BECAUSE people wanted to be able to see and interpret the decision rules. So on my limited experience the answer would be "every time".
Question
16 answers
Medical data set is difficult to compare with other data sets, so classification techniques need t be combined to get accurate results.
Relevant answer
Answer
I think it is not a question of whether data is from medical field or not, but the nature of the data. You may like to elaborate the kind of data you are dealing with -- images, numeric values, qualitative values, etc -- as well as their uncertainty/reliability aspects. Also the nature of classification you are interested in will be useful -- binary vs multi-class, overlapping vs mutually exclusive, etc. Classification is a very very researched domain -- plenty of techniques and experiences can be found in literature.
Question
7 answers
Wanted information about future research scope in the field of data mining in areas like anomaly detection in medical images and cancer detection
Relevant answer
Answer
While you can tweak the anomaly detection algorithms for better performance, the biggest challenge is in making sure that the data itself is accurate, complete, and consistent.
Is anyone using Apache Mahout for their research?
Question
20 answers
I'm looking for people working with Apache Mahout in their research.
Relevant answer
Answer
I'm one of the developers of Mahout and I'm always interested in how people use it
Question
8 answers
There is increasing research that use data mining in education for different purposes. I want to see if there is a study about the skills of students using data mining techniques.
Relevant answer
Answer
Safia,
The closest reference that I have to what you are looking for are:
1) Nan Li et al. "A Machine Learning Approach for Automatic Student Model Discovery"
2) Tiffany Barnes "The Q-matrix Method: Mining Student Response Data for Knowledge"
Be mindful that you need to design an experiment where you isolate the variables that you are mentioning and there are some challenges with those questions you want to address. The problem lies in:
1) Isolating the knowledge domain in which the student is involved
2) How do you isolate results based on the machine learning technique, its interface and the knowledge domain
3) Cultural differences among the subject which might influence the perceptions of the domain, interface and the mapping of those variables to specific skills, learning style, etc. that you want to measure.
While I am not saying that there are no papers on the subject, I am pointing out to some of the factors for which you might not find papers dealing with the whole problem. I would suggest decomposing the problem into some of its part to find better results. I would first look for information on HCI and specific competencies and then see if I could link them to papers on data mining techniques.
Hope this helps
Question
8 answers
I want to work on a categorical data set and find the best features by a filtering method or a wrapper one.there are a few algorithms suitable for categorical data ,so Does any body know any
0-benchmark
1-similarity measurement
2-feature selection methods
3-accurate clustering
for categorical data sets?
(sorry for my general question and thanks for your participation )
Relevant answer
Answer
You can also look at Jaccard Similarity and Locality Sensitive Hashing. See: Chapter 3 in http://infolab.stanford.edu/~ullman/mmds.html
Question
4 answers
Ppdm
Relevant answer
Answer
The field of privacy has seen rapid advances in recent years because of the
increases in the ability to store data. In particular, recent advances in the data mining field have lead to increased concerns about privacy. While the topic of privacy has been traditionally studied in the context of cryptography and information hiding, recent emphasis on data mining has lead to renewed interest in the field.
some links
Question
4 answers
Or is it only based on content?
Relevant answer
Answer
As far as I know, the original MediaWiki source that you can download and run on your own servers does not implement page ranking, so I guess that the original Wikipedia did not employ it. However, no one can really review Wikipedia source code and the server and script configuration so no one cam be sure, but my guess would be No.
Question
1 answer
I have obtained a part of TF-mRNA relationships from CRSD database. However, I want to get a more comprehensive database of TF-mRNA relationships to validate results. I will be very appreciated if providing a more comprehensive database.
Relevant answer
Answer
Sorry, I cannot help.
Question
11 answers
.
Relevant answer
Answer
Which type of datasets are you needing? pcap files? NetFlows? csv? labeled? not-labeled? Real? Simulated?
Some botnets datasets have real attacks in pcap and csv format. These are not specific for intrusion detection, but maybe they are usefull to you:
- The Protected Repository for the Defense of Infrastructure Against Cyber Threats
(PREDICT) published three Botnet datasets until May 16th, 2013. They are CSV
text files where each line is a one minute aggregation of the number of attempted connections of one IP address.
- The CAIDA organization published a research about the Sality botnet. They
the+Sipscan
An intresting pcap repository is this. However, most of them are not labeled:
Also, most payloads in the pcap datasets are not complete, like in:
Where you can find a lot of pcap files from capture the flag contests, but you won't get the complete data.
Finally, three complete pcap botnet datasets can be found in (only botnet traffic, so they are labeled):
How to perform feature selection using mutual information?
Question
8 answers
For example, there is a document-term data set with terms as dimensions and to perform feature selection on the terms Mutual Information is used as the measure. After calculating the mutual information between all possible pairs of terms is it correct to set a threshold and select all the terms of the pairs that have an MI value less than the threshold?
Relevant answer
Answer
Consult the following papers: 1) "A Framework of Feature Selection Methods for Text Categorization", by Shoushan Li, Rui Xia, Chengqing Zong, Chu-Ren Huang. In Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP, pages 692–700 2) "Effective Text Classification by a Supervised Feature Selection Approach", by Tanmay Basu, and C. A. Murthy. In Proceedings of ICDM Workshops 2012, pges 918 - 925. 3) "An Extensive Empirical Study of Feature Selection Metrics for Text Classification", by George Forman. Journal of Machine Learning Research, vol. 3 (2003) 1289-1305.
Question
1 answer
The categorization is for a sentence/word in to a particular class
Relevant answer
Answer
Can you pl explain , what exactly u want to do ? as per my interpretation TF/IDF might be one of the solution.
Question
2 answers
I am wondering if it is possible to break matrix inversion into a hadoop/mapreduce framework?
Relevant answer
Answer
Hello,
I think mahout package contains implementation for that. could you please check this http://mahout.apache.org/.
Thanks
Mohammed
Question
2 answers
I am interested in applying the methods of evaluation of classifiers in KNIME. For some reason the ROC curve does not apply to regressions.
Can anyone help me?
Relevant answer
Answer
. ROC curve apply to classification problems only (in the good old "eye diagram" in digital communication from which the ROC has been borrowed there are just two classes, bit "0" and bit "1" !) the following paper might be of interest to you : A Comparison of Dierent ROC Measures for Ordinal Regression Willem Waegeman et al http://users.dsic.upv.es/~flip/ROCML2006/Papers/waegemanROCML06.pdf Abstract Ordinal regression learning has characteristics of both multi-class classification and metric regression because labels take ordered, discrete values. In applications of ordinal regression, the misclassification cost among the classes often diers and with dierent misclassification costs the common performance measures are not appropriate. Therefore we extend ROC analysis principles to ordinal regression. We derive an exact expression for the volume under the ROC surface (VUS) spanned by the true positive rates for each class and show its interpretation as the probability that a randomly drawn sequence with one object of each class is correctly ranked. Because the computation of V US has a huge time complexity, we also propose three approximations to this measure. Furthermore, the properties of VUS and its relationship with the approximations are analyzed by simulation. The results demonstrate that optimizing various measures will lead to dierent models. .
Can we apply data mining techniques on "IDS signature database" for anomaly detection?
Question
4 answers
Can we extract or discover vital patterns from IDS signature database for further use in anomaly detection by the same IDS?
Relevant answer
Answer
I am assuming that you want to do post processing of alerts to see if other patterns can be discerned to detect for example APTs. My answer would be that it depends on the pattern that you want to find out. For example simple aggregation of data obtained from a signature IDS can provide a basis to see if it is purely random attacks or a systematic hacking. possible solutions would be: 1) Correlation analysis of different alerts vs. targeted machines 2) Derive sample data to train a machine learning algorithm to detect certain complex behaviours of attacks. The success of this strategy will depend on the quality of the data and the network design. Hope this helps
Question
2 answers
Suppose I have a series of paragraphs, how do I know when put together, these paragraphs form a coherent article?
Relevant answer
Answer
It depends on the topic and the sequential of what you have. You must to start from the wide view of the field to your idea and work.
Could you recomend surveys\articles about discovering time sequences anomalies?
Question
1 answer
I have two anomalies connected with time: 1) I have sequence ABCDE where time between C and D in one appereance is 1ms, but in second appereance is 10ms and I have to discover it. 2) I have two timelines with the same sequences and I have to discover time shift between them. Thanks in advance, Tomek.
Relevant answer
Answer
Perhaps try the generalized fluctuation test approach? It is implemented in the strucchange package in R. Check out this: http://statmath.wu-wien.ac.at/~zeileis/papers/Zeileis%2BKleiber%2BKraemer-2003.pdf and this to start: http://www.sciencedirect.com/science/article/pii/S0167947309004435, as well as the vignette in the R package. A second option would be Bayesian change point analysis (R package bcp); info here: http://bioinformatics.oxfordjournals.org/content/24/19/2143.short and here: http://www.stat.yale.edu/~jay/Rmeetup/March4.2010/Docs/JSS2007.pdf
Question
1 answer
New research topic in iit or iiit going on......
Relevant answer
Answer
Please check my website: http://www.abonyilab.com (fuzzy clustering, classification, association rule mining are still hot topic)
Question
4 answers
Is there any one who can suggest any way or site to find twitter data which contains tweets and users locations? Or is there anyone who has such data?
Relevant answer
Answer
Hi,
Podargos can help you find data from Twitter by machine learning. All you have to do is to tell them your needs.
It can provide twitter historical and realtime data. In additional, it's able to help you retrieve other mobile App data of social networks and e-commerce, and supports for nearly 100 languages.
Pordargos's website is http://www.podargos.com
Here is the data sample: 
Question
21 answers
I'm encounter with a very high dimension classification problem. I'm looking for an efficient feature selection (or extraction) method.
Relevant answer
Answer
Your problem is a classification problem, so you need a supervised method because of your class labels.
There are many algorithms you can use. In order to SELECT a subset of features, I strongly recommend you to use a multi-objective evolutionary algorithm (MOEA or MOGA) such as NSGA-II and your objectives would be maximizing between class distance, minimizing within class distance, maximizing F-Measure, minimizing FPR and so on.
If you want to extract new features which construct lower dimensional space, I strongly recommend you to use LDA or FLDA which shows a great impact on dimensionality reduction for supervised problems.
If your data is noisy, do some preprocessings, such as normalization and outlier removal before any dimensional reduction algorithms (MOGA or FLDA)
Ask me if you have more problems about that!
Question
14 answers
And what do you think the next big thing is in Machine Learning/AI/NLP?
Relevant answer
Answer
The next big thing is completely unsupervised algorithms for both image and audio with high accuracy comparable to the human brain.
Question
9 answers
I am looking for good conferences or workshops on the topic of data analytics and big data this year. Is there anything where the deadline for submission is not yet expired?
Question
17 answers
I am trying to combine both methods for classification of human grasping.
Relevant answer
Answer
Hi,
first you have to distinguish whether you want classify data or cluster. If you are interested in classification you should not use an unsupervised learning scheme like k-means or SOM, because data cluster may not match class distributions! For this case you should apply learning vector quantizers (LVQ) or SVMs ... If you apply any unsupervised learning for those problems together with a post labeling, then the accuracy might not be the best because unsupervised learning goal is different from classification!
If you are interested in clustering/quantization then k-means obviously is not the best choice, because it is sensitive to intialiization. SOMs provide a more robust learning.Yet, visual inspection (as provided by the SOM toolbox) has to be done carefully because you have to check (!!! -it is not always guaranteed as frequently wrongly proposed!!!!) the topology preservation property of the trained map before starting visual inspection. Otherwise, visual inspection can mislead your thinking. But there is another alternative if you do not interested in data visualization and related considerations. A robust, fast variant of k-means, also adopting the idea of neighborhood cooperativeness but differently realized than in SOMs, is the neural gas vector quantizer proposed 1993 by Thomas Martinetz. If you just looking for a stable alternative to k-means I strongly recommend this one.
~ Thomas Villmann, Germany
Question
3 answers
Has anyone ever tried to replace 2nd order methods in 2nd order optimization with 3rd order? What are the results? Better? Worse or not worth trying?
Relevant answer
Answer
Dear Dr. Nazri Mohd Nawi,
in most cases, same accuracy level a 3rd order methods will in most cases use fewer iterations than a second order method. However, the number of arithmetic operations per iteration is higher for 3rd order methods than a 2nd order method..
Question
5 answers
Deep learners report very good results on a lot of difficult learning tasks such as Speech recognition, Object Recognition, Natural Language processing, Transfer learning...
The hidden layers are trained one by one used an unsupervised learning procedure such as autoencoders trained to reconstruct the inputs,
where the input is the hidden layer of the previous autoencoder.
At the end, a supervised layer discriminates the inputs.
Relevant answer
Answer
I would think that there are two reasons why they perform well:
1)The use of multiple layers have been known to perform better abstractions for a long time ( see for example An Introduction to Computing with Neural Nets by Lippmann where on figure 14 explains the types of decision regions that can be obtained from adding layers).
2) It allows the creation choke points that forces the neural network to generalize( see Reducing the Dimensionality of Data with Neural Networks by Hinton and Salakhutdinov)
These two reasons may allow you abstract to higher level representations. The problem has been that techniques used on shallow networks such as backpropagation perform badly on such networks, therefore historically limiting research on multiple hidden layers.
There is still much research to be done on deep learning to explain their performance.
Question
27 answers
Which regression model will be better? A linear regression model of higher order or a non-linear regression model of lower order?
Relevant answer
Answer
I agree with Raja Fawad Zafar. The model he suggested has been widely used especially in case of studies on the environmental factors where relationship of the unknown variables are involved. Regarding the books on nonlinear(R software) regression, please follow the link below (and kindly use it for educational or research purpose only) :
Question
8 answers
I am looking for a data set that contains tweets on consumer feedback towards products and services. Benchmark data sets are appreciated as well.
Relevant answer
Answer
Dear Chiung Ching Ho,
Please go through the following site:
Thanks,
Sobhan
Question
5 answers
I want to know how to implement Association Rule Mining in Cloud Environment? Does it only limited to outsource cloud space for storing intermediate data in mining process or anything more?
Relevant answer
Answer
You could design a map-reduce algorithm that would run on a computer cluster and discover the association rules.  The storage could be distributed for example using the Hadoop File system. The computer cluster can be in the cloud.   That would be a simple way of doing "association rule mining in the cloud"
Question
45 answers
Let's have a set of (xi,yi), i=1,2,...,n data for which we do not know anything at all about its hidden functional yi=f(xi) form. We want to find the true function f() with non parametric methods, i.e. without adopting a model of any form and taking regression analysis. The task should has been completed if our function can reconstruct data outside the initial given xi \in [a,b] interval, i.e. if it has predictability value. Otherwise, we don' t care since there exists many methods, (cubic splines interpolation and other) to represent data inside the given x-range [a,b].
So, which method do you think is the 'best' one for solving the above problem?
Relevant answer
Answer
As regards extrapolation, I assume that this is an ill-posed inverse problem. However, it might well be that the maximum entropy approach can be applied to that problem. This is for instance used in recovering images that have been spoiled,e.g., by noise. [ http://www.et.byu.edu/~bjeffs/publications/Willis_ICIP_00.pdf ]
But this approach is not completely parameter free. Instead using a large parameter set, the bias is at least reduced by using that parameters that maximize the entropy. Of course, this does not help a lot if your model is too restricted even with the large parameter set.
Question
3 answers
I am working on multi relational classification technique and i want to combine classifier which are trained on different relations.
Relevant answer
Answer
You could learn a new classifer on the output, most effectively when the respective classsifers give continuous (e.g., probabilistic) output.
Otherwise you should think of using one global classifer which can deal with many probably unnormallized featuers, e.g. random forests or multiple kernel learning.
Question
2 answers
I want to classify spatial image data.
Relevant answer
Answer
Though it depends on the end goal of what features you are looking for, you can, however find these package useful for that purpose: sp, rgdal, ggmap
In the spatial images - you usually encounter both vector models and raster models. Based on your need and purpose, choose the package that best serves your purpose.
Check out the attached presentation. And do not forget to review: http://cran.r-project.org/web/views/Spatial.html for more details.
-Gopalakrishna Palem
Question
3 answers
Existing techniques such as generalization, suppression, etc. For ensuring privacy, what are the recent anonymization techniques proposed in last 2 years?
Relevant answer
Answer
If your data set is large, you might consider micro aggregation. Consider 10 records at a time and average the variables for the records. Report the averages. This simple technique has been around for some time. Good answer depends on the nature of the data and the purpose the anonymized data will be put to.
Question
1 answer
Applying LSA on 500 pdf documents extracted from Google (for a certain feature), I got a low accuracy once I tried to infer the topic of new documents.
What could be the reason for this?
Relevant answer
Answer
The fact that you used a phrase to select the 500 pdfs from Google does not mean that they contain material on the same topic, or even that they are homogeneous. The first 125 returns are more likely to contain your terms than the next 125. If you are using the 501th return from Google for testing, it is possible that the phrase only occurs once in the whole document, or the document is an outlier. Try regarding the first 125 pdfs as 'seen' documents and the next 125 as 'unseen'. Are all these pdf topics poorly classified?
When all else fails, doubt your implementation of the algorithm. Check that all the assumptions for the latent semantic analysis are not breached. See e.g. http://en.wikipedia.org/wiki/Latent_semantic_analysis and its references.
Question
4 answers
Given that I have multiple running of k-means and the same dataset (say $K_1$ to $K_p$), each one running with random initiation. Given two points $x$ and $y$, and their distance $d=||x-y||$.
How can I bound the probability of them being assigned to the same cluster over the random running of the algorithm?
I'm stuck at this sub-problem and is part of a larger problem I need to solve. Any help would be great. Maybe pointers to results or promising methods to grasp this.
[The Effectiveness of Lloyd-Type Methods for the k-Means Problem][1] by Ostrovsky et al seems to be helpful, but I'm still working out the techniques he uses to see how can I use it to accomplish what I need.
Relevant answer
Isn't this the problem to find a null-model for the Rand-Index that is discussed by Hubert and Arabie (1985) in their article 'Comparing partitions'?
Question
11 answers
I wonder if someone has extended the J48 classifier builder implemented in Weka to get an incremental version of the algorithm? For some classifiers, naive bayes among them, Weka provides a batch and and incremental version, but I think this is not the case for J48. I also wonder if rebuilding the classifier with each new training example would be an alternative to the incremental version? I'd appreciate your help in these issues.
Relevant answer
Answer
For incremental induction of decision trees, you can see http://www-lrn.cs.umass.edu/papers/mlj-id5r.pdf . Incremental induction of decision trees has not been explored so far in recent years because reinducing a decision tree with new instances is less expensive and presents better results than incremental learning of DTs. However, I do not know if stream data motivated new investigations in this area.
Related to your question, I think that is not useful to induce a new model with each new instance. In stream data they have some methods to deal with this kind of problem. One approach I've seen is creating a pool of new instances and, after N new validated instances, a new model is induced. I suggest you to look for literature in this area - stream data.
Question
5 answers
Some practical examples related to business would be of great help!
Relevant answer
Answer
K means is a clustering method.
When you apply a clustering method to your dataset, it allows you to separate your data in groups that maximize the similarity between data in the same group and maximise the dissimilarity between data in different groups. The number of groups is an input parameter of the problem, that is you will choose it.
K-means groups data, and returns k centroids, i.e. k vectors that represent the center points of the groups, and returns a matrix that assigns each sample in your dataset to a group.
K-means is an hard method, i.e. it assigns one class to each sample.
There is a soft implementation of k-means, named Fuzzy C-means, that allows you to assign each sample to different groups with a membership value.
Clustering methods belong to Unsupervised learning theory, because they allow to find hidden connections in the data, discovering knowledge.
For this reason, this methods are very relevant in the data mining process, where we want to "mine" knowledge from data.
Unsupervised problems are tricky, since the informations that we want to extract from data are not known a-priori, the evaluation of the results is not as simple as in the classification tasks.
I'm not expert of businness data, but suppose that your dataset represents information about customers of a shop. So k-means can group the customers according to a similarity measure.
hope that the main concepts are clear now!:)
Question
3 answers
I am going to analyze big data of data centers. But first of all i want to know that it what form (data ype) data are stored in data centers (facebook, twitter, amazon etc.).
What parameters should i use for classification/grouping of big data.
Thanks
Relevant answer
Answer
The classification should be formed based on the goals of what you want to achieve (things you want to output, measure and control).
For example, if your goal is 'high-throughput' then you classify big-data based on the parameters that affect the 'high-throughput' such as 'latency involved in processing the data', 'frequency of data input' and so on....
Try to come-up with the business-goals first, then you can easily identify the classification and groupings.
- Gopalakrishna Palem
Question
15 answers
I'm working on a piece of research for this - the data set is in the area of inpatient activity (hospitals). I would value your thoughts/ideas on this.
Relevant answer
Answer
"clear, tangible benefit", "get (and hold) their attention" = tie it to performance reviews. ;-)
Seriously, I wonder if the key is getting them easy to use tools that are linked to their data warehouse and/or EMRs, and make a short training that counts as admin time/CE/teaching load reduction/etc.--some time-based incentive. Of course, you'd need top-down leadership to make it even get off the ground. Something like Tableau or similar tools that allow the docs to ask and answer their own questions ("how many care gaps did I miss this month, and with what types of pts.?", "is my panel sicker than Dr. B's based on DxCG scores?", "how many labs do I order per patient as compared with the other docs in my hospital?", "do my obese pts. live within a quarter mile of a park, or will recommending outdoor exercise be fruitless and I need another tactic?", etc. I don't work with hospital data; those are just typical family med questions... no doubt the hosp docs have a huge set of their own.)
Once they're hooked on answering their own questions (esp. without having to wait a week or more for a BI answer), I'd image it'd be easier to start a cross-doc group, so they can riff of each others' questions.
As for who's done it--I think Kaiser Permanente is doing this already in healthcare, but I'm not sure which divisions or departments. Lots of banks, big businesses, etc. are doing this already, but it still seems to be a novel idea in health care...
Question
5 answers
I have a directed graph in which every node depicts a user and an edge between the user depicts that a mail has been exchanged between them. Weight of an edge shows the number of mails exchanged between the two users. I want to find the most weighted path from a node in this directed graph. I have used GEPHI to generate the attached graph. The graph is divided into different communities based on the weight of the edges. Each community is represented by a different color.
Relevant answer
Answer
Hello Ankit,
How about this approach:
1. Set the "cost" of each edge equal to its negative weight. For instance, if between node i and node j you currently have 5 emails exchanged, set c_{ij} = -5.
2. At the node you are interested in starting, let's call it the root node, create an artificial commodity with a supply of 1 unit.
3. At the node you are interested in ending (or if you are interested in all other nodes, this piece will be embedded in a loop) create an artificial demand of 1 for the commodity created in step 2.
4. Solve a minimum cost network flow (MCNF) problem with the root node as the origin, your other node of interest as its destination and all other nodes acting as transshipment nodes (i.e. flow balance constraints at those nodes equal 0).
Even if you use a generic solver, due to the network structure, network simplex should perform much faster than a typical linear programming solver. Moreover, due to the total unimodularity of the constraint matrix, you're guaranteed to get an integer solution. This means that the unit of flow that needs to travel from origin to destination will not get split up along the way, so your solution will be of the form "use this edge/don't use this edge" for all edges in your graph.
Tthere are other graph optimization libraries (LEDA, Goblin, LEMON) which have even faster specialized MCNF algorithms implemented, so large scale problems should not be problematic.
Note: If you are familiar with network algorithms, you will recognize what I've just described in steps 1-4 as a simple tweak on the shortest-path problem. The tweak is to make all weights negative (which, typically, a specialized shortest-path algorithm like Dijkstra's can't handle but Bellman-Ford can), such that minimizing this negative total cost is equivalent to maximizing your total sum of weights.
Question
5 answers
I need to find the best technique of data mining to classify new data or point without a fixed number of classes like with k-means algorithm. I determine the form of my data with a vector.
Relevant answer
Answer
Do you allow that a new point would not fall into any original class and be a new class?
If you have classes from a hierarchical clustering algorithm, once you've chosen the cutoff which will determine your classes, you can classify a new data point by distance from each of the classes. For example if you did hierarchical clustering with average linkage, a new point would fall into the class whose center is closest to it. Other linkages would give you other metrics to make your decision (e.g. closest point for single linkage and so on)
What is the latest algorithm to find the similarity between data of two or more clusters?
Question
3 answers
I am using CURE clustering of Hierarchical clustering technique
Relevant answer
Answer
you can see following papers Measuring Similarity between Sets of Overlapping Clusters by Mark K. Goldberg [http://www.cs.rpi.edu/research/pdf/10-01.pdf] and Clustering Similarity Comparison Using Density Profiles by Eric Bae [which could be found at citeseerx.ist.psu.edu/]
Question
8 answers
I read an article from Pedro Domingos titled "A few useful things to Know about machine Learning" (Oct 2012), and I do not buy into how he says that a 'dumb algorithm' and a tremendous amount of data will provide better results than moderate data and a more clever algorithm. What if adding more data gives you more noise or irrelevant information? The only thing I see as their justification is that you can go through more iterations from the data and that you have more ways to learn from it. I can't see that claim as sufficient or sound enough for this be valid.
Relevant answer
Answer
His section "More data beats a cleverer algorithm" follows the previous section "Feature engineering is the key". In this context he is probably right, but with this priority: the representation "is the most important factor" (his quote) for the success of the project..
I think what he means is that IF you got the 'right' representation and the appropriate amount of data you are OK without the fancy algorithms. However, what he should have stated more explicitly is that if you don't have a 'good' representation---which is almost always the case---no amount of data or clever algorithms will produce satisfactory results.
What he and almost everyone else do not realize is that there are fundamentally new representations which could radically change the usual rules of the game.
I'm quite sure that the PR / ML obsession with better algorithms can be explained by the lack of good representations: if you don't have access to a 'good' data representation and you like statistics, you start inventing "cleverer" algorithms. ;--)
Question
7 answers
Can anybody tell me the recent sequence mining techniques?
Relevant answer
Answer
Sequence mining is a broad research area. It can cover several types of techniques such as algorithms for:
  • sequential pattern mining, which consists of discovering patterns common to several sequences.  
  • sequential rule mining, which consists of discovering rules appearing in a set of sequences
  • episode mining, which consist of discovering patterns appearing in a single sequence.
  • periodic pattern mining, which consist of discovering a pattern periodicly occuring in a sequence
  • etc.
Besides, some other closely related topics are sequence prediction, sequence classification, sequence clustering.
You can search about each of these topics in Google Scholar to see the most recent algorithms. Besides, if you want to try some, you can check the SPMF data mining library (I'm the founder) which offer many recent algorithms for mining patterns in sequences
Question
4 answers
All the above have some common ground. In your opinion, what is the real difference between these fields? if you were asked what do you classify your research area what would that be?
Relevant answer
Answer
In layman's language, statistics is a way to infer patterns from data based on existing model; machine learning is a heuristics to have the computer form its own model from the data; data mining and pattern recognition are applications (not methods) that can be done through either statistics or machine learning; and pattern recognition is a sub-field of data mining. Many people would just claim they do all of them, I guess.
I do woodworking and carpentry using routers and saws, etc., BTW ;)
Question
14 answers
I propose that the *only* way ahead is via a fundamentally new representational formalism that would clearly reveal what critical information has been missing from the conventional forms of data representation. That is, we need to get to the form of data representation that captures completely new side of objects we have been missing so far, which can only be achieved by relying on the entirely new representational 'tools'.
Relevant answer
Answer
Indrajeet, you seem to be confusing the mind and the brain all the time. ;--)
Question
17 answers
My svmlight output for binary classification (+1 for true and -1 for false classes) is like
+1 0.001
-1 -0.34
+1 0.9
+1 055
-1 -0.2
I tried generating roc in R using ROCR package. But I found in all examples of ROCR TPR ,FPR and cutoff values were used. TPR and FPR can be calculated from the above data but will get only one value and also w.r to cutoff I am confused. In this case cutoff is -1 to +1 right?
Can anyone help me how to give the values from the above data for drawing ROC in ROCR package.
Relevant answer
Answer
Thanx for the help Mr.Iordan . Actually with lot of browsing I corrected the errors and could also manage more than one curve in ROC. Got the plot of multiple curves. But AUC in R was giving problem. So finally I managed with excel which has a formula to calculate AUC.
For AUC in R I tried RocAUC, pAUC but nothing worked. I was giving for each dataset different names.
I would like to model the "Modus operandi" in cyber crime from a user's fraudulent cyber-activities. How can this be achieved?
Question
48 answers
Help with effective algorithm.
Relevant answer
Answer
Take a look in Wolfram's "A New Kind of Science" if you would like to simulate some 'intelligent' behavior. For me, it is about an old/new aproach toward the simulation of unpredictable things. I love this approach. Best regards. johann
Question
1 answer
Just wanted to start exploring the possibilities of bioinformatics to be simulated using artificial intelligence ?
Relevant answer
Answer
You wish to predict a complex phenomenon. Machine learning (SVM, Neural Networks) requires of experimental data available. "Prediction" is not from nothing. So, why you don't use an exploratory response surface, a classical method of sequential statistical optimization...??
Question
4 answers
I'm fitting some multivariate observation sequences to a hidden markov model using R's RHmm pacakge. But I keep getting the following error "Error in BaumWelch(paramHMM, obs, paramAlgo) : Non inversible matrix". Does anyone know what happened? Or does anyone know where I can find other implementations?
Relevant answer
Answer
Hi,
If you have too few observations and too many hidden states, you can get singular matrices. Another reason might be a poor initialization. You can try a different initialization. You could also try adjusting the prior. If your matrix is singular then a different toolbox won't help...
Dave
Question
12 answers
Lately I am starting to get interested in data mining research applied to reduce poverty in some area/country. Does anyone work on this kind of research ?
My previous works were on classification methods, especially generalizing binary multisurface proximity methods [by OL Mangasarian] to multiclass using OVA. I also did other research that related to that model [dealing with missing values, online learning, error bounds, etc].
I am hoping that I can get pointers/updates on how to relate my new interest [data mining & poverty] to my previous work [classification methods]. Any help/discussion will be greatly appreciated.
Relevant answer
Answer
I think data mining methods will help in identifying interesting patterns in the data set and also give you informative segmentation of the data. That is an area I would love to explore more though now am applying them to data from public health research.
Question
24 answers
Big Data is one of the buzzwords of the year, and we heard arguments along the full spectrum – from having so beautifully many data available that we can solve huge societal challenges, over worries considering data archiving, to statements such as ‘a tsunami full of rubbish that n body will be able to use’ or (more scientific) doubts about being able to deal with the huge heterogeneity in semantics. Now, I would like to get your opinion where this debate has let us and where it might go.
Relevant answer
Answer
I agree on this one, Raphaël.
There is a trend to have more and more (heterogeneous) data available from diverse sources and it is not bad to name this 'phenomenon'. In the end, this will (or better should) not alter our daily research work. Still, apart from possible disruptions, we might in some cases benefit from newly released information - if we find it ;-)
By the way, I gave up chasing for a single model (and format) that suites all. In the end, data is collected and optimized for a specific purpose. If you intend using it for something else, you have to life with some needs of conversion and be careful about its fit for another purpose than it was initially collected for.
Question
3 answers
I need to perform Parafac decomposition in R software and I need to understand how I can implement it. Is there a package that performs Parafac and gives me the factor matrix?
Relevant answer
Answer
.
multiway tensor decomposition in R, including PARAFAC :
hope this helps
.
Question
7 answers
I'm using the J48 decision tree algorithm with Weka, but it doesn't build a tree.
It only says Number of Leaves:1 and Size of the tree:1
Relevant answer
Answer
"tree size is 1" means that the split on the data in some particular condition is not worthwhile based on the j48 pruning mechanism. you don't have to use pruning mechanism actually and just construct an unpruned tree. there are so many options in j48, you can choose no pruning
Question
5 answers
I want to classify images based on SIFT features, so what tool can I use to extract them? I have used "http://koen.me/research/colordescriptors/" to extract the SIFT feature, but the problem that I am facing is: the size of the file after extracting the SIFT feature goes too large. I cannot pass that file to SVM as on file has apx. 12000 rows. The image that I am using has dimensions eg. 1024x683. SIFT feature file must contain less information so in this way I can pass hundreds of images to SVM at same time.
Relevant answer
Answer
Try using SURF features. They are similar but faster. There is an implementation in google code site:
Question
1 answer
I usually use Latent Dirichlet Allocation to cluster texts. What do you use? Can someone give a comparison between different text clustering algorithms?
Relevant answer
Answer
Bisecting K-means
Question
4 answers
Classical approaches perform badly on the minority class, even if the global learning is satisfactory.
Relevant answer
Answer
One class classification methods were designed precisely for this type of problem. It is often called novelty detection, or outlier detection or abnormality detection.
The idea is that you learn solely from the majority class (normal) so that any deviations from it are regarded as the minority (abnormal). For example, many common classifiers such as k-means, SVMs, MoGs and nearest neighbour have been adapted for this problem.
David Tax from Delft University is very knowledgeable in this field. His phd thesis is a very good read, as are his numerous papers in this area.
If you use Matlab, there is a toolbox dd_tools which contains these methods.
Finally, when measuring your classifier's performance, I find the balanced error rate (BER) measure useful. It is the average of the false positives and false negatives.
Consider a 90 positive and 10 negative examples and a classifier that treats all objects as positive. Here, accuracy would be 90% which sounds really good but for your situation, it hides the poor performance. The BER here would be 0.5*(0+100) = 50%, a poor value that reflects the weak classifier.
Hope that is of some use.
Question
23 answers
Recently when I tried to visualize similarities between different groups in a social network, I discovered an interesting phenomenon which I couldn't explain. Can someone give me some insight on this?
The story is as follows: Users can create different groups on our social network website, and we'd like to visualize the similarities between different groups in 2D. We use the group members and their page views within the group as the feature vector of each group, and the similarity between groups is computed as the cosine similarity between their feature vectors. I used multidimensional scaling ( cmdscale in R ) to reduce the data into 2D and visualized the data.
The result of the MDS is points lined up with some lines orthogonal to each other. Can someone explain why this is happening?
Relevant answer
Answer
I think @Kartikeya got it right. The graph actually represents 4 overlapping communities in the underlying network along with some outliers. This could possibly be verified by the topological graph clustering (overlapping community detection) from the underlying network. If you are somehow able to show some level of mapping between the groups identified by using your feature vectors and the communities identified by some overlapping community detection method on the underlying network, It could be a significant contribution towards analyzing the grouping tendency in social networks.
Question
4 answers
Hi every one,i want to ask,can apply graph mining techniques on stream data which come from sensor networks?,and what do we can get from this mining?thanks very much .data streams come from sensor networks are very diffrent respect to data come from web,social networks .
Relevant answer
Answer
Hi, I think a good reference on this topic is http://www.charuaggarwal.net/integrating.pdf . Charu briefly introduces idea on mining sensor networks stream. Hope it helps.
Question
11 answers
If we want to define a size on a concept, what is the best criterion to do that?
Is it possible to use Greedy Set Covering Algorithm and suppose the minimum number of spheres covering positive instances as a Concept complexity measurement?
Relevant answer
Answer
Interesting question. But can you give some examples of concepts you want to learn and how are those concept represented in your input space? Because concepts can be understood in different ways ...
Question
9 answers
cmdscale in R is pretty slow. It takes about 1.5 hours to do multidimensional scaling on 10000 points.
Relevant answer
Answer
High-Throughput Multidimensional Scaling (HiT-MDS)
May you find this useful:
Question
2 answers
See my question
Relevant answer
Answer
Metsis et al. describe derivatives of the Enron dataset that were created for anti-spam evaluation:
Can anyone suggest dissertation question(s) or project idea(s) to research for my final year as an undergraduate of Computer Science in Big Data?
Question
5 answers
I am interested in the Big Data technology, however, I am a newbie to this. I would like to dive in and learn as much as I can in one academic year. My strategy is to do this as part of my final year project. I already have experience in web development, human computer interaction and relational databases. I would greatly appreciate it if anyone can suggest a question that revolves around these three areas, as it seems to me that Big Data is the future of Marketing (though I'm not particularly interested in Marketing).
Relevant answer
Answer
u can w0rk in rec0mmander systems 0r ass0ciati0ns rules in Marketing
Question
3 answers
I am working on the data mining area leading to knowledge acquisition. Can we try our own algorithm in the WEKA tool? How far it is feasible and accepted ?
Relevant answer
Answer
Recently I have been working on several classification algorithms and I strongly recommend a Java Machine Learning library: http://java-ml.sourceforge.net/ it is rich and already integrated with WEKA. Java-ML is well documented so You can use it from Your source code. In my opinion WEKA is more tool oriented (You often use WEKA as an application to evaluate some data mining problems).
Question
17 answers
I need to do clustering on a dataset composed of both numerical and categorical data. What algorithms do you recommend ?
Relevant answer
Answer
@Afarin Adami: That's a good idea. It's basically the Hamming distance between the categorical variables. You could compute that, and also an Euclidean distance between the numerical variables, and give a weight to each, depending on the number of categorical and non-categorical variables, the range of the numerical features, etc.
Question
8 answers
In a research, I need to visualize each tree in random forest due to count the number of nodes included in each tree. I use R language to generate random forest but couldn't find any command to satisfy my demand. Do you have any idea how I can visualize each tree in random forest? or how I can calculate the number of nodes in each tree?
Relevant answer
Answer
Thank you Paul for all information you have provided me.
Actually, in my work I check many parameter of random forest but one that I couldn't calculate yet (and need it certainly) is the number of nodes in each tree of random forest. I use 'randomForest' package in R in order to generate my model. But I couldn't find any function presents me the random forest structure(form each DT structure point of view).
Question
11 answers
I have to normalize data which has values for 100-1000 (numeric values). I read some material regarding normalization techniques e.g. min-max normalization, cosine normalization, non-monotonic normalization etc., but I don't know which technique is best for me. I read a paper suggesting non-monotonic normalization is better but haven't found any good material regarding it.
Relevant answer
Answer
Well, seeing your data which has limited range and not much variability between the minimum and maximum, and has one dimension, I think min-max would be a suitable option for normalization.
Question
10 answers
Will it be useful in terms of accuracy and speed of classification?
Relevant answer
Answer
If you are able to provide sensible features that enable such a disease classification, SVM is surely one of the first choices to use. It usually gives state-of-the-art performances with runtime demands scaling cubically with the number of provided training examples. So, everthing depends on your dataset complexity, size and whether your task at hand is well-posed (in the sense that the provided features provide enough information to realize the mapping toward class labels).
Question
7 answers
I'm trying to collect information on this area, but most existing papers seem to target network-based DoS detection, whereas I could not find much about application-level approaches
Relevant answer
Answer
Antonio,
As you have already established, most of the work in DOS detection is based on network intrusion. Some time ago I did a survey of the current state of intrusion detection and found that most of the literature used KDD cup 99 as benchmark for their tests. The literature was filled with this benchmark even though the set has been criticized (and is the main data set used in benchmarking data mining algorithms in detecting DOS). The core problem is lack of a good benchmark on which to publish results.
I would suggest looking into more general topics which can be adapted to host level detection of DOS attacks using ML such as:
P. Kola Sujatha, A. Kannan, S. Ragunath, K. Sindhu Bargavi, and S. Githanjali. A Behavior Based Approach to Host-Level Intrusion Detection Using Self-Organizing
Maps. In ICETET ’08 Proceedings of the 2008 First International Conference on Emerging Trends in Engineering and Technology, pages 1267–1271. IEEE Computer Society Washington, DC, USA, 2008.
H. Kim, J. Smith, and K. Shin. Detecting energy-greedy anomalies and mobile malware variants. In Proceeding of the 6th international conference on Mobile systems, applications, and services, pages 239–252. ACM, 2008.
These might give you some ideas on other topics that might help you.
Hope this helps.
Question
2 answers
Mahout is a solution of Apache Foundation to build scalable machine learning libraries.
Relevant answer
Answer
We are using for Text Mining solutions.
Question
8 answers
Is it possible to crawl emails either in inbox or in a label of an email account?
Question
11 answers
A good researcher in their field is considered good by the numbers of their publications. How far is that true? Are there any other criteria?
Relevant answer
Answer
Very good question, but I think the number of publication counts if you publish in a high quality journal.
I heard from my supervisor about other criteria such as :
- is your research has generated any results/products which is used by an industry.
Some interesting info regarding to the definition of research:
Question
1 answer
During data entry, we can correct errors before mining by creating forms through question ordering during question reformulation.
Question
2 answers
How can a support vector machine algorithm be implemented in image data mining?
Relevant answer
Answer
Comparison of artificial neural networks and support vector machine classifiers for land cover classification in Northern China using a SPOT-5 HRG image
Xianfeng Song, Zheng Duan, Xiaoguang Jiang
International Journal of Remote Sensing
Vol. 33, Iss. 10, 2012
Knorn, J., A. Janz, V. C. Radeloff, T. Kuemmerle, J. Kozak, and P. Hostert. 2009. Land cover mapping of large areas using support vector machines for a chain classification of neighboring Landsat satellite images. Remote Sensing of Environment, 113:957-964
An assessment of support vector machines for land cover classification
C. Huang, L. S. Davis, J. R. G. Townshend
International Journal of Remote Sensing
Vol. 23, Iss. 4, 2002
Question
12 answers
I have a set of data i.e. some features extracted from an image. I need to train with the help of the classifier and then test it. How is it possible with the use of WEKA software? Which is the best classifier?
Relevant answer
Answer
First try to Install the WEKA tool and Read a Tutorial
to understand and operate the WEKA tool.
WEKA tool is the best tool for Machine Learning and Data Mining.
As far as consern which classifiers you can use this after you import your statistical metrics or quanitities into WEKA tool by reading your column or row data from your .csv or .txt file. WEKA has various classifiers and can generate you a statistical binary tree decision for which you can model and structure your own Algorithm.
Basically WEKA tool provides you a pattern recognition statistical learning tree for which is your machine learning algorithm that will adapt and work for your own algorithm.
You can Use the trees.J48 Clone of the J4.5 Decision Tree Learner
Data can be imported from a file in various formats: ARFF, CSV, C4.5, binary
Follow the above Instructions and you will see the Value of WEKA Tool.
Good Luck!
Question
12 answers
Like Different types of Genetic Algorithms
Relevant answer
Answer
There are some good software tools for data mining problems, I worked with Weka, SPM (Salford Predictive Miner) and Statistica before and they gave very good results especially for classification, clustering and visualization problems, however, I don't get what you mean by "implementation of different types of genetic algorithms", do you want to write the code or just apply them for a specific problem? because in these tools the code is already implemented and you can simply choose the classifier you want to use.
Question
1 answer
Can anyone kindly explain what are the standard deviations in the roughness? And how to extract data using correlation lengths or pair correlation functions, usually calculated in typical AFM software (NT-MDT, NTEGRA MoDEL,NOVA Software)?
Relevant answer
Answer
Have a look at the following book by Yiping Zhao: Characterization of amorphous and crystalline rough surface principles and applications
It is an excellent review on surface characterization using realspace and reciprocal space methods. Alternatively you can also try to find her papers on the same topic.
Question
2 answers
Gaussian mutation makes small random changes in the individuals in the population. It adds a random number from a Gaussian distribution with mean zero to each vector entry of an individual. The variance of this distribution is determined by the parameters scale
and shrink.
My question is: do all the genes in the chromosome get effected by Gaussian mutation?
Is there any paper/material that visualizes the Gaussian mutation? There are many examples(visual examples) for crossover operations but cant find any for Gaussian operation.
Relevant answer
Answer
Regarding your first question: Typically, the number of genes that mutated is randomly chosen, too. Usually this probability is 1 divided by the number of genes, i.e. one gene per chromosome in average.
Regarding the question about visualization: Assume that every alleles can take values on a nominal scale, i.e. you deal with a multidimensional space. Then the visualization of any Gaussian mutation is straightforward: Every chromosome is a point in this space and every mutation is a translation in a subset of the dimensions.
You could even plot the chances of translation from a parent chromosome: The mutation corresponds to a hypersphere around the point/chromosome. The density/opacity of this hypersphere corresponds to the probability.
Question
13 answers
Each group of synonyms (English) is linked to a bibliographic reference. The database (Bibliomac for Macintosh only) is driven by a software called 4D. Unfortunately this database is no longer developed. It's a pity because the possibility to link a group of synonyms with bibliographic references is uncommon and powerful for data mining. Reference Manager can do something similar but I don't know how to transfer this list from bibliomac to it because I'm not a computerist? Could somebody help me? I can send the database and the software (unfortunately in French).
Relevant answer
Answer
Monika, thank you for your kind remarks.Johannes is right . My main purpose is to vonvert an old fashioned and obsolete database ( Bibliomac) in a more modern but in keeping the main functoinnality of bibliomac: the ability ro link a bibliographic reference to a group of synonyms. End note is not able to do it. Would be Zotero ?
UMLS and MESH ontology is very good to manage synonyms of simplesemantic entities ( for example: adrenoceptors/adrenergic receptors) but is less efficient for complex query : for example VEGF expression /production/ secretion by cancers/tumors or Effects of thyroid hormones on heart. The use of the boolean operator AND does not work always because it catch up the different terms without respecting the hierarchy of the sentence (of, in,by,) and broadening too much the research ( for example cancer, cancer cells, cancer cell lines). The use of the entire sentence and its synonyms avoid this .It's the main reason , I think, my list could be useful if the link with the references is kept on.The second reason is that these descriptors have been made after a reading of the full text of the article and can't be deduced from the title , the abstract or the author's key words.Ideally, It could be the first step to build a knowledge database with the help of a network of biomedical researchers ( following the model of the protein-protein interaction of the Van Buren Lab). What do you think about this possibility ?
Question
11 answers
I have a data set (267 records) with 5 predictors variables which contain several missing values in the third variable (categorical, multinomial). I rather use single imputation to replace those missing values but some papers said that it is not powerful enough to estimate the values. I'd like to know if there is any multiple imputation method that I can use, which is powerful and simple. I'm open for any suggestion.
Relevant answer
Answer
That depends on the type of model that you are developing. On previous works, I found it is preferable to work with the available information rather than introduce some sort of bias into the data set.
The "right" approach to handle with missing data depends on the type of model that you are developing. For instance, if there is enough correlation between the predictors variables you can try and exploit that. Chemometrics can provide a way to approach this. The general idea behind this approach:
Is there a java-based framework for clustering algorithm evaluation?
Question
28 answers
To evaluate a clustering algorithm or to compare the performance of two clustering algorithms we need some measure such as error, recall, precision, rand index. Is there a java-based framework for clustering algorithm evaluation?
Relevant answer
Answer
JSAT: http://goo.gl/gpQ7F (some unsupervised measures) Java-ML: http://goo.gl/JUFXY (more unsupervised measures) Weka: http://goo.gl/wTyuN (supervised measures available) ELKI: http://elki.dbs.ifi.lmu.de/ (seems even better than Weka)
Question
3 answers
I know Multidimensional Scaling uses only a distance matrix, but Self-Organizing Map requires coordinates of points in the original space. What are some other dimensionality reduction techniques, such as Multidimensional Scaling, that need only a distance matrix rather than point coordinates?
Relevant answer
Answer
Thanks, Evaldas. I just found Isomap works pretty well for my problem. Nice research though !
Question
3 answers
Is there a well defined structure to keep graphs in memory?
Relevant answer
Answer
The usual ways of representing graphs is either with adjacency matrices
or with with lists of nodes (structs, records, objects) that themselves
contains lists of pointers to their neighbors (or indexes of neighbor nodes).
Recently there was a publication about GraphChi,
an algorithm where most of the graph is
stored on harddisk when RAM is too small.
Another trick is the Bloom filter. It's a probabilistic way of
compressing a graph.
If you use it for example for storing edges, false edges
are possible, but not false non-edges. An application
might be reachability. When a node can't be reached
via the edges encoded in a Bloom filter, then it's
sure that it can't be reached in the original graph.
Regards,
Joachim
Question
8 answers
To evaluate a clustering algorithm or to compare the performance of two clustering algorithms we need some measure such as error, recall, precision, rand index. Is there a java-based framework for clustering algorithm evaluation?
Relevant answer
Answer
Thank you Christian Moewes.
Question
19 answers
I have data input for a neural network with one output, the data range between 0 and X, I do not know the exact value of X because it changes with a time, I mean may be at a time t the max value of X=1234, but at t+n , X can take values higher than 1234. I'll use it in an online application to do some predictions.
I would like to put this data (normalize it) in [-1,+1] Are there any good methods to do this?
Relevant answer
Answer
Min-Max is suitable for those cases that the min and max are known. thus, the [-1,1] range can be obtained.
if the min or max is not known, the zero-means normalization is recommended.
that is: (x-mu)/std
Question
3 answers
Does anyone know of the application of Data Envelopment Analysis used as a ranker on different DMU? How to construct or learn such kind of general ranker on large scale data?
Relevant answer
Answer
It is possible and not too complicated. Please look at my website: www.br.nazu.pl, where you can find a paper " Data Envelopment Analysis Without Input or Output" (it is chapter in book "Production Engineering in Making") with link to full text.
Question
4 answers
Genetic information gathered from autistic patients is transformed to multidimentional data. this is huge and requires machine learning techniques to create an automated Autism detection system. I wonder if there are publications along this track.
Relevant answer
Answer
Thank you for your help. You probably mean this article: Structural, Genetic, and Functional Signatures of Disordered Neuro-Immunological Development in Autism Spectrum Disorder
Question
1 answer
Where can I get a sample source code for k means algortihm n datamining from?
Relevant answer
Answer
Take first a look to this pseudocode:
Then you can take a look to this python implementation:
Good Luck!
L.
Question
3 answers
How is better to solve ‘The ground truth’ problem? I need to check the algorithm and evaluate the results.
For my project I have to use genealogical data that were provided by historical center without any ‘ground truth’. I see two options:
1. label these data manually or with a help of some experts in this field
2. use another dataset (just for checking algorithm and results evaluation) that is not very related to my project but it was used by many other researchers for instance I’m thinking about US census records.
I’m worried, if I will label it manually or with some active learning techniques, then for the good conference it will not be possible to ask the question how I got the ground truth, but on the other hand, it is very time consuming to apply algorithm on datasets that are not related to my project.
Relevant answer
Answer
Domain experts can solve your problem. Also you can make your dataset smaller and reduce dimensions. Then you can give some reachable questions, which you can estimate the answer, and compare with the algorithm outputs. I mean you can use a portion of your dataset by reducing both dimensions and them you can input it to a predictable answer and compare with your algorithm outputs. This may provide you from consuming your time.
Question
7 answers
When doing an automatic classification, we can evaluate its result wth precision and recall. That is to say, we can evaluate the accuracy of a classification by counting false positives and false negatives. I'm working on an automatic ranking, and I would like to evaluate it. For example, if my algorithm tells : 1st is Foo, 2nd is Bar, 3rd is Egg, and if I know the true answer is 1st is Egg, 2nd is Bar, 3rd in Foo, how can I tell if this is a quite good answer or not ?
Relevant answer
Answer
Couldn't you basically use the same approach, i.e. comparison to a golden standard. If so, you basically have to identify the possible measures, similar to precision and recall in the case of classification. Most obviously, I could think of:
- having ranked something at the right position;
- having identified the correct order between two items;
- having identified the correct 'distance' between two items; and
- trying to extend the above towards more elements.
Question
1 answer
Is it possible to detect the real time ATM card fraud detection using data warehouse based system. I've upto one million transaction dataset in the Oracle database & I need to prepare a fraud detection model based on the dataset available. If it's possible, I'd like to request to provide some links for related research papers.
Relevant answer
Answer
Hi Jivan
There are various papers available from a google search. One which I've read and am busy applying to other financial data is by Wiese and Omlin 2011:
"Credit Card Transactions, Fraud Detection, and Machine Learning: Modelling Time with LSTM Recurrent Neural Networks"