Science topics: Computer Science and EngineeringData Mining

Science topic

# Data Mining - Science topic

Explore the latest questions and answers in Data Mining, and find Data Mining experts.

Questions related to Data Mining

I'm a Master student in the Faculty of Science.However,I can't find doctors who work in the field of Data Mining. At first, I thought to do my research on "Recommendation Systems", but since it depends on "Web Text Mining" and "Text Parsing", I thought it would be better to start by the last two. I have done previous researches on both topics. The research attached mirrors my field of interest.I would be glad if you advised me about the best place I should join: DMCM or elsewhere.

This the website of DMCM

Is there a good, easy to use open source software package (with graphical interface) to do PCA, PCO and other types of data reduction methods? I know TMEV can do that. I also found some R packages, but I would like something faster.

Is there anyone who works on SNA? Do you have any recommended tools for SNA?

I'll start some research related to SNA, especially for twitter and facebook, but its hard for me to develop some tools to gain data from twitter and facebook, basically I just want to get the "text based data" from both. And I'll use text mining to analyze all that I need.

Could you please give me an advice, what kind of tools that I can use to get data from twitter and facebook?

Is it to compare the reliability of the these data sets?

"The nature of statistical learning theory" was my favorite book during my PhD, and several years after. I still use "Elements of Information Theory" in my work. Recently a book has opened new horizons for me: "Prediction, Learning, and Games".

Do you have good books to advise?

I'm trying to solve a link prediction problem for Flickr Dataset. My dataset has 5K nodes and each node has around 27K features, it is sparse.

I want to find similarity between the nodes so that I can predict a link between them if the similarity value is greater than some threshold that I decide.

One more problem is, how to define this as a classification problem? I wanted to find overlapping tags for two nodes, so the table contains the nodes and some features of them (will be in thousands) and all of them will be positive class only as I know that there is a link between them.

I want to create a test data set with some of the nodes and and create a similar table and label them as a positive class or a negative class. But my problem is that all the data I have is positive, so I think it would never be able to label as negative. How to change it to a classification problem correctly?

Any pointers or help is very much appreciated.

We consider a data stream which is not necessary stationary. We would like to label instance on the fly.

What are the important factors to look into to improve the quality of software by predicting the post release defects ?

I want to develop an application in java which takes input as the log file of a mail server in csv format and outputs a graph for it. The graph should be generated by gephi. Then again I want to input that csv file to KruskalMST.java (program for finding minimum spanning tree of the generated graph above). But before giving the csv file KruskalMST.java, i will convert all the weights to negative so that instead of getting minimum spanning tree I get maximum spanning tree because I'm interested in maximum spanning tree. KruskalMST.java will give me a file which will contain three columns (Source,Target,Weight). Again this file will be input to Gephi and it will generate a maximum spanning tree graphically, then I want to do some analysis on these two graphs.

Until now I have done all these things manually and separately. I want to integrate all these things and want to develop a single application which has gephi embedded in it.

Please give me some suggestions on how to proceed. Will gephi toolkit be helpful?

Can anyone help me find a tool that allows me to download the old tweets in the history of a user. I need to study the content of the tweets of 2011 from a group of users who used a # hashtag.

I know that this is an advantage of using decision trees, but usually people get very large and complex models...

Medical data set is difficult to compare with other data sets, so classification techniques need t be combined to get accurate results.

Wanted information about future research scope in the field of data mining in areas like anomaly detection in medical images and cancer detection

Is anyone using Apache Mahout for their research?

I'm looking for people working with Apache Mahout in their research.

There is increasing research that use data mining in education for different purposes. I want to see if there is a study about the skills of students using data mining techniques.

I want to work on a categorical data set and find the best features by a filtering method or a wrapper one.there are a few algorithms suitable for categorical data ,so Does any body know any

0-benchmark

1-similarity measurement

2-feature selection methods

3-accurate clustering

for categorical data sets?

(sorry for my general question and thanks for your participation )

Or is it only based on content?

I have obtained a part of TF-mRNA relationships from CRSD database. However, I want to get a more comprehensive database of TF-mRNA relationships to validate results. I will be very appreciated if providing a more comprehensive database.

How to perform feature selection using mutual information?

For example, there is a document-term data set with terms as dimensions and to perform feature selection on the terms Mutual Information is used as the measure. After calculating the mutual information between all possible pairs of terms is it correct to set a threshold and select all the terms of the pairs that have an MI value less than the threshold?

The categorization is for a sentence/word in to a particular class

I am wondering if it is possible to break matrix inversion into a hadoop/mapreduce framework?

I am interested in applying the methods of evaluation of classifiers in KNIME. For some reason the ROC curve does not apply to regressions.

Can anyone help me?

Can we apply data mining techniques on "IDS signature database" for anomaly detection?

Can we extract or discover vital patterns from IDS signature database for further use in anomaly detection by the same IDS?

Suppose I have a series of paragraphs, how do I know when put together, these paragraphs form a coherent article?

Could you recomend surveys\articles about discovering time sequences anomalies?

I have two anomalies connected with time:
1) I have sequence ABCDE where time between C and D in one appereance is 1ms, but in second appereance is 10ms and I have to discover it.
2) I have two timelines with the same sequences and I have to discover time shift between them.
Thanks in advance,
Tomek.

New research topic in iit or iiit going on......

Is there any one who can suggest any way or site to find twitter data which contains tweets and users locations? Or is there anyone who has such data?

I'm encounter with a very high dimension classification problem. I'm looking for an efficient feature selection (or extraction) method.

And what do you think the next big thing is in Machine Learning/AI/NLP?

I am looking for good conferences or workshops on the topic of data analytics and big data this year. Is there anything where the deadline for submission is not yet expired?

I am trying to combine both methods for classification of human grasping.

Has anyone ever tried to replace 2nd order methods in 2nd order optimization with 3rd order? What are the results? Better? Worse or not worth trying?

Deep learners report very good results on a lot of difficult learning tasks such as Speech recognition, Object Recognition, Natural Language processing, Transfer learning...

The hidden layers are trained one by one used an unsupervised learning procedure such as autoencoders trained to reconstruct the inputs,

where the input is the hidden layer of the previous autoencoder.

At the end, a supervised layer discriminates the inputs.

Which regression model will be better? A linear regression model of higher order or a non-linear regression model of lower order?

I am looking for a data set that contains tweets on consumer feedback towards products and services. Benchmark data sets are appreciated as well.

I want to know how to implement Association Rule Mining in Cloud Environment? Does it only limited to outsource cloud space for storing intermediate data in mining process or anything more?

Let's have a set of (xi,yi), i=1,2,...,n data for which we do not know anything at all about its hidden functional yi=f(xi) form. We want to find the true function f() with non parametric methods, i.e. without adopting a model of any form and taking regression analysis. The task should has been completed if our function can reconstruct data outside the initial given xi \in [a,b] interval, i.e. if it has predictability value. Otherwise, we don' t care since there exists many methods, (cubic splines interpolation and other) to represent data inside the given x-range [a,b].

So, which method do you think is the 'best' one for solving the above problem?

I am working on multi relational classification technique and i want to combine classifier which are trained on different relations.

I want to classify spatial image data.

Existing techniques such as generalization, suppression, etc. For ensuring privacy, what are the recent anonymization techniques proposed in last 2 years?

Applying LSA on 500 pdf documents extracted from Google (for a certain feature), I got a low accuracy once I tried to infer the topic of new documents.

What could be the reason for this?

Given that I have multiple running of k-means and the same dataset (say $K_1$ to $K_p$), each one running with random initiation. Given two points $x$ and $y$, and their distance $d=||x-y||$.

How can I bound the probability of them being assigned to the same cluster over the random running of the algorithm?

I'm stuck at this sub-problem and is part of a larger problem I need to solve. Any help would be great. Maybe pointers to results or promising methods to grasp this.

[The Effectiveness of Lloyd-Type Methods for the k-Means Problem][1] by Ostrovsky et al seems to be helpful, but I'm still working out the techniques he uses to see how can I use it to accomplish what I need.

I wonder if someone has extended the J48 classifier builder implemented in Weka to get an incremental version of the algorithm? For some classifiers, naive bayes among them, Weka provides a batch and and incremental version, but I think this is not the case for J48. I also wonder if rebuilding the classifier with each new training example would be an alternative to the incremental version? I'd appreciate your help in these issues.

Some practical examples related to business would be of great help!

I am going to analyze big data of data centers. But first of all i want to know that it what form (data ype) data are stored in data centers (facebook, twitter, amazon etc.).

What parameters should i use for classification/grouping of big data.

Thanks

I'm working on a piece of research for this - the data set is in the area of inpatient activity (hospitals). I would value your thoughts/ideas on this.

I have a directed graph in which every node depicts a user and an edge between the user depicts that a mail has been exchanged between them. Weight of an edge shows the number of mails exchanged between the two users. I want to find the most weighted path from a node in this directed graph. I have used GEPHI to generate the attached graph. The graph is divided into different communities based on the weight of the edges. Each community is represented by a different color.

I need to find the best technique of data mining to classify new data or point without a fixed number of classes like with k-means algorithm. I determine the form of my data with a vector.

What is the latest algorithm to find the similarity between data of two or more clusters?

I am using CURE clustering of Hierarchical clustering technique

I read an article from Pedro Domingos titled "A few useful things to Know about machine Learning" (Oct 2012), and I do not buy into how he says that a 'dumb algorithm' and a tremendous amount of data will provide better results than moderate data and a more clever algorithm. What if adding more data gives you more noise or irrelevant information? The only thing I see as their justification is that you can go through more iterations from the data and that you have more ways to learn from it. I can't see that claim as sufficient or sound enough for this be valid.

Can anybody tell me the recent sequence mining techniques?

All the above have some common ground. In your opinion, what is the real difference between these fields? if you were asked what do you classify your research area what would that be?

I propose that the *only* way ahead is via a fundamentally new representational formalism that would clearly reveal what critical information has been missing from the conventional forms of data representation. That is, we need to get to the form of data representation that captures completely new side of objects we have been missing so far, which can only be achieved by relying on the entirely new representational 'tools'.

My svmlight output for binary classification (+1 for true and -1 for false classes) is like

+1 0.001

-1 -0.34

+1 0.9

+1 055

-1 -0.2

I tried generating roc in R using ROCR package. But I found in all examples of ROCR TPR ,FPR and cutoff values were used. TPR and FPR can be calculated from the above data but will get only one value and also w.r to cutoff I am confused. In this case cutoff is -1 to +1 right?

Can anyone help me how to give the values from the above data for drawing ROC in ROCR package.

I would like to model the "Modus operandi" in cyber crime from a user's fraudulent cyber-activities. How can this be achieved?

Help with effective algorithm.

Just wanted to start exploring the possibilities of bioinformatics to be simulated using artificial intelligence ?

I'm fitting some multivariate observation sequences to a hidden markov model using R's RHmm pacakge. But I keep getting the following error "Error in BaumWelch(paramHMM, obs, paramAlgo) : Non inversible matrix". Does anyone know what happened? Or does anyone know where I can find other implementations?

Lately I am starting to get interested in data mining research applied to reduce poverty in some area/country. Does anyone work on this kind of research ?

My previous works were on classification methods, especially generalizing binary multisurface proximity methods [by OL Mangasarian] to multiclass using OVA. I also did other research that related to that model [dealing with missing values, online learning, error bounds, etc].

I am hoping that I can get pointers/updates on how to relate my new interest [data mining & poverty] to my previous work [classification methods]. Any help/discussion will be greatly appreciated.

Big Data is one of the buzzwords of the year, and we heard arguments along the full spectrum – from having so beautifully many data available that we can solve huge societal challenges, over worries considering data archiving, to statements such as ‘a tsunami full of rubbish that n body will be able to use’ or (more scientific) doubts about being able to deal with the huge heterogeneity in semantics. Now, I would like to get your opinion where this debate has let us and where it might go.

I need to perform Parafac decomposition in R software and I need to understand how I can implement it. Is there a package that performs Parafac and gives me the factor matrix?

I'm using the J48 decision tree algorithm with Weka, but it doesn't build a tree.

It only says Number of Leaves:1 and Size of the tree:1

I want to classify images based on SIFT features, so what tool can I use to extract them? I have used "http://koen.me/research/colordescriptors/" to extract the SIFT feature, but the problem that I am facing is: the size of the file after extracting the SIFT feature goes too large. I cannot pass that file to SVM as on file has apx. 12000 rows. The image that I am using has dimensions eg. 1024x683. SIFT feature file must contain less information so in this way I can pass hundreds of images to SVM at same time.

I usually use Latent Dirichlet Allocation to cluster texts. What do you use? Can someone give a comparison between different text clustering algorithms?

Classical approaches perform badly on the minority class, even if the global learning is satisfactory.

Recently when I tried to visualize similarities between different groups in a social network, I discovered an interesting phenomenon which I couldn't explain. Can someone give me some insight on this?

The story is as follows: Users can create different groups on our social network website, and we'd like to visualize the similarities between different groups in 2D. We use the group members and their page views within the group as the feature vector of each group, and the similarity between groups is computed as the cosine similarity between their feature vectors. I used multidimensional scaling ( cmdscale in R ) to reduce the data into 2D and visualized the data.

The result of the MDS is points lined up with some lines orthogonal to each other. Can someone explain why this is happening?

Hi every one,i want to ask,can apply graph mining techniques on stream data which come from sensor networks?,and what do we can get from this mining?thanks very much .data streams come from sensor networks are very diffrent respect to data come from web,social networks .

If we want to define a size on a concept, what is the best criterion to do that?

Is it possible to use Greedy Set Covering Algorithm and suppose the minimum number of spheres covering positive instances as a Concept complexity measurement?

cmdscale in R is pretty slow. It takes about 1.5 hours to do multidimensional scaling on 10000 points.

Can anyone suggest dissertation question(s) or project idea(s) to research for my final year as an undergraduate of Computer Science in Big Data?

I am interested in the Big Data technology, however, I am a newbie to this. I would like to dive in and learn as much as I can in one academic year. My strategy is to do this as part of my final year project. I already have experience in web development, human computer interaction and relational databases.
I would greatly appreciate it if anyone can suggest a question that revolves around these three areas, as it seems to me that Big Data is the future of Marketing (though I'm not particularly interested in Marketing).

I am working on the data mining area leading to knowledge acquisition. Can we try our own algorithm in the WEKA tool? How far it is feasible and accepted ?

I need to do clustering on a dataset composed of both numerical and categorical data. What algorithms do you recommend ?

In a research, I need to visualize each tree in random forest due to count the number of nodes included in each tree. I use R language to generate random forest but couldn't find any command to satisfy my demand. Do you have any idea how I can visualize each tree in random forest? or how I can calculate the number of nodes in each tree?

I have to normalize data which has values for 100-1000 (numeric values). I read some material regarding normalization techniques e.g. min-max normalization, cosine normalization, non-monotonic normalization etc., but I don't know which technique is best for me. I read a paper suggesting non-monotonic normalization is better but haven't found any good material regarding it.

Will it be useful in terms of accuracy and speed of classification?

I'm trying to collect information on this area, but most existing papers seem to target network-based DoS detection, whereas I could not find much about application-level approaches

Mahout is a solution of Apache Foundation to build scalable machine learning libraries.

Is it possible to crawl emails either in inbox or in a label of an email account?

A good researcher in their field is considered good by the numbers of their publications. How far is that true? Are there any other criteria?

During data entry, we can correct errors before mining by creating forms through question ordering during question reformulation.

How can a support vector machine algorithm be implemented in image data mining?

I have a set of data i.e. some features extracted from an image. I need to train with the help of the classifier and then test it. How is it possible with the use of WEKA software? Which is the best classifier?

Like Different types of Genetic Algorithms

Can anyone kindly explain what are the standard deviations in the roughness? And how to extract data using correlation lengths or pair correlation functions, usually calculated in typical AFM software (NT-MDT, NTEGRA MoDEL,NOVA Software)?

Gaussian mutation makes small random changes in the individuals in the population. It adds a random number from a Gaussian distribution with mean zero to each vector entry of an individual. The variance of this distribution is determined by the parameters scale

and shrink.

My question is: do all the genes in the chromosome get effected by Gaussian mutation?

Is there any paper/material that visualizes the Gaussian mutation? There are many examples(visual examples) for crossover operations but cant find any for Gaussian operation.

Each group of synonyms (English) is linked to a bibliographic reference. The database (Bibliomac for Macintosh only) is driven by a software called 4D. Unfortunately this database is no longer developed. It's a pity because the possibility to link a group of synonyms with bibliographic references is uncommon and powerful for data mining. Reference Manager can do something similar but I don't know how to transfer this list from bibliomac to it because I'm not a computerist? Could somebody help me? I can send the database and the software (unfortunately in French).

I have a data set (267 records) with 5 predictors variables which contain several missing values in the third variable (categorical, multinomial). I rather use single imputation to replace those missing values but some papers said that it is not powerful enough to estimate the values. I'd like to know if there is any multiple imputation method that I can use, which is powerful and simple. I'm open for any suggestion.

Is there a java-based framework for clustering algorithm evaluation?

To evaluate a clustering algorithm or to compare the performance of two clustering algorithms we need some measure such as error, recall, precision, rand index. Is there a java-based framework for clustering algorithm evaluation?

I know Multidimensional Scaling uses only a distance matrix, but Self-Organizing Map requires coordinates of points in the original space. What are some other dimensionality reduction techniques, such as Multidimensional Scaling, that need only a distance matrix rather than point coordinates?

Is there a well defined structure to keep graphs in memory?

To evaluate a clustering algorithm or to compare the performance of two clustering algorithms we need some measure such as error, recall, precision, rand index. Is there a java-based framework for clustering algorithm evaluation?

I have data input for a neural network with one output, the data range between 0 and X, I do not know the exact value of X because it changes with a time, I mean may be at a time t the max value of X=1234, but at t+n , X can take values higher than 1234. I'll use it in an online application to do some predictions.

I would like to put this data (normalize it) in [-1,+1] Are there any good methods to do this?

Does anyone know of the application of Data Envelopment Analysis used as a ranker on different DMU? How to construct or learn such kind of general ranker on large scale data?

Genetic information gathered from autistic patients is transformed to multidimentional data. this is huge and requires machine learning techniques to create an automated Autism detection system. I wonder if there are publications along this track.

Where can I get a sample source code for k means algortihm n datamining from?

How is better to solve ‘The ground truth’ problem? I need to check the algorithm and evaluate the results.

For my project I have to use genealogical data that were provided by historical center without any ‘ground truth’. I see two options:

1. label these data manually or with a help of some experts in this field

2. use another dataset (just for checking algorithm and results evaluation) that is not very related to my project but it was used by many other researchers for instance I’m thinking about US census records.

I’m worried, if I will label it manually or with some active learning techniques, then for the good conference it will not be possible to ask the question how I got the ground truth, but on the other hand, it is very time consuming to apply algorithm on datasets that are not related to my project.

When doing an automatic classification, we can evaluate its result wth precision and recall. That is to say, we can evaluate the accuracy of a classification by counting false positives and false negatives. I'm working on an automatic ranking, and I would like to evaluate it. For example, if my algorithm tells : 1st is Foo, 2nd is Bar, 3rd is Egg, and if I know the true answer is 1st is Egg, 2nd is Bar, 3rd in Foo, how can I tell if this is a quite good answer or not ?

Is it possible to detect the real time ATM card fraud detection using data warehouse based system. I've upto one million transaction dataset in the Oracle database & I need to prepare a fraud detection model based on the dataset available. If it's possible, I'd like to request to provide some links for related research papers.