Science topic

Big Data - Science topic

In information technology, big data is a loosely-defined term used to describe data sets so large and complex that they become awkward to work with using on-hand database management tools.
Questions related to Big Data
Can anyone suggest dissertation question(s) or project idea(s) to research for my final year as an undergraduate of Computer Science in Big Data?
Question
5 answers
I am interested in the Big Data technology, however, I am a newbie to this. I would like to dive in and learn as much as I can in one academic year. My strategy is to do this as part of my final year project. I already have experience in web development, human computer interaction and relational databases. I would greatly appreciate it if anyone can suggest a question that revolves around these three areas, as it seems to me that Big Data is the future of Marketing (though I'm not particularly interested in Marketing).
Relevant answer
Answer
u can w0rk in rec0mmander systems 0r ass0ciati0ns rules in Marketing
Question
4 answers
As the data is becoming large, to retrieve the information from such a large data set is not easy. So why can we introduce the concept of clusters into the big data for analysis?
Question
7 answers
Hadoop And MapReduce.
Relevant answer
Answer
These are for teaching rather than true "projects," but here are some introductory MapReduce assignments from our MOOC Introduction to Data Science [1]:
1) Create an Inverted index. Given a set of documents, an inverted index is a dataset with key = word and value = list of document ids in which that word appears. Assume you are given a dataset where key = document id and value = document text.
2) Implement a relational join. Consider the query SELECT * FROM Order, LineItem WHERE Order.order_id = LineItem.order_id. Your MR job should produce the same information as this query. (Hint: Treat the two tables Order and LineItem as one big concatenated bag of records.)
3) Simple social network task. Consider a simple social network dataset, where key = person and value = some friend of that person. Describe a MapReduce algorithm to count he number of friends each person has.
4) Slightly less simple social network task. Use the same dataset as the previous task. The relationship "friend" is often symmetric, meaning that if I am your friend, you are my friend. Describe a MapReduce algorithm to check whether this property holds. Generate a list of all non-symmetric friend relationships.
5) Sequence matching. Consider a set of sequences where key = sequence id and value = a string of nucleotides, e.g., GCTTCCGAAATGCTCGAA.... Describe an algorithm to trim the last 10 characters from each read, then remove any duplicates generated. (Hint: It's not all that different from the Social Network example.)
6) Matrix Multiply. Assume you have two matrices A and B in a sparse matrix format, where each record is of the form i, j, value. Design a MapReduce algorithm to compute matrix multiplication.
For 3) and 4), you can get "small" data for testing a solution here:
You can get a larger dataset (one that will actually motivate the use of Hadoop) here:
Question
13 answers
Libraries left right and centre are cancelling print versions of academic journals and discarding their old journal stocks. When challenged, they say don't worry, this information will all be freely available on the Internet. This is not even completely true nowadays, but what is the long-term future? We are entering a Digital Dark Age, not helped by the fact we have our eyes closed as well. Dangers and factors militating against indefinite free storage include -- energy supply security; energy costs (some data centres use as much electricity as a small town); obsolete storage devices and digital formats; missing software; ephemeral recording media; planned obsolescence; unreliable or complacent custodians; malicious hackers, criminals or terrorists; politically or religiously motivated activists. Some of these points are discussed by Roger Highfield in Daily Telegraph Jan 7 2014 p25.
Relevant answer
Answer
The author John Harris is also very concerned about "a lack of historical sense, a pervasive contempt for the wisdom of the past". (Guardian Feb 21st 2015, p 33).  The situation with libraries is worse than I had thought.  Manchester has modernised its Central Library, dumping or pulping 240,000 non fiction books that were "duplicates, outdated, or otherwise obsolete".  The Nazis, at least, had more believable reasons for destroying books.   As Harris notes, "precious things are now only a corrupted hard drive or system upgrade away from being lost, for good".  "Is there a cure for all this past-blindness".  His suggestion: reading.  But every few months I find yet another academic library that I used to rely on has "modernised",  old books either dumped or entrusted to that fickle friend, the Internet, to be replaced by a trendy cafe area.
Question
4 answers
I want to analyze a CDR. What are the different methods that can used for this? Such as hadoop, some data mining tool, etc..
Relevant answer
Answer
Thank you Damien Mather for the suggestion. I think , the following 2 papers will be helpful for those working in this area:
Best-Fit Mobile Recharge Pack Recommendation
Mobile Subscriber Fingerprinting- A BigData Approach
Question
3 answers
The funding agencies and societies have commissioned a multitude of biological databases, 2334 at last count according to the NIF Registry, and the question that we keep hearing is "my renewal is dead" from many of them largely because of innovation. I would really like to know if there is a good reason to rebuild a database every 2-5 years?
Relevant answer
Answer
Dear Kim,
Granted, there is a good reason to archive the data from each publication. Thank you, the work on dataverse and the data citation principles in Force11 is highly relevant. However, when people work to create tools and databases I am still not clear on what exactly gets archived. Furthermore, do you archive if there is no publication associated? I am thinking not about just some desk drawer in my old lab, but about significant and well known resources such as some of the Allen institute brain-maps that may be many terabytes of unpublished data that are widely referenced in literature. Should there be an archive at each time point of the whole data set, parts of it?
My original question deals with how to deal with databases, like DataVerse, in terms of long term support. What happens to DataVerse data in five years? There is no proper paper published on DataVerse, so what happens to all the data archived there? If there was a paper published on it, would it back up all the data in there?
What is your definition or concept of Big Data?
Question
30 answers
Should we be collecting data more intelligently rather than collecting all sorts of garbage and then apply "statistical hammer" to break it into smaller pieces? Why not use more intelligent statistical design to collect the data in first place?
Relevant answer
Answer
This discussion is very interesting! I may just provide an answer to your title question, as we proposed the following definition for Big Data in the recent paper attached: "Big Data represents the Information assets characterized by such a High Volume, Velocity and Variety to require specific Technology and Analytical Methods for its transformation into Value". In case you don't find it convincing, you will find a survey of other definitions inside the paper. Hope this helps, Marco
Question
2 answers
I am preparing to start my Masters in Computing and Information Science in 2014. I am keen to connect with other people interested in the challenges I have outlined in my ResearchGate profile.
Relevant answer
Answer
sorry I do not have the expertise to answer this question.
Regards
Would it be possible to find a good quantitative definition for BIG DATA? What features define the term BIG? Any consensus?
Question
22 answers
Most people know the term "BIG DATA", but when DATA can be considered BIG? There are many dimensions to take into account: speed, storage, diversity, features, etc. First, what dimensions to use? and second, is it possible to define a function that returns true or false depending on those variables? Is DATA BIG for computers or for us? This last question is related to the performance of processors. Ten years ago, when I worked with data streams, >10^5 was BIG, but not now.
Relevant answer
Answer
A good overview of Big Data's different definitions can be found in "Undefined By Data: A Survey of Big Data Definitions" - http://arxiv.org/abs/1309.5821 .
Question
4 answers
There is great interest in improving access to publications but access to primary data is a larger problem in my view with the increasing number of active researchers producing results that might be expensive and time consuming to reproduce from scratch even with a publication as a guide. Structural biology has been a leader in providing underlying data as well as completed work in databases such as the Protein Data Bank. The ability of researchers to download data as well as completed structures has proven important for the development of computational methods and for improving structure determinations. Yet in many areas the underlying data is not archived and is disappearing at an average rate or 7% per year after the primary publication - http://www.the-scientist.com/?articles.view/articleNo/38755/title/Raw-Data-s-Vanishing-Act/. If basic data is a significant part of the value of research how can or should we save this investment?
Relevant answer
Answer
You are right to call research an "investment," and the funding entity is always in a position to demand certain returns. Over the last several years in the USA, as you allude to, recipients of NIH funds have been obliged to make their publications truly public through PubMed Central. But as you correctly note, access to the underlying data is much more important than access to a manuscript. Even the best manuscript fails to glean all insights from a dataset, and the worst may overinterpret or misrepresent. Also, many data are never reported in publications. To ensure maximum return on investment, funding agencies should mandate data archiving in centrally maintained databases like (e.g., in my field) the Gene Expression Omnibus or Sequence Read Archive (US) or Array Express (Europe) and, with some reasonable time delay, divorce this expectation from publication. For example, if I pay for a high-throughput sequencing project with NIH funds, I should be required to make the results public (or at least available to any funded colleagues), along with full experimental details sufficient for replication studies, within a certain period of time regardless of whether I have published. The funding agency should provide a secure and well-maintained database as well as guidance and oversight.
While researching a small study of microarray data that I published in 2013 (http://shootingcupoche.com/publication/235380207), I was surprised that even respected journals with data archiving mandates typically do not enforce them. I am of course not the first or only person to observe this. Less of a surprise was the finding that willingness to submit data seemed to correlate with the quality of a scientific study, and that data submitted post-publication often do not support the conclusions of a publication. Often, they are faulty and would result in instant rejection if reviewers had the time to look at them. My conclusion is that, since researchers and journals demonstrably will only rarely archive data reliably and consistently without impetus, funding agencies should help by instituting mandates and providing secure, well-maintained archives for all types of data.
Question
4 answers
Is it possible to use different data mining methods such as classification or clustering analysis or association rules mining in Big data? if yes. I need a practical example if possible?
Relevant answer
Answer
The answer to your question is of course 'yes'. I would suggest that you consult the Stanford University MOOC on Machine Learning, given by Andrew Ng. Please follow the URL below and check out the lectures on 'Large Scale Machine Learning'.
Question
5 answers
Can this problem be solved through supply chain business process reengineering ?
Relevant answer
Answer
Hi, Edith,
Your answer makes me very enlightening, thank you.
Yuran
Question
18 answers
What would you suggest for big data cleansing techniques? Is your big data messy? If so, how (or how much) does it affect your work / research? How do you get rid of noise? How do you verify big data veracity, esp. if the source is social media? I would appreciate any suggestions and/or pointers to recent articles in the media, research papers, or documented best practices on big data verification, quality assessment or assurance.
Relevant answer
Answer
Hi Victoria. I do not really agree with the previous comments / answers. Data Cleaning has 2 parts. The first one and often the most time consuming is checking that all values (quantitative or categorical) are as expected (e.g. Format, do you allow NULL / NAN, ...). For this I will use a dedicated perl code for example. And then (second part) once you are sure about the content / values you can use statistics. BUT remember that "outlier removal" first is a question of where you put the threshold AND second and even more important this should always be justified / confirmed by experimentators. Variability is inherent to any experimental data. Never remove outliers if you cannot understand why they are outliers. Last but not least: Statisticians / Biostaticians use often an arbitrary threshold of 95%. This is purely arbitrary decision! And often i is possible that this leads to under/over fitting. If you want some help, just have a contact Pierre.lecocq@ngyx.eu.
Question
3 answers
To analyze customer behaviour and customer segmentation in telecommunication
Relevant answer
Answer
Certainly I found it a fascinating and rewarding field in which to work.
You can try. But the Telco will have to spend time and resources on anonymising the data, so I really doubt it.
Question
17 answers
Seeking suggestions.
Relevant answer
Answer
Dear Kavitha Boggula
I would like to share my thought indirectly responding to the big data and social network analysis for the design of curriculum in technical education. My understanding in your question is that you want to do in-depth social network data analysis as basis for curriculum designing in technical education.
To get connected to your question correctly please define what is social network data analysis add information on research design indicating your overall plan for obtaining answers to your research questions that will guide your study
Attached is university-industry model I developed. This is a model relevant to curriculum design.
Good luck.
Question
4 answers
The value of environmental data is widely appreciated in academic and government circles with a maxim ‘you can’t manage what you can’t measure’. Both academic scientists and policy advisors see the benefits of publicising their use of data to make decisions and these can be summarised from project meta-data and peer reviewed publications. I believe commercial users are different; they seek commercial confidentiality to give them competitive advantage. Does anyone know how I might assess how much data are used by business and for what purposes?
Relevant answer
Answer
You can see at sustainbility reports that are nowadays published by the major part of multinational companies. You have to be careful, nevertheless, to know what is the level and independency of auditing processes. You can find the environmenal data and news in these reports and analyse them. Acting in this way you can also have an idea of the business use of environmental data.
Question
14 answers
Trying to categories the data on a big data environment.
Relevant answer
Answer
I recommend Raven's Eye (http://ravens-eye.net) for big data.
It's available online in Software as a Service pricing packages. Its results are algorithmically derived and preserve the voice of your participants, so it reduces inadvertent interpretive bias. A variety of qualitative paradigms can be used with Raven's Eye (e.g., Phenomenology, Grounded Theory), as can quantitative and statistical approaches.
Raven's Eye is browser-based, so you can use it across multiple platforms (including smartphones and tablets). The graphic user interface is much more intuitive to learn than other such programs, although we also provide ample free tutorials, technicals, and practical information on the website.
Online graphs and downloadable results facilitate many types of visualization. There are plenty of examples online as well, including an analysis of everyday people's perspectives on the meaning of life, reasons for voter preference, analyses of U.S. political party platforms and inaugural addresses of U.S. Presidents.
Data can be analyzed in 65 different languages. This is a super great function if you are interested on doing research with people who speak varying languages other than English. But Raven's Eye analyzes English, as well.
Check it out! Big data users love the experience!
What are the recommended database systems for stream data?
Question
6 answers
I am searching for the best possible solution for storing stream data. I need your recommendations and scholarly suggestions about the topic.
Relevant answer
Answer
Usually the nature of stream data implies that the data stream is infinite. Therefore storing all data might be complicated. A common approach is to use event sourcing where data is stored in a nosql datastore such as mongodb, cassandra, or hbase. A good overview provides the lambda architecture by Nathan Marz from Twitter http://www.drdobbs.com/database/applying-the-big-data-lambda-architectur/240162604. Quite often a whole class of streaming algorithms are applied to the data stream using fixed memory assumptions without storing the data at all. http://en.wikipedia.org/wiki/Streaming_algorithm
How to avoid data replication in document oriented NOSQL ?
Question
8 answers
Like MongoDB?
Relevant answer
Answer
Can you be more precise that kind of data replication you mean. Do you mean replication created by the database (e.g. replica set for MongoDB) or user created replication?
Question
7 answers
Web log mining sample data.
Relevant answer
Answer
Hello.
You'd better check this site first;
(a lot of different sample data ,but these three might be useful)
Anonymous Microsoft Web Data
MSNBC.com Anonymous Web Data
Syskill and Webert Web Page Ratings
-----------------
Also, the sample data in this site might be useful;
Best of luck & best wishes.
Mete Eminagaoglu
Question
3 answers
We are currently gathering more data that we can process. This data occupies a lot of space in dedicated data centers. Moving it out for analysis is out of question especially for streaming data. Also the analysis may require multiple sources of heterogeneous data. In this context the VM and the data extracted from the analysis would take far less bandwidth than the analyzed data.
Relevant answer
Answer
I doubt this would happen to any great extent commercially in the near future, for a simple reason. Data are typically private rather than shared, and migration of (live) VMs is also quite difficult. If the same party owns the data and the VM, typically migration is not called for (except in a small way like perhaps migrating a VM within the same data center), it is simply data in the cloud that is used by the VM there rather than locally.
Even with all the hype about Big Data, it is not common to find curated and commonly usable large data sets being made available in a cloud setting for many users to access. However, doing this would make data as a service possible in niche contexts, e.g., with genomics data and such. One positive of such data as a service could be addressing privacy or other issues without laborious agreements with and vetting of users; e.g., if patient confidentiality or personally identifiable information are concerns, then it may be better to keep and process the data in the cloud and let users see only the results, rather than let users have the data and fret over what they might do with it. This in some sense would be the reverse of the concern that is usually expressed, viz., how can a user trust what a cloud provider does with their data.
Question
20 answers
Big Data is a hot topic in the world today. This is due to its importance in shaping up business trends and knowledge body. Petabyte data possesses great challenges to many processes in storage and dissemination. Indexing is one issue that had receiving a lot of attention from the community.
What are the existing (currently) algorithms in indexing? What are the weaknesses of the current algorithm?
Relevant answer
Answer
Group of professor Jens Dittrich are working a lot with indexing in Hadoop. Here is the link to selected publications of his group http://infosys.uni-saarland.de/research/publications.php
List of publications related to your question (all of them are from prof. Dittrich's group):
- Towards Zero-Overhead Static and Adaptive Indexing in Hadoop
- Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing)
- Efficient OR Hadoop: Why not both?
- Elephant, Do not Forget Everything! Efficient Processing of Growing Datasets
- Efficient Big Data Processing in Hadoop MapReduce
Question
2 answers
I have a study paper to publish on Hadoop MapReduce. Is there any permission required for publishing a study paper from the author who had published it as a research paper? I have referred that paper in "References". Please help me regarding this issue.
Relevant answer
Answer
ACM Computer Surveys
Question
3 answers
I have been teaching RDMBS in an undergraduate database class. I would like to teach Big Data also. I want to make a smooth transition from RDBMS to Big Data. Can you suggest a textbook or good material which would give me the chance to compare between RDBMS and BIG data? Please let me know.
Relevant answer
Answer
Yes, Thak you s much for your prompt reply
Question
1 answer
It is very important to keep a check on scalability and performance using APMs to solve big data problems. How important is the role of automated APMs (Application Performance Management) in solving Big Data problems?
Relevant answer
Answer
Support existing and new applications, voluminous data and storage performance
Which open source cloud platform is better for Hadoop?
Question
18 answers
I'm considering to build an Openstack, Cloudstack or Eucalyptus cloud for big data analysis. Can anyone share their experience and give some suggestions and ideas?
Relevant answer
Answer
I'm currently using 4 nodes. Check this tutorial, it may help you. http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/
Anyone have experience working with a large dataset in MATlab?
Question
65 answers
I'm trying to do an experiment with a large dynamic and heterogeneous dataset already available in MATlab files but the following error shows up: "Cannot display summaries of variables with more than 524288 elements" Googling the problem, it says the matrix is too large, try to put in several excel files. Using simply xlswrite('file.xls',array) gives the same problem in excel: Error using xlswrite (line 220) Excel returned: Error: Object returned error code: 0x800A03EC. How can I place the matrix in different excel or sheet files by specifying the maximum ranges available? It seems csv format alleviates problem but If I try to put it in csv file, some of my matrix contain characters, so it gives following error: Error using dlmwrite (line 118) The input cell array cannot be converted to a matrix. Error in csvwrite (line 43) dlmwrite(filename, m, ',', r, c); Any suggestion is highly appreciated. Dataset address: http://leitang.net/heterogeneous_network.html
Relevant answer
Answer
If you point me the data set I might try to export it to excel for you. I just need a direct link to the data.
Question
9 answers
I would like to work in optimization of algorithms to classification, clustering and logistic regression for data mining over big data.
Relevant answer
Answer
Not sure what the scope is here, but here is something that may help.
We use big-data frameworks to run data-mining methods everyday - but just want to let you know that the industry practice today for the combination is not a "synergy" - it is more of a "connector".
What this means is:
1. Right now there exists no distinct set of algorithms specially designed for big-data (at-least not something that can capture the essence of big-data and take advantage of its CAP properties)
2. The killer is "integration" between big-data and data-mining than the individual algorithms' performances.
3. As Merry pointed out correctly, creating the algorithms and their implementation is not at all reduced in anyway by the big-data facilitation. They still take as much time as before (8 months in Merry's case)
4. The big-data realization, specially in the context of data-mining, is a more hardware demanding task. Not something that can readily be tested in traditional lab/academic setting. Amazon, Rackspace etc. are few helpers here - but a really cost viable scalable decentralized open cluster is still a distant dream for big-data researchers. For example, the feasibility of testing your new recommendation-engine's responsiveness for 1 million concurrent users in a lab setting, with enough detailing logging for its precision and recall
Improving any of these can help the industry/research.
- GK
Question
2 answers
Please suggest. Thanks in advance.
Relevant answer
Answer
Thanks
Question
3 answers
How can we proceed with big data and how can develop an algorithm for that?
Relevant answer
Answer
If you are interested in trying out Hadoop, the easiest way is to download Cloudera's Quickstart VM - Hadoop, Impala, Hive, Pig, Oozie, Mahout, etc. are already installed. The VM also has working examples included.
You will need to read the hardware requirements and make sure your laptop meets them (Mac or PC, both are fine). Here is where you can download the VM (VMware, KVM, or VirtualBox):
After gaining some experience with the Quickstart VM examples, I would recommend reading this article which describes how to use Cloudera Impala to access GDELT (Global Database of Events, Language, and Tone): http://blog.gdelt.org/2013/11/06/fast-gdelt-queries-using-impala-and-parquet/
(Note: GDELT is suspended right now, but the data accumulated through Jan. 16, 2014 should be available to you.)
Question
3 answers
I want to make an application to classify or clustering data from Facebook, I heard about augmented intelligence , but I don't know which methodology is used in Artificial intelligence.
Relevant answer
Answer
Augmented Intelligence is the enhancement of human intelligence using some technological means such as eugenics, gene therapy, brain-computer interfaces, smart drugs, neuroengineering, or some other means.
Augmented Intelligence is usually considered as a futuristic technology. In designing next-generation intelligence applications to power the data-driven planet, it is vital to keep the following simple rule in mind…
Augmented Intelligence = People + Algorithms - Friction
Question
5 answers
Keeping semantic heterogeneity in mind, what are all the possible approaches?
Relevant answer
Answer
Yes, more information would be required in order to answer your question effectively. However, my intuition tells me that you're probably looking for the Simple Knowledge Organisation System (SKOS):
http://www.w3.org/2004/02/skos/SKOS allows you to define things like broader and narrower relations between various concepts.
However, WordNet RDF/OWL representation (as referenced by Samir above), is a more linguistic approach.
Happy to keep the dialogue going! Let me know if that helps!
Daniel Lewis
Question
6 answers
I have found Fuzzy C-means Algorithms are good for big data or very large data sets, since it is based on soft partitioning. Can you test any clustering algorithm in Matlab ?
Relevant answer
Answer
FCM is already implemented in MATLAB, check the fcmdemo.m file. You can also fine there a K-Means implementation as well as Hierarchical Clustering techniques. From my experience, and i don't know to what scale bigdata in your opinion is, iterative clustering techniques are memory consuming. Try finding something that may require a bit less memory usage, for example, Fast Fuzzy C-Means (FFCM) is a method based on FCM, however, with iterations it requires less memory usage, and less time.
If you can predict the shape of your dataset; however, maybe KNN could be of great help sometimes. Sometimes, genetic algorithms, or generally evolutionary techniques can do great in clustering bigdata if you have a powerful machine.
After all, there are variety of stuff that you may use !
Question
2 answers
I have data from four societies and I am trying to apply Social Data Mining, but I would like to explore any similar technique related with Big Data.
Relevant answer
Answer
Carlos: A clarification request. Social Data Mining is an application domain for data mining. Big Data refers to the kind of data being mined in some domain -- a term used to denote data volume, data velocity, data variety and such. So, if the data you have from the 4 societies qualifies as Big Data (which, to begin with, is a loose term denoting the 3 V's (some people also add a fourth V -- Data Veracity), then you are dealing with Big Data. Since you are dealing with societal data, you are also dealing with Social Data Mining. Neither is a technique -- an appropriate technique or techniques for mining Big Societal Data is a different issue, I think. Or am I misunderstanding your query?
Question
8 answers
I have proposed a special journal issue on 'Data Mining and Big Data' to IOSR (International Organization of Scientific Research) Iosr Journals. You can contribute your research/review papers (those who are interested). You can write your papers on the topics- Data mining, Databases, Management of Big data and other related fields.
Relevant answer
Answer
Yes, you can contribute your paper for the same by mailing me the manuscript for review process.
Question
5 answers
Country and industry specific insights are welcomed. Discussion on the challenges or impediments to the effective leveraging of these solutions would also be helpful.
Relevant answer
Answer
The first challenge that you will encounter is the willingness to open the data for analysis and interpretation. This will be tied to political reasons in a macro level. A more related issue is the problem of hand collecting data and the error rates on this data. Whether government or private sector the acquisition source and the data entry is a major issue.
Another important issue is that depending on the level of organization of the country, the source of information will:
a) Lie on disparate sources
b) Use different technologies
c) May require normalization of fields
d) Be on different locations
e) Be constrained by different regional laws
This adds a complexity issue to any big data project in terms of logistics. Nonetheless, if you overcome these issues, I think people will be more willing to cooperate than bigger countries.
Question
2 answers
Phylogeny analysis seems beyond my capacity right now, however, there are so much information in my dataset of seed traits along with climate, phenology, any suggestion will be appricitated.
Relevant answer
Answer
Factor Analysis may help you to reduce information amount and show principal relationships between the measured traits.
Question
3 answers
I am wondering about the minimum size of data sets and minimum length of recorded history to ensure reliability of analysis. Also the variety of recorded parameters and the ability of small business to get profit from any effort in this direction.
Small business usually have tight bounds to get enough of the above.
Relevant answer
Answer
Big data can refer to things other than sample or data size, e.g., an analysis technique takes too much computing power to be able to run it in-house, even on "small" data.
Regardless of size of data, n=20 is probably as small as you'd want to go for *any* reason. http://www.health.ny.gov/diseases/chronic/ratesmall.htm
Reliability is another issue entirely if the point is to make comparisons between groupings of some sort. Then it depends on the relative variation; I use some metrics that require n>700/group to be reliable and some that aren't reliable even when n is around 10k. So the latter are used for informal comparisons instead. Here's one overview of reliability in the context of health care: http://www.rand.org/content/dam/rand/pubs/technical_reports/2009/RAND_TR653.pdf
Approaching the human collective (humanity) the same as neural networks (minds) and vice versa. Good idea or bunk?
Question
8 answers
(The following explanation is kept short for a quick overview. I can and will provide links to scientific papers on the mentioned topics on request.) This question is meant for me to gather ideas, cross-links and feedback, before addressing the topic in depth in a paper. Discussion, Contributions or even Collaborations in regards to this topic would be highly appreciated. I think ... Humanity is a collective/organism similar to a neural structure making up a brain. The collective comes to a state of activity/inactivity or something in between (see quantum states) on any given ‚problem‘. Individuals - similar to single neurons - are highly relevant for the collective to ‚decide‘ on which state the collective will enter whenever observed (assumption: the collective ‚humanity‘ interacts with other collectives on its level). Looking at individuals that constantly reweigh (change strength of relationships/synapses), reconnect (creation and elimination of synapses, rewire (creation and elimination of neural branches/societal affiliation) and regenerate (creation and elimination of themselves) as part of a collective we can assign relevancy to individuals. That is an individual that effects several other individuals to ‚fire‘ the same vote is slightly more relevant for the collective to keep up than individuals who are not as strongly connected. If we take a look at current advancements in Big Data Analysis, Quantified Self and Personality Prediction an assignment of those relevancy to individuals takes place already. At the moment science is able to predict certain personality traits from publicly available data about individuals. From those personality traits a we might soon make highly accurate predictions about individuals choices. When we take this to a higher level, science might soon be able to predict results for political election and similar kinds of crowd-decisions. As Humanity is a collective of humans, each human is a collective of neurons, and possibly each neuron is a collective of a smaller unit (the context of humans, neurons and possible smaller units is important as well!). Thus I suggest to look at humanity, as we look on neuronal networks and vice versa. This might give us a deeper insight on both levels and possibly new insights on where humanity is heading. Excursion: In discussion with a friend we came up with the following scenario: At the counter of Starbucks I order my usual ‚Venti Chocolate Mocha with Cream’ and I am perfectly fine until the barista asks if I wanted a free chocolate chip cookie with either white or dark chocolate. ‚No‘, obviously, would be no choice. Thus my neurons collectively come up with a decision for a ‚white chocolate chip cookie‘. Transferring this to the human/humanity level: what if the purpose of what we are doing right now is to answer the question for white or dark chocolate preference? … Ah, well … never mind this excursion!
Relevant answer
Answer
While you might draw similarities between neural networks and the collective behavior of special interest groups within society, I would suggest that you abstract both a bit and handle them through network theory. In network theory you can use the diffusion equation to model the propagation of idea through the network or other models already existing in graph theory and the theory of networks. Another interesting approach is to use game theory in networks to model the observed behavior. From there, parallels to both architectures can be made based on a single solid area of research
Question
5 answers
Not able to find a way to convert the file for having all the MS MS spectra in a text file.
Relevant answer
Answer
Sridhar, AB SCIEX offer a tool called MS Data Converter, free download at http://www.absciex.com/downloads/software-downloads. It converts to mzML1.0 and mgf. mgf is actually close to a txt type representation of all MS/MS spectra after centroiding. With some editing, you might be able to use it for Massbank or similar.
Question
4 answers
Optimization algorithm available for graph analysis in context of big data
Relevant answer
Answer
I will use the real world data may be social data or mobile data so data is going to be very big enough. secondly i am thinking for optimize the computation on every node somehow and also want to reduce cost of merging data on different node through mapreduce . so Is there any algorithm or strategy which can fulfill this criteria may optimize the process? Is there any other strategy which can replace the MapReduce paradigm?
Do you know of a free automatic diagram software?
Question
16 answers
I have to do something very complicated, take all the countries of the world and create a Euler diagram that shows all the international organizations each country is part of. So I would like to know if there is a software or app in which I can just say Egypt is part of the Arab League and of the African Union and the program creates the two circles with egypt in the middle and so on, having at the end the big picture of the whole thing. Any idea please?
Relevant answer
Answer
The 'dot' program in 'graphviz' will do that. It's free and open-source.
Question
2 answers
LDCC is a Nokia project and it was for collecting mobile data for more than 1 year from users.
Relevant answer
Answer
bad news for you. you surely cannot obtain the data because of confidential reasons.
The study designers and data holders seems to be very strikt.
E.g. they report "... risking the data flowing to the hands of non-authorized instances" and "... at the back end to protect the integrity of the database as well as anonymity of the participating individuals." --- see http://research.nokia.com/page/11389
I my point of view it is rather impossible without any research project (and contracts) with them to get the data. Note that just deleting their names in the data is not enough to anonymise data, since also with a combination of other variables one may identify persons.
Best,
Matthias
Question
4 answers
"Big Data: A Revolution That Will Transform How We Live, Work, and Think".
Until recently, big data made for interesting anecdotes, but now it has become a major source of new knowledge. Google GOOG -0.42% is better than the Centers for Disease Control at identifying flu outbreaks. Google monitors billions of search terms ("best cough medicine," for example) and adds location details to track outbreaks. When Wal-Mart WMT -0.52% analyzed correlations using its customer data and weather, it found that before storms, people buy more flashlights but also more Pop-Tarts, even though marketers can't establish a causal relationship between weather and toaster pastries.
Relevant answer
Answer
I could be said that the big data trend is the inevitable consequence of massive storage capability widely available and devices and systems which generate enormous logs of activity data. We could say the big data is activity data rather than static (customer record) data, because the number of customers of many businesses is certainly not growing exponentially! On the other hand, for data mining, sampling and/or filtering (to get "smaller data") is of course still valid for many modeling and analysis application domains, with exceptions such as outlier detection....
Question
27 answers
Which frameworks are available for BigData processing? Is any framework available involving workflow processing? What additional configuration of distributed environment is required to get real power out of Big Data processing?
Relevant answer
Answer
You can check the different big data processing paradimgs and technologies in http://www.slideshare.net/Datadopter/the-three-generations-of-big-data-processing
Question
5 answers
.
Relevant answer
Answer
Many researchers use these: http://snap.stanford.edu/
Regards, David F. Nettleton.
Question
10 answers
Dear all,
TCGA repository contains data from RNA-Seq2 experiments (Illumina) for primary tumors and metastasis of melanoma. Where one could find RNA-Seq data for normal skin or (better) melanocytes? Or perhaps NHEM cell line?
Thank you in advance!
Best regards,
Peter
Relevant answer
Answer
Thank you! Indeed there are 2 samples available at ENCODE and 3 at SRA.
Question
3 answers
Information credibility depends on some of the information quality attributes; mainly accuracy, relevance, coherence and authenticity. Credibility is classified as presumed, reputed, surface, or experienced property. Credibility is gained and lost based on the quality of information being delivered as well as perception and interpretation of information. With current explosive rate of data generated by different business processes as well as magnitude of published information on the web internet information are identified as resources. These resources lack quality verification mechanisms. Achieving higher levels of information quality assurance requires formalization of information architecture and organization processes. Do you agree? If so, how should we resolve this problem?
Relevant answer
Answer
I think there is no way to have third-party verification throughout the Internet.
The solution is in media literacy of the general population. They have to learn how to scrutinize and question information.
The trouble is, many people do not want to do this because they select information which most fits their pre-conceived notions or worldview.
Question
25 answers
If the former, how can conventional databases handle the complexity, cognitive semantics, and knowledge implications of big data?
If the latter, what are the structural models and basic operational mechanisms of cognitive knowledge bases?
Relevant answer
Answer
Big Data can be successfully store, manipulated and processing using relational databases servers.
In Oracle there some special object oriented packages as Multimedia, XML, Spatial, Topology. The first one is not very strong. Additionally it is possible to create user defined types, which can be representation of complex objects. It is possible to use PL/SQL or Java Stored Procedures (JSP). When we using Java classes, it is possible to obtain very strong tool to processing objects on the database server side. Example of such approach You can find in Query by Voice Example and sound similarity based on the Dynamic Time Warping algorithm (http://yadda.icm.edu.pl/baztech/element/bwmeta1.element.baztech-article-BPOM-0030-0003)
Or Implementation of MFCC vector generation in classification context (http://yadda.icm.edu.pl/baztech/element/bwmeta1.element.baztech-article-LOD8-0002-0011).
Of course there are many publication of other authors.
In MSSQL object representation is built in only for Spatial and XML data. But exists very strong mechanism of User defined Types using CLR (Common Language Runtime). Object can be build using any of language of .NET platform. I built some objects of such types for example to represent net connections (social networks). Using User Defined Aggregates I built analytical (statistical) library. Unfortunately all of my works descriptions are in Polish. In all cases I obtain very high efficiency. Many useful informations You can find in book Professional SQL Server 2005 CLR Programming (http://shop.oreilly.com/product/9780470054031.do).
I believe that both database servers gives enough strong tools which gives programmer and developer possibility to obtain any functionality necessary to store and process Big Data.
Adam
Can someone suggest a good reference on software analytics?
Question
8 answers
I search for references that could possibly be useful for students to understand the area of software analytics. If you use any references in your teaching regarding this topic and could share them, I'd be very appreciative.
Relevant answer
Answer
Maya, look for tebelau http://www.tableausoftware.com/public/
Question
14 answers
Much of the discussion around big data has centered on technology, but organizations must ask themselves why they would like to pursue big data and what they hope to gain from implementing big data solutions.
Relevant answer
Answer
I think that it is often a question of confidence. Large amount of data are now available from Internet but we lack information about their quality. Decision based on wrong or incomplete data may be critical. A solution is to implement local and verified data even if it's big !
Question
17 answers
What are the key problems we have to solve in the big data field for the next 5 years?
Relevant answer
Answer
The key problem is that the data sets (genomic and proteomic in particular) are growing at rates faster than More's law. This is only going to get worse with the next generation of gene sequencers, and will certainly continue for at least 5 years.
I find it hard to imagine a solution allowing "true" big data when the data is growing faster than computation. Pruning of "uninteresting" data quickly will be critical. I think of some similarity with the LHC trigger system, where there is too much data generated by the particle detectors to process in detail or even store. So there is a heirarchy of "triggers" where things that seem uninteresting at first glance are discarded and possible "interesting" events/data are passed on to the next level of trigger/filter. As each level has less data to deal with than the previous, more work can be done and more detailed processing can be accomplished.
Some method of very quickly discarding parts of the data that don't matter much is the key, IMHO.
Question
9 answers
I am looking for good conferences or workshops on the topic of data analytics and big data this year. Is there anything where the deadline for submission is not yet expired?
Question
5 answers
For an Integrative database it is highly desirable to make an easy find option. Can anyone suggest any examples?
Relevant answer
Answer
The best looking interface depend on several factors, but you should consider especially spacing, positioning, size, grouping. All these factors have an impact on a better visuality. Also the user should be notified about what is happening in the background
What are excellent (world-class track record) academic groups/research institutes doing research on Big Data in Pakistan, Bangladesh, India, Thailand, and Malaysia?
Question
6 answers
...
Relevant answer
Answer
Hi Kenth, None in Bangladesh yet. I guess you are already familiar with what is happening at GP and GPIT. . Regards, Adib
Question
3 answers
I am going to analyze big data of data centers. But first of all i want to know that it what form (data ype) data are stored in data centers (facebook, twitter, amazon etc.).
What parameters should i use for classification/grouping of big data.
Thanks
Relevant answer
Answer
The classification should be formed based on the goals of what you want to achieve (things you want to output, measure and control).
For example, if your goal is 'high-throughput' then you classify big-data based on the parameters that affect the 'high-throughput' such as 'latency involved in processing the data', 'frequency of data input' and so on....
Try to come-up with the business-goals first, then you can easily identify the classification and groupings.
- Gopalakrishna Palem
Question
29 answers
This is something I have been pondering lately after attending a number of related academic and industry-led events, yet no definition is ever made clear: The term 'Big Data' has become a very popular buzz word, yet researchers in many scientific and mathematical fields have been analysing and mining large datasets for many years (eg. satellite data, model data). Does it then, refer to big data in a social sciences or business context, or rather, does it more correctly refer to the increasingly accessible, ubiquitous, real-time nature of the multitude of datasets we are now exposed to (e.g. data from sensors, WSNs, crowdsourcing, Web 2.0)? Or indeed both?
Relevant answer
Answer
Michael Stonebraker currently is writing an interesting Blog@CACM series on the different aspects of Big Data. The first four parts are already available:
In these posts he addresses all the different facets of Big Data you are asking about.
Is there any literature on marginal utility of (big) data / information?
Question
15 answers
Everyone talks big data, data as the "new oil", data as a fourth factor of production, yet I've found few articles on that topic. I'm looking for some empirical reasearch on marginal utility of big data or information in an BI context. My hypothesis is that marginal utility of data used for BI/BD purposes is way too tiny to make it profitable for the enterprise to adapt if their business model isn't already based on data itself or shows a very good potential in terms of scalability (e.g. millions of users /customers which I can apply my knowledge to which I derived from my data).
Relevant answer
Answer
In econometrics, the larger the dataset, the smaller the effects that can be identified. This however raises the question whether an estimator is 'relevant', even if it is statistically significant. And in the extreme, when a sample goes towards the population, then the significance tests and p-values lose their meaning. There has been a discussion on the "microeconometrics" on this topic recently. Turning from estimation to (micro)simulation, there is a clear trend towards large, administrative datasets. This makes it possible to simulate counterfactuals or policy effects on a smaller level, i.e. with more precision. It also allows for a more precise identification of "winners" and "losers". But I do not know of any literature on this subject...
Question
6 answers
We consider a data stream which is not necessary stationary. We would like to label instance on the fly.
Relevant answer
Answer
It really depends on your type of data. I think all classification algorithms can (and are mostly intended to) do online classification if you have enough training data available, it is just a matter of which one has best performance in your data. Also, whether you want to build a model with training data, save the model, and apply the model to new data; OR alternatively have the classification model trained and applied online... What data are you dealing with? Did you try any algorithms "off line" first?
Is anyone using Apache Mahout for their research?
Question
20 answers
I'm looking for people working with Apache Mahout in their research.
Relevant answer
Answer
I'm one of the developers of Mahout and I'm always interested in how people use it
Question
2 answers
Telecommunication data
Relevant answer
Answer
This document from Strategy& — a consulting team at PwC — will give you immediate answers and a basis for further studies.
Question
1 answer
Is it possible to extract large datasets from Facebook and import them in a Hadoop FS? I think there is a limit on the data that can be queried using FQL
Relevant answer
Answer
Each query of the stream table is limited to the previous 30 days or 50 posts, whichever is greater, however you can use time-specific fields such as created_time along with FQL operators (such as < or >) to retrieve a much greater range of posts.
Question
1 answer
Since data is stored only in the data node and requires copying of data back and forth from the name nodes, copying takes much more time than the analysis.
Relevant answer
Answer
I find the following paper interesting
Improving MapReduce Performance in Heterogeneous Environments,
Question
36 answers
Which is going to be a game changer in future research?
Relevant answer
Answer
I think Cloud computing and big data are the future research when it comes to distributed computing. Size of data are becoming increasingly bigger and bigger, also storage facilities are becoming a huge concern. This is where Cloud comes in, to help for easy storage and accessibility. Although both research areas are making waves at the moment and will surley be around for the next - say 5 years to come.
Question
2 answers
Currently all giant companies are using Nosql database for serving their customers like HBase, Cassandra, Dynamodb , MongoDb , Google’s Big table . Most are open source and capable to handle Big Data requests but all these are still evolving so is there a need of standardization to maintain ACID property and rules for data accessing?
Relevant answer
Answer
Most of these Databases are born because there specific use cases where not easily doable with the standardized SQL databases.
So a standardization of NoSQL databases makes no sense because all of them is build for a specific use case.
Question
3 answers
Does the reality that our computers do NOT understand a single word we use lead to the informational chaos we call big data? Would you, therefore, agree that the solution lies in better or newer artificial intelligence?
Relevant answer
Answer
Big data challenge has two facets:
1. The first one is data itself. Regardless of the data size, trustability and accuracy of data are key factors. Even the homogeneity of accuracy could be questionable for a data because it may be sourced/manipulated from/by many actors.
Analysis of a data with only two records, such as {x=-1, x=1}, could be a challenge even for a human analyst. Either one data record should be discarded, or an oscillation behavior should be concluded.
2. The second facet is the extremely large number of records. Here, computers come into play in order to automatically perform those analysis rules that human analyst have developed. This is done by translating those rules into computer code. Let's forget about storage, manipulation, and other implementation challenges for now.
As you said, the imperfect translated code is not exactly the same as the original rules, and this could result in big computation overhead and unexpected results. A higher level of human-computer interaction in the form of a programming language, for example, could bring the software code much more close to original human reasoning.
Beyond that, but not in a near future, artificial intelligence could reach such a level of development at which it could create its own reasoning rules, instead of translating those developed by human.
Question
6 answers
What are the methods or best predictive methods to use for this kind of data?
Relevant answer
Answer
Check the proceedings of the "Distributed Systems and Networks (DSN)" conference series. You will find a couple of good ideas there. Another starting point is the "Computer Failure Data Repository (CFDR)", which offers HPC system logs and the according analysis papers.
Question
10 answers
Cloud and big data
Relevant answer
Answer
Xavier is right, it depends on the kind of data and what operations are you supposed to apply on it. MapReduce is great to index/sort data, but it only achieves good parallelization levels if the data is really splittable and may be combined/recombined (i.e., the reduce part). As a rule of thumb, only use MapReduce if your data set (and operations) fit the "divide and conquer" paradigm.
Just a warning, MapReduce implementations like Hadoop are strongly dependent on the input files format, so if you plan to use something different than text files you may need to implement your own parsers/splitters.
Question
3 answers
Does anybody know a solver for a large scale sparse QP that works on the GPU?
Or, more in general, can a GPU speed up solvers for sparse QPs?
Relevant answer
Question
13 answers
I'm looking for difference between data stream and large data set Based on the occurrence concept change.
Relevant answer
Answer
I think your question should be: what are differences between static and dynamic data?
In the case with static data, regardless its amount, your prediction or classification model is also static, i.e. does not change over time.
In the case with dynamic data, there might be concept drift, so classification and prediction model must be changed accordingly. To read more about this topic you can Google for "concept drift tutorial" or "concept drift detection". Most of the works in that topic deal with the dynamicity of data as such data stream.
Question
5 answers
Say for example I need 100 TB of data to do some experiments on Big Data. How do I get if I find the source? What is the Media that is usually used here?
Relevant answer
Answer
Every day, Facebook receives 2.7 billion Likes, while 2.5 billion content items are shared on the social networking site. It uses Hadoop to empower many of its features, like messaging, along with optimizing its advertising performance and to conduct data analysis. With Hadoop’s data analysis techniques, it determines the effectiveness of features or advertisements against each other based on specific demographics, and also leverage the results to tweak features and improve targeting.
Facebook utilizes Hive, an open source project created by Facebook that is the most widely used access layer within the company to query Hadoop using a subset of SQL, and HiPal, social network’s homegrown, closed source, and end-user tool. It needs all these in order to handle and analyze its gigantic volume of data. While Hive allows Facebook to have business intelligence, HiPal compliments it by enabling data discovery, query authoring, charting, and dashboard creation in graphical form.
Question
30 answers
What are the challenges with existing relational data management models? and do you think we will need a new form of data representation to better perform distributed processing?
Relevant answer
Answer
Depends very much on what you are using it for. As a conceptual / logical data model for data modeling? As a lower-level storage and transformation model? In my experience these discussions have a very high risk of degenerating into religious debates if you don't clarify that there is a diference. Note that the original claim of usefulness of the relational model is based on its position at the intermediate level between conceptual and storage model and the fact that it more or less joins the two smoothly. But I would argue that these days it has become clear that this is no longer necessary. So is the reign of the the relational model about to end? I don't think so.
Question
24 answers
Big Data is one of the buzzwords of the year, and we heard arguments along the full spectrum – from having so beautifully many data available that we can solve huge societal challenges, over worries considering data archiving, to statements such as ‘a tsunami full of rubbish that n body will be able to use’ or (more scientific) doubts about being able to deal with the huge heterogeneity in semantics. Now, I would like to get your opinion where this debate has let us and where it might go.
Relevant answer
Answer
I agree on this one, Raphaël.
There is a trend to have more and more (heterogeneous) data available from diverse sources and it is not bad to name this 'phenomenon'. In the end, this will (or better should) not alter our daily research work. Still, apart from possible disruptions, we might in some cases benefit from newly released information - if we find it ;-)
By the way, I gave up chasing for a single model (and format) that suites all. In the end, data is collected and optimized for a specific purpose. If you intend using it for something else, you have to life with some needs of conversion and be careful about its fit for another purpose than it was initially collected for.
Question
4 answers
Is there any one who can suggest any way or site to find twitter data which contains tweets and users locations? Or is there anyone who has such data?
Relevant answer
Answer
Hi,
Podargos can help you find data from Twitter by machine learning. All you have to do is to tell them your needs.
It can provide twitter historical and realtime data. In additional, it's able to help you retrieve other mobile App data of social networks and e-commerce, and supports for nearly 100 languages.
Pordargos's website is http://www.podargos.com
Here is the data sample: 
Question
2 answers
Here are the specifics:
The other services are based on the id of each database and can be accessed like this
*Note we have aligned some data sources into single views like nervous system connectivity (nif-0000-07732-1), annotations (nlx_149407-1) or research animals (nif-0000-08137-1). These sources envelop other sources and those source ids like BAMS (nif-0000-00018) will respond to the identifiers in the query parameter (q=), but not in the source parameter (/nlx-144509-1).
*For the less xml literate the same data can be played with in a gui: http://neuinfo.org/mynif/search.php?q=*
It would be helpful to know if anyone in neuroinformatics would find this useful, and if you don't I would love to know why.
Relevant answer
Answer
Hello Anita, do you have any list of the data sets you have in your database except (http://nif-services.neuinfo.org/servicesv1/v1/summary?q=*) and any protocol to access to those? Thanks!
Question
3 answers
Prof. Michael Yaffe’s article (http://stke.sciencemag.org/cgi/reprint/sigtrans;6/269/pe13.pdf) about the scientific drunk and lamppost defines an addiction to extensive sequencing data despite the relative lack of novel insight it provides. Why not ask NIH to do the experiment and put funding into comparative tests between different research priority strategies?
Relevant answer
Answer
I'm not speaking about the NIH specifically, but remain more genera, as I do not have NIH specific experience.
First, rational prioritization generally does not occur in decision-making bodies in general terms as a number of studies have shown over the past several decades. The range of vectors that affect a decision for allocation of funding in government and institutions include anchoring and historesis otherwise failure to link chained decisions, politics and potential impacts of funding or change, risk aversion in allocation versus risk seeking for chances of breakthroughs, and so on. Second, decisions are often internally complex, due to constitutive elements in proposals and protocols, with inherent trade-offs.
Since decisions in organizations are often not that open to outside accountability there are several ways to improve them. These include oversight and audit, research into how decisions are made (feeding back to improve decisions), and allocating by working from the bottom up, rather than the top down--for the latter, productivity is increased if the worst allocation is replaced by anything that is better in terms of productivity; that is, you don't have to compete to be the best to be funded, you just have to be better than all of the losers 'plus one' (that one funded, but inferior to your proposal). Sequentially replacing bad decisions with slightly better decisions is often easier to do, and improves productivity. Token trained social representatives and volunteer patients are insufficient to improve social choice or allocative decisions.
Part of my MSc thesis addressed these issues, though my focus was on willingness-to-pay/play as a neoclassical welfare economic model, but also as one that can include externalities, psychology, and so forth (1996).
A meta-analysis of research funding in over 40 countries found that GDP growth was greater when 30% of research funding was allocated annually to more risky research projects but where breakthroughs might happen, and there was also some impact of public vs. private funding on GDP growth; results were presented at a 1-day EU-Canada symposium in Toronto c.2004 addressing the next 5 Billion Euro 5-yr research plan, in which most of the funding was allocated to genomics, Europe being scared of USA and Japanese progress to that time-point. Unfortunately I don't have a specific reference, but I'm sure it was eventually published somewhere, as it was important international economics research.
Question
90 answers
Predictive analytics tells us what will happen; prescriptive analytics tells us what to do about it
Relevant answer
Answer
To begin with, predictive analytics is part of data mining, within data warehousing. It brings out the hidden patterns inside the data. It helps to improve the bottom line of business by providing competitive advantage. Major users are Financial institutions, Telcos and actuarial sciences among other progressive organisations. The trend is towards more business user friendly analytics. KXEN purchase by SAP is just an example of trend towards automated analytics from the classic analytics.
Question
2 answers
Hadoop seems to be popular through ETL tooling and big data and can be considered for historical database and data warehouse systems. How it can act while OLTP transactions are required?
Relevant answer
Answer
Speaking as someone working in a Hadoop-based startup, if you want serious OLTP, go get a distributed in-memory database.
The fact is, main memory of a few machines is easily big enough even for the largest OLTP databases. In fact, most OLTP databases can fit in memory easily on a single machine.
The Hadoop distributed file systems, HDFS, is inherently highly unsuited for OLTP use. Even implementing a transaction log on HDFS is not very satisfactory due to the very limited rate at which you can commit changes.
The MapR distribution for Hadoop includes an alternative file system which could be used for such a system.
Question
12 answers
Big Data has become one of the framing slogans of this year, data has been defined as the new oil, and e-Researchers and (digital native) Data Scientists are requested all over the globe. While kind of understanding that statisticians will be the rock stars of tomorrow, I am wondering what would be the appropriate technological enablers for Big Data analysis and visualization. If you are working in this area - or any related field - which tools do you propose for exploring the ocean of big (or small) data?
Relevant answer
Answer
What is essential is understanding the nature of the data. Nothing beats Exploratory Data Analysis. See John Tukey's Exploratory Data Analysis. Addison-Wesley Publishing Co., 1977.
Question
4 answers
We're looking to visualize twitter sentiment and twitter user trust in relation to a particular suite of products or services
Question
16 answers
The Graph Databases and the Pre-relational Network Model look similar in their structure, but there should be differences.
Relevant answer
Answer
One should distinguish between a graph model as a paradigm and a graph database model as architecture. The former is a general metaphor or paradigm many existing models conform to: object-oriented, entity-relationship, RDF and many others. The latter is a specific view on database architecture where this graph-orientation is in the focus. In particular, graph databases provide significant support for traversal operations. In this sense, both graph databases and the network data model are specific views on and implementations of the graph-based paradigm. Other paradigms include the relational model, logic-based models, hierarchical models, lattice and partial order based models (see the concept-oriented model).
Question
8 answers
For cluster data analysis, query manipulation, cloud environment analysis in the context of healthcare dataset .
Relevant answer
Answer
Hadoop is open-source software for reliable, scalable, distributed computing. while CloudSim a Framework for Modeling and Simulation of Cloud Computing Infrastructures and Services. according to my point view its better to work on Hadoop which gives you direct experience of working on distributed computing and also it is open source and supports different environments.
Question
2 answers
Managing Big Data in Cloud Computing Environment poses a great challenge. During the days of Distributed Computing, we were busy with transaction management across the participating building blocks, maintaining referential integrity across the distributed environment, etc.
Relevant answer
Answer
Since the conventional RDBMS deals with only structured data maintaining referential integrity is a challenge.
But Bigdata are mostly Semi-Structured and Unstructured data, hence the advanced database NoSQL is designed for handling Bigdata.
Question
14 answers
For objective-function based clustering such as k-means clustering, what's the most practical and useful method of choosing k?
First of all, does there exist a single method that is most commonly used for choosing k, regardless of your input data? If not, what's the common method for a certain kind of data? For example, when you have a large amount of data, does a resampling/stability-based method seem to work better? When you have a small amount of data, does criterion-based method work better (e.g., Bayesian Information Criteria)? Or do you simply apply clustering algorithm for each k and use the knee-finding approach?
When you choose a specific method of choosing k, do you take into account which clustering algorithms you are using? Or do you only take into account what kind of data you have? Or neither? Why or why not?
In general, do you believe that choosing k is a key process for clustering? Why or why not?
Relevant answer
Answer
The goal of finding the right number of clusters is important and interest me very much.
I've studied this topic during my PhD. So you can take a look at my publications, there are some that treat this.
There are a lot of clustering methods that rely on this initial number of clusters, so it is, i think, important to know this number.
First of all, in some pratical cases we can ask the experts of the data you use how many groups there are expected. But as scientists they will probably tell you to find by yourself to check if something new can be revealed by your methods. I prefer let the data speach by itself.
I think there are three main approaches:
- the use of statistical criteria: AIC, BIC, Tibshirani's GAP (i like very much), MDL
- the use of classification indices like mentionned in some previous post: Calinski and Harabasz, Hartigan, Krzanowski and Lai , Kaufman and Rousseeuw (Silhouette), Dunn, Davies and Bouldin (ok there are dozens of various indices) etc...
- the use of some specific methods that have been specially developped for one algorithm, in your case for k-means: i think of X-means, G-means, PG-means, Bayesian K-means.
If you want some references of paper, please ask.
Question
8 answers
The amount of data that is processed annually has exceeded the zettabyte boundary. This is a number that would only be mentioned in highly theoretical articles 2, 3 decades ago. Such insurmountable amount of data gave birth to a new term : big data. What do you think is the most important tool that will allow us to handle this explosion of data?
a) Do you think, it is the increase in the performance of CPUs, which is currently significantly slower than the growth of data?
b) Do you think, it is the new programming languages being introduced, which will make processing of this data much easier?
c) Do you think it is some novel data analytics algorithms that will shed light into significantly easing the handling of data?
d) Or, anything else ?
e) Or, are we dead in the water and there is no hope ?
Relevant answer
Answer
These are practical approaches to deal the BigData.
(1) MapReduce (HADOOP),
(2) BSP (HAMA)
(3) STORM (Real time processing)
Question
5 answers
In medical or dental cosmetics surgeries, people are seeking for beauty…
Relevant answer
Answer
I have done some research with regard to aesthetics in chess. The model developed seems to work quite well and correlates positively with human expert assessment. No 'machine learning' involved. I suppose a similar discrete computational model could be adapted for other purposes.
Question
4 answers
Research topics required on MANET's and Big data
Relevant answer
Answer
I would think that you can link Big Data to MANET. One method would be to run large simulations and then perform any scientific analysis on the very large trace files that have been generated. Some typical example include, Failure analysis, Data reduction & compression with an acceptable loss of properties, Entropy based approach to reducing large data sets applied to MANET traces.
Question
3 answers
There are a lot of High Energy Physics simulation scenarios. Do you known examples of such scenarios very close to Big Data?
Relevant answer
Answer
Hello Florin, considering that the amount of simulated data that models the processes (final state of particle decays) we are interested in is huge and we are using dedicated storage elements for this purpose, and considering also that i'm working on very rare decay where we cut > 95% of all events and still we got TB's of data at the end - the answer is yes we are dealing with simulations that produces huge amounts of data, that we cut with selections afterwards. My example is from CMS experiment on LHC. I was not familiar with the term Big Data but when i read about it in wikipedia the first example was exactly the HEP experiments.
Hope this helps !
Question
4 answers
I was thinking research in Hadoop is actively pursued by the "BIG" cloud companies, but the academia is lagging behind may be because of the infrastructure for testing the implementations..
Relevant answer
Answer
This link gives some research areas for students in college who want to work in Hadoop.
Question
9 answers
For example, iteration algorithms such as simulated annealing, SVM,...
Relevant answer
Answer
Hi Defu, you should have a look at Stratosphere.eu, it has support for iterative algorithms in large clusters.
I attached the paper describing it! (This is the full paper: http://vldb.org/pvldb/vol5/p1268_stephanewen_vldb2012.pdf)
Please contact me if you have further questions.
Question
16 answers
In a data set I know that there are specific values which are having higher effect than others, so I wans to give higher weight to them. However, if I try to give it manually, it would be difficult to justify it. If not, those would get same weight as all the other values in the calculation. So please share your experiences, if anybody knows how to give higher weight mathematically and some alternatives for OWA.
Relevant answer
Answer
A good overview of applications of the Choquet integer is given in
M. Grabisch, M. Roubens, Applications of the Choquet integral in
Multicriteria Decision Making, in
M. Grabisch, T. Murofushi, M. Sugeno, Eds., Fuzzy Measures and Integrals.
Theory and Applications, Berlin, Springer-Verlag, 2000, 348-374.
Question
45 answers
Due to fast development of hardware fast, high capacity RAM are available. Now what actually In-Memory Database means?
Relevant answer
Answer
In-Memory Database is the kind of a data storage where whole data is located inside main memory (RAM). The concept is quite old. First scientific publications can be found in 1984 and some initial In-Memory DB implementations appeared in 1990x . However, it's still not deeply researched area and only now it's become more popular with the tendency of having less and less expensive DRAM storage on the market.
There is no more hope that the disk I/O operations speed will significantly rise. See presentation by Jim Gray "Tape is Dead Disk is Tape Flash is Disk RAM Locality is King" .
Here you can find more "In-Memory Key Concepts" from researchers who are standing behind SAP HANA http://epic.hpi.uni-potsdam.de/Home/InMemoryDataMgmt
Question
120 answers
NoSQL is a hot topic in database design nowadays. What makes them different from each other?
Relevant answer
Answer
RDBMS is completely structured way of storing data.
While the NoSQL is unstructured way of storing the data.
And another main difference is that the amount of data stored mainly depends on the Physical memory of the system. While in the NoSQL you don't have any such limits as you can scale the system horizontally.
"Extremely large datasets are often event based transactions that occur in chronological order. Examples are weblogs, shopping transactions, manufacturing data from assembly line devices, scientific data collections, etc. These types of data accumulate in large numbers every second and can take a RDBMS with all of its overhead to its knees. But for OLTP processing, nothing beats the combination of data quality and performance of a well designed RDBMS."
NoSQL is a very broad term and typically is referred to as meaning "Not Only SQL." The term is dropping out of favor in the non-RDBMS community.
You'll find that NoSQL database have few common characteristics. They can be roughly divided into a few categories:
key/value stores
Bigtable inspired databases (based on the Google Bigtable paper)
Dynamo inspired databases
distributed databases
document databases
This is a huge question, but it's fairly well answered in this Survey of Distributed Databases.
For a short answer:
NoSQL databases may dispense with various portions of ACID in order to achieve certain other benefits--partition tolerance, performance, to distribute load, or to scale linearly with the addition of new hardware.
As far as when to use them--that depends entirely on the needs of your application.
Question
18 answers
Recently San Diego State University initiated a new research cluster, called "Human Dynamics in the Mobile Age" (HDMA). I am the coordinator of this cluster. We are trying to figure out a simple and easy-to-understand definition for Human Dynamics. What's your own definition of "human dynamics"? Do you like this term or not?
Relevant answer
Answer
A concise definition of human dynamics could be the following: Human dynamics is a branch of complex systems research in statistical physics. Its main goal is to understand human behavior using methods originally developed in statistical physics.
According to professors Tao Zhou, Xiao-Pu Han and Bing-Hong Wang, the quantitative understanding of human dynamics is one of the most under-explored area of contemporary science. Quantitative understanding of human behaviors provides elementary but important comprehension of the complexity of many human-initiated systems. A basic assumption embedded in previous analyses on human dynamics is that its temporal statistics are uniform and stationary, which can be properly described by a Poisson process. Accordingly, the intervening time distribution should have an exponential tail. However, recently, this assumption is challenged by extensive evidence, ranging from communication to entertainment to work patterns, that human dynamics obeys non-Poisson statistics with heavy-tailed intervening time distribution.
We are witnessing rapid changes, however, thanks to the emergence of detailed datasets that capture human behavior, which allow us to follow specific human actions in ultimate detail. Interest in the subject is driven by a need for a general understanding of complex systems. Currently social systems offer some of the best mapped datasets on the dynamics of any complex system, when the action of each component (individual) can be followed in ultimate detail.
Question
3 answers
Recently, we released a framework called windML ( http://www.windml.org), which provides an easy-to-use access to wind data sources for Python building upon numpy, scipy, sklearn, and matplotlib. It contains data mining methods and examples for various learning tasks like time-series prediction, classification, clustering, dimensionality reduction, and related tasks. We currently use two data sets, i.e., the NREL western wind integration study and an Australian data set named AEMO. The NREL data set is really awesome with 10-minute data of 32000 wind turbines for 3 years. But it's partly based on simulations. Does anyone know of another (large, public) data set with spatial distributed wind turbines? Thanks in advance.
Relevant answer
Answer
Dear Oliver.
Are you familiar with this paper:
A. Panangadan, S-S. Ho, A. Talukder "Cyclone Tracking using Multiple Satellite Image Sources" GIS 2009
No guarantee it is public domain, but it is worth asking the authors.
Best regards,
Myra Spiliopoulou
Question
8 answers
Does anyone know of a database of estimates of the number of species/taxa for every (or every terrestrial maybe) 1 degree grid square across the earth?
Many thanks in advance,
Chris
Relevant answer
Answer
Although it will not cover all taxa, and you will need to be careful about the accuracy of the data at small spatial scales, you might start by looking at the IUCN range maps. You could overlay the ranges for all species of interest on a grid to get a count of the number of species per cell.
Question
47 answers
Nowadays everyone is talking about Big Data Analytics. I think it makes excellent sense in these days as cloud computing emerged as important area of research. I would also like to start doing some research in this area, but how? Can anyone share his/her experience?
Relevant answer
Answer
Please see my paper:
DOI: 10.1007/s13253-018-00341-3
Question
5 answers
This question is related to the research topics Information, Innovation, and Knowledge.
Relevant answer
Answer
As scientists, we should trust what we have evidence for. That's why I think a sufficient level of transparency is needed to ensure a high quality of information systems, and confidence in them. What makes science superior to other kinds of gaining insights is not just the systematic method of verifying or falsifying assumptions, but also that we expose our insights to public criticism. This is the driving force of advancing our insights, and avoids that particular interests and perspectives may lead to biases and blind spots, often unintentionally so. For example, security is not just a function of the crime and terrorism prevented, but also of the resilience of our society to perturbations, which requires freedom and the ability to contradict and innovate.
Question
1 answer
I am developing a didactic proposal for the use of Big Data in Science Teaching, making use of public free tools. I am not considering teaching conventional Big Data with Hadoop, MapReduce, etc. because my pre-service teacher students are not going to work directly with Big Data but rather use it as a modern tool to teach Science and to "answer questions that were previously considered beyond our reach". Have any of you considered this possibility? Do any of you have suggestions?
Relevant answer
Answer
I don't have experience, but there is a course starting in one week about Big Data in education.
Question
1 answer
In traditional parallel and distributed data mining algorithms the issues are data decomposition: data and task, data layout: horizontal and vertical, load balancing: static and dynamic, memory used: shared, distributed and hybrid. So if we design data mining algorithms on the MapReduce platform what should be the research issues?
Relevant answer
Answer
There has been significant work building algorithms on top of MapReduce. Some very good papers surveying this topic include: