acmetaya.blogg.se - Rapidminer studio csv file size limit

#RAPIDMINER STUDIO CSV FILE SIZE LIMIT CODE#
#RAPIDMINER STUDIO CSV FILE SIZE LIMIT SERIES#

Put this one merged document back into a dataset with only one example and one attribute

#RAPIDMINER STUDIO CSV FILE SIZE LIMIT SERIES#

Take those rows that are marked missing (?) and replace them with & & & & & as a placeholderĬonvert the dataset to a series of documents (one per row)

#RAPIDMINER STUDIO CSV FILE SIZE LIMIT CODE#

Thanks, below is the same code with some annotations so you can see what I was doing with this "hack": As mentioned earlier, fixing the encoding upstream so that it comes into RM nice and pretty would probably be a smoother way to go. I'll keep poking around to see if I can get to the bottom of it. However I was able to create a hack that does basically the same thing: I played around with RegEx expressions for a while to try to catch it but could not do so. RM is reading the intended column separator as "missing" rather than a recognizable unicode character. The issue is, as I suspected before, the column separator is not catching correctly. I am not a Python programmer but I will play a bit more with RapidMiner's built-in features to see if I can find the right RegEx for this file: You still have some funky accented character issues which are due to encoding issues as said earlier but I think it's clear that a key problem is that the default RapidMiner csv row parsing (escape character) is not cutting off in the right places for your file.

Just going back to the test1.csv file and the 22 vs 276 rows issue, I found pretty quickly that if you save the test1.csv file as an xlsx file (in Excel), and then just use the Read Excel operator instead of Read CSV, you import the 22 rows with no problem: I cannot attach a text file so just create your own with the following text this is text followed by blank lines However after going through the Execute Python operator it is now split over two cells We read in a text file and put the string in one cell Finally I put it through Execute Python (doing nothing) and all of a sudden its two cells! I think the blanklines messes up the operator? I then use Documents to Data to make a dataframe of one cell. I then save it to my machine and load it in via Rapidminer (Read Document). It looks like a look at this example! I create a text document manually and write three lines into it. Question - have you ensured that RapidMiner is running Python properly? Go to RapidMiner -> Preferences -> Python Scripting and use the "Test" button. I ran your new python script and did not get any problems reading the csv file. No worries about the queries, That's what the forum is for.

I have tried UTF-8 which does not fix unfortunately and even if it did I don't think that would solve the problem as I would still have to save a CSV on my local.

If I view the output of the generate attributes operator everything is how I would expect, however as soon as I write csv or execute python, everything goes wrong. Basically loop through 4000 text files in a directory, convert to a single dataset, create/remove some attributes and then use the python operator. The process I have attached is an example of what I am looking to do for this part of the pipeline. So the problem is that this is a small process as part of a bigger pipeline so I'd prefer to pass the data through rapidminer and not be saving + reading documents from my local machine during the process. One possible issue I was thinking is maybe there are some funky characters in my text that are causing the problem? Or its an error carried forward in the process? Interestingly I noticed in the generated dataset (that works perfectly) the attribute is defined as polynominal. This gives me the impression that the issue may be on my end.Īs for the data type, the attribute of interest was defined as text and I've tried playing around with changing them but to no avail. When I run the R generated data on my machine it works fine so it must be a local issue with my dataset. I have included the process in this post. The process finishes and I can see the third attribute modified in RapidMiner:Ĭan you share your process to see what's wrong? I generated an artificial dataset with R (size 240 mb): I tried to reproduce your error, but it has worked for me.