Friday, December 7, 2007

How big, huge, humongous … text file can be?

Originally TextMaster had limit of 2.4 GB for the text file size. At the time text file of that size would be considered too big to be real. The following e-mail changed that:
“Hey Jake,
We do not have the file merged yet, but I need to QC the merged file and no way to actually open it, there will be about 150K records in it and length will be around 20000.
Paul”
150,000 records in text file is not unusual but record length of 20 K is very unusual. All together, file size would be 3 GB. Because file wasn’t available I had to create it. Data generator did it within a few minutes and I had 150 K lines fix length file with 20 K record length. I created connection (file description) which had two fields with length of 10 K each.
Query “Select top(10) * from C:\TMSampleData\HugeFile.tmc “ started and reported an error but execution continued. I got first ten lines from my HugeFile. I was impressed. When I tried to select all records from the file error message appeared but process continued. Little bit of debugging pointed to the fact that progress bar control can accept only 32 bit integer. Divide by 1000 for file size above 2 GB fixed the problem. It was released in August but blog wasn’t updated. Sorry.
Now limit is 300 GB. Is it enough? Only time will tell.

Saturday, August 4, 2007

TextMaster Beginning

Almost two years ago, colleague of mine, had a problem. At the beginning seemed to be trivial thing. He had to check data file, fixed size text file where every line contained distinct client information. File had about 12 million lines and every line had 210 characters length. After a few text editors failed to open it, with or without error message, turned out that just viewing a few hundred lines is not so trivial task. At the time, he was working as Quality Assurance Specialist (or something like that) and on daily bases he had to check data files coming in and going out. To “check” file he had to import it into SQL server and run queries on it. Needless to say, if you move file into database it’s not necessarily the same file.

Text file structure and size checking sounds like a trivial task. But is it really?

- Let’s take the most trivial tasks of all – count number of lines in the text file. If file has 100 lines of data lines might be counted manually, for 10,000 spread sheet or some of the text editors will do the job to the certain extend but for 1 Million lines hardly there is any other solution available on the market today. Text file can be imported into SQL Server but after that it’s really not that file. It’s table in the SQL server. Conversion easily can make one broken line, in the text file, to become two lines in SQL server;

- Or checking if each and every line has the same number of characters (for fixed size files) or the same number of fields (for delimited files). Even for delimited text file with 100 lines and 10 fields per line it can be time consuming process and very annoying. Some spreadsheets will open comma delimited text file very quickly but they will also assume that lines with different number of fields should be like that. It’s not spreadsheet software bug. Those programs are designed to behave like that. TextMaster is designed to check line length of every line, determine it’s length or number of fields are report those counts per line length / field count;

- Pretty often spot checking of small number of lines randomly selected from the big file is very handy feature but very hard to find. TextMaster can export to another file randomly selected lines based on user criteria very quickly and easily. If it’s not the only one than it belongs to the group of a few software tools capable of doing it;

- From time to time file received will have a file name without too much meaning, if any, and with undefined structure. If it was processed before and saved in the profile within a few seconds structure and purpose can be determined. In addition processing instruction might be available.

That was beginning. There is lot more and lot more to come. Sign up and we will keep you posted.