The Rain and The Shade

February 8, 2012

Peeping into very large text(xml) files

Filed under: C# Coding,PowerShell — ovaisakhter @ 4:47 pm

Recently I worked with a 40 GB XML file. The main objective was to look at the data in the file and create a compatible data model and eventually write a routine which can import the data into the database.

To start with I used a virtual machine hosted some where in the cloud for all my work . This approach made my life much easier as

  • The long running processes were not slowing down my own computer and I can do something else while the processing was done
  • I can just turn off my PC and go home without having to kill the process
  • I got a much faster machine in the VM

I tried to open the file in some known editors but there was no result and most editors froze. My initial objective was to just into the some part of the document so at least get an idea of what type of data I have to work with.  My initial thoughts were write a small program using XMLReader in C# and dump some part of the file into another file.

Just before writing the code I stumbled upon the Get-Content command in the Windows Power Shell. Get-Content command mainly lets you open a text based file and let you do basic operations. so if you want to show a file you will write something like

Get-Content .\1.txt

on the Windows Power Shell command prompt. This will print all the content of the file on the console. Not much useful is it?

Now if you want to display first 10 lines of this file you can give following command

Get-Content .\1.txt -totalcount 10

The best thing about this command is that it will not load all the file in the memory rather it will only load the specified lines. You can very easily save the out put of this command to a file like following

Get-Content .\1.txt -totalcount 10 > new1.txt

You can get more information about this command from the following link

http://technet.microsoft.com/en-us/library/ee176843.aspx

The link provides some other parameters used with Get-Content where you can count the number of rows or read last n number of rows but you can not use these if you are trying to work with very huge files as all of them will load the file in the memory and then will perform the requested operation, which will kill the purpose of using the command in the first place.

Advertisements

Create a free website or blog at WordPress.com.