Using Matlab “princomp” for Easy Dimension Reduction Using Principal Component Analysis (PCA)

Although I have detailed another way of doing dimension reduction in Matlab I recently found the command “princomp” which does everything for you. The following code reads in .csv files from a directory and reduces them to a set number of dimensions (“OutputSize” in this case). This is a lot easier than doing it yourself with the eigenvectors etc:

function [output_args]=ReduceUsingPCA2(DirName,OutputSize)

files = dir(fullfile(DirName, ‘*.csv’));
for i=1:length(files)
% read files(i).name and process
FileName= [DirName '/' files(i).name];
% read in csv file from FileName and store in x
x = csvread(FileName);

% calculate PCs and project data onto principal components
[COEFF,SCORE] = princomp(x);

[infile, remain] = strtok(FileName,’/’);
infile = strtok(remain,’.’);
mkdir([num2str(OutputSize) 'PC']);
outputfilename = [num2str(OutputSize) 'PC' infile '_' num2str(OutputSize) 'PCs.csv'];
csvwrite(outputfilename,SCORE(:,1:OutputSize));
end
end

The important method is   [COEFF,SCORE] = princomp(x); which takes in your data “x” and stores its projection into PCA space in “SCORE” which I then output to csv. I still need to find out how to project back into normal space but I think it should be just as straightforward as this was. For more info on “princomp” type “help princomp” into matlab and have a look at the help files.

Run System Commands in The Background With c#

I needed to run 200 odd batch files sequentially in the background for a speech classifier. The code to do this involved writing a batch file in code to include the commands i wanted the HTK to run, probably not the best way of doing this but HTK is funny with file paths with spaces etc and this was the quickest way.

Process hvite = new Process();
hvite.EnableRaisingEvents = false;
hvite.StartInfo.FileName = “temp.bat”;
hvite.StartInfo.CreateNoWindow = true;
hvite.StartInfo.WindowStyle = ProcessWindowStyle.Hidden;
hvite.Start();
hvite.WaitForExit();

Using Matlab and Principal Component Analysis (PCA) to Reduce Dimensionality of .csv Data

This information is out of date really, I have a much easier method here that does away with doing everything yourself.

I used Matlab to reduce the number of dimensions in my gesture data. After a bit of experimentation with different numbers of dimensions I found I could reduce the number of dimensions by half using PCA and still get quite low errors between the original data and the reduced dimension reconstructed data. Some gesturers made such consistent movements I could use just 2 dimensions to describe almost their entire range of motion.

The method is relatively clear in Matlab, although I am still a bit unsure of the multiple transforms made in the following code. I think I may have performed a few too many, but at least it works! The code “ReduceUsingPCA.m” takes in a directory to perform the conversion on and the number of output dimensions you require. So to convert every .csv in “c:input” to 20 dimensional data you run it as “ReduceUsingPCA(“c:input”,20) in Matlab.

% FileName is the name of the file to work on, OutputSize is no. of
% dimensions to output after PCA
function [output_args]=ReduceUsingPCA(DirName,OutputSize)

files = dir(fullfile(DirName, ‘*.csv’));

for i=1:length(files)
% read files(i).name and process
FileName= [DirName '/' files(i).name];
% read in csv file from FileName and store in x

x = csvread(FileName);

[Rows, Columns] = size(x);  % find size of input matrix
m=mean(x);                  % find mean of input matrix
y=x-ones(size(x,1),1)*m;    % normalise by subtracting mean
c=cov(y);                   % find covariance matrix
[V,D]=eig(c);               % find eigenvectors (V) and eigenvalues (D) of covariance matrix
[D,idx] = sort(diag(D));    % sort eigenvalues in descending order by first diagonalising eigenvalue matrix, idx stores order to use when ordering eigenvectors
D = D(end:-1:1)’;
V = V(:,idx(end:-1:1));     % put eigenvectors in order to correspond with eigenvalues
V2d=V(:,1:OutputSize);        % (significant Principal Components we use, OutputSize is input variable)
prefinal=V2d’*y’;
final=prefinal’;            % final is normalised data projected onto eigenspace

[infile, remain] = strtok(FileName,’/’);
infile = strtok(remain,’.’);
mkdir([num2str(OutputSize) 'PC']);
outputfilename = [num2str(OutputSize) 'PC' infile '_' num2str(OutputSize) 'PCs.csv'];

csvwrite(outputfilename,final);
end
end

The files are saved in the same directory as the input data, eg: “filename20PCs.csv”

Converting CSV and Vector Data to Native HTK Format Using C#

The output of my Principal Component Analysis in Matlab to reduce the dimensionality of gesture data is in the comma separated variable format. 57 dimension data goes in, X dimension data comes out in standard csv format. It is better to remove unnecessary information from the gesture data as it only makes the recognition of gesture (and intent, in my case) more difficult.

The problem is that the HTK, which I am going to use to perform recognition, doesn’t natively accept csv data so you have to convert to the HTK binary format parameter files. I chose to do this in c# as I’m familiar with it, but I stumbled across a few problems relating to the conversion between big-endian and little-endian binary data. HTK reads data in the opposite way to my PC (although I’m sure I read on their website somewhere that there is automatic detection for this).

The following code is pretty rough around the edges as it includes a lot of stuff to help me debug it. The program reads a directory and converts all *.csv files into HTK format binary files by reading in the data as floats, converting to bytes, writing a header and then writing the data to a binary file *.csv.bin.

static void Main(string[] args)
{
string dir = @”G:PHD Nov 09# programsmatlab worktest”;
DirectoryInfo di = new DirectoryInfo(dir);

FileInfo[] rgFiles = di.GetFiles(“*.csv”);
foreach (FileInfo fi in rgFiles)
{
using (TextReader tr = new StreamReader(fi.FullName))
{
string data = tr.ReadToEnd();
System.Text.ASCIIEncoding encoding=new System.Text.ASCIIEncoding();
byte[] byteArray = encoding.GetBytes(data);

string newdata = data.Replace(‘n’,’ ‘);
string[] plit = newdata.Trim().Split(‘ ‘);

int samples = plit.Length;
int itemspersample = plit[0].Split(‘,’).Length;

// now create binary data, each sample (part of a line in the file)
// has to be converted from a float to a 4 byte array and then joined to make one
// large binary file

byte[] bytedata = new byte[samples * itemspersample * 4];

for (int i = 0; i < samples; i++)
{
for (int j = 0; j < itemspersample; j++)
{
string dd = plit[i].Split(‘,’)[j];
float f = (float)Convert.ToDouble(plit[i].Split(‘,’)[j]);

byte[] temp = new byte[4];
temp = BitConverter.GetBytes(f);

bytedata[(i * itemspersample * 4) + (j * 4)] = temp[3];
bytedata[(i * itemspersample * 4) + (j * 4) + 1] = temp[2];
bytedata[(i * itemspersample * 4) + (j * 4) + 2] = temp[1];
bytedata[(i * itemspersample * 4) + (j * 4) + 3] = temp[0];
}
}

// now create HTK header 12 bytes long
byte[] nSamples = BitConverter.GetBytes(samples);
byte[] sampPeriod = BitConverter.GetBytes(100000);
byte[] sampSize = BitConverter.GetBytes(Convert.ToInt16(itemspersample * 4));
byte[] parmKind = BitConverter.GetBytes(Convert.ToInt16(9));

using (BinaryWriter bw = new BinaryWriter(File.Open(fi.FullName + “.bin”, FileMode.Create)))
{
Array.Reverse(nSamples);
Array.Reverse(sampPeriod);
Array.Reverse(sampSize);
Array.Reverse(parmKind);
bw.Write(nSamples);
bw.Write(sampPeriod);
bw.Write(sampSize);
bw.Write(parmKind);
bw.Write(bytedata);
}
}
}

To check it works you run HList, with no config file required as the header explains to HTK everything it needs to know about the data:

G:PHD Nov 09# programsmatlab worktest>hlist -h EO412_10PCs.csv.bin
————————- Source: EO412_10PCs.csv.bin ————————
Sample Bytes:  40       Sample Kind:   USER
Num Comps:     10       Sample Period: 10000.0 us
Num Samples:   5        File Format:   HTK
—————————— Samples: 0->-1 ——————————
0:    1838.200 308.910-262.970 401.920 -66.737-499.370 305.260-260.250 -91.974  28.171
1:    1837.700 308.630-263.340 400.810 -67.144-499.920 305.280-260.060 -92.174  27.584
2:    1837.000 308.360-263.750 399.940 -67.870-500.510 305.540-259.960 -91.964  26.922
3:    1836.500 308.160-264.000 398.500 -68.003-501.230 305.790-259.980 -92.138  26.342
4:    1837.000 308.360-263.750 399.940 -67.870-500.510 305.540-259.960 -91.964  26.922
———————————– END ———————————–